Enumerations | Functions

utf8 Namespace Reference

Utilities to convert between std::string and std::wstring. More...

Enumerations

enum  TextEncoding {
  encUNSPECIFIED, encUTF8, encUTF16BE, encUTF16LE,
  encUTF32BE, encUTF32LE, encSCSU, encUTF7,
  encUTFEBCDIC, encBOCU1
}

Functions

DSOEXPORT std::wstring decodeCanonicalString (const std::string &str, int version)
 Converts a std::string with multibyte characters into a std::wstring.
DSOEXPORT std::string encodeCanonicalString (const std::wstring &wstr, int version)
 Converts a std::wstring into canonical std::string.
DSOEXPORT boost::uint32_t decodeNextUnicodeCharacter (std::string::const_iterator &it, const std::string::const_iterator &e)
 Return the next Unicode character in the UTF-8 encoded string.
DSOEXPORT std::string encodeUnicodeCharacter (boost::uint32_t ucs_character)
 Encodes the given wide character into a canonical string, theoretically up to 6 chars in length.
DSOEXPORT std::string encodeLatin1Character (boost::uint32_t ucsCharacter)
 Encodes the given wide character into an at least 8-bit character.
DSOEXPORT char * stripBOM (char *in, size_t &size, TextEncoding &encoding)
 Interpret (and skip) Byte Order Mark in input stream.
DSOEXPORT const char * textEncodingName (TextEncoding enc)
 Return name of a text encoding.

Detailed Description

Utilities to convert between std::string and std::wstring.

Strings in Gnash are generally stored as std::strings. We have to deal, however, with characters larger than standard ASCII (128), which can be encoded in two different ways.

SWF6 and later use UTF-8, encoded as multibyte characters and allowing many thousands of unique codes. Multibyte characters are difficult to handle, as their length - used for many string operations - is not certain without parsing the string. Converting the string to a wstring (generally a uint32_t - the pp seems only to handle characters up to 65535 - two bytes is the minimum size of a wchar) facilitates string operations, as the length of the string is equal to the number of valid characters.

SWF5 and earlier, however, used the ISO-8859 specification, allowing the standard 128 ASCII characters plus 128 extra characters that depend on the particular subset of ISO-8859. Characters are 8 bits, not the ASCII standard 7. SWF5 cannot handle multi-byte characters without special functions.

It is important that SWF5 can distinguish between the two encodings, so we cannot convert all strings to UTF-8.

Presently, this code is used for the AS String object, gnash::edit_text_character, ord() and chr().


Enumeration Type Documentation

Enumerator:
encUNSPECIFIED 
encUTF8 
encUTF16BE 
encUTF16LE 
encUTF32BE 
encUTF32LE 
encSCSU 
encUTF7 
encUTFEBCDIC 
encBOCU1 

Function Documentation

std::wstring utf8::decodeCanonicalString ( const std::string &  str,
int  version 
)

Converts a std::string with multibyte characters into a std::wstring.

Returns:
a version-dependent wstring.
Parameters:
str the canonical string to convert.
version the SWF version, used to decide how to decode the string. For SWF5, UTF-8 (or any other) multibyte encoded characters are converted char by char, mangling the string.

References decodeNextUnicodeCharacter(), gnash::key::e, and INVALID_CHAR.

Referenced by gnash::TextField::replaceSelection(), gnash::TextField::TextField(), gnash::TextField::updateHtmlText(), and gnash::TextField::updateText().

boost::uint32_t utf8::decodeNextUnicodeCharacter ( std::string::const_iterator &  it,
const std::string::const_iterator &  e 
)

Return the next Unicode character in the UTF-8 encoded string.

Invalid UTF-8 sequences produce a U+FFFD character as output. Advances string iterator past the character returned, unless the returned character is '', in which case the iterator does not advance.

References FIRST_BYTE, and NEXT_BYTE.

Referenced by decodeCanonicalString().

std::string utf8::encodeCanonicalString ( const std::wstring &  wstr,
int  version 
)

Converts a std::wstring into canonical std::string.

Returns:
a version-dependent encoded std::string.
Parameters:
wstr the wide string to convert.
version the SWF version, used to decide how to encode the string.

For SWF 5, each character is stored as an 8-bit (at least) char, rather than converting it to a canonical UTF-8 byte sequence. Gnash can then distinguish between 8-bit characters, which it handles correctly, and multi-byte characters, which are regarded as multiple characters for string methods.

References encodeLatin1Character(), and encodeUnicodeCharacter().

Referenced by gnash::TextField::get_htmltext_value(), gnash::TextField::get_text_value(), gnash::TextField::setHtmlTextValue(), and gnash::TextField::setTextValue().

std::string utf8::encodeLatin1Character ( boost::uint32_t  ucsCharacter  ) 

Encodes the given wide character into an at least 8-bit character.

Allows storage of Latin1 (ISO-8859-1) characters. This is the format of SWF5 and below.

Referenced by encodeCanonicalString().

std::string utf8::encodeUnicodeCharacter ( boost::uint32_t  ucs_character  ) 

Encodes the given wide character into a canonical string, theoretically up to 6 chars in length.

Referenced by encodeCanonicalString().

char * utf8::stripBOM ( char *  in,
size_t &  size,
TextEncoding &  encoding 
)

Interpret (and skip) Byte Order Mark in input stream.

This function takes a pointer to a buffer and returns the start of actual data after an eventual BOM. No conversion is performed, no bytes copy, just skipping of the BOM snippet and interpretation of it returned to the encoding input parameter.

See http://en.wikipedia.org/wiki/Byte-order_mark

Parameters:
in The input buffer.
size Size of the input buffer, will be decremented by the size of the BOM, if any.
encoding Output parameter, will always be set. encUNSPECIFIED if no BOM is found.
Returns:
A pointer either equal to 'in' or some bytes inside it.
const char * utf8::textEncodingName ( TextEncoding  enc  ) 

Return name of a text encoding.

References encBOCU1, encSCSU, encUNSPECIFIED, encUTF16BE, encUTF16LE, encUTF32BE, encUTF32LE, encUTF7, encUTF8, and encUTFEBCDIC.