3. Definitions¶

3.1. Character¶

Generic term for a semantic symbol. Many possible interpretations exist in the context of encoding.

In computing, the most important aspect is that characters can be letters, spaces or control characters which represent the end of a file or can be used to trigger a sound.

3.2. Glyph¶

One or more shapes that may be combined into a grapheme.

In Latin, a glyph often has 2 variants like ‘A’ and ‘a’ and Arabic often has four. This term is context dependent and different styles or formats can be considered different glyphs.

Most relevant in programming is that diacritic marks (e.g. accents like  and ^) are also glyphs, which are sometimes represented with another at one point, like the à in ISO 8859-1 or as two separate glyphs, so an a and the combining  (U+0300 and U+0061 combined as U+00E0).

3.3. Code point¶

A code point is an unsigned integer. The smallest code point is zero. Code points are usually written as hexadecimal, e.g. “0x20AC” (8,364 in decimal).

3.4. Character set (charset)¶

A character set, abbreviated charset, is a mapping between code points and characters. The mapping has a fixed size. For example, most 7 bits encodings have 128 entries, and most 8 bits encodings have 256 entries. The biggest charset is the Unicode Character Set 6.0 with 1,114,112 entries.

In some charsets, code points are not all contiguous. For example, the cp1252 charset maps code points from 0 though 255, but it has only 251 entries: 0x81, 0x8D, 0x8F, 0x90 and 0x9D code points are not assigned.

Examples of the ASCII charset: the digit five (“5”, U+0035) is assigned to the code point 0x35 (53 in decimal), and the uppercase letter “A” (U+0041) to the code point 0x41 (65).

The biggest code point depends on the size of the charset. For example, the biggest code point of the ASCII charset is 127 ($$2^7-1$$)

Charset examples:

Charset Code point Character
ASCII 0x35 5 (U+0035)
ASCII 0x41 A (U+0041)
ISO-8859-15 0xA4 € (U+20AC)
Unicode Character Set 0x20AC € (U+20AC)

3.5. Character string¶

A character string, or “Unicode string”, is a string where each unit is a character. Depending on the implementation, each character can be any Unicode character, or only characters in the range U+0000—U+FFFF, range called the Basic Multilingual Plane (BMP). There are 3 different implementations of character strings:

• array of 32 bits unsigned integers (the UCS-4 encoding): full Unicode range
• array of 16 bits unsigned integers (UCS-2): BMP only
• array of 16 bits unsigned integers with surrogate pairs (UTF-16): full Unicode range

UCS-4 uses twice as much memory than UCS-2, but it supports all Unicode characters. UTF-16 is a compromise between UCS-2 and UCS-4: characters in the BMP range use one UTF-16 unit (16 bits), characters outside this range use two UTF-16 units (a surrogate pair, 32 bits). This advantage is also the main disadvantage of this kind of character string.

The length of a character string implemented using UTF-16 is the number of UTF-16 units, and not the number of characters, which is confusing. For example, the U+10FFFF character is encoded as two UTF-16 units: {U+DBFF, U+DFFF}. If the character string only contains characters of the BMP range, the length is the number of characters. Getting the nth character or the length in characters using UTF-16 has a complexity of $$O(n)$$, whereas it has a complexity of $$O(1)$$ for UCS-2 and UCS-4 strings.

The Java language, the Qt library and Windows 2000 implement character strings with UTF-16. The C and Python languages use UTF-16 or UCS-4 depending on: the size of the wchar_t type (16 or 32 bits) for C, and the compilation mode (narrow or wide) for Python. Windows 95 uses UCS-2 strings.

UCS-2, UCS-4 and UTF-16 encodings, and surrogate pairs.

3.6. Byte string¶

A byte string is a character string encoded to an encoding. It is implemented as an array of 8 bits unsigned integers. It can be called by its encoding. For example, a byte string encoded to ASCII is called an “ASCII encoded string”, or simply an “ASCII string”.

The character range supported by a byte string depends on its encoding, because an encoding is associated with a charset. For example, an ASCII string can only store characters in the range U+0000—U+007F.

The encoding is not stored explicitly in a byte string. If the encoding is not documented or attached to the byte string, the encoding has to be guessed, which is a difficult task. If a byte string is decoded from the wrong encoding, it will not be displayed correctly, leading to a well known issue: mojibake.

The same problem occurs if two byte strings encoded to different encodings are concatenated. Never concatenate byte strings encoded to different encodings! Use character strings, instead of byte strings, to avoid mojibake issues.

PHP5 only supports byte strings. In the C language, “strings” are usually byte strings which are implemented as the char* type (or const char*).

The char* type of the C language and the mojibake issue.

3.7. UTF-8 encoded strings and UTF-16 character strings¶

A UTF-8 string is a particular case, because UTF-8 is able to encode all Unicode characters [1] . But a UTF-8 string is not a Unicode string because the string unit is byte and not character: you can get an individual byte of a multibyte character.

Another difference between UTF-8 strings and Unicode strings is the complexity of getting the nth character: $$O(n)$$ for the byte string and $$O(1)$$ for the Unicode string. There is one exception: if the Unicode string is implemented using UTF-16: it has also a complexity of $$O(n)$$.

 [1] A UTF-8 encoder should not encode surrogate characters (U+D800—U+DFFF).

3.8. Encoding¶

An encoding describes how to encode code points to bytes and how to decode bytes to code points.

An encoding is always associated with a charset. For example, the UTF-8 encoding is associated with the Unicode charset. So we can say that an encoding encodes characters to bytes and decode bytes to characters, or more generally, it encodes a character string to a byte string and decodes a byte string to a character string.

The 7 and 8 bits charsets have the simplest encoding: store a code point as a single byte. Since these charsets are also called encodings, it is easy to confuse them. The best example is the ISO-8859-1 encoding: all of the 256 possible bytes are considered as 8 bit code points (0 through 255) and are mapped to characters. For example, the character A (U+0041) has the code point 65 (0x41 in hexadecimal) and is stored as the byte 0x41.

Charsets with more than 256 entries cannot encode all code points into a single byte. The encoding encodes all code points into byte sequences of the same length or of variable length. For example, UTF-8 is a variable length encoding: code points lower than 128 use a single byte, whereas higher code points take 2, 3 or 4 bytes. The UCS-2 encoding encodes all code points into sequences of two bytes (16 bits).

3.9. Encode a character string¶

Encode a character string to a byte string, to an encoding. For example, encode “Hé” to UTF-8 gives 0x48 0xC3 0xA9.

By default, most libraries are strict: raise an error at the first unencodable character. Some libraries allow to choose how to handle them.

Most encodings are stateless, but some encoding requires a stateful encoder. For example, the UTF-16 encoding starts by generating a BOM, 0xFF 0xFE or 0xFE 0xFF depending on the endian.

3.10. Decode a byte string¶

Decode a byte string from an encoding to a character string. For example, decode 0x48 0xC3 0xA9 from UTF-8 gives “Hé”.

By default, most libraries raise an error if a byte sequence cannot be decoded. Some libraries allow to choose how to handle them.

Most encodings are stateless, but some encoding requires a stateful decoder. For example, the UTF-16 encoding decodes the two first bytes as a BOM to read the endian (use UTF-16-LE or UTF-16-BE).

3.11. Mojibake¶

When a byte strings is decoded from the wrong encoding, or when two byte strings encoded to different encodings are concatenated, a program will display mojibake.

The classical example is a latin string (with diacritics) encoded to UTF-8 but decoded from ISO-8859-1. It displays Ã© {U+00C3, U+00A9} for the é (U+00E9) letter, because é is encoded to 0xC3 0xA9 in UTF-8.

Other examples:

Text Encoded to Decoded from Result
Noël UTF-8 ISO-8859-1 NoÃ«l
Русский KOI-8 ISO-8859-1 òÕÓÓËÉÊ

Note

“Mojibake” is japanese word meaning literally “unintelligible sequence of characters”. This issue is called “Кракозябры” (krakozyabry) in Russian.

2004. An image of a post envelope with address written in krakozyabry (кракозя́бры) AKA Mojibake. The envelope contained a Harry Potter book. This letter was sent to a Russian student by her French friend, who manually wrote the address that she received by e-mail. Her e-mail client, unfortunately, was not set up correctly to display Cyrillic characters, so they were substituted with diacritic symbols from the Western charset (ISO-8859-1) The original message was in KOI8-R.

The address was deciphered by the postal employees and delivered successfully. Some of the correct characters (red) were written above the wrong ones (black).