4. Unicode¶

Unicode is a character set. It is a superset of all the other character sets. In the version 6.0, Unicode has 1,114,112 code points (the last code point is U+10FFFF). Unicode 1.0 was limited to 65,536 code points (the last code point was U+FFFF), the range U+0000—U+FFFF called BMP (Basic Multilingual Plane). I call the range U+10000—U+10FFFF as non-BMP characters.

4.1. Unicode Character Set¶

The Unicode Character Set (UCS) contains 1,114,112 code points: U+0000—U+10FFFF. Characters and code point ranges are grouped by categories. Only encodings of the UTF family are able to encode the UCS.

4.2. Categories¶

Unicode 6.0 has 7 character categories, and each category has subcategories:

Letter (L): lowercase (Ll), modifier (Lm), titlecase (Lt), uppercase (Lu), other (Lo)

Mark (M): spacing combining (Mc), enclosing (Me), non-spacing (Mn)

Number (N): decimal digit (Nd), letter (Nl), other (No)

Punctuation (P): connector (Pc), dash (Pd), initial quote (Pi), final quote (Pf), open (Ps), close (Pe), other (Po)

Symbol (S): currency (Sc), modifier (Sk), math (Sm), other (So)

Separator (Z): line (Zl), paragraph (Zp), space (Zs)

Other (C): control (Cc), format (Cf), not assigned (Cn), private use (Co), surrogate (Cs)

There are 3 ranges reserved for private use (Co subcategory): U+E000—U+F8FF (6,400 code points), U+F0000—U+FFFFD (65,534) and U+100000—U+10FFFD (65,534). Surrogates (Cs subcategory) use the range U+D800—U+DFFF (2,048 code points).

4.3. Statistics¶

On a total of 1,114,112 possible code points, only 248,966 code points are assigned: 77.6% are not assigned. Statistics excluding not assigned (Cn), private use (Co) and surrogate (Cs) subcategories:

Letter: 100,520 (91.8%)

Symbol: 5,508 (5.0%)

Mark: 1,498 (1.4%)

Number: 1,100 (1.0%)

Punctuation: 598 (0.5%)

Other: 205 (0.2%)

Separator: 20 (0.0%)

On a total of 106,028 letters and symbols, 101,482 are in “other” subcategories (Lo and So): only 4.3% have well defined subcategories:

Letter, lowercase (Ll): 1,759

Letter, uppercase (Lu): 1,436

Symbol, math (Sm): 948

Letter, modifier (Lm): 210

Symbol, modifier (Sk): 115

Letter, titlecase (Lt): 31

Symbol, currency (Sc): 47

4.4. Normalization¶

Unicode standard explains how to decompose a character. For example, the precomposed character ç (U+00C7, Latin capital letter C with cedilla) can be written as the sequence of two characters: {¸ (U+0327, Combining cedilla), c (U+0043, Latin capital letter C)}. This decomposition can be useful when searching for a substring in a text, e.g. removing the diacritic is pratical for the user. The decomposed form is called Normal Form D (NFD) and the precomposed form is called Normal Form C (NFC).

Form	String	Unicode
NFC	ç	U+00C7
NFD	¸c	{U+0327, U+0043}

Unicode database also contains a compatibility layer: if a character cannot be rendered (no font contain the requested character) or encoded to a specific encoding, Unicode proposes a replacment character sequence which looks like the character, but may have a different meaning.

For example, ĳ (U+0133, Latin small ligature ij) is replaced by the two characters {i (U+0069, Latin small letter I), j (U+006A, Latin small letter J)}. ĳ character cannot be encoded to ISO 8859-1, whereas ij characters can.

Two extra normal forms use this compatibility layer: NFKD (decomposed) and NFKC (precomposed).

Note

The precomposed forms (NFC and NFKC) begin by a canonical decomposition before recomposing pre-combined characters again.

4. Unicode¶

4.1. Unicode Character Set¶

4.2. Categories¶

4.3. Statistics¶

4.4. Normalization¶

Table of Contents

Previous topic

Next topic

This Page