13. Libraries

Programming languages have no or basic support of Unicode. Libraries are required to get a full support of Unicode on all platforms.

13.1. Qt library

Qt is a big C++ library covering different topics, but it is typically used to create graphical interfaces. It is distributed under the GNU LGPL license (version 2.1), and is also available under a commercial license.

13.1.1. Character and string classes

QChar is a Unicode character, only able to store BMP characters. It is implemented using a 16 bits unsigned number. Interesting QChar methods:

  • isSpace(): True if the character category is separator (Zl, Zp or Zs)

  • toUpper(): convert to upper case

QString is a character string implemented as an array of QChar using UTF-16. A Non-BMP character is stored as two QChar (a surrogate pair). Interesting QString methods:

  • toAscii(), fromAscii(): encode to/decode from ASCII

  • toLatin1(), fromLatin1(): encode to/decode from ISO 8859-1

  • utf16(), fromUtf16(): encode to/decode to UTF-16 (in the host endian)

  • normalized(): normalize to NFC, NFD, NFKC or NFKD

Qt decodes literal byte strings from ISO 8859-1 using the QLatin1String class, a thin wrapper to char*. QLatin1String is a character string storing each character as a single byte. It is possible because it only supports characters in U+0000—U+00FF range. QLatin1String cannot be used to manipulate text, it has a smaller API than QString. For example, it is not possible to concatenate two QLatin1String strings.

13.1.2. Codec

QTextCodec.codecForLocale() gets the locale encoding codec:

  • Windows: ANSI code page

  • Otherwise: the locale encoding. Try nl_langinfo(CODESET), or LC_ALL, LC_CTYPE, LANG environment variables. If no one gives any useful information, fallback to ISO 8859-1.

13.1.3. Filesystem


QFile.decodeName() is the reverse operation.

Qt has two implementations of its QFSFileEngine:

  • Windows: use Windows native API

  • UNIX: use POSIX API. Examples: fopen(), getcwd() or get_current_dir_name(), mkdir(), etc.

Related classes: QFile, QFileInfo, QAbstractFileEngineHandler, QFSFileEngine.

13.2. The glib library

The glib library is a great C library distributed under the GNU LGPL license (version 2.1).

13.2.1. Character strings

The gunichar type is a character. It is able to store any Unicode 6.0 character (U+0000—U+10FFFF).

The glib library has no character string type. It uses byte strings using the gchar* type, but most functions use UTF-8 encoded strings.

13.2.2. Codec functions

  • g_convert(): decode from an encoding and encode to another encoding with the iconv library. Use g_convert_with_fallback() to choose how to handle undecodable bytes and unencodable characters.

  • g_locale_from_utf8() / g_locale_to_utf8(): encode to/decode from the current locale encoding.

  • g_get_charset(): get the locale encoding

    • Windows: current ANSI code page

    • OS/2: current code page (call DosQueryCp())

    • other: try nl_langinfo(CODESET), or LC_ALL, LC_CTYPE or LANG environment variables

  • g_utf8_get_char(): get the first character of an UTF-8 string as gunichar

13.2.3. Filename functions

  • g_filename_from_utf8() / g_filename_to_utf8(): encode/decode a filename to/from UTF-8

  • g_filename_display_name(): human readable version of a filename. Try to decode the filename from each encoding of g_get_filename_charsets() encoding list. If all decoding failed, decode the filename from UTF-8 and replace undecodable bytes by � (U+FFFD).

  • g_get_filename_charsets(): get the list of charsets used to decode and encode filenames. g_filename_display_name() tries each encoding of this list, other functions just use the first encoding. Use UTF-8 on Windows. On other operating systems, use:

    • G_FILENAME_ENCODING environment variable (if set): comma-separated list of character set names, the special token "@locale" is taken to mean the locale encoding

    • or UTF-8 if G_BROKEN_FILENAMES environment variable is set

    • or call g_get_charset() (the locale encoding)

13.3. iconv library

libiconv is a library to encode and decode text in different encodings. It is distributed under the GNU LGPL license. It supports a lot of encodings including rare and old encodings.

By default, libiconv is strict: an unencodable character raise an error. You can ignore these characters by adding the //IGNORE suffix to the encoding name. There is also the //TRANSLIT suffix to replace unencodable characters by similarly looking characters.

PHP has a builtin binding of iconv.

13.4. ICU libraries

International Components for Unicode (ICU) is a mature, widely used set of C, C++ and Java libraries providing Unicode and Globalization support for software applications. ICU is an open source project distributed under the MIT license.

13.5. libunistring

libunistring provides functions for manipulating Unicode strings and for manipulating C strings according to the Unicode standard. It is distributed under the GNU LGPL license version 3.