10. Operating systems

10.1. Windows

Since Windows 2000, Windows offers a nice Unicode API and supports non-BMP characters. It uses Unicode strings implemented as wchar_t* strings (LPWSTR). wchar_t is 16 bits long on Windows and so it uses UTF-16: non-BMP characters are stored as two wchar_t (a surrogate pair), and the length of a string is the number of UTF-16 units and not the number of characters.

Windows 95, 98 and Me had also Unicode strings, but were limited to BMP characters: they used UCS-2 instead of UTF-16.

10.1.1. Code pages

A Windows application has two encodings, called code pages (abbreviated “cp”): ANSI and OEM code pages. The ANSI code page, CP_ACP, is used for the ANSI version of the Windows API to decode byte strings to character strings and has a number between 874 and 1258. The OEM code page or “IBM PC” code page, CP_OEMCP, comes from MS-DOS, is used for the Windows console, contains glyphs to create text interfaces (draw boxes) and has a number between 437 and 874. Example of a French setup: ANSI is cp1252 and OEM is cp850.

There are code page constants:

  • CP_ACP: Windows ANSI code page

  • CP_MACCP: Macintosh code page

  • CP_OEMCP: ANSI code page of the current process

  • CP_SYMBOL (42): Symbol code page

  • CP_THREAD_ACP: ANSI code page of the current thread

  • CP_UTF7 (65000): UTF-7

  • CP_UTF8 (65001): UTF-8

Functions.

UINT GetACP()

Get the ANSI code page number.

UINT GetOEMCP()

Get the OEM code page number.

BOOL SetThreadLocale(LCID locale)

Set the locale. It can be used to change the ANSI code page of current thread (CP_THREAD_ACP).

See also

Wikipedia article: Windows code page.

10.1.2. Encode and decode functions

Encode and decode functions of <windows.h>.

MultiByteToWideChar()

Decode a byte string from a code page to a character string. Use MB_ERR_INVALID_CHARS flag to return an error on an undecodable byte sequence.

The default behaviour (flags=0) depends on the Windows version:

In strict mode (MB_ERR_INVALID_CHARS), the UTF-8 decoder (CP_UTF8) returns an error on surrogate characters on Windows Vista and later. On Windows XP, the UTF-8 decoder is not strict: surrogates can be decoded in any mode.

The UTF-7 decoder (CP_UTF7) only supports flags=0.

Examples on any Windows version:

Flags

default (0)

MB_ERR_INVALID_CHARS

0xE9 0x80, cp1252

é€ {U+00E9, U+20AC}

é€ {U+00E9, U+20AC}

0xC3 0xA9, CP_UTF8

é {U+00E9}

é {U+00E9}

0xFF, cp932

{U+F8F3}

decoding error

0xFF, CP_UTF7

{U+FF}

invalid flags

Examples on Windows Vista and later:

Flags

default (0)

MB_ERR_INVALID_CHARS

0x81 0x00, cp932

{U+30FB, U+0000}

decoding error

0xFF, CP_UTF8

{U+FFFD}

decoding error

0xED 0xB2 0x80, CP_UTF8

{U+FFFD, U+FFFD, U+FFFD}

decoding error

Examples on Windows 2000, XP, 2003:

Flags

default (0)

MB_ERR_INVALID_CHARS

0x81 0x00, cp932

{U+0000}

decoding error

0xFF, CP_UTF8

decoding error

decoding error

0xED 0xB2 0x80, CP_UTF8

{U+DC80}

{U+DC80}

Note

The U+30FB character is the Katakana middle dot (・). U+F8F3 code point is part of a Unicode range reserved for private use (U+E000—U+F8FF).

WideCharToMultiByte()

Encode a character string to a byte string. The behaviour on unencodable characters depends on the code page, the Windows version and the flags.

Code page

Windows version

Flags

Behaviour

CP_UTF8

2000, XP, 2003

0

Encode surrogates

Vista or later

0

Replace surrogates by U+FFFD

WC_ERR_INVALID_CHARS

Strict

CP_UTF7

all versions

0

Encode surrogates

Others

all versions

0

Replace by similar glyph

WC_NO_BEST_FIT_CHARS

Replace by ? (1)

  1. : Strict if you check for pusedDefaultChar pointer.

pusedDefaultChar is not supported by CP_UTF7 or CP_UTF8.

Use WC_NO_BEST_FIT_CHARS flag (or WC_ERR_INVALID_CHARS flag for CP_UTF8) to have a strict encoder: return an error on unencodable character. By default, if a character cannot be encoded, it is replaced by a character with a similar glyph or by “?” (U+003F). For example, with cp1252, Ł (U+0141) is replaced by L (U+004C).

On Windows Vista or later with WC_ERR_INVALID_CHARS flag, the UTF-8 encoder (CP_UTF8) returns an error on surrogate characters. The default behaviour (flags=0) depends on the Windows version: surrogates are replaced by U+FFFD on Windows Vista and later, and are encoded to UTF-8 on older Windows versions. The WC_NO_BEST_FIT_CHARS flag is not supported by the UTF-8 encoder.

The WC_ERR_INVALID_CHARS flag is only supported by CP_UTF8 and only on Windows Vista or later.

The UTF-7 encoder (CP_UTF7) only supports flags=0. It is not strict: it encodes surrogate characters.

Examples (on any Windows version):

Flags

default (0)

WC_NO_BEST_FIT_CHARS

ÿ (U+00FF), cp932

0x79 (y)

0x3F (?)

Ł (U+0141), cp1252

0x4C (L)

0x3F (?)

€ (U+20AC), cp1252

0x80

0x80

U+DC80, CP_UTF7

0x2b 0x33 0x49 0x41 0x2d (+3IA-)

invalid flags

Examples on Windows Vista an later:

Flags

default (0)

WC_ERR_INVALID_CHARS

WC_NO_BEST_FIT_CHARS

U+DC80, CP_UTF8

0xEF 0xBF 0xBD

encoding error

invalid flags

Examples on Windows 2000, XP, 2003:

Flags

default (0)

WC_ERR_INVALID_CHARS

WC_NO_BEST_FIT_CHARS

U+DC80, CP_UTF8

0xED 0xB2 0x80

invalid flags

invalid flags

Note

MultiByteToWideChar() and WideCharToMultiByte() functions are similar to mbstowcs() and wcstombs() functions.

10.1.3. Windows API: ANSI and wide versions

Windows has two versions of each function of its API: the ANSI version using byte strings (A suffix) and the ANSI code page, and the wide version (W suffix) using character strings. There are also functions without suffix using TCHAR* strings: if the C define _UNICODE is defined, TCHAR is replaced by wchar_t and the Unicode functions are used; otherwise TCHAR is replaced by char and the ANSI functions are used. Example:

  • CreateFileA(): bytes version, use byte strings encoded to the ANSI code page

  • CreateFileW(): Unicode version, use wide character strings

  • CreateFile(): TCHAR version depending on the _UNICODE define

Always prefer the Unicode version to avoid encoding/decoding errors, and use directly the W suffix to avoid compiling issues.

Note

There is a third version of the API: the MBCS API (multibyte character string). Use the TCHAR functions and define _MBCS to use the MBCS functions. For example, _tcsrev() is replaced by _mbsrev() if _MBCS is defined, by _wcsrev() if _UNICODE is defined, or by _strrev() otherwise.

10.1.4. Windows string types

  • LPSTR (LPCSTR): byte string, char* (const char*)

  • LPWSTR (LPCWSTR): wide character string, wchar_t* (const wchar_t*)

  • LPTSTR (LPCTSTR): byte or wide character string depending of _UNICODE define, TCHAR* (const TCHAR*)

10.1.5. Filenames

Windows stores filenames as Unicode in the filesystem. Filesystem wide character POSIX-like API:

int _wfstat(const wchar_t *filename, struct _stat *statbuf)

Unicode version of stat().

FILE *_wfopen(const wchar_t *filename, const wchar_t *mode)

Unicode version of fopen().

int _wopen(const wchar_t *filename, int oflag[, int pmode])

Unicode version of open().

POSIX functions, like fopen(), use the ANSI code page to encode/decode strings.

10.1.6. Windows console

Console functions.

GetConsoleCP()

Get the code page of the standard input (stdin) of the console.

GetConsoleOutputCP()

Get the code page of the standard output (stdout and stderr) of the console.

WriteConsoleW()

Write a character string into the console.

To improve the Unicode support of the console, set the console font to a TrueType font (e.g. “Lucida Console”) and use the wide character API

If the console is unable to render a character, it tries to use a character with a similar glyph. For example, with OEM code page 850, Ł (U+0141) is replaced by L (U+0041). If no replacment character can be found, “?” (U+003F) is displayed instead.

In a console (cmd.exe), chcp command can be used to display or to change the OEM code page (and console code page). Changing the console code page is not a good idea because the ANSI API of the console still expects characters encoded to the previous console code page.

Note

Set the console code page to cp65001 (UTF-8) doesn’t improve Unicode support, it is the opposite: non-ASCII are not rendered correctly and type non-ASCII characters (e.g. using the keyboard) doesn’t work correctly, especially using raster fonts.

10.1.7. File mode

_setmode() and _wsopen() are special functions to set the encoding of a file:

  • _O_U8TEXT: UTF-8 without BOM

  • _O_U16TEXT: UTF-16 without BOM

  • _O_WTEXT: UTF-16 with BOM

fopen() can use these modes using ccs= in the file mode:

  • ccs=UNICODE: _O_WTEXT

  • ccs=UTF-8: _O_UTF8

  • ccs=UTF-16LE: _O_UTF16

10.2. Mac OS X

Mac OS X uses UTF-8 for the filenames. If a filename is an invalid UTF-8 byte string, Mac OS X returns an error. The filenames are decomposed to an incompatible variant of the Normal Form D (NFD). Extract of the Technical Q&A QA1173: “For example, HFS Plus uses a variant of Normal Form D in which U+2000 through U+2FFF, U+F900 through U+FAFF, and U+2F800 through U+2FAFF are not decomposed.”

10.3. Locales

To support different languages and encodings, UNIX and BSD operating systems have “locales”. Locales are process-wide: if a thread or a library change the locale, the whole process is impacted.

10.3.1. Locale categories

Locale categories:

  • LC_COLLATE: compare and sort strings

  • LC_CTYPE: decode byte strings and encode character strings

  • LC_MESSAGES: language of messages

  • LC_MONETARY: monetary formatting

  • LC_NUMERIC: number formatting (e.g. thousands separator)

  • LC_TIME: time and date formatting

LC_ALL is a special category: if you set a locale using this category, it sets the locale for all categories.

Each category has its own environment variable with the same name. For example, LC_MESSAGES=C displays error messages in English. To get the value of a locale category, LC_ALL, LC_xxx (e.g. LC_CTYPE) or LANG environment variables are checked: use the first non empty variable. If all variables are unset, fallback to the C locale.

Note

The gettext library reads LANGUAGE, LC_ALL and LANG environment variables (and some others) to get the user language. The LANGUAGE variable is specific to gettext and is not related to locales.

10.3.2. The C locale

When a program starts, it does not get directly the user locale: it uses the default locale which is called the “C” locale or the “POSIX” locale. It is also used if no locale environment variable is set. For LC_CTYPE, the C locale usually means ASCII, but not always (see the locale encoding section). For LC_MESSAGES, the C locale means to speak the original language of the program, which is usually English.

10.3.3. Locale encoding

For Unicode, the most important locale category is LC_CTYPE: it is used to set the “locale encoding”.

To get the locale encoding:

  • Copy the current locale: setlocale(LC_CTYPE, NULL)

  • Set the current locale encoding to the user preference: setlocale(LC_CTYPE, "")

  • Use nl_langinfo(CODESET) if available

  • or setlocale(LC_CTYPE, NULL)

For the C locale, nl_langinfo(CODESET) returns ASCII, or an alias to this encoding (e.g. “US-ASCII” or “646”). But on FreeBSD, Solaris and Mac OS X, codec functions (e.g. mbstowcs()) use ISO 8859-1 even if nl_langinfo(CODESET) announces ASCII encoding. AIX uses ISO 8859-1 for the C locale (and nl_langinfo(CODESET) returns "ISO8859-1").

10.3.4. Locale functions

<locale.h> functions.

char *setlocale(category, NULL)

Get the value of the specified locale category.

char *setlocale(category, name)

Set the value of the specified locale category.

<langinfo.h> functions.

char *nl_langinfo(CODESET)

Get the name of the locale encoding.

<stdlib.h> functions.

size_t mbstowcs(wchar_t *dest, const char *src, size_t n)

Decode a byte string from the locale encoding to a character string. The decoder is strict: it returns an error on undecodable byte sequence. If available, prefer the reentrant version: mbsrtowcs().

size_t wcstombs(char *dest, const wchar_t *src, size_t n)

Encode a character string to a byte string in the locale encoding. The encoder is strict : it returns an error if a character cannot by encoded. If available, prefer the reentrant version: wcsrtombs().

mbstowcs() and wcstombs() are strict and don’t support error handlers.

Note

“mbs” stands for “multibyte string” (byte string) and “wcs” stands for “wide character string”.

On Windows, the “locale encoding” are the ANSI and OEM code pages. A Windows program uses the user preferred code pages at startup, whereas a program starts with the C locale on UNIX.

10.4. Filesystems (filenames)

10.4.1. CD-ROM and DVD

CD-ROM uses the ISO 9660 filesystem which stores filenames as byte strings. This filesystem is very restrictive: only A-Z, 0-9, _ and “.” are allowed. Microsoft has developed the Joliet extension: store filenames as UCS-2, up to 64 characters (BMP only). It was first supported by Windows 95. Today, all operating systems are able to read it.

UDF (Universal Disk Format) is the filesystem of DVD: it stores filenames as character strings.

10.4.2. Microsoft: FAT and NTFS filesystems

MS-DOS uses the FAT filesystems (FAT 12, FAT 16, FAT 32): filenames are stored as byte strings. Filenames are limited to 8+3 characters (8 for the name, 3 for the extension) and displayed differently depending on the code page (mojibake issue).

Microsoft extended its FAT filesystem in Windows 95: the Virtual FAT (VFAT) supports “long filenames”, filenames are stored as UCS-2, up to 255 characters (BMP only). Starting at Windows 2000, non-BMP characters can be used: UTF-16 replaces UCS-2 and the limit is now 255 UTF-16 units.

The NTFS filesystem stores filenames using UTF-16 encoding.

10.4.3. Apple: HFS and HFS+ filesystems

HFS stores filenames as byte strings.

HFS+ stores filenames as UTF-16: the maximum length is 255 UTF-16 units.

10.4.4. Others

JFS and ZFS also use Unicode.

The ext family (ext2, ext3, ext4) store filenames as byte strings.