10. Operating systems¶

10.1. Windows¶

Since Windows 2000, Windows offers a nice Unicode API and supports non-BMP characters. It uses Unicode strings implemented as wchar_t* strings (LPWSTR). wchar_t is 16 bits long on Windows and so it uses UTF-16: non-BMP characters are stored as two wchar_t (a surrogate pair), and the length of a string is the number of UTF-16 units and not the number of characters.

Windows 95, 98 and Me had also Unicode strings, but were limited to BMP characters: they used UCS-2 instead of UTF-16.

10.1.1. Code pages¶

A Windows application has two encodings, called code pages (abbreviated “cp”): ANSI and OEM code pages. The ANSI code page, CP_ACP, is used for the ANSI version of the Windows API to decode byte strings to character strings and has a number between 874 and 1258. The OEM code page or “IBM PC” code page, CP_OEMCP, comes from MS-DOS, is used for the Windows console, contains glyphs to create text interfaces (draw boxes) and has a number between 437 and 874. Example of a French setup: ANSI is cp1252 and OEM is cp850.

There are code page constants:

CP_ACP: Windows ANSI code page

CP_MACCP: Macintosh code page

CP_OEMCP: ANSI code page of the current process

CP_SYMBOL (42): Symbol code page

CP_THREAD_ACP: ANSI code page of the current thread

CP_UTF7 (65000): UTF-7

CP_UTF8 (65001): UTF-8

Functions.

UINT GetACP()¶: Get the ANSI code page number.

UINT GetOEMCP()¶: Get the OEM code page number.

BOOL SetThreadLocale(LCID locale)¶: Set the locale. It can be used to change the ANSI code page of current thread (CP_THREAD_ACP).

10.1.2. Encode and decode functions¶

Encode and decode functions of <windows.h>.

MultiByteToWideChar()

Decode a byte string from a code page to a character string. Use MB_ERR_INVALID_CHARS flag to return an error on an undecodable byte sequence.

The default behaviour (flags=0) depends on the Windows version:

Windows Vista and later: replace undecodable bytes

Windows 2000, XP and 2003: ignore undecodable bytes

In strict mode (MB_ERR_INVALID_CHARS), the UTF-8 decoder (CP_UTF8) returns an error on surrogate characters on Windows Vista and later. On Windows XP, the UTF-8 decoder is not strict: surrogates can be decoded in any mode.

The UTF-7 decoder (CP_UTF7) only supports flags=0.

Examples on any Windows version:

Flags	default (0)	MB_ERR_INVALID_CHARS
`0xE9 0x80`, cp1252	é€ {U+00E9, U+20AC}	é€ {U+00E9, U+20AC}
`0xC3 0xA9`, CP_UTF8	é {U+00E9}	é {U+00E9}
`0xFF`, cp932	{U+F8F3}	decoding error
`0xFF`, CP_UTF7	{U+FF}	invalid flags

Examples on Windows Vista and later:

Flags	default (0)	MB_ERR_INVALID_CHARS
`0x81 0x00`, cp932	{U+30FB, U+0000}	decoding error
`0xFF`, CP_UTF8	{U+FFFD}	decoding error
`0xED 0xB2 0x80`, CP_UTF8	{U+FFFD, U+FFFD, U+FFFD}	decoding error

Examples on Windows 2000, XP, 2003:

Flags	default (0)	MB_ERR_INVALID_CHARS
`0x81 0x00`, cp932	{U+0000}	decoding error
`0xFF`, CP_UTF8	decoding error	decoding error
`0xED 0xB2 0x80`, CP_UTF8	{U+DC80}	{U+DC80}

Note

The U+30FB character is the Katakana middle dot (・). U+F8F3 code point is part of a Unicode range reserved for private use (U+E000—U+F8FF).

WideCharToMultiByte()

Encode a character string to a byte string. The behaviour on unencodable characters depends on the code page, the Windows version and the flags.

Code page	Windows version	Flags	Behaviour
CP_UTF8	2000, XP, 2003	0	Encode surrogates
	Vista or later	0	Replace surrogates by U+FFFD
	Vista or later	WC_ERR_INVALID_CHARS	Strict
CP_UTF7	all versions	0	Encode surrogates
Others	all versions	0	Replace by similar glyph
Others	all versions	WC_NO_BEST_FIT_CHARS	Replace by ? (1)

: Strict if you check for pusedDefaultChar pointer.

pusedDefaultChar is not supported by CP_UTF7 or CP_UTF8.

Use WC_NO_BEST_FIT_CHARS flag (or WC_ERR_INVALID_CHARS flag for CP_UTF8) to have a strict encoder: return an error on unencodable character. By default, if a character cannot be encoded, it is replaced by a character with a similar glyph or by “?” (U+003F). For example, with cp1252, Ł (U+0141) is replaced by L (U+004C).

On Windows Vista or later with WC_ERR_INVALID_CHARS flag, the UTF-8 encoder (CP_UTF8) returns an error on surrogate characters. The default behaviour (flags=0) depends on the Windows version: surrogates are replaced by U+FFFD on Windows Vista and later, and are encoded to UTF-8 on older Windows versions. The WC_NO_BEST_FIT_CHARS flag is not supported by the UTF-8 encoder.

The WC_ERR_INVALID_CHARS flag is only supported by CP_UTF8 and only on Windows Vista or later.

The UTF-7 encoder (CP_UTF7) only supports flags=0. It is not strict: it encodes surrogate characters.

Examples (on any Windows version):

Flags	default (0)	WC_NO_BEST_FIT_CHARS
ÿ (U+00FF), cp932	`0x79` (y)	`0x3F` (?)
Ł (U+0141), cp1252	`0x4C` (L)	`0x3F` (?)
€ (U+20AC), cp1252	`0x80`	`0x80`
U+DC80, CP_UTF7	`0x2b 0x33 0x49 0x41 0x2d` (+3IA-)	invalid flags

Examples on Windows Vista an later:

Flags	default (0)	WC_ERR_INVALID_CHARS	WC_NO_BEST_FIT_CHARS
U+DC80, CP_UTF8	`0xEF 0xBF 0xBD`	encoding error	invalid flags

Examples on Windows 2000, XP, 2003:

Flags	default (0)	WC_ERR_INVALID_CHARS	WC_NO_BEST_FIT_CHARS
U+DC80, CP_UTF8	`0xED 0xB2 0x80`	invalid flags	invalid flags

Note

MultiByteToWideChar() and WideCharToMultiByte() functions are similar to mbstowcs() and wcstombs() functions.

10.1.3. Windows API: ANSI and wide versions¶

Windows has two versions of each function of its API: the ANSI version using byte strings (A suffix) and the ANSI code page, and the wide version (W suffix) using character strings. There are also functions without suffix using TCHAR* strings: if the C define _UNICODE is defined, TCHAR is replaced by wchar_t and the Unicode functions are used; otherwise TCHAR is replaced by char and the ANSI functions are used. Example:

CreateFileA(): bytes version, use byte strings encoded to the ANSI code page

CreateFileW(): Unicode version, use wide character strings

CreateFile(): TCHAR version depending on the _UNICODE define

Always prefer the Unicode version to avoid encoding/decoding errors, and use directly the W suffix to avoid compiling issues.

Note

There is a third version of the API: the MBCS API (multibyte character string). Use the TCHAR functions and define _MBCS to use the MBCS functions. For example, _tcsrev() is replaced by _mbsrev() if _MBCS is defined, by _wcsrev() if _UNICODE is defined, or by _strrev() otherwise.

10.1.4. Windows string types¶

LPSTR (LPCSTR): byte string, char* (const char*)

LPWSTR (LPCWSTR): wide character string, wchar_t* (const wchar_t*)

LPTSTR (LPCTSTR): byte or wide character string depending of _UNICODE define, TCHAR* (const TCHAR*)

10.1.5. Filenames¶

Windows stores filenames as Unicode in the filesystem. Filesystem wide character POSIX-like API:

int _wfstat(const wchar_t *filename, struct _stat *statbuf)¶: Unicode version of stat().

FILE *_wfopen(const wchar_t *filename, const wchar_t *mode)¶: Unicode version of fopen().

int _wopen(const wchar_t *filename, int oflag[, int pmode])¶: Unicode version of open().

POSIX functions, like fopen(), use the ANSI code page to encode/decode strings.

10.1.6. Windows console¶

Console functions.

GetConsoleCP(): Get the code page of the standard input (stdin) of the console.

GetConsoleOutputCP(): Get the code page of the standard output (stdout and stderr) of the console.

WriteConsoleW(): Write a character string into the console.

To improve the Unicode support of the console, set the console font to a TrueType font (e.g. “Lucida Console”) and use the wide character API

If the console is unable to render a character, it tries to use a character with a similar glyph. For example, with OEM code page 850, Ł (U+0141) is replaced by L (U+0041). If no replacment character can be found, “?” (U+003F) is displayed instead.

In a console (cmd.exe), chcp command can be used to display or to change the OEM code page (and console code page). Changing the console code page is not a good idea because the ANSI API of the console still expects characters encoded to the previous console code page.

10.1.7. File mode¶

_setmode() and _wsopen() are special functions to set the encoding of a file:

_O_U8TEXT: UTF-8 without BOM

_O_U16TEXT: UTF-16 without BOM

_O_WTEXT: UTF-16 with BOM

fopen() can use these modes using ccs= in the file mode:

ccs=UNICODE: _O_WTEXT

ccs=UTF-8: _O_UTF8

ccs=UTF-16LE: _O_UTF16

10.2. Mac OS X¶

Mac OS X uses UTF-8 for the filenames. If a filename is an invalid UTF-8 byte string, Mac OS X returns an error. The filenames are decomposed to an incompatible variant of the Normal Form D (NFD). Extract of the Technical Q&A QA1173: “For example, HFS Plus uses a variant of Normal Form D in which U+2000 through U+2FFF, U+F900 through U+FAFF, and U+2F800 through U+2FAFF are not decomposed.”

10.3. Locales¶

To support different languages and encodings, UNIX and BSD operating systems have “locales”. Locales are process-wide: if a thread or a library change the locale, the whole process is impacted.

10.3.1. Locale categories¶

Locale categories:

LC_COLLATE: compare and sort strings

LC_CTYPE: decode byte strings and encode character strings

LC_MESSAGES: language of messages

LC_MONETARY: monetary formatting

LC_NUMERIC: number formatting (e.g. thousands separator)

LC_TIME: time and date formatting

LC_ALL is a special category: if you set a locale using this category, it sets the locale for all categories.

Each category has its own environment variable with the same name. For example, LC_MESSAGES=C displays error messages in English. To get the value of a locale category, LC_ALL, LC_xxx (e.g. LC_CTYPE) or LANG environment variables are checked: use the first non empty variable. If all variables are unset, fallback to the C locale.

Note

The gettext library reads LANGUAGE, LC_ALL and LANG environment variables (and some others) to get the user language. The LANGUAGE variable is specific to gettext and is not related to locales.

10.3.2. The C locale¶

When a program starts, it does not get directly the user locale: it uses the default locale which is called the “C” locale or the “POSIX” locale. It is also used if no locale environment variable is set. For LC_CTYPE, the C locale usually means ASCII, but not always (see the locale encoding section). For LC_MESSAGES, the C locale means to speak the original language of the program, which is usually English.

10.3.3. Locale encoding¶

For Unicode, the most important locale category is LC_CTYPE: it is used to set the “locale encoding”.

To get the locale encoding:

Copy the current locale: setlocale(LC_CTYPE, NULL)

Set the current locale encoding to the user preference: setlocale(LC_CTYPE, "")

Use nl_langinfo(CODESET) if available

or setlocale(LC_CTYPE, NULL)

For the C locale, nl_langinfo(CODESET) returns ASCII, or an alias to this encoding (e.g. “US-ASCII” or “646”). But on FreeBSD, Solaris and Mac OS X, codec functions (e.g. mbstowcs()) use ISO 8859-1 even if nl_langinfo(CODESET) announces ASCII encoding. AIX uses ISO 8859-1 for the C locale (and nl_langinfo(CODESET) returns "ISO8859-1").

10.3.4. Locale functions¶

<locale.h> functions.

char *setlocale(category, NULL)¶: Get the value of the specified locale category.

char *setlocale(category, name): Set the value of the specified locale category.

<langinfo.h> functions.

char *nl_langinfo(CODESET)¶: Get the name of the locale encoding.

<stdlib.h> functions.

size_t mbstowcs(wchar_t *dest, const char *src, size_t n)¶: Decode a byte string from the locale encoding to a character string. The decoder is strict: it returns an error on undecodable byte sequence. If available, prefer the reentrant version: mbsrtowcs().

size_t wcstombs(char *dest, const wchar_t *src, size_t n)¶: Encode a character string to a byte string in the locale encoding. The encoder is strict : it returns an error if a character cannot by encoded. If available, prefer the reentrant version: wcsrtombs().

mbstowcs() and wcstombs() are strict and don’t support error handlers.

Note

“mbs” stands for “multibyte string” (byte string) and “wcs” stands for “wide character string”.

On Windows, the “locale encoding” are the ANSI and OEM code pages. A Windows program uses the user preferred code pages at startup, whereas a program starts with the C locale on UNIX.

10.4. Filesystems (filenames)¶

10.4.1. CD-ROM and DVD¶

CD-ROM uses the ISO 9660 filesystem which stores filenames as byte strings. This filesystem is very restrictive: only A-Z, 0-9, _ and “.” are allowed. Microsoft has developed the Joliet extension: store filenames as UCS-2, up to 64 characters (BMP only). It was first supported by Windows 95. Today, all operating systems are able to read it.

UDF (Universal Disk Format) is the filesystem of DVD: it stores filenames as character strings.

10.4.2. Microsoft: FAT and NTFS filesystems¶

MS-DOS uses the FAT filesystems (FAT 12, FAT 16, FAT 32): filenames are stored as byte strings. Filenames are limited to 8+3 characters (8 for the name, 3 for the extension) and displayed differently depending on the code page (mojibake issue).

Microsoft extended its FAT filesystem in Windows 95: the Virtual FAT (VFAT) supports “long filenames”, filenames are stored as UCS-2, up to 255 characters (BMP only). Starting at Windows 2000, non-BMP characters can be used: UTF-16 replaces UCS-2 and the limit is now 255 UTF-16 units.

The NTFS filesystem stores filenames using UTF-16 encoding.

10.4.3. Apple: HFS and HFS+ filesystems¶

HFS stores filenames as byte strings.

HFS+ stores filenames as UTF-16: the maximum length is 255 UTF-16 units.

10.4.4. Others¶

JFS and ZFS also use Unicode.

The ext family (ext2, ext3, ext4) store filenames as byte strings.

10. Operating systems¶

10.1. Windows¶

10.1.1. Code pages¶

10.1.2. Encode and decode functions¶

10.1.3. Windows API: ANSI and wide versions¶

10.1.4. Windows string types¶

10.1.5. Filenames¶

10.1.6. Windows console¶

10.1.7. File mode¶

10.2. Mac OS X¶

10.3. Locales¶

10.3.1. Locale categories¶

10.3.2. The C locale¶

10.3.3. Locale encoding¶

10.3.4. Locale functions¶

10.4. Filesystems (filenames)¶

10.4.1. CD-ROM and DVD¶

10.4.2. Microsoft: FAT and NTFS filesystems¶

10.4.3. Apple: HFS and HFS+ filesystems¶

10.4.4. Others¶

Table of Contents

Previous topic

Next topic

This Page