10. Operating systems¶
10.1. Windows¶
Since Windows 2000, Windows offers a nice Unicode API and supports
non-BMP characters. It uses Unicode strings
implemented as wchar_t*
strings (LPWSTR). wchar_t
is 16 bits long
on Windows and so it uses UTF-16: non-BMP
characters are stored as two wchar_t
(a surrogate pair), and the length of a string is the number of UTF-16 units and
not the number of characters.
Windows 95, 98 and Me had also Unicode strings, but were limited to BMP characters: they used UCS-2 instead of UTF-16.
10.1.1. Code pages¶
A Windows application has two encodings, called code pages (abbreviated “cp”):
ANSI and OEM code pages. The ANSI code page, CP_ACP
, is used for the
ANSI version of the Windows API to decode byte strings to
character strings and has a number between 874 and 1258. The OEM
code page or “IBM PC” code page, CP_OEMCP
, comes from MS-DOS, is
used for the Windows console, contains glyphs to create
text interfaces (draw boxes) and has a number between 437 and 874. Example of a
French setup: ANSI is cp1252 and OEM is cp850.
There are code page constants:
Functions.
-
UINT GetACP()¶
Get the ANSI code page number.
-
UINT GetOEMCP()¶
Get the OEM code page number.
-
BOOL SetThreadLocale(LCID locale)¶
Set the locale. It can be used to change the ANSI code page of current thread (
CP_THREAD_ACP
).
See also
Wikipedia article: Windows code page.
10.1.2. Encode and decode functions¶
Encode and decode functions of <windows.h>
.
- MultiByteToWideChar()
Decode a byte string from a code page to a character string. Use
MB_ERR_INVALID_CHARS
flag to return an error on an undecodable byte sequence.The default behaviour (flags=0) depends on the Windows version:
Windows Vista and later: replace undecodable bytes
Windows 2000, XP and 2003: ignore undecodable bytes
In strict mode (
MB_ERR_INVALID_CHARS
), the UTF-8 decoder (CP_UTF8
) returns an error on surrogate characters on Windows Vista and later. On Windows XP, the UTF-8 decoder is not strict: surrogates can be decoded in any mode.The UTF-7 decoder (
CP_UTF7
) only supports flags=0.Examples on any Windows version:
Flags
default (0)
MB_ERR_INVALID_CHARS
0xE9 0x80
, cp1252é€ {U+00E9, U+20AC}
é€ {U+00E9, U+20AC}
0xC3 0xA9
, CP_UTF8é {U+00E9}
é {U+00E9}
0xFF
, cp932{U+F8F3}
decoding error
0xFF
, CP_UTF7{U+FF}
invalid flags
Examples on Windows Vista and later:
Flags
default (0)
MB_ERR_INVALID_CHARS
0x81 0x00
, cp932{U+30FB, U+0000}
decoding error
0xFF
, CP_UTF8{U+FFFD}
decoding error
0xED 0xB2 0x80
, CP_UTF8{U+FFFD, U+FFFD, U+FFFD}
decoding error
Examples on Windows 2000, XP, 2003:
Flags
default (0)
MB_ERR_INVALID_CHARS
0x81 0x00
, cp932{U+0000}
decoding error
0xFF
, CP_UTF8decoding error
decoding error
0xED 0xB2 0x80
, CP_UTF8{U+DC80}
{U+DC80}
Note
The U+30FB character is the Katakana middle dot (・). U+F8F3 code point is part of a Unicode range reserved for private use (U+E000—U+F8FF).
- WideCharToMultiByte()
Encode a character string to a byte string. The behaviour on unencodable characters depends on the code page, the Windows version and the flags.
Code page
Windows version
Flags
Behaviour
CP_UTF8
2000, XP, 2003
0
Encode surrogates
Vista or later
0
Replace surrogates by U+FFFD
WC_ERR_INVALID_CHARS
Strict
CP_UTF7
all versions
0
Encode surrogates
Others
all versions
0
Replace by similar glyph
WC_NO_BEST_FIT_CHARS
Replace by ? (1)
: Strict if you check for pusedDefaultChar pointer.
pusedDefaultChar is not supported by CP_UTF7 or CP_UTF8.
Use
WC_NO_BEST_FIT_CHARS
flag (orWC_ERR_INVALID_CHARS
flag forCP_UTF8
) to have a strict encoder: return an error on unencodable character. By default, if a character cannot be encoded, it is replaced by a character with a similar glyph or by “?” (U+003F). For example, with cp1252, Ł (U+0141) is replaced by L (U+004C).On Windows Vista or later with
WC_ERR_INVALID_CHARS
flag, the UTF-8 encoder (CP_UTF8
) returns an error on surrogate characters. The default behaviour (flags=0) depends on the Windows version: surrogates are replaced by U+FFFD on Windows Vista and later, and are encoded to UTF-8 on older Windows versions. TheWC_NO_BEST_FIT_CHARS
flag is not supported by the UTF-8 encoder.The
WC_ERR_INVALID_CHARS
flag is only supported byCP_UTF8
and only on Windows Vista or later.The UTF-7 encoder (
CP_UTF7
) only supports flags=0. It is not strict: it encodes surrogate characters.Examples (on any Windows version):
Flags
default (0)
WC_NO_BEST_FIT_CHARS
ÿ (U+00FF), cp932
0x79
(y)0x3F
(?)Ł (U+0141), cp1252
0x4C
(L)0x3F
(?)€ (U+20AC), cp1252
0x80
0x80
U+DC80, CP_UTF7
0x2b 0x33 0x49 0x41 0x2d
(+3IA-)invalid flags
Examples on Windows Vista an later:
Flags
default (0)
WC_ERR_INVALID_CHARS
WC_NO_BEST_FIT_CHARS
U+DC80, CP_UTF8
0xEF 0xBF 0xBD
encoding error
invalid flags
Examples on Windows 2000, XP, 2003:
Flags
default (0)
WC_ERR_INVALID_CHARS
WC_NO_BEST_FIT_CHARS
U+DC80, CP_UTF8
0xED 0xB2 0x80
invalid flags
invalid flags
Note
MultiByteToWideChar()
and WideCharToMultiByte()
functions
are similar to mbstowcs()
and wcstombs()
functions.
10.1.3. Windows API: ANSI and wide versions¶
Windows has two versions of each function of its API: the ANSI version using
byte strings (A
suffix) and the ANSI code page, and the wide version (W
suffix) using character strings. There are also functions without suffix using TCHAR*
strings:
if the C define _UNICODE
is defined, TCHAR
is
replaced by wchar_t
and the Unicode functions are used; otherwise
TCHAR
is replaced by char
and the ANSI functions are used.
Example:
CreateFileA()
: bytes version, use byte strings encoded to the ANSI code page
CreateFileW()
: Unicode version, use wide character strings
CreateFile()
:TCHAR
version depending on the_UNICODE
define
Always prefer the Unicode version to avoid encoding/decoding errors, and use
directly the W
suffix to avoid compiling issues.
Note
There is a third version of the API: the MBCS API (multibyte character
string). Use the TCHAR functions and define _MBCS
to use the MBCS
functions. For example, _tcsrev()
is replaced by _mbsrev()
if _MBCS
is defined, by _wcsrev()
if _UNICODE
is defined, or by _strrev()
otherwise.
10.1.4. Windows string types¶
LPSTR (LPCSTR): byte string,
char*
(const char*
)LPWSTR (LPCWSTR): wide character string,
wchar_t*
(const wchar_t*
)LPTSTR (LPCTSTR): byte or wide character string depending of
_UNICODE
define,TCHAR*
(const TCHAR*
)
10.1.5. Filenames¶
Windows stores filenames as Unicode in the filesystem. Filesystem wide character POSIX-like API:
POSIX functions, like fopen()
, use the ANSI code page to encode/decode strings.
10.1.6. Windows console¶
Console functions.
- GetConsoleCP()
Get the code page of the standard input (stdin) of the console.
- GetConsoleOutputCP()
Get the code page of the standard output (stdout and stderr) of the console.
- WriteConsoleW()
Write a character string into the console.
To improve the Unicode support of the console, set the console font to a TrueType font (e.g. “Lucida Console”) and use the wide character API
If the console is unable to render a character, it tries to use a character with a similar glyph. For example, with OEM code page 850, Ł (U+0141) is replaced by L (U+0041). If no replacment character can be found, “?” (U+003F) is displayed instead.
In a console (cmd.exe
), chcp
command can be used to display or to
change the OEM code page (and console code page). Changing the
console code page is not a good idea because the ANSI API of the console still
expects characters encoded to the previous console code page.
See also
Conventional wisdom is retarded, aka What the @#%&* is _O_U16TEXT? (Michael S. Kaplan, 2008) and the Python bug report #1602: windows console doesn’t print or input Unicode.
10.1.7. File mode¶
_setmode()
and _wsopen()
are special functions to set the
encoding of a file:
fopen()
can use these modes using ccs=
in the file mode:
ccs=UNICODE
:_O_WTEXT
ccs=UTF-8
:_O_UTF8
ccs=UTF-16LE
:_O_UTF16
10.2. Mac OS X¶
Mac OS X uses UTF-8 for the filenames. If a filename is an invalid UTF-8 byte string, Mac OS X returns an error. The filenames are decomposed to an incompatible variant of the Normal Form D (NFD). Extract of the Technical Q&A QA1173: “For example, HFS Plus uses a variant of Normal Form D in which U+2000 through U+2FFF, U+F900 through U+FAFF, and U+2F800 through U+2FAFF are not decomposed.”
10.3. Locales¶
To support different languages and encodings, UNIX and BSD operating systems have “locales”. Locales are process-wide: if a thread or a library change the locale, the whole process is impacted.
10.3.1. Locale categories¶
Locale categories:
LC_COLLATE
: compare and sort strings
LC_CTYPE
: decode byte strings and encode character strings
LC_MESSAGES
: language of messages
LC_MONETARY
: monetary formatting
LC_NUMERIC
: number formatting (e.g. thousands separator)
LC_TIME
: time and date formatting
LC_ALL
is a special category: if you set a locale using this
category, it sets the locale for all categories.
Each category has its own environment variable with the same name. For
example, LC_MESSAGES=C
displays error messages in English. To get the
value of a locale category, LC_ALL
, LC_xxx
(e.g. LC_CTYPE
) or
LANG
environment variables are checked: use the first non empty variable.
If all variables are unset, fallback to the C locale.
Note
The gettext library reads LANGUAGE
, LC_ALL
and LANG
environment
variables (and some others) to get the user language. The LANGUAGE
variable is specific to gettext and is not related to locales.
10.3.2. The C locale¶
When a program starts, it does not get directly the user locale: it uses the
default locale which is called the “C” locale or the “POSIX” locale. It is also
used if no locale environment variable is set. For LC_CTYPE
, the C
locale usually means ASCII, but not always (see the locale
encoding section). For LC_MESSAGES
, the C locale means to speak the
original language of the program, which is usually English.
10.3.3. Locale encoding¶
For Unicode, the most important locale category is LC_CTYPE
: it is used to
set the “locale encoding”.
To get the locale encoding:
Copy the current locale:
setlocale(LC_CTYPE, NULL)
Set the current locale encoding to the user preference:
setlocale(LC_CTYPE, "")
Use
nl_langinfo(CODESET)
if availableor
setlocale(LC_CTYPE, NULL)
For the C locale, nl_langinfo(CODESET)
returns ASCII, or an alias
to this encoding (e.g. “US-ASCII” or “646”). But on FreeBSD, Solaris and
Mac OS X, codec functions (e.g. mbstowcs()
) use
ISO 8859-1 even if nl_langinfo(CODESET)
announces ASCII encoding.
AIX uses ISO 8859-1 for the C locale (and nl_langinfo(CODESET)
returns "ISO8859-1"
).
10.3.4. Locale functions¶
<locale.h>
functions.
-
char *setlocale(category, NULL)¶
Get the value of the specified locale category.
-
char *setlocale(category, name)
Set the value of the specified locale category.
<langinfo.h>
functions.
-
char *nl_langinfo(CODESET)¶
Get the name of the locale encoding.
<stdlib.h>
functions.
-
size_t mbstowcs(wchar_t *dest, const char *src, size_t n)¶
Decode a byte string from the locale encoding to a character string. The decoder is strict: it returns an error on undecodable byte sequence. If available, prefer the reentrant version:
mbsrtowcs()
.
-
size_t wcstombs(char *dest, const wchar_t *src, size_t n)¶
Encode a character string to a byte string in the locale encoding. The encoder is strict : it returns an error if a character cannot by encoded. If available, prefer the reentrant version:
wcsrtombs()
.
mbstowcs() and wcstombs() are strict and don’t support error handlers.
Note
“mbs” stands for “multibyte string” (byte string) and “wcs” stands for “wide character string”.
On Windows, the “locale encoding” are the ANSI and OEM code pages. A Windows program uses the user preferred code pages at startup, whereas a program starts with the C locale on UNIX.
10.4. Filesystems (filenames)¶
10.4.1. CD-ROM and DVD¶
CD-ROM uses the ISO 9660 filesystem which stores filenames as byte strings. This filesystem is very restrictive: only A-Z, 0-9, _ and “.” are allowed. Microsoft has developed the Joliet extension: store filenames as UCS-2, up to 64 characters (BMP only). It was first supported by Windows 95. Today, all operating systems are able to read it.
UDF (Universal Disk Format) is the filesystem of DVD: it stores filenames as character strings.
10.4.2. Microsoft: FAT and NTFS filesystems¶
MS-DOS uses the FAT filesystems (FAT 12, FAT 16, FAT 32): filenames are stored as byte strings. Filenames are limited to 8+3 characters (8 for the name, 3 for the extension) and displayed differently depending on the code page (mojibake issue).
Microsoft extended its FAT filesystem in Windows 95: the Virtual FAT (VFAT) supports “long filenames”, filenames are stored as UCS-2, up to 255 characters (BMP only). Starting at Windows 2000, non-BMP characters can be used: UTF-16 replaces UCS-2 and the limit is now 255 UTF-16 units.
The NTFS filesystem stores filenames using UTF-16 encoding.
10.4.3. Apple: HFS and HFS+ filesystems¶
HFS stores filenames as byte strings.
HFS+ stores filenames as UTF-16: the maximum length is 255 UTF-16 units.
10.4.4. Others¶
JFS and ZFS also use Unicode.
The ext family (ext2, ext3, ext4) store filenames as byte strings.