.. _oses: Operating systems ================= .. todo:: write an intro for all OS? .. _win: .. _Windows: Windows ------- Since Windows 2000, Windows offers a nice Unicode API and supports :ref:`non-BMP characters `. It uses :ref:`Unicode strings ` implemented as :c:type:`wchar_t*` strings (LPWSTR). :c:type:`wchar_t` is 16 bits long on Windows and so it uses :ref:`UTF-16 `: :ref:`non-BMP ` characters are stored as two :c:type:`wchar_t` (a :ref:`surrogate pair `), and the length of a string is the number of UTF-16 units and not the number of characters. Windows 95, 98 and Me had also Unicode strings, but were limited to :ref:`BMP characters `: they used :ref:`UCS-2 ` instead of UTF-16. .. todo:: And Windows CE? .. index: Code page .. _codepage: Code pages '''''''''' A Windows application has two encodings, called code pages (abbreviated "cp"): ANSI and OEM code pages. The ANSI code page, :c:macro:`CP_ACP`, is used for the ANSI version of the :ref:`Windows API ` to decode :ref:`byte strings ` to :ref:`character strings ` and has a number between 874 and 1258. The OEM code page or "IBM PC" code page, :c:macro:`CP_OEMCP`, comes from MS-DOS, is used for the :ref:`Windows console `, contains glyphs to create text interfaces (draw boxes) and has a number between 437 and 874. Example of a French setup: ANSI is :ref:`cp1252` and OEM is cp850. There are code page constants: * :c:macro:`CP_ACP`: Windows ANSI code page * :c:macro:`CP_MACCP`: Macintosh code page * :c:macro:`CP_OEMCP`: ANSI code page of the current process * :c:macro:`CP_SYMBOL` (42): Symbol code page * :c:macro:`CP_THREAD_ACP`: ANSI code page of the current thread * :c:macro:`CP_UTF7` (65000): :ref:`UTF-7 ` * :c:macro:`CP_UTF8` (65001): :ref:`UTF-8 ` Functions. .. c:function:: UINT GetACP() Get the ANSI code page number. .. c:function:: UINT GetOEMCP() Get the OEM code page number. .. c:function:: BOOL SetThreadLocale(LCID locale) Set the locale. It can be used to change the ANSI code page of current thread (:c:macro:`CP_THREAD_ACP`). .. seealso:: Wikipedia article: `Windows code page `_. Encode and decode functions ''''''''''''''''''''''''''' Encode and decode functions of ````. .. c:function:: MultiByteToWideChar() :ref:`Decode ` a :ref:`byte string ` from a code page to a :ref:`character string `. Use :c:macro:`MB_ERR_INVALID_CHARS` flag to :ref:`return an error ` on an :ref:`undecodable byte sequence `. The default behaviour (flags=0) depends on the Windows version: - Windows Vista and later: :ref:`replace ` :ref:`undecodable bytes ` - Windows 2000, XP and 2003: :ref:`ignore ` :ref:`undecodable bytes ` In strict mode (:c:macro:`MB_ERR_INVALID_CHARS`), the :ref:`UTF-8 ` decoder (:c:macro:`CP_UTF8`) returns an error on :ref:`surrogate characters ` on Windows Vista and later. On Windows XP, the :ref:`UTF-8 decoder is not strict `: surrogates can be decoded in any mode. The :ref:`UTF-7 ` decoder (:c:macro:`CP_UTF7`) only supports flags=0. Examples on any Windows version: +------------------------+---------------------+----------------------+ | Flags | default (0) | MB_ERR_INVALID_CHARS | +========================+=====================+======================+ | ``0xE9 0x80``, cp1252 | é€ {U+00E9, U+20AC} | é€ {U+00E9, U+20AC} | +------------------------+---------------------+----------------------+ | ``0xC3 0xA9``, CP_UTF8 | é {U+00E9} | é {U+00E9} | +------------------------+---------------------+----------------------+ | ``0xFF``, cp932 | {U+F8F3} | *decoding error* | +------------------------+---------------------+----------------------+ | ``0xFF``, CP_UTF7 | {U+FF} | *invalid flags* | +------------------------+---------------------+----------------------+ Examples on Windows Vista and later: +-----------------------------+--------------------------+----------------------+ | Flags | default (0) | MB_ERR_INVALID_CHARS | +=============================+==========================+======================+ | ``0x81 0x00``, cp932 | {U+30FB, U+0000} | *decoding error* | +-----------------------------+--------------------------+----------------------+ | ``0xFF``, CP_UTF8 | {U+FFFD} | *decoding error* | +-----------------------------+--------------------------+----------------------+ | ``0xED 0xB2 0x80``, CP_UTF8 | {U+FFFD, U+FFFD, U+FFFD} | *decoding error* | +-----------------------------+--------------------------+----------------------+ Examples on Windows 2000, XP, 2003: +-----------------------------+------------------+----------------------+ | Flags | default (0) | MB_ERR_INVALID_CHARS | +=============================+==================+======================+ | ``0x81 0x00``, cp932 | {U+0000} | *decoding error* | +-----------------------------+------------------+----------------------+ | ``0xFF``, CP_UTF8 | *decoding error* | *decoding error* | +-----------------------------+------------------+----------------------+ | ``0xED 0xB2 0x80``, CP_UTF8 | {U+DC80} | {U+DC80} | +-----------------------------+------------------+----------------------+ .. note:: The U+30FB character is the Katakana middle dot (・). U+F8F3 code point is part of a Unicode range reserved for private use (U+E000—U+F8FF). .. c:function:: WideCharToMultiByte() :ref:`Encode ` a :ref:`character string ` to a :ref:`byte string `. The behaviour on :ref:`unencodable characters ` depends on the code page, the Windows version and the flags. +-----------+----------------------+----------------------+------------------------------+ | Code page | Windows version | Flags | Behaviour | +===========+======================+======================+==============================+ | | 2000, XP, 2003 | 0 | Encode surrogates | | +----------------------+----------------------+------------------------------+ | CP_UTF8 | | 0 | Replace surrogates by U+FFFD | | | Vista or later +----------------------+------------------------------+ | | | WC_ERR_INVALID_CHARS | Strict | +-----------+----------------------+----------------------+------------------------------+ | CP_UTF7 | *all versions* | 0 | Encode surrogates | +-----------+----------------------+----------------------+------------------------------+ | Others | | 0 | Replace by similar glyph | | | *all versions* +----------------------+------------------------------+ | | | WC_NO_BEST_FIT_CHARS | Replace by ? (1) | +-----------+----------------------+----------------------+------------------------------+ (1) : Strict if you check for pusedDefaultChar pointer. pusedDefaultChar is not supported by CP_UTF7 or CP_UTF8. Use :c:macro:`WC_NO_BEST_FIT_CHARS` flag (or :c:macro:`WC_ERR_INVALID_CHARS` flag for :c:macro:`CP_UTF8`) to have a strict encoder: :ref:`return an error ` on :ref:`unencodable character `. By default, if :ref:`a character cannot be encoded `, it is :ref:`replaced by a character with a similar glyph ` or by "?" (U+003F). For example, with :ref:`cp1252`, Ł (U+0141) is replaced by L (U+004C). On Windows Vista or later with :c:macro:`WC_ERR_INVALID_CHARS` flag, the :ref:`UTF-8 ` encoder (:c:macro:`CP_UTF8`) returns an error on :ref:`surrogate characters `. The default behaviour (flags=0) depends on the Windows version: surrogates are replaced by U+FFFD on Windows Vista and later, and are encoded to UTF-8 on older Windows versions. The :c:macro:`WC_NO_BEST_FIT_CHARS` flag is not supported by the UTF-8 encoder. The :c:macro:`WC_ERR_INVALID_CHARS` flag is only supported by :c:macro:`CP_UTF8` and only on Windows Vista or later. The :ref:`UTF-7 ` encoder (:c:macro:`CP_UTF7`) only supports flags=0. It is not strict: it encodes :ref:`surrogate characters `. Examples (on any Windows version): +--------------------+--------------------------------------+----------------------+ | Flags | default (0) | WC_NO_BEST_FIT_CHARS | +====================+======================================+======================+ | ÿ (U+00FF), cp932 | ``0x79`` (y) | ``0x3F`` (?) | +--------------------+--------------------------------------+----------------------+ | Ł (U+0141), cp1252 | ``0x4C`` (L) | ``0x3F`` (?) | +--------------------+--------------------------------------+----------------------+ | € (U+20AC), cp1252 | ``0x80`` | ``0x80`` | +--------------------+--------------------------------------+----------------------+ | U+DC80, CP_UTF7 | ``0x2b 0x33 0x49 0x41 0x2d`` (+3IA-) | *invalid flags* | +--------------------+--------------------------------------+----------------------+ Examples on Windows Vista an later: +--------------------+--------------------+----------------------+----------------------+ | Flags | default (0) | WC_ERR_INVALID_CHARS | WC_NO_BEST_FIT_CHARS | +====================+====================+======================+======================+ | U+DC80, CP_UTF8 | ``0xEF 0xBF 0xBD`` | *encoding error* | *invalid flags* | +--------------------+--------------------+----------------------+----------------------+ Examples on Windows 2000, XP, 2003: +--------------------+--------------------+----------------------+----------------------+ | Flags | default (0) | WC_ERR_INVALID_CHARS | WC_NO_BEST_FIT_CHARS | +====================+====================+======================+======================+ | U+DC80, CP_UTF8 | ``0xED 0xB2 0x80`` | *invalid flags* | *invalid flags* | +--------------------+--------------------+----------------------+----------------------+ .. note:: :c:func:`MultiByteToWideChar` and :c:func:`WideCharToMultiByte` functions are similar to :c:func:`mbstowcs` and :c:func:`wcstombs` functions. .. todo:: Document NormalizeString() .. todo:: Document the replacement character? .. _win_api: Windows API: ANSI and wide versions ''''''''''''''''''''''''''''''''''' Windows has two versions of each function of its API: the ANSI version using :ref:`byte strings ` (``A`` suffix) and the :ref:`ANSI code page `, and the wide version (``W`` suffix) using :ref:`character strings `. There are also functions without suffix using :c:type:`TCHAR*` strings: if the :ref:`C ` define :c:macro:`_UNICODE` is defined, :c:type:`TCHAR` is replaced by :c:type:`wchar_t` and the Unicode functions are used; otherwise :c:type:`TCHAR` is replaced by :c:type:`char` and the ANSI functions are used. Example: * :c:func:`CreateFileA()`: bytes version, use :ref:`byte strings ` encoded to the ANSI code page * :c:func:`CreateFileW()`: Unicode version, use :ref:`wide character strings ` * :c:func:`CreateFile()`: :c:type:`TCHAR` version depending on the :c:macro:`_UNICODE` define Always prefer the Unicode version to avoid encoding/decoding errors, and use directly the ``W`` suffix to avoid compiling issues. .. note:: There is a third version of the API: the MBCS API (multibyte character string). Use the TCHAR functions and define :c:macro:`_MBCS` to use the MBCS functions. For example, :c:func:`_tcsrev` is replaced by :c:func:`_mbsrev` if :c:macro:`_MBCS` is defined, by :c:func:`_wcsrev` if :c:macro:`_UNICODE` is defined, or by :c:func:`_strrev` otherwise. Windows string types '''''''''''''''''''' * LPSTR (LPCSTR): :ref:`byte string `, :c:type:`char*` (:c:type:`const char*`) * LPWSTR (LPCWSTR): :ref:`wide character string `, :c:type:`wchar_t*` (:c:type:`const wchar_t*`) * LPTSTR (LPCTSTR): byte or wide character string depending of ``_UNICODE`` define, :c:type:`TCHAR*` (:c:type:`const TCHAR*`) Filenames ''''''''' Windows stores filenames as Unicode in the filesystem. Filesystem wide character POSIX-like API: .. c:function:: int _wfstat(const wchar_t* filename, struct _stat *statbuf) Unicode version of :c:func:`stat()`. .. c:function:: FILE *_wfopen(const wchar_t* filename, const wchar_t *mode) Unicode version of :c:func:`fopen`. .. c:function:: int _wopen(const wchar_t *filename, int oflag[, int pmode]) Unicode version of :c:func:`open`. POSIX functions, like :c:func:`fopen()`, use the :ref:`ANSI code page ` to encode/decode strings. .. _win_console: Windows console ''''''''''''''' Console functions. .. c:function:: GetConsoleCP() Get the code page of the standard input (stdin) of the console. .. c:function:: GetConsoleOutputCP() Get the code page of the standard output (stdout and stderr) of the console. .. c:function:: WriteConsoleW() Write a :ref:`character string ` into the console. .. todo:: document ReadConsoleW()? To improve the :ref:`Unicode support ` of the console, set the console font to a TrueType font (e.g. "Lucida Console") and use the wide character API If the console is unable to render a character, it tries to use a :ref:`character with a similar glyph `. For example, with OEM :ref:`code page ` 850, Ł (U+0141) is replaced by L (U+0041). If no replacment character can be found, "?" (U+003F) is displayed instead. In a console (``cmd.exe``), ``chcp`` command can be used to display or to change the :ref:`OEM code page ` (and console code page). Changing the console code page is not a good idea because the ANSI API of the console still expects characters encoded to the previous console code page. .. seealso:: `Conventional wisdom is retarded, aka What the @#%&* is _O_U16TEXT? `_ (Michael S. Kaplan, 2008) and the Python bug report #1602: `windows console doesn't print or input Unicode `_. .. note:: Set the console :ref:`code page ` to cp65001 (:ref:`UTF-8`) doesn't improve Unicode support, it is the opposite: non-ASCII are not rendered correctly and type non-ASCII characters (e.g. using the keyboard) doesn't work correctly, especially using raster fonts. File mode ''''''''' :c:func:`_setmode` and :c:func:`_wsopen` are special functions to set the encoding of a file: * :c:macro:`_O_U8TEXT`: :ref:`UTF-8` without :ref:`BOM ` * :c:macro:`_O_U16TEXT`: :ref:`UTF-16 ` without BOM * :c:macro:`_O_WTEXT`: UTF-16 with BOM :c:func:`fopen` can use these modes using ``ccs=`` in the file mode: * ``ccs=UNICODE``: :c:macro:`_O_WTEXT` * ``ccs=UTF-8``: :c:macro:`_O_UTF8` * ``ccs=UTF-16LE``: :c:macro:`_O_UTF16` .. todo:: Consequences on TTY and pipes? .. _osx: Mac OS X -------- Mac OS X uses :ref:`UTF-8` for the filenames. If a filename is an invalid UTF-8 byte string, Mac OS X :ref:`returns an error `. The filenames are :ref:`decomposed ` to an incompatible variant of the Normal Form D (NFD). Extract of the `Technical Q&A QA1173 `_: "For example, HFS Plus uses a variant of Normal Form D in which U+2000 through U+2FFF, U+F900 through U+FAFF, and U+2F800 through U+2FAFF are not decomposed." .. _locales: Locales ------- To support different languages and encodings, UNIX and BSD operating systems have "locales". Locales are process-wide: if a thread or a library change the locale, the whole process is impacted. .. _locale categories: Locale categories ''''''''''''''''' Locale categories: * :c:macro:`LC_COLLATE`: compare and sort strings * :c:macro:`LC_CTYPE`: decode :ref:`byte strings ` and encode :ref:`character strings ` * :c:macro:`LC_MESSAGES`: language of messages * :c:macro:`LC_MONETARY`: monetary formatting * :c:macro:`LC_NUMERIC`: number formatting (e.g. thousands separator) * :c:macro:`LC_TIME`: time and date formatting :c:macro:`LC_ALL` is a special category: if you set a locale using this category, it sets the locale for all categories. Each category has its own environment variable with the same name. For example, ``LC_MESSAGES=C`` displays error messages in English. To get the value of a locale category, ``LC_ALL``, ``LC_xxx`` (e.g. ``LC_CTYPE``) or ``LANG`` environment variables are checked: use the first non empty variable. If all variables are unset, fallback to the C locale. .. note:: The gettext library reads ``LANGUAGE``, ``LC_ALL`` and ``LANG`` environment variables (and some others) to get the user language. The ``LANGUAGE`` variable is specific to gettext and is not related to locales. The C locale '''''''''''' When a program starts, it does not get directly the user locale: it uses the default locale which is called the "C" locale or the "POSIX" locale. It is also used if no locale environment variable is set. For :c:macro:`LC_CTYPE`, the C locale usually means :ref:`ASCII`, but not always (see the locale encoding section). For :c:macro:`LC_MESSAGES`, the C locale means to speak the original language of the program, which is usually English. .. _locale encoding: Locale encoding ''''''''''''''' For Unicode, the most important locale category is ``LC_CTYPE``: it is used to set the "locale encoding". To get the locale encoding: * Copy the current locale: ``setlocale(LC_CTYPE, NULL)`` * Set the current locale encoding to the user preference: ``setlocale(LC_CTYPE, "")`` * Use ``nl_langinfo(CODESET)`` if available * or ``setlocale(LC_CTYPE, NULL)`` .. todo:: write a full example in C For the C locale, ``nl_langinfo(CODESET)`` returns :ref:`ASCII`, or an alias to this encoding (e.g. "US-ASCII" or "646"). But on FreeBSD, Solaris and :ref:`Mac OS X `, codec functions (e.g. :c:func:`mbstowcs`) use :ref:`ISO-8859-1` even if ``nl_langinfo(CODESET)`` announces ASCII encoding. AIX uses :ref:`ISO-8859-1` for the C locale (and ``nl_langinfo(CODESET)`` returns ``"ISO8859-1"``). Locale functions '''''''''''''''' ```` functions. .. c:function:: char* setlocale(category, NULL) Get the value of the specified locale category. .. c:function:: char* setlocale(category, name) Set the value of the specified locale category. .. todo:: setlocale("") means user preference ```` functions. .. c:function:: char* nl_langinfo(CODESET) Get the name of the locale encoding. ```` functions. .. c:function:: size_t mbstowcs(wchar_t *dest, const char *src, size_t n) :ref:`Decode ` a :ref:`byte string ` from the :ref:`locale encoding ` to a :ref:`character string `. The decoder is :ref:`strict `: it returns an error on :ref:`undecodable byte sequence `. If available, prefer the reentrant version: :c:func:`mbsrtowcs`. .. c:function:: size_t wcstombs(char *dest, const wchar_t *src, size_t n) :ref:`Encode ` a :ref:`character string ` to a :ref:`byte string ` in the :ref:`locale encoding `. The encoder is :ref:`strict ` : it returns an error if :ref:`a character cannot by encoded `. If available, prefer the reentrant version: :c:func:`wcsrtombs`. mbstowcs() and wcstombs() are :ref:`strict ` and don't support :ref:`error handlers `. .. note:: "mbs" stands for "multibyte string" (byte string) and "wcs" stands for "wide character string". On Windows, the "locale encoding" are the :ref:`ANSI and OEM code pages `. A Windows program uses the user preferred code pages at startup, whereas a program starts with the C locale on UNIX. .. _filename: Filesystems (filenames) ----------------------- CD-ROM and DVD '''''''''''''' CD-ROM uses the ISO 9660 filesystem which stores filenames as :ref:`byte strings `. This filesystem is very restrictive: only A-Z, 0-9, _ and "." are allowed. Microsoft has developed the Joliet extension: store filenames as :ref:`UCS-2 `, up to 64 characters (:ref:`BMP ` only). It was first supported by Windows 95. Today, all operating systems are able to read it. UDF (Universal Disk Format) is the filesystem of DVD: it stores filenames as character strings. .. todo:: UDF encoding? Microsoft: FAT and NTFS filesystems ''''''''''''''''''''''''''''''''''' MS-DOS uses the FAT filesystems (FAT 12, FAT 16, FAT 32): filenames are stored as :ref:`byte strings `. Filenames are limited to 8+3 characters (8 for the name, 3 for the extension) and displayed differently depending on the :ref:`code page ` (:ref:`mojibake issue `). Microsoft extended its FAT filesystem in Windows 95: the Virtual FAT (VFAT) supports "long filenames", filenames are stored as :ref:`UCS-2 `, up to 255 characters (BMP only). Starting at Windows 2000, :ref:`non-BMP characters ` can be used: :ref:`UTF-16 ` replaces UCS-2 and the limit is now 255 UTF-16 units. The NTFS filesystem stores filenames using `UTF-16 encoding `_. Apple: HFS and HFS+ filesystems ''''''''''''''''''''''''''''''' HFS stores filenames as byte strings. HFS+ stores filenames as :ref:`UTF-16 `: the maximum length is 255 UTF-16 units. Others '''''' JFS and ZFS also use Unicode. The ext family (ext2, ext3, ext4) store filenames as byte strings. .. todo:: Linux: mount options (FAT, NFSv3) .. todo:: USB keys, camera, memory cards .. todo:: Network fileystems like NFS (NFS4 supports Unicode?)