# 8. How to guess the encoding of a document?¶

Only ASCII, UTF-8 and encodings using a BOM (UTF-7 with BOM, UTF-8 with BOM, UTF-16, and UTF-32) have reliable algorithms to get the encoding of a document. For all other encodings, you have to trust heuristics based on statistics.

## 8.1. Is ASCII?¶

Check if a document is encoded to ASCII is simple: test if the bit 7 of all bytes is unset (`0b0xxxxxxx`).

Example in C:

```int isASCII(const char *data, size_t size)
{
const unsigned char *str = (const unsigned char*)data;
const unsigned char *end = str + size;
for (; str != end; str++) {
if (*str & 0x80)
return 0;
}
return 1;
}
```

In Python, the ASCII decoder can be used:

```def isASCII(data):
try:
data.decode('ASCII')
except UnicodeDecodeError:
return False
else:
return True
```

Note

Only use the Python function on short strings because it decodes the whole string into memory. For long strings, it is better to use the algorithm of the C function because it doesn’t allocate any memory.

## 8.2. Check for BOM markers¶

If the string begins with a BOM, the encoding can be extracted from the BOM. But there is a problem with UTF-16-BE and UTF-32-LE: UTF-32-LE BOM starts with the UTF-16-LE BOM.

Example of a function written in C to check if a BOM is present:

```#include <string.h>   /* memcmp() */

const char *UTF_16_BE_BOM = "\xFE\xFF";
const char *UTF_16_LE_BOM = "\xFF\xFE";
const char *UTF_8_BOM = "\xEF\xBB\xBF";
const char *UTF_32_BE_BOM = "\x00\x00\xFE\xFF";
const char *UTF_32_LE_BOM = "\xFF\xFE\x00\x00";

char* check_bom(const char *data, size_t size)
{
if (size >= 3) {
if (memcmp(data, UTF_8_BOM, 3) == 0)
return "UTF-8";
}
if (size >= 4) {
if (memcmp(data, UTF_32_LE_BOM, 4) == 0)
return "UTF-32-LE";
if (memcmp(data, UTF_32_BE_BOM, 4) == 0)
return "UTF-32-BE";
}
if (size >= 2) {
if (memcmp(data, UTF_16_LE_BOM, 2) == 0)
return "UTF-16-LE";
if (memcmp(data, UTF_16_BE_BOM, 2) == 0)
return "UTF-16-BE";
}
return NULL;
}
```

For the UTF-16-LE/UTF-32-LE BOM conflict: this function returns `"UTF-32-LE"` if the string begins with `"\xFF\xFE\x00\x00"`, even if this string can be decoded from UTF-16-LE.

Example in Python getting the BOMs from the codecs library:

```from codecs import BOM_UTF8, BOM_UTF16_BE, BOM_UTF16_LE, BOM_UTF32_BE, BOM_UTF32_LE

BOMS = (
(BOM_UTF8, "UTF-8"),
(BOM_UTF32_BE, "UTF-32-BE"),
(BOM_UTF32_LE, "UTF-32-LE"),
(BOM_UTF16_BE, "UTF-16-BE"),
(BOM_UTF16_LE, "UTF-16-LE"),
)

def check_bom(data):
return [encoding for bom, encoding in BOMS if data.startswith(bom)]
```

This function is different from the C function: it returns a list. It returns `['UTF-32-LE', 'UTF-16-LE']` if the string begins with `b"\xFF\xFE\x00\x00"`.

## 8.3. Is UTF-8?¶

UTF-8 encoding adds markers to each bytes and so it’s possible to write a reliable algorithm to check if a byte string is encoded to UTF-8.

Example of a strict C function to check if a string is encoded with UTF-8. It rejects overlong sequences (e.g. ```0xC0 0x80```) and surrogate characters (e.g. `0xED 0xB2 0x80`, U+DC80).

```#include <stdint.h>

int isUTF8(const char *data, size_t size)
{
const unsigned char *str = (unsigned char*)data;
const unsigned char *end = str + size;
unsigned char byte;
unsigned int code_length, i;
uint32_t ch;
while (str != end) {
byte = *str;
if (byte <= 0x7F) {
/* 1 byte sequence: U+0000..U+007F */
str += 1;
continue;
}

if (0xC2 <= byte && byte <= 0xDF)
/* 0b110xxxxx: 2 bytes sequence */
code_length = 2;
else if (0xE0 <= byte && byte <= 0xEF)
/* 0b1110xxxx: 3 bytes sequence */
code_length = 3;
else if (0xF0 <= byte && byte <= 0xF4)
/* 0b11110xxx: 4 bytes sequence */
code_length = 4;
else {
/* invalid first byte of a multibyte character */
return 0;
}

if (str + (code_length - 1) >= end) {
/* truncated string or invalid byte sequence */
return 0;
}

/* Check continuation bytes: bit 7 should be set, bit 6 should be
* unset (b10xxxxxx). */
for (i=1; i < code_length; i++) {
if ((str[i] & 0xC0) != 0x80)
return 0;
}

if (code_length == 2) {
/* 2 bytes sequence: U+0080..U+07FF */
ch = ((str[0] & 0x1f) << 6) + (str[1] & 0x3f);
/* str[0] >= 0xC2, so ch >= 0x0080.
str[0] <= 0xDF, (str[1] & 0x3f) <= 0x3f, so ch <= 0x07ff */
} else if (code_length == 3) {
/* 3 bytes sequence: U+0800..U+FFFF */
ch = ((str[0] & 0x0f) << 12) + ((str[1] & 0x3f) << 6) +
(str[2] & 0x3f);
/* (0xff & 0x0f) << 12 | (0xff & 0x3f) << 6 | (0xff & 0x3f) = 0xffff,
so ch <= 0xffff */
if (ch < 0x0800)
return 0;

/* surrogates (U+D800-U+DFFF) are invalid in UTF-8:
test if (0xD800 <= ch && ch <= 0xDFFF) */
if ((ch >> 11) == 0x1b)
return 0;
} else if (code_length == 4) {
/* 4 bytes sequence: U+10000..U+10FFFF */
ch = ((str[0] & 0x07) << 18) + ((str[1] & 0x3f) << 12) +
((str[2] & 0x3f) << 6) + (str[3] & 0x3f);
if ((ch < 0x10000) || (0x10FFFF < ch))
return 0;
}
str += code_length;
}
return 1;
}
```

In Python, the UTF-8 decoder can be used:

```def isUTF8(data):
try:
data.decode('UTF-8')
except UnicodeDecodeError:
return False
else:
return True
```

In Python 2, this function is more tolerant than the C function, because the UTF-8 decoder of Python 2 accepts surrogate characters (U+D800—U+DFFF). For example, `isUTF8(b'\xED\xB2\x80')` returns `True`. With Python 3, the Python function is equivalent to the C function. If you would like to reject surrogate characters in Python 2, use the following strict function:

```def isUTF8Strict(data):
try:
decoded = data.decode('UTF-8')
except UnicodeDecodeError:
return False
else:
for ch in decoded:
if 0xD800 <= ord(ch) <= 0xDFFF:
return False
return True
```

## 8.4. Libraries¶

PHP has a builtin function to detect the encoding of a byte string: `mb_detect_encoding()`.

• chardet: Python version of the “chardet” algorithm implemented in Mozilla

• UTRAC: command line program (written in C) to recognize the encoding of an input file and its end-of-line type

• charguess: Ruby library to guess the charset of a document