I recently had an opportunity to dig into how character encoding works. I’ll freely admit, it wasn’t something I had dedicated much thought to until last week, when a particularly weird case at work demanded a better understanding.
Character encoding is the process of taking characters that make up the building blocks of our various languages and turning them into 1’s and 0’s. I think to really understand how it works, it’s worth us starting closer to the beginning. With ASCII.
ASCII
ASCII is a 7-bit character encoding, which meant it supported 128 characters
- binary: 1111111
- hex: 0x7f
- decimal: 127
So, when writing a file, each character would consume 1 octet of space
- 1 octet
- 1 byte
- 8 bits
For example, for the letter a
, it is:
- the 98th character
- 97 in decimal
- 0x61 in hex
- 01100001 in binary
Because it was a 7-bit encoding, but written into bytes, the first bit would be ignored, and would even be used by some programs to encode “hidden” data within ASCII.
UTF-8
Let’s skip ahead to UTF-8. Where ASCII is only concerned with 128 characters, UTF-8 is capable of encoding all of the Unicode code-points. That’s a potential for 1,112,064 characters! Despite this feat, it manages to do so while often consuming the same amount of space! How!?
It comes down to that ignored 8th bit in ASCII. We can get a clearer picture looking at the UTF-8 manpages:
The following byte sequences are used to represent a character. The
sequence to be used depends on the UCS code number of the character:
0x00000000 - 0x0000007F:
0xxxxxxx
0x00000080 - 0x000007FF:
110xxxxx 10xxxxxx
0x00000800 - 0x0000FFFF:
1110xxxx 10xxxxxx 10xxxxxx
0x00010000 - 0x001FFFFF:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0x00200000 - 0x03FFFFFF:
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0x04000000 - 0x7FFFFFFF:
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
The xxx bit positions are filled with the bits of the character
code number in binary representation, most significant bit first
(big-endian). Only the shortest possible multibyte sequence which
can represent the code number of the character can be used.
The UCS code values 0xd800–0xdfff (UTF-16 surrogates) as well as
0xfffe and 0xffff (UCS noncharacters) should not appear in conforming
UTF-8 streams. According to RFC 3629 no point above U+10FFFF should
be used, which limits characters to four bytes.
I want to call your attention to an important part here:
0x00000000 - 0x0000007F:
0xxxxxxx
If the character being encoded is in the range from 0x0
to 0x7f
, the
first bit will be zero and it will be directly encoded in a single
octet. It’s worth noting that this is the same range for ASCII! So
if we are writing ASCII characters, they are encoded in exactly the same
way as ASCII! Given that computers operated for years on this 128
character set, it is reasonable to imagine that the majority of
characters being encoded fall into this character set. At least in
English speaking countries.
This has two important byproducts:
- ASCII with a leading 0 bit is valid UTF-8
- ASCII characters consume the same amount of space when encoded in UTF-8
What happens when the code-point doesn’t fall inside the latin-1 (ASCII) character set though? The leading bit will be used to indicate as such, and multiple octets will be used to encode that character. I think explaining this will be easiest with an example:
Encoding 😼 in UTF-8
😼 has the Unicode code-point U+1f63c. If we look at the table above, this falls into the range:
0x00010000 - 0x001FFFFF:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
The template has 21 x
’s, so this character in binary, padded with
leading 0’s so the length is 21:
000011111011000111100
And inserted into the template:
000 011111 011000 111100
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
-----------------------------------
11110000 10011111 10011000 10111100
0xf0 0x9f 0x98 0xbc
Sure enough, if we verify this with hexyl
:
$ printf '😼' | hexyl --plain
f0 9f 98 bc
Lo and behold, this matches what we expected! We just successfully encoded 😼 into UTF-8 together!