encoding UTF-8 | parasrah

I recently had an opportunity to dig into how character encoding works. I’ll freely admit, it wasn’t something I had dedicated much thought to until last week, when a particularly weird case at work demanded a better understanding.

Character encoding is the process of taking characters that make up the building blocks of our various languages and turning them into 1’s and 0’s. I think to really understand how it works, it’s worth us starting closer to the beginning. With ASCII.

ASCII

ASCII is a 7-bit character encoding, which meant it supported 128 characters

binary: 1111111
hex: 0x7f
decimal: 127

So, when writing a file, each character would consume 1 octet of space

1 octet
1 byte
8 bits

For example, for the letter a, it is:

the 98th character
97 in decimal
0x61 in hex
01100001 in binary

Because it was a 7-bit encoding, but written into bytes, the first bit would be ignored, and would even be used by some programs to encode “hidden” data within ASCII.

UTF-8

Let’s skip ahead to UTF-8. Where ASCII is only concerned with 128 characters, UTF-8 is capable of encoding all of the Unicode code-points. That’s a potential for 1,112,064 characters! Despite this feat, it manages to do so while often consuming the same amount of space! How!?

It comes down to that ignored 8th bit in ASCII. We can get a clearer picture looking at the UTF-8 manpages:

  The following byte sequences are used to represent a character. The
  sequence to be used depends on the UCS code number of the character:

0x00000000 - 0x0000007F:
    0xxxxxxx
0x00000080 - 0x000007FF:
    110xxxxx 10xxxxxx
0x00000800 - 0x0000FFFF:
    1110xxxx 10xxxxxx 10xxxxxx
0x00010000 - 0x001FFFFF:
    11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0x00200000 - 0x03FFFFFF:
    111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0x04000000 - 0x7FFFFFFF:
    1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

  The xxx bit positions are filled with the bits of the character
  code number in binary representation, most significant bit first
  (big-endian). Only the shortest possible multibyte sequence which
  can represent the code number of the character can be used.

  The UCS code values 0xd800–0xdfff (UTF-16 surrogates) as well as
  0xfffe and 0xffff (UCS noncharacters) should not appear in conforming
  UTF-8 streams. According to RFC 3629 no point above U+10FFFF should
  be used, which limits characters to four bytes.

I want to call your attention to an important part here:

0x00000000 - 0x0000007F:
    0xxxxxxx

If the character being encoded is in the range from 0x0 to 0x7f, the first bit will be zero and it will be directly encoded in a single octet. It’s worth noting that this is the same range for ASCII! So if we are writing ASCII characters, they are encoded in exactly the same way as ASCII! Given that computers operated for years on this 128 character set, it is reasonable to imagine that the majority of characters being encoded fall into this character set. At least in English speaking countries.

This has two important byproducts:

ASCII with a leading 0 bit is valid UTF-8
ASCII characters consume the same amount of space when encoded in UTF-8

What happens when the code-point doesn’t fall inside the latin-1 (ASCII) character set though? The leading bit will be used to indicate as such, and multiple octets will be used to encode that character. I think explaining this will be easiest with an example:

Encoding 😼 in UTF-8

😼 has the Unicode code-point U+1f63c. If we look at the table above, this falls into the range:

0x00010000 - 0x001FFFFF:
    11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The template has 21 x’s, so this character in binary, padded with leading 0’s so the length is 21:

000011111011000111100

And inserted into the template:

     000   011111   011000   111100
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
-----------------------------------
11110000 10011111 10011000 10111100
  0xf0     0x9f     0x98     0xbc

Sure enough, if we verify this with hexyl:

$ printf '😼' | hexyl --plain
  f0 9f 98 bc

Lo and behold, this matches what we expected! We just successfully encoded 😼 into UTF-8 together!