Understanding Encodings

Ezio Melotti

Something about me

As Joel Spolsky said

"If you are a programmer working in 2003 and you don't know the basics of characters, character sets, encodings, and Unicode, and I catch you, I'm going to punish you by making you peel onions for 6 months in a submarine.
I swear I will."

Joel Spolsky, The Absolute Minimum Every Software Developer Absolutely,
Positively Must Know About Unicode and Character Sets (No Excuses!)

In a submarine...

by mnorri, http://www.flickr.com/photos/mnorri/2368605084/

...peeling onions

by वंपायर, http://www.flickr.com/photos/c0t0s0d0/2795245345/

Outline

What is a Character Set?

What is a Character Set?

ASCII

ASCII


ASCII 0x00 to 0x7F
 !"#$%&\'()*+,-./0123456789:;<=>?@
ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
abcdefghijklmnopqrstuvwxyz{|}~

ASCII

from http://traceyullomdesign.com/2011/05/20/glyphs-and-typographic-symbols/

ASCII

ASCII

8-bits Character Sets

ASCII, ISO-8859-1

8-bits Character Sets

ISO-8859-1
(aka Latin1)
0x00 to 0x7F (ASCII)
 !"#$%&\'()*+,-./0123456789:;<=>?@
ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
abcdefghijklmnopqrstuvwxyz{|}~
0x80 to 0xFF
 ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿
ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß
àáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ

ISO-8859-* Family

ASCII, ISO-8859-1

ISO-8859-* Family

ASCII, ISO-8859-1, ISO-8859-5

ISO-8859-* Family

ASCII, ISO-8859-1, ISO-8859-5, ISO-8859-5

ISO-8859-* Family

  • -1 Western European
  • -2 Central European
  • -3 South European
  • -4 North European
  • -5 Latin/Cyrillic
  • -6 Latin/Arabic
  • -7 Latin/Greek
  • -8 Latin/Hebrew
  • -9 Turkish
  • -10 Nordic
  • -11 Latin/Thai
  • -13 Baltic Rim
  • -14 Celtic
  • -15 Western European 2
  • -16 South-Eastern European

Still not enough...

en.wikipedia.org/wiki/File:IqaluitStop.jpg

We need a better solution...

ISO-8859-* Family

ASCII, ISO-8859-1, ISO-8859-5, ISO-8859-5

Introducing Unicode

Unicode

Introducing Unicode

Unicode

Introducing Unicode

Codepoints

Unicode Planes

Unicode is organized in 16 planes, with 65536 codepoints each

Unicode characters

Unicode U+0000 to U+007F
(ASCII)
 !"#$%&\'()*+,-./0123456789:;<=>?@
ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
abcdefghijklmnopqrstuvwxyz{|}~
U+0080 to U+00FF
(Latin-1 Supplement)
 ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿
ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß
àáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
U+0100 to U+017F
(Latin Extended-A)
ĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğ
ĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿ
ŀŁłŃńŅņŇňʼnŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞş
ŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽž
U+0180 to U+10FFFF ...

How many characters?

512 chars per slide → 2175 slides to represent them all

How many characters?

512 chars per slide → 2175 slides to represent them all

Ian Albert, ian-albert.com/unicode_chart/

Unicode

Growth of Unicode on the Web

googleblog.blogspot.com/2008/05/moving-to-unicode-51.html

What is an Encoding?

Character Bit sequence Hex
'a' 0b01100001 0x61
'A' 0b01000001 0x41

Encoding/decoding

Encoding types

Encodings can be divided in:

Single-byte encodings
Use only 1 byte, limited to 256 characters
Multi-byte encodings
Use more than 1 byte, and are divided in
Fixed width
a fixed number of bytes is used for each character
Variable width
a variable number of bytes is used

8-bits/single-byte Encodings

Unicode Transformation Format (UTF)

UTF-32

Character Codepoint Bytes (hex)
'a' U+0061 00 00 00 61
'ä' U+00E4 00 00 00 E4
'☃' U+2603 00 00 26 03
'🀩' U+1F029 00 01 F0 29

UTF-16

UTF-16 is similar to UTF-32, but uses only 2 bytes

Character Codepoint Bytes (hex)
'a' U+0061 00 61
'ä' U+00E4 00 E4
'☃' U+2603 26 03
'🀩' U+1F029 ????

but 2-bytes are enough for BMP chars only...

UTF-16 - Surrogates

Surrogates

UTF-8

UTF-8

Character Codepoint Bytes (hex)
'a' U+0061 61
'ä' U+00E4 C3 A4
'☃' U+2603 E2 98 83
'🀩' U+1F029 F0 9F 80 A9

UTF-8

Bit pattern Meaning
0xxxxxxx ASCII byte
10xxxxxx continuation byte
110xxxxx start byte of a 2-bytes sequence
1110xxxx start byte of a 3-bytes sequence
11110xxx start byte of a 4-bytes sequence

UTF-8 - decoding example

To find the codepoint, the bytes are converted to binary:
    0xE2     0x98     0x83
11100010 10011000 10000011
1110xxxx 10xxxxyy 10yyyyyy

The leading bits used to identify the start of a 2-bytes
sequence (1110) and continuation bytes (10) are removed:
----xxxx --xxxxyy --yyyyyy
----0010 --011000 --000011

Divide the 'x' and 'y' bits and convert them to hex:
xxxxxxxx|yyyyyyyy
00100110|00000011
    0x26|    0x03

The values are combined to create the codepoint:
U+2603 = ☃

Recommendations

Encoding/Decoding - UnicodeErrors


>>> unistr = 'Minä tykkään Unicodesta!' # Python 3
>>> unistr.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character
  '\xe4' in position 3: ordinal not in range(128)
>>> bytestr = unistr.encode('iso-8859-1')
>>> bytestr.decode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode bytes
  in position 3-5: invalid data

Implicit Decoding on Python 2

Source Encoding

Mojibake

NOT mojibake Minä tykkään Unicodesta!
UTF-8 showed as ISO-8859-1 Minä tykkään Unicodesta!
ISO-8859-1 showed as UTF-8 Min� tykk��n Unicodesta!

Narrow vs Wide

Two different Python builds:

Narrow Wide
uses UTF-16 internally uses UTF-32 internally
2 bytes per char 4 bytes per char
sys.maxunicode == 65535 sys.maxunicode == 1114111
len('🀩') == 2 len('🀩') == 1
'🀩'[0] == '\ud83c' '🀩'[0] == '🀩'

PEP 393

The end

Questions?

98% of the persons working on submarines
have no idea what Unicode is. [citation needed]