"If you are a programmer working in 2003 and you don't know the basics of characters, character sets, encodings, and Unicode, and I catch you, I'm going to punish you by making you peel onions for 6 months in a submarine.
I swear I will."
Joel Spolsky,
The Absolute Minimum Every Software Developer Absolutely,
Positively Must Know About Unicode and Character Sets (No Excuses!)
by वंपायर, http://www.flickr.com/photos/c0t0s0d0/2795245345/
ASCII | 0x00 to 0x7F | |
from http://traceyullomdesign.com/2011/05/20/glyphs-and-typographic-symbols/
ASCII
ASCII, ISO-8859-1
ISO-8859-1 (aka Latin1) |
0x00 to 0x7F (ASCII) |
|
0x80 to 0xFF |
|
ASCII, ISO-8859-1
ASCII, ISO-8859-1, ISO-8859-5
ASCII, ISO-8859-1, ISO-8859-5, ISO-8859-5
|
|
en.wikipedia.org/wiki/File:IqaluitStop.jpg
We need a better solution...
ASCII, ISO-8859-1, ISO-8859-5, ISO-8859-5
Unicode
Unicode
Unicode is organized in 16 planes, with 65536 codepoints each
Unicode | U+0000 to U+007F (ASCII) |
|
U+0080 to U+00FF (Latin-1 Supplement) |
|
|
U+0100 to U+017F (Latin Extended-A) |
|
|
U+0180 to U+10FFFF | ... |
512 chars per slide → 2175 slides to represent them all
512 chars per slide → 2175 slides to represent them all
Ian Albert, ian-albert.com/unicode_chart/
Character | Bit sequence | Hex |
---|---|---|
'a' |
0b01100001 |
0x61 |
'A' |
0b01000001 |
0x41 |
Encodings can be divided in:
<meta http-equiv="Content-Type"
content="text/html; charset=utf-8">
Character | Codepoint | Bytes (hex) |
---|---|---|
'a' |
U+0061 |
00 00 00 61 |
'ä' |
U+00E4 |
00 00 00 E4 |
'☃' |
U+2603 |
00 00 26 03 |
'🀩' |
U+1F029 |
00 01 F0 29 |
UTF-16 is similar to UTF-32, but uses only 2 bytes
Character | Codepoint | Bytes (hex) |
---|---|---|
'a' |
U+0061 |
00 61 |
'ä' |
U+00E4 |
00 E4 |
'☃' |
U+2603 |
26 03 |
'🀩' |
U+1F029 |
???? |
but 2-bytes are enough for BMP chars only...
Character | Codepoint | Bytes (hex) |
---|---|---|
'a' |
U+0061 |
00 61 |
'ä' |
U+00E4 |
00 E4 |
'☃' |
U+2603 |
26 03 |
'🀩' |
U+1F029 |
D8 3C DC 29 |
Character | Codepoint | Bytes (hex) |
---|---|---|
'a' |
U+0061 |
61 |
'ä' |
U+00E4 |
C3 A4 |
'☃' |
U+2603 |
E2 98 83 |
'🀩' |
U+1F029 |
F0 9F 80 A9 |
Bit pattern | Meaning |
---|---|
0xxxxxxx |
ASCII byte |
10xxxxxx |
continuation byte |
110xxxxx |
start byte of a 2-bytes sequence |
1110xxxx |
start byte of a 3-bytes sequence |
11110xxx |
start byte of a 4-bytes sequence |
To find the codepoint, the bytes are converted to binary:
0xE2 0x98 0x83
11100010 10011000 10000011
1110xxxx 10xxxxyy 10yyyyyy
The leading bits used to identify the start of a 2-bytes
sequence (1110) and continuation bytes (10) are removed:
----xxxx --xxxxyy --yyyyyy
----0010 --011000 --000011
Divide the 'x' and 'y' bits and convert them to hex:
xxxxxxxx|yyyyyyyy
00100110|00000011
0x26| 0x03
The values are combined to create the codepoint:
U+2603 = ☃
>>> unistr = 'Minä tykkään Unicodesta!' # Python 3
>>> unistr.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character
'\xe4' in position 3: ordinal not in range(128)
>>> bytestr = unistr.encode('iso-8859-1')
>>> bytestr.decode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode bytes
in position 3-5: invalid data
>>> u'Unicode and ' + 'Bytes'
u'Unicode and Bytes'
sys.getdefaultencoding()
is usedsys.getdefaultencoding() == 'ascii'
'str'
contains non-ASCII chars the decoding fails
>>> u'Unicode and ' + 'ByteSnowMan: ☃'
UnicodeDecodeError: 'ascii' codec can't
decode byte 0xe2 in position 13: ordinal
not in range(128)
>>> 'Unicode and ' + b'Bytes'
TypeError: Can't convert 'bytes' object
to str implicitly
# -*- coding: utf-8 -*-
print u'This file is saved in UTF-8 ☺'
NOT mojibake | Minä tykkään Unicodesta! |
UTF-8 showed as ISO-8859-1 | Minä tykkään Unicodesta! |
ISO-8859-1 showed as UTF-8 | Min� tykk��n Unicodesta! |
'�'
in case of errorsTwo different Python builds:
Narrow | Wide |
---|---|
uses UTF-16 internally | uses UTF-32 internally |
2 bytes per char | 4 bytes per char |
sys.maxunicode == 65535 |
sys.maxunicode == 1114111 |
len('🀩') == 2 |
len('🀩') == 1 |
'🀩'[0] == '\ud83c' |
'🀩'[0] == '🀩' |
Questions?
☃98% of the persons working on submarines
have no idea what Unicode is. [citation needed]