Ezio Melotti
"If you are a programmer working in 2003 and you don't know the basics of characters, character sets, encodings, and Unicode, and I catch you, I'm going to punish you by making you peel onions for 6 months in a submarine.
I swear I will."
Joel Spolsky, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
by mnorri
by वंपायर
A Character Set is a collection of elements used to represent textual information.
ASCII | 0x00 to 0x7F | !"#$%&\'()*+,-./0123456789:;<=>?@ ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_` abcdefghijklmnopqrstuvwxyz{|}~ |
8th bit initially used for parity checking
by Tracey Ullom
8th bit then used to represent more characters
ASCII
ASCII, ISO-8859-1
ISO-8859-1 (aka Latin1) | 0x00 to 0x7F (ASCII) | !"#$%&\'()*+,-./0123456789:;<=>?@ ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_` abcdefghijklmnopqrstuvwxyz{|}~ |
0x80 to 0xFF | ¡¢£¤¥¦§¨©ª«¬ ¯°±²³´µ¶·¸¹º»¼½¾¿ ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß àáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ |
ASCII, ISO-8859-1
ASCII, ISO-8859-1, ISO-8859-5
ASCII, ISO-8859-1, ISO-8859-5, ISO-8859-5
1 Western European | 9 Turkish |
2 Central European | 10 Nordic |
3 South European | 11 Latin/Thai |
4 North European | 13 Baltic Rim |
5 Latin/Cyrillic | 14 Celtic |
6 Latin/Arabic | 15 Western European 2 |
7 Latin/Greek | 16 South-Eastern European |
8 Latin/Hebrew |
What if I want to mix русский and العربية?
What if I want to use ᐃᓄᒃᑎᑐᑦ (inuktitut)?!
en.wikipedia.org/wiki/File:IqaluitStop.jpg
We need a better solution...
ASCII, ISO-8859-1, ISO-8859-5, ISO-8859-5
Unicode
Unicode
Unicode is organized in 16 planes, with 65536 codepoints each
Unicode | U+0000 to U+007F (ASCII) | !"#$%&\'()*+,-./0123456789:;<=>?@ ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_` abcdefghijklmnopqrstuvwxyz{|}~ |
U+0080 to U+00FF (Latin-1 Supplement) | ¡¢£¤¥¦§¨©ª«¬ ¯°±²³´µ¶·¸¹º»¼½¾¿ ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß àáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ |
|
U+0100 to U+017F (Latin Extended-A) | ĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğ ĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿ ŀŁłŃńŅņŇňʼnŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞş ŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽž |
|
U+0180 to U+10FFFF | ... |
512 chars per slide → 2175 slides to represent them all
512 chars per slide → 2175 slides to represent them all
by Ian Albert
by Google
by Google
by Google
A Character Set is a collection of elements used to represent textual information.
An Encoding is a mapping from a character set definition to the bit sequences used to represent the data
Character | Bit sequence | Hex |
---|---|---|
'a' | 0b01100001 | 0x61 |
'A' | 0b01000001 | 0x41 |
Encoding: text → bytes
Decoding: bytes → text
Encodings can be divided in:
Single-byte encodings
Multi-byte encodings
For 8-bits character sets, the terms "character set" and "encoding" might overlap
Cause of endless confusions:
<meta charset="utf-8">
this should actually be called 'encoding'
UTF family: UTF-8, UTF-16, UTF-32
UTF-8
UTF-32 is a fixed-width encoding
Character | Codepoint | Bytes (hex) |
---|---|---|
'a' | U+0061 | 00 00 00 61 |
'ä' | U+00E4 | 00 00 00 E4 |
'☃' | U+2603 | 00 00 26 03 |
'🀩' | U+1F029 | 00 01 F0 29 |
UTF-16 is similar to UTF-32, but uses only 2 bytes
Character | Codepoint | Bytes (hex) |
---|---|---|
'a' | U+0061 | 00 61 |
'ä' | U+00E4 | 00 E4 |
'☃' | U+2603 | 26 03 |
'🀩' | U+1F029 | ???? |
but 2-bytes are enough for BMP chars only...
To represent non-BMP chars, a surrogate pair is used
- two codepoints in range U+D800–U+DFFF combined to obtain a non-BMP char
Character | Codepoint | Bytes (hex) |
---|---|---|
'a' | U+0061 | 00 61 |
'ä' | U+00E4 | 00 E4 |
'☃' | U+2603 | 26 03 |
'🀩' | U+1F029 | D8 3C DC 29 |
Surrogates are invalid in UTF-8 and UTF-32
Surrogates are valid in UTF-16 only if paired correctly
Often they are ignored
Many things break with surrogates
UTF-8 is variable-width multibyte encoding:
UTF-8 uses
The start byte specifies how many continuation bytes there will be
Character | Codepoint | Bytes (hex) |
---|---|---|
'a' | U+0061 | 61 |
'ä' | U+00E4 | C3 A4 |
'☃' | U+2603 | E2 98 83 |
'🀩' | U+1F029 | F0 9F 80 A9 |
Bit pattern | Meaning |
---|---|
0xxxxxxx | ASCII byte |
10xxxxxx | continuation byte |
110xxxxx | start byte of a 2-bytes sequence |
1110xxxx | start byte of a 3-bytes sequence |
11110xxx | start byte of a 4-bytes sequence |
To find the codepoint, the bytes are converted to binary:
0xE2 0x98 0x83 11100010 10011000 10000011 1110xxxx 10xxxxyy 10yyyyyy
The leading bits used to identify the start of a 2-bytes sequence (1110) and continuation bytes (10) are removed:
----xxxx --xxxxyy --yyyyyy ----0010 --011000 --000011
Divide the 'x' and 'y' bits and convert them to hex:
xxxxxxxx|yyyyyyyy 00100110|00000011 0x26| 0x03
The values are combined to create the codepoint:
U+2603 = ☃
Remember:
and:
UnicodeError:
UnicodeEncodeError: raised during encoding UnicodeDecodeError: raised during decoding
>>> unistr = 'Minä tykkään Unicodesta!' # Python 3 >>> unistr.encode('ascii') UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 3: ordinal not in range(128) >>> bytestr = unistr.encode('iso-8859-1') >>> bytestr.decode('utf-8') UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-5: invalid data
Python 2 allows to mix Unicode and Byte strings:
>>> u'Unicode and ' + 'Bytes' u'Unicode and Bytes'
>>> u'Unicode and ' + 'ByteSnowMan: ☃' UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 13: ordinal not in range(128)
On Python 3 text and bytes can NOT be mixed:
>>> 'Unicode and ' + b'Bytes' TypeError: Can't convert 'bytes' object to str implicitly
PEP-0263: Defining Python Source Code Encodings
# -*- coding: utf-8 -*- print u'This file is saved in UTF-8 ☺'
Related only to the text in the source file
If not specified, the default is:
Mojibake (文字化け): "unintelligible sequence of characters"
NOT mojibake | Minä tykkään Unicodesta! |
UTF-8 showed as ISO-8859-1 | Minä tykkään Unicodesta! |
ISO-8859-1 showed as UTF-8 | Min� tykk��n Unicodesta! |
On Python <3.3 there are two different Python builds:
Narrow | Wide |
---|---|
uses UTF-16 internally | uses UTF-32 internally |
2 bytes per char | 4 bytes per char |
sys.maxunicode == 65535 | sys.maxunicode == 1114111 |
len('🀩') == 2 | len('🀩') == 1 |
'🀩'[0] == 'ud83c' | '🀩'[0] == '🀩' |
Fixed in Python 3.3! Thanks to PEP 393
Questions? ☃