Unicode and Python 3

Ezio Melotti

Something about me

Something about me

...in Finland U+2603 - SNOWMAN

Outline

ASCII


ASCII 0x00 to 0x7F
 !"#$%&\'()*+,-./0123456789:;<=>?@
ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
abcdefghijklmnopqrstuvwxyz{|}~

8-bits Character Sets

ISO-8859-1 0x00 to 0x7F (ASCII)
 !"#$%&\'()*+,-./0123456789:;<=>?@
ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
abcdefghijklmnopqrstuvwxyz{|}~
0x80 to 0xFF
 ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿
ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß
àáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ

ISO-8859-* Family

  • -1 Western European
  • -2 Central European
  • -3 South European
  • -4 North European
  • -5 Latin/Cyrillic
  • -6 Latin/Arabic
  • -7 Latin/Greek
  • -8 Latin/Hebrew
  • -9 Turkish
  • -10 Nordic
  • -11 Latin/Thai
  • -13 Baltic Rim
  • -14 Celtic
  • -15 Western European 2
  • -16 South-Eastern European

Unicode

Unicode characters

Unicode U+0000 to U+007F
(ASCII)
 !"#$%&\'()*+,-./0123456789:;<=>?@
ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
abcdefghijklmnopqrstuvwxyz{|}~
U+0080 to U+00FF
(Latin-1 Supplement)
 ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿
ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß
àáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
U+0100 to U+017F
(Latin Extended-A)
ĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğ
ĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿ
ŀŁłŃńŅņŇňʼnŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞş
ŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽž
U+0180 to U+10FFFF ...

How many characters?

512 chars per slide → 2175 slides to represent them all

Unicode

Growth of Unicode on the Web http://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html

Character Sets and Encoding

Codepoints

Unicode Transformation Format (UTF)

Unicode strings

Python supports Unicode strings:
Python 2:
>>> unistr = u'I am a Unicode string on Python 2'
>>> type(unistr)
<type 'unicode'>
Python 3:
>>> unistr = 'I am a Unicode string on Python 3'
>>> type(unistr)
<class 'str'>

Escape sequences in string literals

Python 2:
>>> u'☃' == u'\u2603' == u'\U00002603' == u'\N{SNOWMAN}'
True
Python 3:
>>> '☃' == '\u2603' == '\U00002603' == '\N{SNOWMAN}'
True

Unicode is "abstract"

Byte strings

Python also supports Byte strings:
Python 2:
>>> bytestr = 'I am a Byte string on Python 2'
>>> type(bytestr)
<type 'str'>
Python 3:
>>> bytestr = b'I am a Byte string on Python 3'
>>> type(bytestr)
<class 'bytes'>

String types in Python 2 and 3

Sequence of Python 2 Python 3
Unicode
strings
Codepoints u"Unicode string" "Unicode string"
<type 'unicode'> <class 'str'>
Byte
strings
Bytes "Byte string" b"Byte string"
<type 'str'> <class 'bytes'>

Unicode in Python 3

If you are using Python 3:

Encoding/Decoding

Encoding/Decoding - Errors

>>> snowman = 'Snowman: ☃' # Python 3
>>> snowman.encode('ascii', 'strict') # default
UnicodeEncodeError: 'ascii' codec can't encode character
  '\u2603' in position 9: ordinal not in range(128)
>>> snowman.encode('ascii', 'replace')
b'Snowman: ?'
>>> snowman.encode('ascii', 'ignore')
b'Snowman: '
>>> snowman.encode('ascii', 'xmlcharrefreplace')
b'Snowman: &#9731;'
>>> snowman.encode('ascii', 'backslashreplace')
b'Snowman: \\u2603'

>>> snowman.encode('utf-8')
b'Snowman: \xe2\x98\x83'

Implicit Decoding on Python 2

Encoding/Decoding - UnicodeErrors


>>> unistr = 'Minä tykkään Unicodesta!' # Python 3
>>> unistr.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character
  '\xe4' in position 3: ordinal not in range(128)
>>> bytestr = unistr.encode('iso-8859-1')
>>> bytestr.decode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode bytes
  in position 3-5: invalid data

When to use Unicode and Byte strings

  1. Decode the input as soon as you receive it
  2. Work only with Unicode
  3. Encode the output just before sending it
But...

Know the Encoding

When to use Unicode and Byte strings

Why?

>>> unistr = u'Minä tykkään Unicodesta!' # Python 2
>>> # I ♥ Unicode
>>> bytestr = unistr.encode('utf-8')
>>> print len(unistr)
24
>>> print len(bytestr)
27
>>> print unistr[:4]
Minä
>>> print bytestr[:4]
Min├
>>> bytestr
'Min\xc3\xa4 tykk\xc3\xa4\xc3\xa4n Unicodesta!'

Where?

Different sources where the text is encoded

Source Encoding

Web Pages

Web Pages - Find the Encoding

Files

Files - 2

os module

os module

Python 2 Python 3
>>> os.getcwd()
'/home/wolf'
>>> os.getcwdu()
u'/home/wolf'
>>> os.listdir('/home')
['wolf', '.directory']
>>> os.listdir(u'/home')
[u'wolf', u'.directory']
>>> os.getcwdb()
b'/home/wolf'
>>> os.getcwd()
'/home/wolf'
>>> os.listdir(b'/home')
[b'wolf', b'.directory']
>>> os.listdir('/home')
['wolf', '.directory']

Network

Terminal

 >>> os.environ['LANG']
'en_US.UTF-8'
>>> sys.stdout.encoding
'UTF-8'
>>> sys.stdin.encoding
'UTF-8'
>>> sys.stdout.errors # Python 3 only!
'strict'
>>> sys.stdout.encoding = 'ascii'
AttributeError: can't set attribute
>>> sys.stdout.errors = 'replace'
AttributeError: can't set attribute

Terminal - 2

String representation in Python 3

Defined in PEP-3138
Python 2:
>>> unistr = u'Minä tykkään Unicodesta!'
>>> print repr(unistr)
u'Min\xe4 tykk\xe4\xe4n Unicodesta!'
Python 3:
>>> unistr = 'Minä tykkään Unicodesta!'
>>> print(repr(unistr))
'Minä tykkään Unicodesta!'

Python 2 → Python 3

Other things

The end

Questions?

U+2FD3 - KANGXI RADICAL DRAGON