ASCII | 0x00 to 0x7F | |
ISO-8859-1 | 0x00 to 0x7F (ASCII) |
|
0x80 to 0xFF |
|
|
|
Unicode | U+0000 to U+007F (ASCII) |
|
U+0080 to U+00FF (Latin-1 Supplement) |
|
|
U+0100 to U+017F (Latin Extended-A) |
|
|
U+0180 to U+10FFFF | ... |
512 chars per slide → 2175 slides to represent them all
>>> unistr = u'I am a Unicode string on Python 2'
>>> type(unistr)
<type 'unicode'>
>>> unistr = 'I am a Unicode string on Python 3'
>>> type(unistr)
<class 'str'>
>>> u'☃' == u'\u2603' == u'\U00002603' == u'\N{SNOWMAN}'
True
>>> '☃' == '\u2603' == '\U00002603' == '\N{SNOWMAN}'
True
>>> bytestr = 'I am a Byte string on Python 2'
>>> type(bytestr)
<type 'str'>
>>> bytestr = b'I am a Byte string on Python 3'
>>> type(bytestr)
<class 'bytes'>
Sequence of | Python 2 | → | Python 3 | |
---|---|---|---|---|
Unicode strings |
Codepoints | u"Unicode string" | → | "Unicode string" |
<type 'unicode'> | → | <class 'str'> | ||
Byte strings |
Bytes | "Byte string" | → | b"Byte string" |
<type 'str'> | → | <class 'bytes'> |
u
is no longer necessary for Unicode stringsb
is now necessary for Byte strings>>> bytestr = unistr.encode('utf-8')
>>> unistr = bytestr.decode('utf-8')
>>> snowman = 'Snowman: ☃' # Python 3
>>> snowman.encode('ascii', 'strict') # default
UnicodeEncodeError: 'ascii' codec can't encode character
'\u2603' in position 9: ordinal not in range(128)
>>> snowman.encode('ascii', 'replace')
b'Snowman: ?'
>>> snowman.encode('ascii', 'ignore')
b'Snowman: '
>>> snowman.encode('ascii', 'xmlcharrefreplace')
b'Snowman: ☃'
>>> snowman.encode('ascii', 'backslashreplace')
b'Snowman: \\u2603'
>>> snowman.encode('utf-8')
b'Snowman: \xe2\x98\x83'
>>> u'Unicode and ' + 'Bytes'
u'Unicode and Bytes'
sys.getdefaultencoding()
is usedsys.getdefaultencoding() == 'ascii'
'str'
contains non-ascii the decoding fails
>>> u'Unicode and ' + 'ByteSnowMan: ☃'
UnicodeDecodeError: 'ascii' codec can't
decode byte 0xe2 in position 13: ordinal
not in range(128)
>>> 'Unicode and ' + b'Bytes'
TypeError: Can't convert 'bytes' object to str implicitly
>>> unistr = 'Minä tykkään Unicodesta!' # Python 3
>>> unistr.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character
'\xe4' in position 3: ordinal not in range(128)
>>> bytestr = unistr.encode('iso-8859-1')
>>> bytestr.decode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode bytes
in position 3-5: invalid data
>>> unistr = u'Minä tykkään Unicodesta!' # Python 2
>>> # I ♥ Unicode
>>> bytestr = unistr.encode('utf-8')
>>> print len(unistr)
24
>>> print len(bytestr)
27
>>> print unistr[:4]
Minä
>>> print bytestr[:4]
Min├
>>> bytestr
'Min\xc3\xa4 tykk\xc3\xa4\xc3\xa4n Unicodesta!'
# -*- coding: utf-8 -*-
print u'This file is saved in UTF-8 ☺'
urllib.urlopen
→ Bytesurllib.request.urlopen
→ Bytesxml.dom.minidom
↔ Bytes/Unicodecontent-type: text/html; charset=utf-8
<?xml version="1.0" encoding="utf-8" ?>
<meta http-equiv="Content-Type"
content="text/html; charset=utf-8">
open('file.txt', 'rb')
→ Bytesopen('file.txt', 'rt')
→ Bytescodecs.open('file.txt', 'rt', 'utf-8')
→ Unicodeopen('file.txt', 'rb')
→ Bytesopen('file.txt', 'rt')
→ Unicodecodecs.open()
no longer neededopen()
accepts the encoding
and errors
args:
open('file.txt', 'rt', 'utf-8', 'strict')
encoding=None
, Python:
sys.getdefaultencoding()
os.getcwd()
/os.getcwdu()
→ Bytes/Unicodeos.getcwdb()
/os.getcwd()
→ Bytes/Unicodeos.listdir('')
/os.listdir(u'')
→ Bytes/Unicodeos.listdir(b'')
/os.listdir('')
→ Bytes/UnicodePython 2 | Python 3 |
---|---|
|
|
>>> os.environ['LANG']
'en_US.UTF-8'
>>> sys.stdout.encoding
'UTF-8'
>>> sys.stdin.encoding
'UTF-8'
>>> sys.stdout.errors # Python 3 only!
'strict'
>>> sys.stdout.encoding = 'ascii'
AttributeError: can't set attribute
>>> sys.stdout.errors = 'replace'
AttributeError: can't set attribute
sys.stdin
/sys.stdout
can be replaced with other file-like objects>>> unistr = u'Minä tykkään Unicodesta!'
>>> print repr(unistr)
u'Min\xe4 tykk\xe4\xe4n Unicodesta!'
>>> unistr = 'Minä tykkään Unicodesta!'
>>> print(repr(unistr))
'Minä tykkään Unicodesta!'
unicode
for all unencoded textstr
only for binary or encoded datau
from unicode stringsb
to byte stringsunicode(bytestr, enc)
to str(bytestr, enc)
unicodedata
modulebytearray
mutable type (PEP-3137)