The whole key to such encoding problems is to understand that there are in principle two distinct concepts of "string" : (1) string of characters , and (2) string/array of bytes. This distinction has been mostly ignored for a long time because of the historic ubiquity of encodings with no more than 256 characters (ASCII, Latin-1, Windows-1252, Mac OS Roman,…): these encodings map a set of common characters to numbers between 0 and 255 (i.e. bytes); the relatively limited exchange of files before the advent of the web made this situation of incompatible encodings tolerable, as most programs could ignore the fact that there were multiple encodings as long as they produced text that remained on the same operating system: such programs would simply treat text as bytes (through the encoding used by the operating system). The correct, modern view properly separates these two string concepts, based on the following two points:
Characters are mostly unrelated to computers : one can draw them on a chalk board, etc., like for instance بايثون, 中蟒 and 🐍. "Characters" for machines also include "drawing instructions" like for example spaces, carriage return, instructions to set the writing direction (for Arabic, etc.), accents, etc. A very large character list is included in the Unicode standard; it covers most of the known characters.
On the other hand, computers do need to represent abstract characters in some way: for this, they use arrays of bytes (numbers between 0 and 255 included), because their memory comes in byte chunks. The necessary process that converts characters to bytes is called encoding. Thus, a computer requires an encoding in order to represent characters. Any text present on your computer is encoded (until it is displayed), whether it be sent to a terminal (which expects characters encoded in a specific way), or saved in a file. In order to be displayed or properly "understood" (by, say, the Python interpreter), streams of bytes are decoded into characters. A few encodings (UTF-8, UTF-16,…) are defined by Unicode for its list of characters (Unicode thus defines both a list of characters and encodings for these characters—there are still places where one sees the expression "Unicode encoding" as a way to refer to the ubiquitous UTF-8, but this is incorrect terminology, as Unicode provides multiple encodings).
In summary, computers need to internally represent characters with bytes , and they do so through two operations:
Encoding : characters → bytes
Decoding : bytes → characters
Some encodings cannot encode all characters (e.g., ASCII), while (some) Unicode encodings allow you to encode all Unicode characters. The encoding is also not necessarily unique , because some characters can be represented either directly or as a combination (e.g. of a base character and of accents).
Note that the concept of newline adds a layer of complication, since it can be represented by different (control) characters that depend on the operating system (this is the reason for Python's universal newline file reading mode).
Now, what I have called "character" above is what Unicode calls a " user-perceived character ". A single user-perceived character can sometimes be represented in Unicode by combining character parts (base character, accents,…) found at different indexes in the Unicode list, which are called " code points "—these codes points can be combined together to form a "grapheme cluster". Unicode thus leads to a third concept of string, made of a sequence of Unicode code points, that sits between byte and character strings, and which is closer to the latter. I will call them " Unicode strings " (like in Python 2).
While Python can print strings of (user-perceived) characters, Python non-byte strings are essentially sequences of Unicode code points , not of user-perceived characters. The code point values are the ones used in Python's
\U Unicode string syntax. They should not be confused with the encoding of a character (and do not have to bear any relationship with it: Unicode code points can be encoded in various ways).
This has an important consequence: the length of a Python (Unicode) string is its number of code points, which is not always its number of user-perceived characters: thus
s = "\u1100\u1161\u11a8"; print(s, "len", len(s)) (Python 3) gives
각 len 3 despite
s having a single user-perceived (Korean) character (because it is represented with 3 code points—even if it does not have to, as
print("\uac01") shows). However, in many practical circumstances, the length of a string is its number of user-perceived characters, because many characters are typically stored by Python as a single Unicode code point.
In Python 2 , Unicode strings are called… "Unicode strings" (
unicode type, literal form
u"…"), while byte arrays are "strings" (
str type, where the array of bytes can for instance be constructed with string literals
"…"). In Python 3 , Unicode strings are simply called "strings" (
str type, literal form
"…"), while byte arrays are "bytes" (
bytes type, literal form
b"…"). As a consequence, something like
"🐍" gives a different result in Python 2 (
'\xf0', a byte) and Python 3 (
"🐍", the first and only character).
With these few key points, you should be able to understand most encoding related questions!
Normally, when you print
u"…" to a terminal , you should not get garbage: Python knows the encoding of your terminal. In fact, you can check what encoding the terminal expects:
% python Python 2.7.6 (default, Nov 15 2013, 15:20:37) [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> print sys.stdout.encoding UTF-8
If your input characters can be encoded with the terminal's encoding, Python will do so and will send the corresponding bytes to your terminal without complaining. The terminal will then do its best to display the characters after decoding the input bytes (at worst the terminal font does not have some of the characters and will print some kind of blank instead).
If your input characters cannot be encoded with the terminal's encoding, then it means that the terminal is not configured for displaying these characters. Python will complain (in Python with a
UnicodeEncodeError since the character string cannot be encoded in a way that suits your terminal). The only possible solution is to use a terminal that can display the characters (either by configuring the terminal so that it accepts an encoding that can represent your characters, or by using a different terminal program). This is important when you distribute programs that can be used in different environments: messages that you print should be representable in the user's terminal. Sometimes it is thus best to stick to strings that only contain ASCII characters.
However, when you redirect or pipe the output of your program, then it is generally not possible to know what the input encoding of the receiving program is, and the above code returns some default encoding: None (Python 2.7) or UTF-8 (Python 3):
% python2.7 -c "import sys; print sys.stdout.encoding" | cat None % python3.4 -c "import sys; print(sys.stdout.encoding)" | cat UTF-8
The encoding of stdin, stdout and stderr can however be set through the
PYTHONIOENCODING environment variable, if needed:
% PYTHONIOENCODING=UTF-8 python2.7 -c "import sys; print sys.stdout.encoding" | cat UTF-8
If the printing to a terminal does not produce what you expect, you can check the UTF-8 encoding that you put manually in is correct; for instance, your first character (
\u001A) is not printable, if I'm not mistaken.
At http://wiki.python.org/moin/PrintFails, you can find a solution like the following, for Python 2.x:
import codecs import locale import sys # Wrap sys.stdout into a StreamWriter to allow writing unicode. sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout) uni = u"\u001A\u0BC3\u1451\U0001D10C" print uni
For Python 3, you can check one of the questions asked previously on StackOverflow.