Python has a quite decent internal support of unicode. But sometimes it's hard to find how to exploit the support.
I've lost some time solving a simple problem:
* A program reads a file, which is in the iso-8859-1 encoding.
* The text is converted to Unicode
* At some moment, program detects that some strings are actually were in windows-1251 encoding
* The task: how to re-interpret the string?
The only solution I found is hacky:
import codecs w1251dec = codecs.getdecoder('windows-1251') def reenc(s): s2 = '' for ch in s: s2 = s2 + chr(ord(ch)) return w1251dec(s2) s = unicode("\xc3\xeb\xe0\xe2\xe0", 'latin-1') print s, repr(s) s = reenc(s) print s, repr(s)
Anyway, it works. Running the program gives:
Ãëàâà u'\xc3\xeb\xe0\xe2\xe0' Глава u'\u0413\u043b\u0430\u0432\u0430'
If you know a better solution, please share it!
By the way, do you feel the difference between s and s2? Variable s is the unicode string, but s2 is the usual string (== array of bytes). I solved the problem only after I invented this trick.
See also: shut up, you dummy 7-bit Python.