python, re-encoding incorrected encoded string
Python has a quite decent internal support of unicode. But sometimes it's hard to find how to exploit the support.
I've lost some time solving a simple problem:
* A program reads a file, which is in the iso-8859-1 encoding.
* The text is converted to Unicode
* At some moment, program detects that some strings are actually were in windows-1251 encoding
* The task: how to re-interpret the string?
The only solution I found is hacky:
import codecs
w1251dec = codecs.getdecoder('windows-1251')
def reenc(s):
s2 = ''
for ch in s:
s2 = s2 + chr(ord(ch))
return w1251dec(s2)[0]
s = unicode("\xc3\xeb\xe0\xe2\xe0", 'latin-1')
print s, repr(s)
s = reenc(s)
print s, repr(s)
Anyway, it works. Running the program gives:
Ãëàâà u'\xc3\xeb\xe0\xe2\xe0'
Глава u'\u0413\u043b\u0430\u0432\u0430'
If you know a better solution, please share it!
By the way, do you feel the difference between s and s2? Variable s is the unicode string, but s2 is the usual string (== array of bytes). I solved the problem only after I invented this trick.
See also: shut up, you dummy 7-bit Python.