python, re-encoding incorrected encoded string
Python has a quite decent internal support of unicode. But sometimes it’s hard to find how to exploit the support.
I’ve lost some time solving a simple problem:
* A program reads a file, which is in the iso-8859-1 encoding.
* The text is converted to Unicode
* At some moment, program detects that some strings are actually were in windows-1251 encoding
* The task: how to re-interpret the string?
The only solution I found is hacky:
import codecs
w1251dec = codecs.getdecoder('windows-1251')
def reenc(s):
s2 = ''
for ch in s:
s2 = s2 + chr(ord(ch))
return w1251dec(s2)[0]
s = unicode("\xc3\xeb\xe0\xe2\xe0", 'latin-1')
print s, repr(s)
s = reenc(s)
print s, repr(s)
Anyway, it works. Running the program gives:
Ãëàâà u'\xc3\xeb\xe0\xe2\xe0'
Глава u'\u0413\u043b\u0430\u0432\u0430'
If you know a better solution, please share it!
By the way, do you feel the difference between s and s2? Variable s is the unicode string, but s2 is the usual string (== array of bytes). I solved the problem only after I invented this trick.
See also: shut up, you dummy 7-bit Python.
August 17th, 2007 at 2:16 pm
Oh, god. The code is just invalid. For my goals, it works:
Still TODO: how to convert an unicode string to an usual string in a given encoding.
August 17th, 2007 at 2:24 pm
Don’t worry, the code is correct. In my case, my source data is actually in MAC ROMAN encoding, and it contains a character (less or equal), which is not in latin-1 encoding.
Probably it’s also the reason why I failed with a simple solution: I can’t convert non-latin-1 data to latin-1.
August 17th, 2007 at 2:31 pm
The very final solution is to use the first variant, but wrap chr(ord(ch)) to a try-except block, and replace bad characters with “?”. It’s ok for me for now,
August 17th, 2007 at 2:34 pm
Can’t believe! Actually, the problematic symbol was the Russian small letter “r’. Old-time programmers can remember a lot of problems with this letter in 1990th years.
Thanks Adobe for reminding it!
July 7th, 2009 at 7:32 am
I’ve noticed that after all the code changes it looks like: