python, re-encoding incorrected encoded string

Python has a quite decent internal support of unicode. But sometimes it's hard to find how to exploit the support.

I've lost some time solving a simple problem:

* A program reads a file, which is in the iso-8859-1 encoding.
* The text is converted to Unicode
* At some moment, program detects that some strings are actually were in windows-1251 encoding
* The task: how to re-interpret the string?

The only solution I found is hacky:

import codecs
w1251dec = codecs.getdecoder('windows-1251')

def reenc(s):
  s2 = ''
  for ch in s:
    s2 = s2 + chr(ord(ch))
  return w1251dec(s2)[0]

s = unicode("\xc3\xeb\xe0\xe2\xe0", 'latin-1')
print s, repr(s)
s = reenc(s)
print s, repr(s)

Anyway, it works. Running the program gives:


Ãëàâà u'\xc3\xeb\xe0\xe2\xe0'
Глава u'\u0413\u043b\u0430\u0432\u0430'

If you know a better solution, please share it!

By the way, do you feel the difference between s and s2? Variable s is the unicode string, but s2 is the usual string (== array of bytes). I solved the problem only after I invented this trick.

See also: shut up, you dummy 7-bit Python.

Categories: python

Updated: