python, re-encoding incorrected encoded string

Python has a quite decent internal support of unicode. But sometimes it’s hard to find how to exploit the support.

I’ve lost some time solving a simple problem:

* A program reads a file, which is in the iso-8859-1 encoding.
* The text is converted to Unicode
* At some moment, program detects that some strings are actually were in windows-1251 encoding
* The task: how to re-interpret the string?

The only solution I found is hacky:

import codecs
w1251dec = codecs.getdecoder('windows-1251')

def reenc(s):
  s2 = ''
  for ch in s:
    s2 = s2 + chr(ord(ch))
  return w1251dec(s2)[0]

s = unicode("\xc3\xeb\xe0\xe2\xe0", 'latin-1')
print s, repr(s)
s = reenc(s)
print s, repr(s)

Anyway, it works. Running the program gives:

Ãëàâà u'\xc3\xeb\xe0\xe2\xe0'
Глава u'\u0413\u043b\u0430\u0432\u0430'

If you know a better solution, please share it!

By the way, do you feel the difference between s and s2? Variable s is the unicode string, but s2 is the usual string (== array of bytes). I solved the problem only after I invented this trick.

5 Responses to “python, re-encoding incorrected encoded string”

  1. olpa Says:

    Oh, god. The code is just invalid. For my goals, it works:

    import codecs
    w1251dec = codecs.getdecoder('windows-1251')
    def reenc(s):
      s2 = ''
      for ch in s:
          ch2 = w1251dec(ch)[0]
          s2 = s2 + ch2
        except UnicodeEncodeError:
          s2 = s2 + ch
      return s2
    s = unicode("\\xc3\\xeb\\xe0\\xe2\\xe0", 'latin-1')
    s = u'\\u2264' + s
    print s, repr(s)
    s = reenc(s)
    print s, repr(s)

    Still TODO: how to convert an unicode string to an usual string in a given encoding.

  2. olpa Says:

    Don’t worry, the code is correct. In my case, my source data is actually in MAC ROMAN encoding, and it contains a character (less or equal), which is not in latin-1 encoding.

    Probably it’s also the reason why I failed with a simple solution: I can’t convert non-latin-1 data to latin-1.

  3. olpa Says:

    The very final solution is to use the first variant, but wrap chr(ord(ch)) to a try-except block, and replace bad characters with “?”. It’s ok for me for now,

  4. olpa Says:

    Can’t believe! Actually, the problematic symbol was the Russian small letter “r’. Old-time programmers can remember a lot of problems with this letter in 1990th years.

    Thanks Adobe for reminding it!

  5. Oleg Says:

    I’ve noticed that after all the code changes it looks like:

    def reenc_next(cin, cout, i, s):
        ch  = chr(i)
        ch1 = cin(ch)[0]
        ch2 = cout(ch)[0]
        s = s.replace(ch1, ch2)
      return s
    def reenc_string(from_enc, to_enc, s):
      from_dec = codecs.getdecoder(from_enc)
      to_dec   = codecs.getdecoder(to_enc)
      for i in xrange(256):
        s = reenc_next(from_dec, to_dec, i, s)
      return s

