shut up, you dummy 7-bit Python

I’m working on an unicode-aware application. I like to use print to debug programs, but in this case it was nightmare. The most popular result of print was:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xXX in position 0: ordinal not in range(128)

I spent two hours fixing it, and I hope it’s done. The solution is one of the ugliest hack I ever written, but it solves the pain.

1) First, the most obvious idea was to set another codec, not ascii.

It was easy to find the function sys.setdefaultencoding, but it was bad news to learn that this function doesn’t exist.

Googling, I found discussions of this sad fact and several recommendations:

* The Illusive setdefaultencoding
* Using Unicode with ElementTidy
* [Zopyrus] sys.setdefaultencoding

I continued with the following:

reload(sys)
sys.setdefaultencoding('utf-8')

2) Fixing one bugs we get another bugs

Here is an minimal a posteriory example:

import locale
locale.setlocale(locale.LC_ALL, '')

file_name = unicode('fgdjhkjdfhgjk', 'UTF-8')
try:
    open(file_name)
except IOError, e:
    print "%s: '%s'" % (e.strerror, file_name)

You’ll get the familiar message when you run it this way:

$ LANG=ru_RU.koi8-r python test3.py

(I think you can use any xx_XX.iso-8859-1, if the error message “No such file or directory” for this language contains non-ASCII characters.)

When using the trick with setting the default encoding to utf8, the only change is the name of the codec in the error message (‘utf8’ instead of ‘ascii’).

3) This problem would disappear if the text of the error message were returned in the UTF-8 encoding.

The simplest way is to set the whole environment to UTF-8. But:

* I’m not ready for this transition yet, and
* what’s about other people? They’ll refuse to change themselves just to satisfy my program.

The best option is to switch to UTF-8 at the beginning of the program.

Ok, no problems. I wrote a loop. For each LC_*, get the current language/encoding pait, and immediately set language/utf-8.

I said no problems? No, except getlocale returned None/None for any LC_*.

4) The smart workaround is to call setlocale(LC_XXX, '').

Setting the empty locale forces the system to look into the user environment and deduce the desired settings. Now getlocale return values, not None.

But attempt to set UTF-8 gives:

locale.Error: unsupported locale setting

Just for test I wrote

locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

And I got an unexpected error: for some reason, it isn’t supported on my system! Then I tried ru_RU.UTF-8 and got surprised even more. It was not accepted too.

5) But the error message is in Russian, therefore locale settings somehow work?

Probably somewhere in the deep dark corners of the system libraries, the environment is inspected, the variable LANG is examined, and the corresponding locale is used.

Therefore, I set up also environment variables. It worked.

6) After I found Windows, I got the next error. The property locale.LC_MESSAGES is absent.

Well, I added dynamic lookup of the LC-properties.

The final code:

import sys, locale, os

def force_utf8_hack():
  reload(sys)
  sys.setdefaultencoding('utf-8')
  for attr in dir(locale):
    if attr[0:3] != 'LC_':
      continue
    aref = getattr(locale, attr)
    locale.setlocale(aref, '')
    (lang, enc) = locale.getlocale(aref)
    if lang != None:
      try:
        locale.setlocale(aref, (lang, 'UTF-8'))
      except:
        os.environ[attr] = lang + '.UTF-8'

The license is public domain.

4 Responses to “shut up, you dummy 7-bit Python”

  1. zgoda Says:

    Yes, this is ugly hack.

    If the problem is your “print” statement – fight it there, in the “print” statement, not at the system/platform level.

    Just remember that your terminal is a file-like object, you cann’t put unicode objects into it, only byte-encoded streams. If you happen to work with unicode objects, you have to encode it to byte stream before putting it into any file, socket or simply output somewhere outside your application. Python is not well suited for guessing required encoding, so it uses ASCII as default encoding – “when facing ambiguity, refuse to guess”. You have to know the required encoding before producing output.

  2. olpa, OSS developer » Blog Archive » python, re-encoding incorrected encoded string Says:

    […] See also: shut up, you dummy 7-bit Python. […]

  3. individuo7 Says:

    Thanks just what I needed… My program ran fine but the Django tests gave me an UnicodeDecodeError

  4. olpa Says:

    I have no experience with Django, but the first idea is that the utf8-initialization code should be performed at Django startup. How? I have no idea.

    The second idea is that Django unlikely uses stdin/stdout operations, but instead creates some intermediate objects. Maybe there is an option how to create such objects.

Leave a Reply