shut up, you dummy 7-bit Python

I'm working on an unicode-aware application. I like to use print to debug programs, but in this case it was nightmare. The most popular result of print was:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xXX in position 0: ordinal not in range(128)

I spent two hours fixing it, and I hope it's done. The solution is one of the ugliest hack I ever written, but it solves the pain.

1) First, the most obvious idea was to set another codec, not ascii.

It was easy to find the function sys.setdefaultencoding, but it was bad news to learn that this function doesn't exist.

Googling, I found discussions of this sad fact and several recommendations:

* The Illusive setdefaultencoding
* Using Unicode with ElementTidy
* [Zopyrus] sys.setdefaultencoding

I continued with the following:

reload(sys)
sys.setdefaultencoding('utf-8')

2) Fixing one bugs we get another bugs

Here is an minimal a posteriory example:

import locale
locale.setlocale(locale.LC_ALL, '')

file_name = unicode('fgdjhkjdfhgjk', 'UTF-8')
try:
    open(file_name)
except IOError, e:
    print "%s: '%s'" % (e.strerror, file_name)

You'll get the familiar message when you run it this way:

$ LANG=ru_RU.koi8-r python test3.py

(I think you can use any xx_XX.iso-8859-1, if the error message "No such file or directory" for this language contains non-ASCII characters.)

When using the trick with setting the default encoding to utf8, the only change is the name of the codec in the error message ('utf8' instead of 'ascii').

3) This problem would disappear if the text of the error message were returned in the UTF-8 encoding.

The simplest way is to set the whole environment to UTF-8. But:

* I'm not ready for this transition yet, and
* what's about other people? They'll refuse to change themselves just to satisfy my program.

The best option is to switch to UTF-8 at the beginning of the program.

Ok, no problems. I wrote a loop. For each LC_*, get the current language/encoding pait, and immediately set language/utf-8.

I said no problems? No, except getlocale returned None/None for any LC_*.

4) The smart workaround is to call setlocale(LC_XXX, '').

Setting the empty locale forces the system to look into the user environment and deduce the desired settings. Now getlocale return values, not None.

But attempt to set UTF-8 gives:

locale.Error: unsupported locale setting

Just for test I wrote

locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

And I got an unexpected error: for some reason, it isn't supported on my system! Then I tried ru_RU.UTF-8 and got surprised even more. It was not accepted too.

5) But the error message is in Russian, therefore locale settings somehow work?

Probably somewhere in the deep dark corners of the system libraries, the environment is inspected, the variable LANG is examined, and the corresponding locale is used.

Therefore, I set up also environment variables. It worked.

6) After I found Windows, I got the next error. The property locale.LC_MESSAGES is absent.

Well, I added dynamic lookup of the LC-properties.

The final code:

import sys, locale, os

def force_utf8_hack():
  reload(sys)
  sys.setdefaultencoding('utf-8')
  for attr in dir(locale):
    if attr[0:3] != 'LC_':
      continue
    aref = getattr(locale, attr)
    locale.setlocale(aref, '')
    (lang, enc) = locale.getlocale(aref)
    if lang != None:
      try:
        locale.setlocale(aref, (lang, 'UTF-8'))
      except:
        os.environ[attr] = lang + '.UTF-8'

The license is public domain.

Categories: python