python wtf: strip() eats too much

Many of python-xml code is probably wrong. Tracing a bug, I found an interesting WTF. A minimal example:

import string
s1 =  "\xa0x\xa0"
s2 = u"\xa0x\xa0"
print repr(s1.strip())
print repr(s2.strip())
print repr(s2.strip(string.whitespace))

And what we see in the output?

'\ xa0x\ xa0'
u'x'
u'\ xa0x\ xa0'

The second line is different from the others two, and a programmer 99.99% expects that all the three lines should be the same.

Formally, Python is correct. The unicode class of whitespace includes also the unbreakable space. But let's think further. We can split the whitespace class on two further subclasses, (in TeX terms) droppable glue and undroppable kern. And split() shouldn't touch the latter.

Fortunately, there is a workaround by specifying string.whitespace explicitly.

Categories: python