WYS is not always WYG in python.re

Development, Python October 2nd, 2008

After almost two month hard work, I finally check-in the feature, and tonight I decided to relax on some leisure python programming:

This side project is quite trivial, fetch the HTML content, search the keywords in the thread, and build links table of contents for navigation. The only intrigue highlight that make this post worthy your 5 minute is that the language of the page is Chinese, and it is encoded in GB2312.

Long story short, I am trying to search the total number of posts in the thread using this regular expression:

pattern = re.compile(‘(?<=<b class="page">总数 )(?P<total>\d+)’)

The first catch is I have to declare the code page used for the source code, as python interpreter complains:

SyntaxError: Non-ASCII character ‘\xe6′ in file ./elevator.py on line 17, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

OK, I will stick to UTF-8, so add this declaration in the second line:
# -*- coding: utf-8 -*-

It does not work. And the dumped content of the page is totally messy. Oops, we forget to decode the content to Unicode, use codec to wrap the urlopened handle:

gb = codecs.lookup(‘gb2312′)
    # load the page
    content = gb.streamreader(urllib.urlopen(url)).read()

And don’t forget to add either Unicode prefix or re.Unicode flag to the pattern.

pattern = re.compile(‘(?<=<b class="page">总数 )(?P<total>\d+)’, re.UNICODE)

Still no luck, but it works in the python console with the same pattern, faked data, and also works if we change a little bit:

pattern = re.compile(‘(?<=<b class="page">.{2} )(?P<total>\d+)’, re.UNICODE)

Looks like the trouble maker is the non-Latin characters: 总数. Let’s play a little bit in the pdb console:

(Pdb) ‘总数’
\xe6\x80\xbb\xe6\x95\xb0′
(Pdb) ‘总数’.decode(‘utf8′)
u\u603b\u6570′

And it works finally with the hard-coded Unicode character:

pattern = re.compile(‘(?<=<b class="page">\u603b\u6570 )(?P<total>\d+)’, re.UNICODE)

We can use the decode method to avoid the ugly Unicode string for better readability:

pattern = re.compile(‘(?<=<b class="page">总数 )(?P<total>\d+)’decode(‘utf-8′), re.UNICODE)

And a note is recorded that the decoded codec MUST be consistent to the code page declaration.

Some speculations based upon the observation:

  • re.UNICODE does not enforce the Unicode mode, it just redefine the escaped characters like: \b, \w etc.
  • The pattern and string in Unicode implicitly invokes the Unicode mode. That explains why some pattern works in Python console only. Both of them are encoded in UTF-8, so re really runs in 8bit!
  • Python interpreter will not translate the literal string even though the code page is specified.

Please leave your insight in the comments. Thanks

UPDATE:
Thanks for all the comments first. Seems that I have a typo when testing the pattern with Unicode prefix. Here are the test cases:

patterns = [
    re.compile(‘(?<=<b class="page">总数 )(?P<total>\d+)</b>’.decode(‘utf8′), re.UNICODE),
    re.compile(ur‘(?<=<b class="page">总数 )(?P<total>\d+)</b>’, re.UNICODE),
    re.compile(u‘(?<=<b class="page">总数 )(?P<total>\d+)</b>’, re.UNICODE),
    re.compile(‘(?<=<b class="page">总数 )(?P<total>\d+)</b>’, re.UNICODE),
    ]

print [ pattern.search(s) for pattern in patterns ]

The output is:

[<_sre.SRE_Match object at 0xb7c18260>, <_sre.SRE_Match object at 0xb7c18360>, <_sre.SRE_Match object at 0xb7c183a0>, None]

Download test.py

Build aBridge with wxGTK 2.6.1

Gentoo January 30th, 2006

aBridge is one of open-source bridge applications based upon wxWindows 2.4.1, no longer maintained. Hope somebody would pick up the work.

Well, it looks like several users have problem to compile abridge. It takes me a little while to locate the problem, until I read this post, wxWidgets with unicode is flaky. My wxGTK is compiled with unicode and gtk2 support:

[ebuild U ] x11-libs/wxGTK-2.6.2-r1 [2.6.1] +X -debug +doc -gnome -joystick -odbc +opengl +sdl +unicode 14,160 kB

And I tested the setting with this. Yes, we can tell it is the unicode support. OK, just disable it in aclocal.m4.

Compile … several obsolete functions are used, replace them with alternatives… Compile … Redo the previous … Compile … Link …
Oops, another link error:

undefined reference to `pango_x_get_context

Google, and find this, add pango setting to aclocal.m4:

WX_LIBS_ONLY=”`$WX_CONFIG_WITH_ARGS –libs –unicode=no`”
PANGO_LIBS=”`pkg-config –libs pangox`”
WX_LIBS=”$PANGO_LIBS $WX_LIBS_ONLY “

Is this a bug for wxGTK? I don’t know. Make… Run…, it works.

Time to put every pieces together, download abridge-0.4.0-wxGTK-2.6.1.diff, and apply it to the original source code:

$ cd abridge-0.4.0
$ patch -p1 < ../abridge-0.4.0-wxGTK-2.6.1.diff
$ ./configure
$ make

Feel free to develop the ebuild based upon my work.