WYS is not always WYG in python.re
Development, Python October 2nd, 2008
After almost two month hard work, I finally check-in the feature, and tonight I decided to relax on some leisure python programming:
This side project is quite trivial, fetch the HTML content, search the keywords in the thread, and build links table of contents for navigation. The only intrigue highlight that make this post worthy your 5 minute is that the language of the page is Chinese, and it is encoded in GB2312.
Long story short, I am trying to search the total number of posts in the thread using this regular expression:
The first catch is I have to declare the code page used for the source code, as python interpreter complains:
SyntaxError: Non-ASCII character ‘\xe6′ in file ./elevator.py on line 17, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
OK, I will stick to UTF-8, so add this declaration in the second line:
# -*- coding: utf-8 -*-
It does not work. And the dumped content of the page is totally messy. Oops, we forget to decode the content to Unicode, use codec to wrap the urlopened handle:
# load the page
content = gb.streamreader(urllib.urlopen(url)).read()
And don’t forget to add either Unicode prefix or re.Unicode flag to the pattern.
Still no luck, but it works in the python console with the same pattern, faked data, and also works if we change a little bit:
Looks like the trouble maker is the non-Latin characters: 总数. Let’s play a little bit in the pdb console:
‘\xe6\x80\xbb\xe6\x95\xb0′
(Pdb) ‘总数’.decode(‘utf8′)
u‘\u603b\u6570′
And it works finally with the hard-coded Unicode character:
We can use the decode method to avoid the ugly Unicode string for better readability:
And a note is recorded that the decoded codec MUST be consistent to the code page declaration.
Some speculations based upon the observation:
- re.UNICODE does not enforce the Unicode mode, it just redefine the escaped characters like: \b, \w etc.
- The pattern and string in Unicode implicitly invokes the Unicode mode. That explains why some pattern works in Python console only. Both of them are encoded in UTF-8, so re really runs in 8bit!
- Python interpreter will not translate the literal string even though the code page is specified.
Please leave your insight in the comments. Thanks
UPDATE:
Thanks for all the comments first. Seems that I have a typo when testing the pattern with Unicode prefix. Here are the test cases:
re.compile(‘(?<=<b class="page">总数 )(?P<total>\d+)</b>’.decode(‘utf8′), re.UNICODE),
re.compile(ur‘(?<=<b class="page">总数 )(?P<total>\d+)</b>’, re.UNICODE),
re.compile(u‘(?<=<b class="page">总数 )(?P<total>\d+)</b>’, re.UNICODE),
re.compile(‘(?<=<b class="page">总数 )(?P<total>\d+)</b>’, re.UNICODE),
]
print [ pattern.search(s) for pattern in patterns ]
The output is:
Download test.py







You’re doing it too complicated.
# -*- coding: utf-8 -*-
content = urllib.urlopen(url).read()
content = content.decode(coding)
pattern = re.compile(ur’总数 (?P\d+)’, re.UNICODE)
Please note:
* strictly taken, you should get the page content encoding from the page
* i don’t think you need that look-before assertion
* but what you definitely need is u’…’ or ur’…’!
I think this code must work:
pattern = re.compile(u‘(?<=总数 )(?P\\d+)’, re.UNICODE)
u’(?<=总数 )(?P\\d+)’ == ‘(?<=总数 )(?P\d+)’.decode(’utf-8′)
I think the problem is that you’re not actually using unicode, you’re still using strings.
Instead of this:
pattern = re.compile(’(?<=总数 )(?P\d+)’, re.UNICODE)
You should do this:
pattern = re.compile(u’(?<=总数 )(?P\d+)’, re.UNICODE)
Notice the “u” prefix on the string literal. That makes it a unicode literal.
I do not posses skills in Python regular expression madness, but I suggest you to try to use BeautifulSoup module for HTML pattern matching.
‘总数’.decode(’utf-8′)
is equal to
# -*- coding: utf-8 -*-
u’总数’
Notice the u prefix