Parse HTML file with BeautifulSoup

Development October 12th, 2008

In the last post, regular expression is used to fetch the specific information. To access the structured information, BeautifulSoap BeautifulSoup is preferred for its simplicity and convenient API:

  • You may override the fromEncoding in the constructor, this is very useful for non-roman, non-standard web pages.
  • Versatile find/findAll on tag, attributes.
  • Developer-friendly syntactic sugar, the Tag implements the interface of string, list, dict and callable function, so there are many ways to access the data as you wish. The drawback of this approach is the typo is only caught in the run time instead of compilation time.
  • Easy to deploy, only one BeautifulSoup.py file.

Something I don’t like:

  • No XPath support, more efforts are needed to port from JavaScript.
  • The API does not support stream, or file object. Laziness is always cherished for pipelining.
  • Why BeautifulSoup? I have made typo as soap more than ten times.

Here is the home-brewed script to wage through the Dvbbs thread to find the corresponding messages: elevator.py

WYS is not always WYG in python.re

Development October 2nd, 2008

After almost two month hard work, I finally check-in the feature, and tonight I decided to relax on some leisure python programming:

This side project is quite trivial, fetch the HTML content, search the keywords in the thread, and build links table of contents for navigation. The only intrigue highlight that make this post worthy your 5 minute is that the language of the page is Chinese, and it is encoded in GB2312.

Long story short, I am trying to search the total number of posts in the thread using this regular expression:

pattern = re.compile(‘(?<=<b class="page">总数 )(?P<total>\d+)’)

The first catch is I have to declare the code page used for the source code, as python interpreter complains:

SyntaxError: Non-ASCII character ‘\xe6′ in file ./elevator.py on line 17, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

OK, I will stick to UTF-8, so add this declaration in the second line:
# -*- coding: utf-8 -*-

It does not work. And the dumped content of the page is totally messy. Oops, we forget to decode the content to Unicode, use codec to wrap the urlopened handle:

gb = codecs.lookup(‘gb2312′)
    # load the page
    content = gb.streamreader(urllib.urlopen(url)).read()

And don’t forget to add either Unicode prefix or re.Unicode flag to the pattern.

pattern = re.compile(‘(?<=<b class="page">总数 )(?P<total>\d+)’, re.UNICODE)

Still no luck, but it works in the python console with the same pattern, faked data, and also works if we change a little bit:

pattern = re.compile(‘(?<=<b class="page">.{2} )(?P<total>\d+)’, re.UNICODE)

Looks like the trouble maker is the non-Latin characters: 总数. Let’s play a little bit in the pdb console:

(Pdb) ‘总数’
\xe6\x80\xbb\xe6\x95\xb0′
(Pdb) ‘总数’.decode(‘utf8′)
u\u603b\u6570′

And it works finally with the hard-coded Unicode character:

pattern = re.compile(‘(?<=<b class="page">\u603b\u6570 )(?P<total>\d+)’, re.UNICODE)

We can use the decode method to avoid the ugly Unicode string for better readability:

pattern = re.compile(‘(?<=<b class="page">总数 )(?P<total>\d+)’decode(‘utf-8′), re.UNICODE)

And a note is recorded that the decoded codec MUST be consistent to the code page declaration.

Some speculations based upon the observation:

  • re.UNICODE does not enforce the Unicode mode, it just redefine the escaped characters like: \b, \w etc.
  • The pattern and string in Unicode implicitly invokes the Unicode mode. That explains why some pattern works in Python console only. Both of them are encoded in UTF-8, so re really runs in 8bit!
  • Python interpreter will not translate the literal string even though the code page is specified.

Please leave your insight in the comments. Thanks

UPDATE:
Thanks for all the comments first. Seems that I have a typo when testing the pattern with Unicode prefix. Here are the test cases:

patterns = [
    re.compile(‘(?<=<b class="page">总数 )(?P<total>\d+)</b>’.decode(‘utf8′), re.UNICODE),
    re.compile(ur‘(?<=<b class="page">总数 )(?P<total>\d+)</b>’, re.UNICODE),
    re.compile(u‘(?<=<b class="page">总数 )(?P<total>\d+)</b>’, re.UNICODE),
    re.compile(‘(?<=<b class="page">总数 )(?P<total>\d+)</b>’, re.UNICODE),
    ]

print [ pattern.search(s) for pattern in patterns ]

The output is:

[<_sre.SRE_Match object at 0xb7c18260>, <_sre.SRE_Match object at 0xb7c18360>, <_sre.SRE_Match object at 0xb7c183a0>, None]

Download test.py

When RegEx meets WordML

Development May 23rd, 2008

One of excuses to keep me updating this blog is some exhausting logistics work I need to tackle with in last couple weeks, long story short, the requirement is to load a excel file, filter with lookup table, then retrieve extra information from a line-based text file and render the docx file with some words highlighted. Let’s decompose this problem to tasks one by one:

Retrieve extra information from a line-based text file
A typical regular expression match example.

Render the docx file with some words highlighted
This task seems easy, as you know, ultimately docx file is a zipped Office Open XML, aka text. We can even replace all the words in one shot as this recipe suggests. Assume the example sentence is:
Kun loves programming and beer, would you buy me one beer?
The to-be highlighted words are programming and beer.

The behavior of Microsoft Office Word 2007 breaks the sentence into 7 pieces: Kun loves_, programming, _and_, beer, , would you buy me one, beer and ?; _ stands for leading or tailing space. Each piece is rendered with either normal style or highlighted style. That is quite messy.

WordML may support embedded style in the bible somewhere, but I am going to live with that since it is crunch time and we can cheat: have you noticed that our highlighted words are always followed by the normal text? So we can put the whole sentence in the normal style enclosure, whenever the RegEx hits the match, we break the enclosure, insert the highlighted words with highlighted style, then start a new normal enclosure. Brilliant!

Hold on, the text is rendered in Word 2007 as:
Kun lovesprogrammingandbeer, would you buy me onebeer?
According to WordML spec or the scream of Jeni:

It is also notable that since leading and trailing whitespace is not normally significant in XML; some runs require a designating specifying that their whitespace is significant via the xml:space element.

So the formal solution for this quiz is to add xml:space=preserve attribute whenever the normal text has leading and/or tailing space(s). In our case, Kun loves_, _and_ and , would you buy me one_ need that attribute but ? does not. The versatile re.sub also supports a callback function instead of string for more complicated substitution like this. As long as the highlighted word is succeeded by space, the succeeding normal text needs to preserve the space, so we can build the pattern like this:

pattern = re.compile(“(?<=\W(%s))(\s)” % “|”.join(the_list_of_to_be_highlighted_words))

in the callback function, we set the attribute if group(1) is matched. Some corner cases needs more post-process: we need to set the attribute if the highlighted word is not in the head of the line, otherwise we need to eliminate unnecessary normal enclosure.

Or we can set xml:space=perserve to all normal text with extra bytes overhead. It is not perfect but good enough.

I will talk about the CSV later.

PyAWS 0.3.0 released

Development, Web May 6th, 2008

After 6 months, PyAWS 0.3.0 is eventually released. You can check out the tar ball here.

I almost abandoned this project as I found the XSLT approach is more appealing: ideal for AJAX application and easy to integrate via simplejson in the server side. Furthermore, I joined Microsoft, moved to Canada, and had less spare time to work on less interested hobby work. The last straw is the unexpected complicity of the the BIG FAT refactory.

Until recently, I got the email from one PyAWS user, he reported a bug on unexpected result of ListLookup operation. It is so good to hear from some users that this library still benefits somebody in the world. So I picked it up, completed the refactory and released it today. The library still in active development, the code style stinks, the document sucks and most of all, testing is lacking — I would explain it for a little bit here.

I am a big fan of TDD personally, and we have respected testing troops to help building our products in MSFT as well. However, the complexity of PyAWS is far beyond my capacity: there are tens of operations and twenties of response groups, and response groups may combine, that make it extremely difficult to cover all the paths. To make it worse, the AWS is dynamic, there is no guarantee that the consecutive queries would return the same result. I may consider automation to facilitate the unit tests. If you have better ideas, please leave a comment here.

Django’s D-day

Development, Web April 7th, 2008

Google just released the Google App Engine in python development environment. The environment is loaded with WSGI, and Django 0.96 “for convenience”.

Just checked the Datastore API, it is a copycat of Django reference. Google’s engineers hacked the Django’s Model to support Google’s datastore, aka BigTable. Bang! Google Account is also supported via User API, no idea whether it is integrated to Django’s authentication framework though.

I am so glad that Google has made such a move, I can bet the Django users may grow exponentially in the following couple months. Today is Django’s D-day.