When RegEx meets WordML

Development May 23rd, 2008

One of excuses to keep me updating this blog is some exhausting logistics work I need to tackle with in last couple weeks, long story short, the requirement is to load a excel file, filter with lookup table, then retrieve extra information from a line-based text file and render the docx file with some words highlighted. Let’s decompose this problem to tasks one by one:

Retrieve extra information from a line-based text file
A typical regular expression match example.

Render the docx file with some words highlighted
This task seems easy, as you know, ultimately docx file is a zipped Office Open XML, aka text. We can even replace all the words in one shot as this recipe suggests. Assume the example sentence is:
Kun loves programming and beer, would you buy me one beer?
The to-be highlighted words are programming and beer.

The behavior of Microsoft Office Word 2007 breaks the sentence into 7 pieces: Kun loves_, programming, _and_, beer, , would you buy me one, beer and ?; _ stands for leading or tailing space. Each piece is rendered with either normal style or highlighted style. That is quite messy.

WordML may support embedded style in the bible somewhere, but I am going to live with that since it is crunch time and we can cheat: have you noticed that our highlighted words are always followed by the normal text? So we can put the whole sentence in the normal style enclosure, whenever the RegEx hits the match, we break the enclosure, insert the highlighted words with highlighted style, then start a new normal enclosure. Brilliant!

Hold on, the text is rendered in Word 2007 as:
Kun lovesprogrammingandbeer, would you buy me onebeer?
According to WordML spec or the scream of Jeni:

It is also notable that since leading and trailing whitespace is not normally significant in XML; some runs require a designating specifying that their whitespace is significant via the xml:space element.

So the formal solution for this quiz is to add xml:space=preserve attribute whenever the normal text has leading and/or tailing space(s). In our case, Kun loves_, _and_ and , would you buy me one_ need that attribute but ? does not. The versatile re.sub also supports a callback function instead of string for more complicated substitution like this. As long as the highlighted word is succeeded by space, the succeeding normal text needs to preserve the space, so we can build the pattern like this:

pattern = re.compile(“(?<=\W(%s))(\s)” % “|”.join(the_list_of_to_be_highlighted_words))

in the callback function, we set the attribute if group(1) is matched. Some corner cases needs more post-process: we need to set the attribute if the highlighted word is not in the head of the line, otherwise we need to eliminate unnecessary normal enclosure.

Or we can set xml:space=perserve to all normal text with extra bytes overhead. It is not perfect but good enough.

I will talk about the CSV later.

In memory of the victims of China earthquake

Misc May 18th, 2008

28,881 victims(confirmed on May 16) were killed in the magnitude 7.9 8.2 earthquake that struck Sichuan province, China, May 12. Another 150,000 people were injured and await medical treatment and water.

This blog is painted to black to vigil the victims in the disaster.

PyAWS 0.3.0 released

Development, Web May 6th, 2008

After 6 months, PyAWS 0.3.0 is eventually released. You can check out the tar ball here.

I almost abandoned this project as I found the XSLT approach is more appealing: ideal for AJAX application and easy to integrate via simplejson in the server side. Furthermore, I joined Microsoft, moved to Canada, and had less spare time to work on less interested hobby work. The last straw is the unexpected complicity of the the BIG FAT refactory.

Until recently, I got the email from one PyAWS user, he reported a bug on unexpected result of ListLookup operation. It is so good to hear from some users that this library still benefits somebody in the world. So I picked it up, completed the refactory and released it today. The library still in active development, the code style stinks, the document sucks and most of all, testing is lacking — I would explain it for a little bit here.

I am a big fan of TDD personally, and we have respected testing troops to help building our products in MSFT as well. However, the complexity of PyAWS is far beyond my capacity: there are tens of operations and twenties of response groups, and response groups may combine, that make it extremely difficult to cover all the paths. To make it worse, the AWS is dynamic, there is no guarantee that the consecutive queries would return the same result. I may consider automation to facilitate the unit tests. If you have better ideas, please leave a comment here.