One of excuses to keep me updating this blog is some exhausting logistics work I need to tackle with in last couple weeks, long story short, the requirement is to load a excel file, filter with lookup table, then retrieve extra information from a line-based text file and render the docx file with some words highlighted. Let’s decompose this problem to tasks one by one:

Retrieve extra information from a line-based text file
A typical regular expression match example.

Render the docx file with some words highlighted
This task seems easy, as you know, ultimately docx file is a zipped Office Open XML, aka text. We can even replace all the words in one shot as this recipe suggests. Assume the example sentence is:
Kun loves programming and beer, would you buy me one beer?
The to-be highlighted words are programming and beer.

The behavior of Microsoft Office Word 2007 breaks the sentence into 7 pieces: Kun loves_, programming, _and_, beer, , would you buy me one, beer and ?; _ stands for leading or tailing space. Each piece is rendered with either normal style or highlighted style. That is quite messy.

WordML may support embedded style in the bible somewhere, but I am going to live with that since it is crunch time and we can cheat: have you noticed that our highlighted words are always followed by the normal text? So we can put the whole sentence in the normal style enclosure, whenever the RegEx hits the match, we break the enclosure, insert the highlighted words with highlighted style, then start a new normal enclosure. Brilliant!

Hold on, the text is rendered in Word 2007 as:
Kun lovesprogrammingandbeer, would you buy me onebeer?
According to WordML spec or the scream of Jeni:

It is also notable that since leading and trailing whitespace is not normally significant in XML; some runs require a designating specifying that their whitespace is significant via the xml:space element.

So the formal solution for this quiz is to add xml:space=preserve attribute whenever the normal text has leading and/or tailing space(s). In our case, Kun loves_, _and_ and , would you buy me one_ need that attribute but ? does not. The versatile re.sub also supports a callback function instead of string for more complicated substitution like this. As long as the highlighted word is succeeded by space, the succeeding normal text needs to preserve the space, so we can build the pattern like this:

pattern = re.compile(“(?<=\W(%s))(\s)” % “|”.join(the_list_of_to_be_highlighted_words))

in the callback function, we set the attribute if group(1) is matched. Some corner cases needs more post-process: we need to set the attribute if the highlighted word is not in the head of the line, otherwise we need to eliminate unnecessary normal enclosure.

Or we can set xml:space=perserve to all normal text with extra bytes overhead. It is not perfect but good enough.

I will talk about the CSV later.


2 Comments to “When RegEx meets WordML”

  1. Shawn Wheatley | May 23rd, 2008 at 12:33 pm

    Sounds interesting. We tackle this sort of thing a lot at work, but mostly outputting SpreadsheetML (Office 2003.) Are you doing the Excel filtering via COM/.NET interop or will you be filtering outside of Excel in a similar manner? Any chance you’ll be able to put together some sample files and the script (or a script using the same concepts?)

  2. bookstack | May 24th, 2008 at 9:27 pm

    Oh, we won’t go that further.

    No intent to mess up with .xlsx file format ,csv is preferred here is it can be easily tracked by version control system, parsed by csv module of Python and edited by Microsoft Office Excel.

Leave a Comment