Rewrite WordPress and ZenPhoto for Nginx

Web October 15th, 2008

Nginx also supports URL rewrite, not compatible to Apache’s mod_rewrite, but more intuitive and more powerful imho. The only problem is most applications, WordPress and ZenPhoto for this specific case do include the mod_rewrite code snippet and/or may update the .htaccess for your convenience.

Thanks to the Slicehost community, the port of mod_rewrite rules perfectly covers WordPress and SuperCache. Here are some minor modification to craft for more general usage:

# the blog dir, aka where index.php is
set $blog_dir ”;
# the wordpress dir where all wp-* stays
set $wordpress_dir ‘/wordpress’;
include wordpress.rewrite;

In nginx.conf, define wordpress_dir and blog_dir, these two variables are equivalent to WordPress address (URL) and Blog address (URL) stripped off the host information. Then we can replace the hard-coded /blog path by using $wordpress_dir or $blog_dir:

2d1
<
26c25
< set $supercache_file /blog/wp-content/cache/supercache/$http_host/$1index.html;
---
> set $supercache_file $wordpress_dir/wp-content/cache/supercache/$http_host/$1index.html;
36c35
< rewrite . /blog/index.php last;
---
> rewrite ^(.*)$ $blog_dir/index.php?q=$1 last;

Here is my zenphoto.rewrite, it seems sivel’s more concise. Either of these should work.

Parse HTML file with BeautifulSoup

Development October 12th, 2008

In the last post, regular expression is used to fetch the specific information. To access the structured information, BeautifulSoap BeautifulSoup is preferred for its simplicity and convenient API:

  • You may override the fromEncoding in the constructor, this is very useful for non-roman, non-standard web pages.
  • Versatile find/findAll on tag, attributes.
  • Developer-friendly syntactic sugar, the Tag implements the interface of string, list, dict and callable function, so there are many ways to access the data as you wish. The drawback of this approach is the typo is only caught in the run time instead of compilation time.
  • Easy to deploy, only one BeautifulSoup.py file.

Something I don’t like:

  • No XPath support, more efforts are needed to port from JavaScript.
  • The API does not support stream, or file object. Laziness is always cherished for pipelining.
  • Why BeautifulSoup? I have made typo as soap more than ten times.

Here is the home-brewed script to wage through the Dvbbs thread to find the corresponding messages: elevator.py

WYS is not always WYG in python.re

Development October 2nd, 2008

After almost two month hard work, I finally check-in the feature, and tonight I decided to relax on some leisure python programming:

This side project is quite trivial, fetch the HTML content, search the keywords in the thread, and build links table of contents for navigation. The only intrigue highlight that make this post worthy your 5 minute is that the language of the page is Chinese, and it is encoded in GB2312.

Long story short, I am trying to search the total number of posts in the thread using this regular expression:

pattern = re.compile(‘(?<=<b class="page">总数 )(?P<total>\d+)’)

The first catch is I have to declare the code page used for the source code, as python interpreter complains:

SyntaxError: Non-ASCII character ‘\xe6′ in file ./elevator.py on line 17, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

OK, I will stick to UTF-8, so add this declaration in the second line:
# -*- coding: utf-8 -*-

It does not work. And the dumped content of the page is totally messy. Oops, we forget to decode the content to Unicode, use codec to wrap the urlopened handle:

gb = codecs.lookup(‘gb2312′)
    # load the page
    content = gb.streamreader(urllib.urlopen(url)).read()

And don’t forget to add either Unicode prefix or re.Unicode flag to the pattern.

pattern = re.compile(‘(?<=<b class="page">总数 )(?P<total>\d+)’, re.UNICODE)

Still no luck, but it works in the python console with the same pattern, faked data, and also works if we change a little bit:

pattern = re.compile(‘(?<=<b class="page">.{2} )(?P<total>\d+)’, re.UNICODE)

Looks like the trouble maker is the non-Latin characters: 总数. Let’s play a little bit in the pdb console:

(Pdb) ‘总数’
\xe6\x80\xbb\xe6\x95\xb0′
(Pdb) ‘总数’.decode(‘utf8′)
u\u603b\u6570′

And it works finally with the hard-coded Unicode character:

pattern = re.compile(‘(?<=<b class="page">\u603b\u6570 )(?P<total>\d+)’, re.UNICODE)

We can use the decode method to avoid the ugly Unicode string for better readability:

pattern = re.compile(‘(?<=<b class="page">总数 )(?P<total>\d+)’decode(‘utf-8′), re.UNICODE)

And a note is recorded that the decoded codec MUST be consistent to the code page declaration.

Some speculations based upon the observation:

  • re.UNICODE does not enforce the Unicode mode, it just redefine the escaped characters like: \b, \w etc.
  • The pattern and string in Unicode implicitly invokes the Unicode mode. That explains why some pattern works in Python console only. Both of them are encoded in UTF-8, so re really runs in 8bit!
  • Python interpreter will not translate the literal string even though the code page is specified.

Please leave your insight in the comments. Thanks

UPDATE:
Thanks for all the comments first. Seems that I have a typo when testing the pattern with Unicode prefix. Here are the test cases:

patterns = [
    re.compile(‘(?<=<b class="page">总数 )(?P<total>\d+)</b>’.decode(‘utf8′), re.UNICODE),
    re.compile(ur‘(?<=<b class="page">总数 )(?P<total>\d+)</b>’, re.UNICODE),
    re.compile(u‘(?<=<b class="page">总数 )(?P<total>\d+)</b>’, re.UNICODE),
    re.compile(‘(?<=<b class="page">总数 )(?P<total>\d+)</b>’, re.UNICODE),
    ]

print [ pattern.search(s) for pattern in patterns ]

The output is:

[<_sre.SRE_Match object at 0xb7c18260>, <_sre.SRE_Match object at 0xb7c18360>, <_sre.SRE_Match object at 0xb7c183a0>, None]

Download test.py