Poor man’s NAS

Gentoo November 14th, 2008

A Network Attached Storage(NAS) has been in my wanted list for quite a long time, thanks to Live Search Cashback program to make it happen: a Western Digital MyBook World Edition(500GB). More information about the hardware specification:

  • ARM926EJ-Sid(wb) [41069265] revision 5 (ARMv5TEJ) 99.73 MHz
  • Memory: 32M
  • VIA Networking Velocity Family Gigabit Ethernet
  • WD5000AAVS-0 500G HD

I believe 100MHz ARM CPU is powerful enough to drive this tiny box, but the limited capacity of memory cripples it as a lame duck. The sustainable file write(85G using lftp mirror) rate is approximately 3.8MB/s. It hardly qualifies any service beyond file server. Now, it is time to hack.

Jailbreak and SSH

The first thing to do is to create a user in the web interface of MyBook as root with null password is banned for security reason. Log on with admin and 123456, create a user JOE and setup the password for later use.

Run the script discussed in the wiki, and ssh with JOE. Now you can su to root with blank password, 0wned!

User management

MyBook takes a very intricate way to manage users:

All Samba users are granted shell access, but unix password sync = yes is not set, the /etc/shadow and /var/private/smbpasswd are updated individually by a Perl script via the web interface. The only reasonable explanation is the minimized Samba lacks PAM support.

All user names are capitalized. I assume this is a brutal force approach to address the difference between Samba and Linux native accounts: Windows user name is case insensitive, while Linux is case-sensitive.

As the password scrambled in /etc/shadow, it is easier to add/delete/update users via the web interface, then fine-tune the corresponding files. The user administration executives are hidden in /usr/www/nbin.

Share with Samba

The default exported directory is /share/internal/PUBLIC, the permission of the directory is set as rwsr-sr-x, and the owner is www-data, YMMV. So any file/directory created will be owned by www-data. If you are unhappy with the name, you may add a user, e.g joe as discussed before, then add joe to www-data group:

# /etc/group, YMMV
www-data:x:33:share

remember to change the default mask in /etc/smb.conf:

create mask = 0775
directory mask = 0775

Package management

Though I am a big fan of Gentoo, it is a little bit paranoid to build everything from scratch. A precompiled package management, like Optware makes more sense. Check out this tutorial for bootstrapping.

The essential packages for daily administration imho are screen, lftp.

Feature requests

There are some itchy miss features, if you happen to know a solution or hint, please drop me a message in the comment:

Access Anywhere No mionet, just SSH. If you are a perfectionist, consider to port this Delphi application to MyBook to host MyBook in your preferred domain.

Download Manager A web front-end to listen to download requests from Firefox/IE plugins, then delegate it to wget backend with cookie support. A more aggressive approach may support megaupload happy hour.

Rewrite WordPress and ZenPhoto for Nginx

Web October 15th, 2008

Nginx also supports URL rewrite, not compatible to Apache’s mod_rewrite, but more intuitive and more powerful imho. The only problem is most applications, WordPress and ZenPhoto for this specific case do include the mod_rewrite code snippet and/or may update the .htaccess for your convenience.

Thanks to the Slicehost community, the port of mod_rewrite rules perfectly covers WordPress and SuperCache. Here are some minor modification to craft for more general usage:

# the blog dir, aka where index.php is
set $blog_dir ”;
# the wordpress dir where all wp-* stays
set $wordpress_dir ‘/wordpress’;
include wordpress.rewrite;

In nginx.conf, define wordpress_dir and blog_dir, these two variables are equivalent to WordPress address (URL) and Blog address (URL) stripped off the host information. Then we can replace the hard-coded /blog path by using $wordpress_dir or $blog_dir:

2d1
<
26c25
< set $supercache_file /blog/wp-content/cache/supercache/$http_host/$1index.html;
---
> set $supercache_file $wordpress_dir/wp-content/cache/supercache/$http_host/$1index.html;
36c35
< rewrite . /blog/index.php last;
---
> rewrite ^(.*)$ $blog_dir/index.php?q=$1 last;

Here is my zenphoto.rewrite, it seems sivel’s more concise. Either of these should work.

Parse HTML file with BeautifulSoup

Development October 12th, 2008

In the last post, regular expression is used to fetch the specific information. To access the structured information, BeautifulSoap BeautifulSoup is preferred for its simplicity and convenient API:

  • You may override the fromEncoding in the constructor, this is very useful for non-roman, non-standard web pages.
  • Versatile find/findAll on tag, attributes.
  • Developer-friendly syntactic sugar, the Tag implements the interface of string, list, dict and callable function, so there are many ways to access the data as you wish. The drawback of this approach is the typo is only caught in the run time instead of compilation time.
  • Easy to deploy, only one BeautifulSoup.py file.

Something I don’t like:

  • No XPath support, more efforts are needed to port from JavaScript.
  • The API does not support stream, or file object. Laziness is always cherished for pipelining.
  • Why BeautifulSoup? I have made typo as soap more than ten times.

Here is the home-brewed script to wage through the Dvbbs thread to find the corresponding messages: elevator.py

WYS is not always WYG in python.re

Development October 2nd, 2008

After almost two month hard work, I finally check-in the feature, and tonight I decided to relax on some leisure python programming:

This side project is quite trivial, fetch the HTML content, search the keywords in the thread, and build links table of contents for navigation. The only intrigue highlight that make this post worthy your 5 minute is that the language of the page is Chinese, and it is encoded in GB2312.

Long story short, I am trying to search the total number of posts in the thread using this regular expression:

pattern = re.compile(‘(?<=<b class="page">总数 )(?P<total>\d+)’)

The first catch is I have to declare the code page used for the source code, as python interpreter complains:

SyntaxError: Non-ASCII character ‘\xe6′ in file ./elevator.py on line 17, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

OK, I will stick to UTF-8, so add this declaration in the second line:
# -*- coding: utf-8 -*-

It does not work. And the dumped content of the page is totally messy. Oops, we forget to decode the content to Unicode, use codec to wrap the urlopened handle:

gb = codecs.lookup(‘gb2312′)
    # load the page
    content = gb.streamreader(urllib.urlopen(url)).read()

And don’t forget to add either Unicode prefix or re.Unicode flag to the pattern.

pattern = re.compile(‘(?<=<b class="page">总数 )(?P<total>\d+)’, re.UNICODE)

Still no luck, but it works in the python console with the same pattern, faked data, and also works if we change a little bit:

pattern = re.compile(‘(?<=<b class="page">.{2} )(?P<total>\d+)’, re.UNICODE)

Looks like the trouble maker is the non-Latin characters: 总数. Let’s play a little bit in the pdb console:

(Pdb) ‘总数’
\xe6\x80\xbb\xe6\x95\xb0′
(Pdb) ‘总数’.decode(‘utf8′)
u\u603b\u6570′

And it works finally with the hard-coded Unicode character:

pattern = re.compile(‘(?<=<b class="page">\u603b\u6570 )(?P<total>\d+)’, re.UNICODE)

We can use the decode method to avoid the ugly Unicode string for better readability:

pattern = re.compile(‘(?<=<b class="page">总数 )(?P<total>\d+)’decode(‘utf-8′), re.UNICODE)

And a note is recorded that the decoded codec MUST be consistent to the code page declaration.

Some speculations based upon the observation:

  • re.UNICODE does not enforce the Unicode mode, it just redefine the escaped characters like: \b, \w etc.
  • The pattern and string in Unicode implicitly invokes the Unicode mode. That explains why some pattern works in Python console only. Both of them are encoded in UTF-8, so re really runs in 8bit!
  • Python interpreter will not translate the literal string even though the code page is specified.

Please leave your insight in the comments. Thanks

UPDATE:
Thanks for all the comments first. Seems that I have a typo when testing the pattern with Unicode prefix. Here are the test cases:

patterns = [
    re.compile(‘(?<=<b class="page">总数 )(?P<total>\d+)</b>’.decode(‘utf8′), re.UNICODE),
    re.compile(ur‘(?<=<b class="page">总数 )(?P<total>\d+)</b>’, re.UNICODE),
    re.compile(u‘(?<=<b class="page">总数 )(?P<total>\d+)</b>’, re.UNICODE),
    re.compile(‘(?<=<b class="page">总数 )(?P<total>\d+)</b>’, re.UNICODE),
    ]

print [ pattern.search(s) for pattern in patterns ]

The output is:

[<_sre.SRE_Match object at 0xb7c18260>, <_sre.SRE_Match object at 0xb7c18360>, <_sre.SRE_Match object at 0xb7c183a0>, None]

Download test.py

HOWTO: Serve virtual host with Nginx

Web September 6th, 2008

As the limited memory budget, and I plan to host multiple website in the VPS, I decided to take a less versatile, but lightweight Apache alternative, the Nginx made by the polar bear.

There is no RPM in the repository I have enlisted, so let’s fallback the old-school way:

# remove the blocking glibc-dummy-centos-4 package, then get the toolchain:
yum remove glibc-dummy-centos-4
yum -y install gcc openssl-devel
# Now it is time to build the nginx:
./configure –prefix=/opt/nginx –with-http_ssl_module –with-http_stub_status_module
make && sudo make install

Then get the PHP with FastCGI support, and the lighttpd-fastcgi for the fastcgi loader.

yum install php-cli php-mysql lighttpd-fastcgi

Here is the nginx.conf that server.

user nobody;
worker_processes 2;
pid logs/nginx.pid;
error_log logs/error.log;

events {
worker_connections 2048;
use epoll;
}

http {
include mime.types;
include fastcgi_params;
default_type application/octet-stream;

log_format main ‘$remote_addr - $remote_user [$time_local] $request ‘
‘”$status” $body_bytes_sent “$http_referer” ‘
‘”$http_user_agent” “$http_x_forwarded_for”‘;

access_log logs/access.log main;
client_header_timeout 3m;
client_body_timeout 3m;
send_timeout 3m;

client_header_buffer_size 1k;
large_client_header_buffers 4 4k;

gzip on;
gzip_min_length 1100;
gzip_buffers 4 8k;
gzip_types text/plain;
gzip_static on;

output_buffers 1 32k;
postpone_output 1460;

sendfile on;
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 75 20;

server {
#REVIEW: how to redirect https? using re?
server_name www.kunxi.org;
rewrite ^(.*) http://kunxi.org$1 permanent;
}

server {
# kunxi’s gallery
listen 80;
server_name gallery.kunxi.org;
root /home/webadmin/$host;
access_log logs/$host.log main;

location / {
root /home/webadmin/$host;
index index.html index.htm index.php;
include zenphoto.rewrite;
}

# pass the PHP scripts to FastCGI server listening on 127.0.0.1:9000
location ~ \.php$ {
fastcgi_pass 127.0.0.1:9000;
fastcgi_index index.php;
}

# deny access to .htaccess files.
location ~ /\.ht {
deny all;
}
}

server {
# kunxi’s sites
listen 80;
server_name kunxi.org *.kunxi.org;
root /home/webadmin/$host;

error_page 404 $document_root/404.html;
error_page 500 502 503 504 $document_root/50x.html;
access_log logs/$host.log main;

location / {
root /home/webadmin/$host;
index index.html index.htm index.php;

# the blog dir, aka where index.php is
set $blog_dir ”;
# the wordpress dir where all wp-* stays
set $wordpress_dir ‘/wordpress’;

include wordpress.rewrite;
}

# rewrite the /files/
#
location /files/ {
alias /home/webadmin/static.kunxi.org/;
}

# pass the PHP scripts to FastCGI server listening on 127.0.0.1:9000
#
location ~ \.php$ {
fastcgi_pass 127.0.0.1:9000;
fastcgi_index index.php;
}

# deny access to .htaccess files.
#
location ~ /\.ht {
deny all;
}
}
}

Some highlights of the configuration:
Rewrite www.kunxi.org to kunxi.org, yes, we support no-www!

server {
#REVIEW: how to redirect https? using re?
server_name www.kunxi.org;
rewrite ^(.*) http://kunxi.org$1 permanent;
}

And this wildcards will cover all sub-domains powered by PHP:

server_name kunxi.org *.kunxi.org;
root /home/webadmin/$host;
… ….

Home-brewed nginx and fastcgi init scripts to make it works after the reboot:

chkconfig –add nginx
chkconfig –add fcgi-php
service start nginx
service start fcgi-php

Tips and Traps:
Nginx supports 0 downtime upgrade, so if your nginx.conf is wrong, the server would ignore it and suck up the complain. Make sure stop the nginx service and start nginx during debugging rewrite rules.

fcgi-php seems to have problem to parse localhost, so I use 127.0.0.1 instead.

The rewrite rule for WordPress and ZenPhoto are explained here and here. (TODO).