Parse HTML file with BeautifulSoup
Development, Python October 12th, 2008
In the last post, regular expression is used to fetch the specific information. To access the structured information, BeautifulSoap BeautifulSoup is preferred for its simplicity and convenient API:
- You may override the fromEncoding in the constructor, this is very useful for non-roman, non-standard web pages.
- Versatile find/findAll on tag, attributes.
- Developer-friendly syntactic sugar, the Tag implements the interface of string, list, dict and callable function, so there are many ways to access the data as you wish. The drawback of this approach is the typo is only caught in the run time instead of compilation time.
- Easy to deploy, only one BeautifulSoup.py file.
Something I don’t like:
- No XPath support, more efforts are needed to port from JavaScript.
- The API does not support stream, or file object. Laziness is always cherished for pipelining.
- Why BeautifulSoup? I have made typo as soap more than ten times.
Here is the home-brewed script to wage through the Dvbbs thread to find the corresponding messages: elevator.py







> Why BeautifulSoup?
Because it makes tagsoups beautiful
Yes, BeautifulSoup is wonderful module if you have to do with HTML parsing on the Web. But you can do the same with the power of lxml, so why you need to use both in your script?
The “Soup” is presumably because HTML is frequently referred to as “tag soup”.
Nice article. Thanks. :) Eugene
I like your post you have done a great job and i agree with you that BeautifulSoup is preferred for its simplicity and convenient API and the features you have writen in that are great.