Hello, Sometime ago I was searching for a library that would simplify mass data scraping/extraction from webpages. Python XPath implementation seemed like the way to go. The problem was that most of the HTML on the net doesn't conform to XML standards, even the XHTML (those advertised as valid XHTML too) pages.
I tried to fix that with BeautifulSoup + regexp filtering of some particular cases I encountered. That was slow and after running my data scraper for some time a lot of new problems (exceptions from xpath parser) were showing up. Not to mention that BeautifulSoup stripped almost all of the content from some heavily broken pages (50+KiB page stripped down to some few hundred bytes). Character encoding conversion was a hell too - even UTF-8 pages had some non- standard characters causing issues. Cutting to the chase - that's when I decided to take the matter into my own hands. I hacked together a solution sporting completely new approach overnight. It's called htxpath - a small, lightweight (also without dependencies) python library which lets you to extract specific tag(s) from a HTML document using a path string which has very similar syntax to xpath (but is more convenient in some cases). It did a very good job for me. My library, rather than parsing the whole input into a tree, processes it like a char stream with regular expressions. I decided to share it with everyone so there it is: http://code.google.com/p/htxpath/ I am aware that it is not beautifully coded as my experience with python is rather brief, but I am curious if it will be useful to anyone (also it's my first potentially [real-world ;)] useful project gone public). In that case I promise to continue developing it. It's probably full of bugs, but I can't catch them all by myself. regards, Filip Sobalski -- http://mail.python.org/mailman/listinfo/python-list