Steven D'Aprano wrote:
On Sun, 05 Jul 2009 10:12:54 +0200, Hendrik van Rooyen wrote:
Python is not C.
John Nagle is an old hand at Python. He's perfectly aware of this, and
I'm sure he's not trying to program C in Python.
I'm not entirely sure *what* he is doing, and hopefully he'll speak up
and say, but whatever the problem is it's not going to be as simple as
that.
I didn't write this code; I'm just using it. As I said in the
original posting, it's from "http://code.google.com/p/html5lib".
It's from an effort to write a clean HTML 5 parser in Python for
general-purpose use. HTML 5 parsing is well-defined for the awful
cases that make older browsers incompatible, but quite complicated.
The Python implementation here is intended partly as a reference
implementation, so browser writers have something to compare with.
I have a small web crawler robust enough to parse
real-world HTML, which can be appallingly bad. I currently use
an extra-robust version of BeautifulSoup, and even that sometimes
blows up. So I'm very interested in a new Python parser which supposedly
handles bad HTML in the same way browsers do. But if it's slower
than BeautifulSoup, there's a problem.
John Nagle
--
http://mail.python.org/mailman/listinfo/python-list