Steven D'Aprano wrote:
On Sun, 05 Jul 2009 10:12:54 +0200, Hendrik van Rooyen wrote:

Python is not C.

John Nagle is an old hand at Python. He's perfectly aware of this, and I'm sure he's not trying to program C in Python.

I'm not entirely sure *what* he is doing, and hopefully he'll speak up and say, but whatever the problem is it's not going to be as simple as that.

    I didn't write this code; I'm just using it.  As I said in the
original posting, it's from "http://code.google.com/p/html5lib";.
It's from an effort to write a clean HTML 5 parser in Python for
general-purpose use.  HTML 5 parsing is well-defined for the awful
cases that make older browsers incompatible, but quite complicated.
The Python implementation here is intended partly as a reference
implementation, so browser writers have something to compare with.

    I have a small web crawler robust enough to parse
real-world HTML, which can be appallingly bad.  I currently use
an extra-robust version of BeautifulSoup, and even that sometimes
blows up.  So I'm very interested in a new Python parser which supposedly
handles bad HTML in the same way browsers do.  But if it's slower
than BeautifulSoup, there's a problem.

                                        John Nagle
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to