Re: Code that ought to run fast, but can't due to Python limitations.

John Nagle Mon, 06 Jul 2009 23:31:26 -0700

Steven D'Aprano wrote:

On Sun, 05 Jul 2009 10:12:54 +0200, Hendrik van Rooyen wrote:
Python is not C.
John Nagle is an old hand at Python. He's perfectly aware of this, andI'm sure he's not trying to program C in Python.
I'm not entirely sure *what* he is doing, and hopefully he'll speak upand say, but whatever the problem is it's not going to be as simple asthat.


    I didn't write this code; I'm just using it.  As I said in the
original posting, it's from "http://code.google.com/p/html5lib";.
It's from an effort to write a clean HTML 5 parser in Python for
general-purpose use.  HTML 5 parsing is well-defined for the awful
cases that make older browsers incompatible, but quite complicated.
The Python implementation here is intended partly as a reference
implementation, so browser writers have something to compare with.

    I have a small web crawler robust enough to parse
real-world HTML, which can be appallingly bad.  I currently use
an extra-robust version of BeautifulSoup, and even that sometimes
blows up.  So I'm very interested in a new Python parser which supposedly
handles bad HTML in the same way browsers do.  But if it's slower
than BeautifulSoup, there's a problem.

                                        John Nagle
--
http://mail.python.org/mailman/listinfo/python-list

Re: Code that ought to run fast, but can't due to Python limitations.

Reply via email to