Paul Rubin wrote: > "Diez B. Roggisch" <[EMAIL PROTECTED]> writes: > >>And if only the html-parsing is slow, you might consider creating an >>extension for that. Using e.g. Pyrex. > > > I just tried using BeautifulSoup to pull some fields out of some html > files--about 2 million files, output of a web crawler. It parsed very > nicely at about 5 files per second.
That's about what I'm seeing. And it's the bottleneck of "sitetruth.com". > By > simply treating the html as a big string and using string.find to > locate the fields I wanted, I got it up to about 800 files/second, > which made each run about 1/2 hour. For our application, we have to look at the HTML in some detail, so we really need it in a tree form. > Simplest still would be if Python > just ran about 100x faster than it does, a speedup which is not > outlandish to hope for. Right. Looking forward to ShedSkin getting good enough to run BeautifulSoup. (Actually, the future of page parsing is probably to use some kind of stripped-down browser that reads the page, builds the DOM, runs the startup JavaScript, then lets you examine the DOM. There are too many pages now that just come through as blank if you don't run the OnLoad JavaScript.) John Nagle -- http://mail.python.org/mailman/listinfo/python-list