At 11:21 AM -0700 6/9/08, subeen wrote:
On Jun 10, 12:15 am, Stefan Behnel <[EMAIL PROTECTED]> wrote:
 subeen wrote:
 > can use urllib2 module and/or beautiful soup for developing crawler

 Not if you care about a) speed and/or b) memory efficiency.

 > http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

 Stefan

ya, beautiful soup is slower. so it's better to use urllib2 for
fetching data and regular expressions for parsing data.


regards,
Subeen.
http://love-python.blogspot.com/
--
http://mail.python.org/mailman/listinfo/python-list

Beautiful Soup is a bit slower, but it will actually parse some of the bizarre HTML you'll download off the web. We've written a couple of crawlers to run over specific clients sites (I note, we did _not_ create the content on these sites).

Expect to find html code that looks like this:

<ul>
<li>
<form>
</li>
</form>
</ul>
[from a real example, and yes, it did indeed render in IE.]

I don't know if some of the quicker parsers discussed require well-formed HTML since I've not used them. You may want to consider using one of the quicker HTML parsers and, when they throw a fit on the downloaded HTML, drop back to Beautiful Soup -- which usually gets _something_ useful off the page.

--Ray

--

Raymond Cote
Appropriate Solutions, Inc.
PO Box 458 ~ Peterborough, NH 03458-0458
Phone: 603.924.6079 ~ Fax: 603.924.8668
rgacote(at)AppropriateSolutions.com
www.AppropriateSolutions.com
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to