Re: Web Crawler - Python or Perl?

Ray Cote Mon, 09 Jun 2008 13:01:08 -0700

At 11:21 AM -0700 6/9/08, subeen wrote:

On Jun 10, 12:15 am, Stefan Behnel <[EMAIL PROTECTED]> wrote:

 subeen wrote:
 > can use urllib2 module and/or beautiful soup for developing crawler


 Not if you care about a) speed and/or b) memory efficiency.

 > http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/


 Stefan


ya, beautiful soup is slower. so it's better to use urllib2 for
fetching data and regular expressions for parsing data.


regards,
Subeen.
http://love-python.blogspot.com/
--
http://mail.python.org/mailman/listinfo/python-list

Beautiful Soup is a bit slower, but it will actually parse some ofthe bizarre HTML you'll download off the web. We've written a coupleof crawlers to run over specific clients sites (I note, we did _not_create the content on these sites).


Expect to find html code that looks like this:

<ul>
<li>
<form>
</li>
</form>
</ul>
[from a real example, and yes, it did indeed render in IE.]

I don't know if some of the quicker parsers discussed requirewell-formed HTML since I've not used them. You may want to considerusing one of the quicker HTML parsers and, when they throw a fit onthe downloaded HTML, drop back to Beautiful Soup -- which usuallygets _something_ useful off the page.


--Ray

--

Raymond Cote
Appropriate Solutions, Inc.
PO Box 458 ~ Peterborough, NH 03458-0458
Phone: 603.924.6079 ~ Fax: 603.924.8668
rgacote(at)AppropriateSolutions.com
www.AppropriateSolutions.com
--
http://mail.python.org/mailman/listinfo/python-list

Re: Web Crawler - Python or Perl?

Reply via email to