Re: web crawler in python

Philip Semanchuk Thu, 10 Dec 2009 05:26:46 -0800


On Dec 9, 2009, at 7:39 PM, my name wrote:

I'm currently planning on writing a web crawler in python but have a
question as far as how I should design it. My goal is speed andmaximum
efficient use of the hardware\bandwidth I have available.
As of now I have a Dual 2.4ghz xeon box, 4gb ram, 500gb sata and a20mbps
bandwidth cap (for now) . Running FreeBSD.
What would be the best way to design the crawler? Using the threadmodule?Would I be able to max out this connection with the hardware listedabove
using python threads?

I wrote a web crawler in Python (under FreeBSD, in fact) and I choseto do it using separate processes. Process A would download pages andwrite them to disk, process B would attempt to convert them toUnicode, process C would evaluate the content, etc. That worked wellfor me because the processes were very independent of one another sothey had very little data to share. Each process had a work queue(Postgres database table); process A would feed B's queue, B wouldfeed C & D's queues, etc.

I should point out that my crawler spidered one site at a time. As aresult the downloading process spent a lot of time waiting (in orderto be polite to the remote Web server). This sounds pretty differentfrom what you want to do (an indeed from most crawlers).

Figuring out the best design for your crawler depends on a host offactors that you haven't mentioned. (What are you doing with thepages you download? Is the box doing anything else? Are you storingthe pages long term or discarding them? etc.) I don't think we can doit for you -- I know *I* can't; I have a day job. ;) But I encourageyou to try something out. If you find your code isn't giving what youwant, come back to the list with a specific problem. It's alwayseasier to help with specific than with general problems.


Good luck
Philip
--
http://mail.python.org/mailman/listinfo/python-list

Re: web crawler in python

Reply via email to