Sounds Interesting. When its done would you care to share it? Sincerely, Michael H. -----Original Message----- From: Philip Semanchuk [mailto:phi...@semanchuk.com] Sent: Thursday, April 09, 2009 9:46 PM To: Python Subject: Re: Open source web crawler with mysql integration
On Apr 9, 2009, at 7:37 PM, Daniel Fetchinson wrote: >> I'm looking for a crawler that can spider my site and toss the >> results >> into mysql so, in turn, that database can be indexed by Sphinx >> Search. >> >> Since I don't want to reinvent the wheel, is anyone aware of any open >> source projects or code snippets that can already handle this? > > Have a look at http://nikitathespider.com/python/ As the author of Nikita, I can say that (a) she used Postgres and (b) the code wasn't open sourced except for a couple of small parts. The service is now defunct. It wasn't making money. Ideally I'd like to open source the code one day, but it would take a lot of documentation work to make it installable by others, and I won't have the time to do that for the foreseeable future. At the URL provided there's a nice module for parsing robots.txt files (better than the one in the standard library IMHO) but that's about it. FYI, I wrote my spider in Python because I couldn't find a decent one written in Python. There's Nutch, but that's not Python (Java I think). Good luck Philip -- http://mail.python.org/mailman/listinfo/python-list