In article <[EMAIL PROTECTED]>, John Nagle <[EMAIL PROTECTED]> wrote:
> abeen wrote: > > Hello, > > > > I would want to know which could be the best programming language for > > developing web spider. > > More information about the spider, much better,, > > As someone who actually runs a Python based web spider in production, I > should comment. > > You need a very robust parser to parse real world HTML. > Even the stock version of BeautifulSoup isn't good enough. We have a > modified version of BeautifulSoup, plus other library patches, just to > keep the parser from blowing up or swallowing the entire page into > a malformed comment or tag. Browsers are incredibly forgiving in this > regard. > > "urllib" needs extra robustness, too. The stock timeout mechanism > isn't good enough. Some sites do weird things, like open TCP connections > for HTTP but not send anything. > > Python is on the slow side for this. Python is about 60x > slower than C, and for this application, you definitely see that. > A Python based spider will go compute bound for seconds per page > on big pages. The C-based parsers for XML/HTML aren't robust enough for > this application. And then there's the Global Interpreter Lock; a multicore > CPU won't help a multithreaded compute-bound process. > > I'd recommend using Java or C# for new work in this area > if you're doing this in volume. Otherwise, you'll need to buy > many, many extra racks of servers. In practice, the big spiders > are in C or C++. I'll throw in an opinion from a different viewpoint. I'm really happy I used Python to develop my spider. I like the language, it has a good library and good community support and 3rd party modules. John, I don't know what your spider does, but you face some hurdles that I don't. For instance, since I'm focused on validation, if bizarre (invalid) HTML makes a page look like garbage, I just report the problem to the author. Performance isn't a big problem for me, either, since this is not a crawl-as-fast-as-you-can application. What you said sounds to me entirely correct for your application. The OP who asked for as much information as possible didn't give a whole lot to start with. -- Philip http://NikitaTheSpider.com/ Whole-site HTML validation, link checking and more -- http://mail.python.org/mailman/listinfo/python-list