Re: crawler questions..

Grant Ingersoll Wed, 04 Mar 2009 13:42:43 -0800

You might have a look at Droids (http://incubator.apache.org/droids/)or Nutch (http://lucene.apache.org/nutch) and their communities. Theyare much more focused on crawling (not to say there aren't people herewho crawl, just saying those projects are (mostly) about crawling)


On Mar 4, 2009, at 4:30 PM, bruce wrote:

Hi...

Sorry that this is a bit off track. Ok, maybe way off track!

But I don't have anyone to bounce this off of..
I'm working on a crawling project, crawling a college website, toextractcourse/class information. I've built a quick test app in python tocrawl thesite. I crawl at the top level, and work my way down to getting therequiredcourse/class schedule. The app works. I can consistently run it andextract
the information.
My issue is now that I have a "basic" app that works, i need tofigure outhow I guarantee that I'm correctly crawling the site. How do I knowwhenI've got an error at a given node/branch, so that the app knows thatit's
not going to fetch the underlying branch/nodes of the tree..

How do I know when I have a complete "tree"!
I'm looking for someone, or some group/prof that I can talk to aboutthese
issues. My goal is to eventually look at using nutch/lucene if at all
applicable.

Any pointers, or people, or papers, etc... would be helpful.

Thanks





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:

http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: crawler questions..

Reply via email to