On Wed, Mar 4, 2009 at 4:41 PM, Grant Ingersoll <gsing...@apache.org> wrote: > You might have a look at Droids (http://incubator.apache.org/droids/) or > Nutch (http://lucene.apache.org/nutch) and their communities. They are much > more focused on crawling (not to say there aren't people here who crawl, > just saying those projects are (mostly) about crawling) > > > On Mar 4, 2009, at 4:30 PM, bruce wrote: > >> Hi... >> >> Sorry that this is a bit off track. Ok, maybe way off track! >> >> But I don't have anyone to bounce this off of.. >> >> I'm working on a crawling project, crawling a college website, to extract >> course/class information. I've built a quick test app in python to crawl >> the >> site. I crawl at the top level, and work my way down to getting the >> required >> course/class schedule. The app works. I can consistently run it and >> extract >> the information. >> >> My issue is now that I have a "basic" app that works, i need to figure out >> how I guarantee that I'm correctly crawling the site. How do I know when >> I've got an error at a given node/branch, so that the app knows that it's >> not going to fetch the underlying branch/nodes of the tree.. >> >> How do I know when I have a complete "tree"! >> >> I'm looking for someone, or some group/prof that I can talk to about these >> issues. My goal is to eventually look at using nutch/lucene if at all >> applicable. >> >> Any pointers, or people, or papers, etc... would be helpful.
The ir-book[1] also has a section on the subject if you hadn't already seen it. --tim [1] - http://nlp.stanford.edu/IR-book/html/htmledition/web-crawling-and-indexes-1.html --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org