crawler questions..

bruce Wed, 04 Mar 2009 13:30:39 -0800

Hi...

Sorry that this is a bit off track. Ok, maybe way off track!


But I don't have anyone to bounce this off of..

I'm working on a crawling project, crawling a college website, to extract
course/class information. I've built a quick test app in python to crawl the
site. I crawl at the top level, and work my way down to getting the required
course/class schedule. The app works. I can consistently run it and extract
the information.

My issue is now that I have a "basic" app that works, i need to figure out
how I guarantee that I'm correctly crawling the site. How do I know when
I've got an error at a given node/branch, so that the app knows that it's
not going to fetch the underlying branch/nodes of the tree..

How do I know when I have a complete "tree"!

I'm looking for someone, or some group/prof that I can talk to about these
issues. My goal is to eventually look at using nutch/lucene if at all
applicable.

Any pointers, or people, or papers, etc... would be helpful.

Thanks





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

crawler questions..

Reply via email to