You might have a look at Droids (http://incubator.apache.org/droids/)
or Nutch (http://lucene.apache.org/nutch) and their communities. They
are much more focused on crawling (not to say there aren't people here
who crawl, just saying those projects are (mostly) about crawling)
On Mar 4, 2009, at 4:30 PM, bruce wrote:
Hi...
Sorry that this is a bit off track. Ok, maybe way off track!
But I don't have anyone to bounce this off of..
I'm working on a crawling project, crawling a college website, to
extract
course/class information. I've built a quick test app in python to
crawl the
site. I crawl at the top level, and work my way down to getting the
required
course/class schedule. The app works. I can consistently run it and
extract
the information.
My issue is now that I have a "basic" app that works, i need to
figure out
how I guarantee that I'm correctly crawling the site. How do I know
when
I've got an error at a given node/branch, so that the app knows that
it's
not going to fetch the underlying branch/nodes of the tree..
How do I know when I have a complete "tree"!
I'm looking for someone, or some group/prof that I can talk to about
these
issues. My goal is to eventually look at using nutch/lucene if at all
applicable.
Any pointers, or people, or papers, etc... would be helpful.
Thanks
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org