hi... after reading through the docs for lucene/nutch, i'm trying to straighten out how it all works...
if i want to crawl through a portion of a web site for the purpose of extracting information, it appears that this would work. however, i'm not sure if i need lucene/nutch or both.. i don't need to do indexing, as i'm not going to be doing any query searching, at least not initially... i'm also trying to understand just what gets returned when i 'crawl' a portion of a site.. do i get information back in a series of html files.. do i get a db of information, just what do i get..?? i'm looking at being able to take a given url www.foo.com, and to be able to crawl through a portion of the site.. need to figure out how to accomplish this... and once i have the returned information (if it's in a file/txt format) i'd like to be able to extract certain information based upon the DOM of the page... if the returned information from the 'crawler' is of a textfile format, i can easily create a parsing function to go through the files and generate the information... can someone provide me with insight as to whether lucene/nutch is the way to go with this project.. thanks -bruce = --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]