lucene...

bruce Wed, 21 Jun 2006 09:18:28 -0700

hi...

after reading through the docs for lucene/nutch, i'm trying to straighten
out how it all works...


if i want to crawl through a portion of a web site for the purpose of
extracting information, it appears that this would work. however, i'm not
sure if i need lucene/nutch or both.. i don't need to do indexing, as i'm
not going to be doing any query searching, at least not initially...

i'm also trying to understand just what gets returned when i 'crawl' a
portion of a site.. do i get information back in a series of html files.. do
i get a db of information, just what do i get..??

i'm looking at being able to take a given url www.foo.com, and to be able to
crawl through a portion of the site.. need to figure out how to accomplish
this... and once i have the returned information (if it's in a file/txt
format) i'd like to be able to extract certain information based upon the
DOM of the page... if the returned information from the 'crawler' is of a
textfile format, i can easily create a parsing function to go through the
files and generate the information...

can someone provide me with insight as to whether lucene/nutch is the way to
go with this project..

thanks

-bruce
=


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

lucene...

Reply via email to