I was considering not using nutch for indexing web documents. I was thinking
either extracting the full HTML document or through the use of some kind of
web scraper html parser utility extracting only the text content from a web
page and then indexing that.
I know it is strange, but I feel I have
I am using this code which is pretty basic. And it won't index the documents.
I run the index code and print the document to make sure that it gets
indexed, but when I looked at the output "gen" and "segments" file, there are
only like 20bytes of data in the files. I am indexing about 300k of te
I have been fine with my database (discussion forum) to lucene. I am taking
the simplest approach, eg; I have a discussion forum which are just text
messages, I take those out of the databse and then index the content.
I am having troubling because I have hundreds of thousands of messages and i
On Fri, 11 May 2007 09:02:04 -0400, Erick Erickson wrote
> Search the mail archive for Oracle, and there's lengthy discussion. The
> short form is that you query your database, taking selected
> data from it and add it to a Lucene document, then write the
> document to your Lucene index. Repeat thi
I think this is a simple question; or dont know. Is there a way to
automatically convert all tokens to wildcard query with any given input.
ie, if I enter 'n' it will convert that to 'n*'. Also, I am using multiple
fields, so this is how I presently have it.
MultiFieldQueryParser parser = new