Lucene or nutch for indexing web documents

2007-11-27 Thread bbrown
I was considering not using nutch for indexing web documents. I was thinking either extracting the full HTML document or through the use of some kind of web scraper html parser utility extracting only the text content from a web page and then indexing that. I know it is strange, but I feel I have

Can't code to index documents

2007-11-14 Thread bbrown
I am using this code which is pretty basic. And it won't index the documents. I run the index code and print the document to make sure that it gets indexed, but when I looked at the output "gen" and "segments" file, there are only like 20bytes of data in the files. I am indexing about 300k of te

Indexing in pieces?

2007-08-31 Thread bbrown
I have been fine with my database (discussion forum) to lucene. I am taking the simplest approach, eg; I have a discussion forum which are just text messages, I take those out of the databse and then index the content. I am having troubling because I have hundreds of thousands of messages and i

Re: Indexing the ORACLE using lucene

2007-05-11 Thread bbrown
On Fri, 11 May 2007 09:02:04 -0400, Erick Erickson wrote > Search the mail archive for Oracle, and there's lengthy discussion. The > short form is that you query your database, taking selected > data from it and add it to a Lucene document, then write the > document to your Lucene index. Repeat thi

Simple, always do wildcard or fuzzy query

2007-05-10 Thread bbrown
I think this is a simple question; or dont know. Is there a way to automatically convert all tokens to wildcard query with any given input. ie, if I enter 'n' it will convert that to 'n*'. Also, I am using multiple fields, so this is how I presently have it. MultiFieldQueryParser parser = new