Re: duplication checking while indexing

2008-12-30 Thread Chris Lu
JDBM is surely a better way than in memory hash map. But I feel since all previous documents are already in the index, although not closed yet, there should be a way to read all previous terms. It's ok to use additional data structure, like JDBM or hash map, to duplicate the terms, in order to look

Re: duplication checking while indexing

2008-12-29 Thread liu Ivan
I use JDBM store document's key ID. 2008/12/30 Chris Lu > Otis, thanks for the pointer. > I think the question can be: > > How to access TermEnum or TermInfos during indexing. > > If this is possible, things would be easier. > > -- > Chris Lu > - > Instant Scalable Full-

Re: duplication checking while indexing

2008-12-29 Thread Chris Lu
Otis, thanks for the pointer. I think the question can be: How to access TermEnum or TermInfos during indexing. If this is possible, things would be easier. -- Chris Lu - Instant Scalable Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: h

Re: duplication checking while indexing

2008-12-29 Thread Otis Gospodnetic
Chris, Mark Miller & Co. are working on (Near) Duplicate Detection. I think the work is in Solr's JIRA, but some of it might be applicable to Lucene. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Chris Lu > To: "java-user@lucene.apach