Re: Can I use Lucene to retrieve a list of duplicates

2007-02-27 Thread Paul Taylor
MBER field values if you use a FieldCache instead of fetching each document. : Date: Mon, 26 Feb 2007 16:25:11 + : From: Paul Taylor <[EMAIL PROTECTED]> : Reply-To: java-user@lucene.apache.org, [EMAIL PROTECTED] : To: Erick Erickson <[EMAIL PROTECTED]> : Cc: java-user@lucene.apache.org :

Re: Can I use Lucene to retrieve a list of duplicates

2007-02-26 Thread Chris Hostetter
nt. : Date: Mon, 26 Feb 2007 16:25:11 + : From: Paul Taylor <[EMAIL PROTECTED]> : Reply-To: java-user@lucene.apache.org, [EMAIL PROTECTED] : To: Erick Erickson <[EMAIL PROTECTED]> : Cc: java-user@lucene.apache.org : Subject: Re: Can I use Lucene to retrieve a list of duplicates : :

Re: Can I use Lucene to retrieve a list of duplicates

2007-02-26 Thread Paul Taylor
Hi I got it working before I saw your latest mail, the only problem is that it doesn't look very efficient. This is my duplicate method, the problem is that I have to enumerate through *every* term. This was worse before because I was only interested in terms that matched a particular field (

Re: Can I use Lucene to retrieve a list of duplicates

2007-02-26 Thread Erick Erickson
Here's an excerpt from something I wrote to enumerate all the terms for a field. I hacked out some of my tracing, so it may not even compile . Basically, change the line "if (td.next())" to "while (td.next())" and every time you stay in that loop for more than one cycle, you'll have duplicate

Re: Can I use Lucene to retrieve a list of duplicates

2007-02-26 Thread Paul Taylor
Hi, Sorry I don't see how I get access to TermEnums. So far Ive created a document per row, the first field holds the row id, then i have one field per column, and checked the index has been created ok with some search querys. I now want to pass a column to check, and receive a list of all

Re: Can I use Lucene to retrieve a list of duplicates

2007-02-23 Thread Chris Hostetter
: Thanks this might do it, but do I need to know the terms beforehand, I : just want to return any terms with frequency more than one? no, TermEnum will let you iterate over all the terms ... you don't even need TermDocs if you just want the docFreq for each term (which would be 1 if there are no

Re: Can I use Lucene to retrieve a list of duplicates

2007-02-23 Thread Paul Taylor
yes Ive seen this before thanks, it was an article that referred to this that pointed me towards lucene in the first place :) Erik Hatcher wrote: On Feb 23, 2007, at 10:16 AM, Paul Taylor wrote: Hi I have Java Swing application with a table, I was considering using Lucene to index the data i

Re: Can I use Lucene to retrieve a list of duplicates

2007-02-23 Thread Paul Taylor
Thanks this might do it, but do I need to know the terms beforehand, I just want to return any terms with frequency more than one? Erick Erickson wrote: Sure, you can use the TermDocs/TermEnum classes. Basically, for a term (probably column value in your app) these let you quickly answer the q

Re: Can I use Lucene to retrieve a list of duplicates

2007-02-23 Thread Erik Hatcher
On Feb 23, 2007, at 10:16 AM, Paul Taylor wrote: Hi I have Java Swing application with a table, I was considering using Lucene to index the data in the table. One task Id like to do is for the user to select 'Find Duplicate records for Column X', then I would filter the table to show only

Re: Can I use Lucene to retrieve a list of duplicates

2007-02-23 Thread Erick Erickson
Sure, you can use the TermDocs/TermEnum classes. Basically, for a term (probably column value in your app) these let you quickly answer the question "which (and how many) documents does this term appear in". What you get is the Lucene doc id, which let's you fetch all the information about the doc

Can I use Lucene to retrieve a list of duplicates

2007-02-23 Thread Paul Taylor
Hi I have Java Swing application with a table, I was considering using Lucene to index the data in the table. One task Id like to do is for the user to select 'Find Duplicate records for Column X', then I would filter the table to show only records where there is more than one with the same val