Maybe I'm missing something simple, but I don't see how this will work. It looks like this filter will just filter out documents that don't have guid field, but in my case every document has a guid.
In a single index there are no duplicates. Duplicates are only a problem when I search multiple indexes. --Jason > You probably want to build a Filter. > > I've been planning to do exactly this on our own system, only our > duplicates are indicated by documents having the same value in an MD5 > digest field, instead of a GUID field. > > For a single Reader, such a filter would work something like this: > > public class UniqueFilter extends Filter { > public BitSet bits(IndexReader reader) throws IOException { > BitSet result = new BitSet(reader.maxDoc()); > TermDocs termDocs = reader.termDocs(); > TermEnum terms = reader.terms(new Term("guid", "")); > while (terms.next() && terms.term().field().equals("guid")) { > termDocs.seek(terms.term()); > if (termDocs.next()) { > result.set(termDocs.doc()); > } > } > return result; > } > } > > If you were to wrap this in a CachingWrapperFilter, the hard work would > only be executed once, and that's the main benefit of using a filter. > > However, for multiple indexes it might be more tricky. If you're not > willing to switch to MultiReader (we're in the same boat, if that's the > case) then you'll have to build a different set of bits for each reader, > and loop through all readers' TermEnums at once. If you step them all > through one at a time, you can get fairly good efficiency as you can > skip calling termDocs on readers where the term didn't occur. > > Daniel --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]