Op Saturday 12 April 2008 00:03:13 schreef Antony Bowesman: > Paul Elschot wrote: > > Op Friday 11 April 2008 13:49:59 schreef Mathieu Lecarme: > >> Use Filter and BitSet. > >> From the personnal data, you build a Filter > >> (http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/search/ > >>Fil ter.html) wich is used in the main index. > > > > With 1 billion mails, and possibly a Filter per user, you may want > > to use more compact filters than BitSets, which is currently > > possible in the development trunk of lucene. > > Thanks for the pointers. I've already used Solr's DocSet interface > in my implementation, which I think is where the ideas for the > current Lucene enhancements came from.
The ideas came from quite a few sources. They can be traced starting from changes.txt in the sources. > They work well to reduce the > filter's footprint. I'm also caching filters. > > The intention is that there is a user data index and the mail > index(es). The search against user data index will return a set of > mail Ids, which is the common key between the two. Doc Ids are no > good between the indexes, so that means a potentially large boolean > OR query to create the filter of labelled mails in the mail indexes. > I know it's a theoretical question, but will this perform? The normal way to collect doc ids for a filter is into a bitset iterating over the indexed ids (mail ids in your case). A bitset has random access, so there is no need to do this in doc id order. An OR query has to work in doc id order so it can compute a score per doc id, and the ordering loses some performance. Regards, Paul Elschot --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]