Michael

Forgive me, I am not familiar with Lucene internal code. Can you verify whether these suggested changes are indeed correct.

I am changing line 210 of TermsFilter.

  if (result == null) {
      if (docs.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
          result = new FixedBitSet(reader.maxDoc());
// lazy init but don't do it in the hot loop since we could read many docs
           result.set(docs.docID());
       }
}
// below commented out
// while (docs.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
//  result.set(docs.docID());
//}

This change seems to have very little impact on performance.

It is taking around 25 second to look up documents associated with murmur hash string id's on an index size of 10m records.

Thanks in advance

Jamie

On 2015/08/10 2:46 PM, Michael McCandless wrote:
OK, indeed, that version has the changes I was thinking of,
specifically optimizing the case when only a single doc contains a
term by inlining that docID into the terms dict.

You should be able to improve on TermsFilter a bit because you know
only 1 doc matches each ID, so after the first segment finds a given
ID you should stop testing other segments.  Also, since you are doing
bulk lookup, you should pre-sort the IDs so it's a sequential scan
through the terms dict.

There is another thread right now, subject "Mapping doc values back to
doc ID (in decent time)", also talking about how to do faster PK
lookups.

Mike McCandless

http://blog.mikemccandless.com

Reply via email to