Michael
Forgive me, I am not familiar with Lucene internal code. Can you verify
whether these suggested changes are indeed correct.
I am changing line 210 of TermsFilter.
if (result == null) {
if (docs.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
result = new FixedBitSet(reader.maxDoc());
// lazy init but don't do it in the hot loop since we could
read many docs
result.set(docs.docID());
}
}
// below commented out
// while (docs.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
// result.set(docs.docID());
//}
This change seems to have very little impact on performance.
It is taking around 25 second to look up documents associated with
murmur hash string id's on an index size of 10m records.
Thanks in advance
Jamie
On 2015/08/10 2:46 PM, Michael McCandless wrote:
OK, indeed, that version has the changes I was thinking of,
specifically optimizing the case when only a single doc contains a
term by inlining that docID into the terms dict.
You should be able to improve on TermsFilter a bit because you know
only 1 doc matches each ID, so after the first segment finds a given
ID you should stop testing other segments. Also, since you are doing
bulk lookup, you should pre-sort the IDs so it's a sequential scan
through the terms dict.
There is another thread right now, subject "Mapping doc values back to
doc ID (in decent time)", also talking about how to do faster PK
lookups.
Mike McCandless
http://blog.mikemccandless.com