[
https://issues.apache.org/jira/browse/LUCENE-5938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adrien Grand updated LUCENE-5938:
---------------------------------
Attachment: LUCENE-5938.patch
I have been trying to think about how to do it and here is a first iteration.
This DocIdSet, called SparseFixedBitSet is similar to FixedBitSet except that
it only stores words that have at least one bit that is set. It is inspired
from hash array mapped tries, and uses Long.bitCount/numberOfTrailingZeros
operations in order to lookup words given an index.
Some more notes:
- like FixedBitSet it supports random access and implements the Bits interface
- although not completely accurate, its cost() method should return numbers
that are very close to the set cardinality, especially on sparse sets that have
bits uniformly dispersed
- memory usage is much better on sparse sets compared to FixedBitSet
- it supports random write access, so could be used eg. in TermsFilter
I re-ran the benchmark that I had for our doc-id sets with this new impl and it
seems to perform very well: http://people.apache.org/~jpountz/doc_id_sets2.html
> New DocIdSet implementation with random write access
> ----------------------------------------------------
>
> Key: LUCENE-5938
> URL: https://issues.apache.org/jira/browse/LUCENE-5938
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Attachments: LUCENE-5938.patch
>
>
> We have a great cost API that is supposed to help make decisions about how to
> best execute queries. However, due to the fact that several of our filter
> implementations (eg. TermsFilter and BooleanFilter) return FixedBitSets,
> either we use the cost API and make bad decisions, or need to fall back to
> heuristics which are not as good such as
> RandomAccessFilterStrategy.useRandomAccess which decides that random access
> should be used if the first doc in the set is less than 100.
> On the other hand, we also have some nice compressed and cacheable DocIdSet
> implementation but we cannot make use of them because TermsFilter requires a
> DocIdSet that has random write access, and FixedBitSet is the only DocIdSet
> that we have that supports random access.
> I think it would be nice to replace FixedBitSet in those filters with another
> DocIdSet that would also support random write access but would have a better
> cost?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]