[
https://issues.apache.org/jira/browse/LUCENE-5938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adrien Grand updated LUCENE-5938:
---------------------------------
Attachment: low_freq.tasks
LUCENE-5938.patch
OK, I did something slightly different. It happens that all queries in the
tasks file match a pretty large number of documents, which favors FixedBitSet.
So now I've configured a threshold: FixedBitSet is used when more than maxDoc /
16384 docs match and SparseFixedBitSet is used otherwise. Since
SparseFixedBitSet is much faster than FixedBitSet for such low densities, the
cost to start by creating a SparseFixedBitSet and then upgrading to a
FixedBitSet is negligible compared to starting with a FixedBitSet from the
beginning (see http://people.apache.org/~jpountz/doc_id_sets2.html).
So now the benchmark looks better for those queries that match many documents:
{noformat}
IntNRQ 7.10 (6.3%) 6.57 (9.6%)
-7.4% ( -21% - 9%)
Prefix3 110.36 (14.8%) 109.88 (9.5%)
-0.4% ( -21% - 28%)
Wildcard 62.83 (14.5%) 66.93 (9.5%)
6.5% ( -15% - 35%)
{noformat}
I don't think the improvement with {{Wildcard}} is noise, I can reproduce it
easily. I think the reason is that since the default is filter rewrite now, we
don't have to compute the terms intersection twice, which is costly with
wildcard queries.
I also wanted to see what happens with queries that match fewer documents
compared to boolean rewrite, so I generated a set of wildcard queries that are
expanded to a couple of terms and don't match too many documents (see tasks
file attached):
{noformat}
Wildcard 99.90 (9.0%) 294.66 (30.6%)
194.9% ( 142% - 257%)
{noformat}
For such queries, the new default rewrite method looks much better.
> New DocIdSet implementation with random write access
> ----------------------------------------------------
>
> Key: LUCENE-5938
> URL: https://issues.apache.org/jira/browse/LUCENE-5938
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Attachments: LUCENE-5938.patch, LUCENE-5938.patch, LUCENE-5938.patch,
> low_freq.tasks
>
>
> We have a great cost API that is supposed to help make decisions about how to
> best execute queries. However, due to the fact that several of our filter
> implementations (eg. TermsFilter and BooleanFilter) return FixedBitSets,
> either we use the cost API and make bad decisions, or need to fall back to
> heuristics which are not as good such as
> RandomAccessFilterStrategy.useRandomAccess which decides that random access
> should be used if the first doc in the set is less than 100.
> On the other hand, we also have some nice compressed and cacheable DocIdSet
> implementation but we cannot make use of them because TermsFilter requires a
> DocIdSet that has random write access, and FixedBitSet is the only DocIdSet
> that we have that supports random access.
> I think it would be nice to replace FixedBitSet in those filters with another
> DocIdSet that would also support random write access but would have a better
> cost?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]