[jira] [Updated] (LUCENE-5938) New DocIdSet implementation with random write access

Adrien Grand (JIRA) Fri, 12 Sep 2014 07:28:04 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-5938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Adrien Grand updated LUCENE-5938:
---------------------------------
    Attachment: low_freq.tasks
                LUCENE-5938.patch

OK, I did something slightly different. It happens that all queries in the 
tasks file match a pretty large number of documents, which favors FixedBitSet. 
So now I've configured a threshold: FixedBitSet is used when more than maxDoc / 
16384 docs match and SparseFixedBitSet is used otherwise. Since 
SparseFixedBitSet is much faster than FixedBitSet for such low densities, the 
cost to start by creating a SparseFixedBitSet and then upgrading to a 
FixedBitSet is negligible compared to starting with a FixedBitSet from the 
beginning (see http://people.apache.org/~jpountz/doc_id_sets2.html).

So now the benchmark looks better for those queries that match many documents:
{noformat}
                  IntNRQ        7.10      (6.3%)        6.57      (9.6%)   
-7.4% ( -21% -    9%)
                 Prefix3      110.36     (14.8%)      109.88      (9.5%)   
-0.4% ( -21% -   28%)
                Wildcard       62.83     (14.5%)       66.93      (9.5%)    
6.5% ( -15% -   35%)
{noformat}

I don't think the improvement with {{Wildcard}} is noise, I can reproduce it 
easily. I think the reason is that since the default is filter rewrite now, we 
don't have to compute the terms intersection twice, which is costly with 
wildcard queries.

I also wanted to see what happens with queries that match fewer documents 
compared to boolean rewrite, so I generated a set of wildcard queries that are 
expanded to a couple of terms and don't match too many documents (see tasks 
file attached):

{noformat}
                Wildcard       99.90      (9.0%)      294.66     (30.6%)  
194.9% ( 142% -  257%)
{noformat}

For such queries, the new default rewrite method looks much better.

> New DocIdSet implementation with random write access
> ----------------------------------------------------
>
>                 Key: LUCENE-5938
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5938
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>         Attachments: LUCENE-5938.patch, LUCENE-5938.patch, LUCENE-5938.patch, 
> low_freq.tasks
>
>
> We have a great cost API that is supposed to help make decisions about how to 
> best execute queries. However, due to the fact that several of our filter 
> implementations (eg. TermsFilter and BooleanFilter) return FixedBitSets, 
> either we use the cost API and make bad decisions, or need to fall back to 
> heuristics which are not as good such as 
> RandomAccessFilterStrategy.useRandomAccess which decides that random access 
> should be used if the first doc in the set is less than 100.
> On the other hand, we also have some nice compressed and cacheable DocIdSet 
> implementation but we cannot make use of them because TermsFilter requires a 
> DocIdSet that has random write access, and FixedBitSet is the only DocIdSet 
> that we have that supports random access.
> I think it would be nice to replace FixedBitSet in those filters with another 
> DocIdSet that would also support random write access but would have a better 
> cost?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-5938) New DocIdSet implementation with random write access

Reply via email to