[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

Michael McCandless (JIRA) Fri, 15 Jun 2012 09:24:44 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295748#comment-13295748
 ]


Michael McCandless commented on LUCENE-4069:
--------------------------------------------

I ran a benchmark on 10 M Wikipedia index; for the factory I used 
createSetBasedOnMemory and passed it 100 MB; I think that's enough to ensure we 
get the 10% saturation on save ...:

{noformat}
                Task    QPS base StdDev base   QPS bloomStdDev bloom      Pct 
diff
              Fuzzy1      102.47        3.67       41.95        0.78  -61% -  
-56%
              Fuzzy2       38.36        1.76       18.68        0.37  -54% -  
-47%
             Respell       89.89        4.38       44.09        0.52  -53% -  
-47%
            Wildcard       40.48        2.82       36.20        0.64  -17% -   
-2%
        SloppyPhrase        7.96        0.28        8.07        0.07   -3% -    
5%
             Prefix3       61.94        5.34       63.35        0.37   -6% -   
12%
        TermBGroup1M       71.37        6.79       73.73        1.55   -7% -   
16%
          AndHighMed       64.09        5.51       66.73        1.75   -6% -   
16%
      TermBGroup1M1P       49.55        3.78       51.75        2.67   -7% -   
18%
         AndHighHigh       16.05        1.12       16.77        0.53   -5% -   
15%
         TermGroup1M       35.87        3.07       37.56        0.74   -5% -   
16%
          OrHighHigh        9.60        1.38       10.15        0.65  -13% -   
31%
           OrHighMed       11.93        1.91       12.63        0.93  -15% -   
35%
              IntNRQ        9.12        1.25        9.68        0.11   -7% -   
24%
                Term      154.55       19.60      165.32        0.97   -5% -   
23%
              Phrase       11.40        0.33       12.21        0.18    2% -   
11%
            SpanNear        4.31        0.07        4.73        0.03    7% -   
12%
            PKLookup      122.78        1.42      145.95        5.22   13% -   
24%
{noformat}

Baseline is Lucene40 PostingsFormat even for the id field ... so PKLookup gets 
a good improvement.  This is on an index w/ 5 segments at each level.

Other queries seem to speed up as well (eg Term, Or*).

The queries that rely on Terms.intersect got much worse: is the 
BloomFilteredFieldsProducer should just pass through intersect to the delegate?
                
> Segment-level Bloom filters for a 2 x speed up on rare term searches
> --------------------------------------------------------------------
>
>                 Key: LUCENE-4069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>    Affects Versions: 3.6, 4.0
>            Reporter: Mark Harwood
>            Priority: Minor
>             Fix For: 4.0, 3.6.1
>
>         Attachments: BloomFilterPostingsBranch4x.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

Reply via email to