[jira] [Commented] (LUCENE-4225) New FixedPostingsFormat for less overhead than SepPostingsFormat

Michael McCandless (JIRA) Mon, 16 Jul 2012 06:09:38 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13415083#comment-13415083
 ]


Michael McCandless commented on LUCENE-4225:
--------------------------------------------

Initial results are compelling!  On the 10M doc Wikipedia test,
Sep(For) vs Fixed(For):

{noformat}
                Task    QPS base StdDev base     QPS for  StdDev for      Pct 
diff
              IntNRQ        8.40        0.83        8.33        0.38  -13% -   
15%
         TermGroup1M       46.67        1.51       48.95        0.21    1% -    
8%
        TermBGroup1M       79.97        1.96       85.05        0.52    3% -    
9%
             Prefix3       68.82        2.62       73.96        2.27    0% -   
15%
              Fuzzy2       69.54        2.69       75.55        2.29    1% -   
16%
      TermBGroup1M1P       42.67        1.07       46.38        0.86    4% -   
13%
              Fuzzy1       85.07        3.34       93.16        2.20    2% -   
16%
             Respell       67.30        2.20       74.69        3.87    1% -   
20%
                Term      156.81        8.62      180.38        6.83    4% -   
26%
            Wildcard       42.55        1.13       50.97        0.87   14% -   
25%
          OrHighHigh        8.66        0.77       10.46        0.59    4% -   
40%
           OrHighMed       15.62        1.54       18.93        1.05    4% -   
41%
          AndHighMed       45.80        1.69       57.18        0.80   18% -   
31%
            SpanNear        7.59        0.32        9.95        0.14   23% -   
38%
         AndHighHigh       11.09        0.32       14.68        0.15   27% -   
37%
            PKLookup      143.83        2.80      195.40        4.13   30% -   
41%
              Phrase       15.53        1.15       21.34        0.18   26% -   
49%
        SloppyPhrase        5.94        0.49        8.74        0.24   32% -   
64%
{noformat}

And Fixed(For) vs Lucene40:

{noformat}
                Task    QPS base StdDev base     QPS for  StdDev for      Pct 
diff
          AndHighMed       60.07        1.69       44.20        1.17  -30% -  
-22%
              Phrase       11.97        0.60        9.61        0.20  -25% -  
-13%
              IntNRQ        9.77        0.46        8.93        0.38  -16% -    
0%
              Fuzzy2       49.08        1.33       48.72        1.08   -5% -    
4%
             Respell       61.33        1.52       60.90        1.41   -5% -    
4%
            SpanNear        7.72        0.20        7.74        0.07   -3% -    
3%
            PKLookup      194.64        3.03      197.83        3.27   -1% -    
4%
        SloppyPhrase        4.76        0.19        4.93        0.11   -2% -   
10%
              Fuzzy1       63.49        1.07       66.57        1.53    0% -    
9%
         TermGroup1M       53.91        1.40       58.24        1.27    3% -   
13%
             Prefix3       61.02        1.72       66.14        2.11    2% -   
15%
            Wildcard       51.27        1.40       56.26        1.78    3% -   
16%
      TermBGroup1M1P       29.65        0.98       32.77        0.79    4% -   
17%
        TermBGroup1M       34.37        1.16       38.07        1.14    3% -   
18%
                Term       24.98        1.32       28.13        3.31   -5% -   
32%
         AndHighHigh       17.08        0.69       19.42        0.52    6% -   
21%
          OrHighHigh       10.68        0.40       12.52        0.94    4% -   
30%
           OrHighMed       13.66        0.52       16.65        1.34    7% -   
36%
{noformat}

So we are still slower than Lucene40 in some cases, but a lot closer
than with Sep!

But these are early results ... and the PF doesn't pass tests yet ... so!


                
> New FixedPostingsFormat for less overhead than SepPostingsFormat
> ----------------------------------------------------------------
>
>                 Key: LUCENE-4225
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4225
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>         Attachments: LUCENE-4225.patch
>
>
> I've worked out the start at a new postings format that should have
> less overhead for fixed-int[] encoders (For,PFor)... using ideas from
> the old bulk branch, and new ideas from Robert.
> It's only a start: there's no payloads support yet, and I haven't run
> Lucene's tests with it, except for one new test I added that tries to
> be a thorough PostingsFormat tester (to make it easier to create new
> postings formats).  It does pass luceneutil's performance test, so
> it's at least able to run those queries correctly...
> Like Lucene40, it uses two files (though once we add payloads it may
> be 3).  The .doc file interleaves doc delta and freq blocks, and .pos
> has position delta blocks.  Unlike sep, blocks are NOT shared across
> terms; instead, it uses block encoding if there are enough ints to
> encode, else the same Lucene40 vInt format.  This means low-freq terms
> (< 128 = current default block size) are always vInts, and high-freq
> terms will have some number of blocks, with a vInt final block.
> Skip points are only recorded at block starts.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4225) New FixedPostingsFormat for less overhead than SepPostingsFormat

Reply via email to