[jira] [Commented] (LUCENE-4225) New FixedPostingsFormat for less overhead than SepPostingsFormat

Michael McCandless (JIRA) Mon, 16 Jul 2012 17:17:37 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13415810#comment-13415810
 ]


Michael McCandless commented on LUCENE-4225:
--------------------------------------------

bq. I think the slower cases are all explained: the skip interval is crazy, and 
lazy-loading the freq blocks should fix IntNRQ. (Though, i dont know how you 
get away with AndHighHigh currently).

Maybe AndHighHigh isn't doing much actual skipping... ie the distance
b/w each doc is probably around the blockSize?

I wonder even how much skipping AndMedHigh queries are really
doing... but I agree we need to have a smaller skipInterval since our
"base" skipInterval is so high.

And try a smaller block size...

{quote}
Still the second benchmark could be confusing: we are mixing concerns 
benchmarking FOR vs Vint and also different index layouts 
 Maybe we can we benchmark this layout with bulkvint vs Lucene40 to get a 
better idea of just how the index layout is doing?
{quote}

Oh yeah!  OK I cutover BulkVInt to fixed postings format and compared
it (base) to FOR:

{noformat}
                Task    QPS base StdDev base     QPS for  StdDev for      Pct 
diff
        SloppyPhrase        6.90        0.18        6.88        0.17   -5% -    
4%
            PKLookup      196.92        4.41      197.38        4.55   -4% -    
4%
             Respell       65.25        2.09       65.55        0.80   -3% -    
5%
         TermGroup1M       39.07        0.78       39.34        0.94   -3% -    
5%
            SpanNear        5.42        0.14        5.48        0.12   -3% -    
6%
        TermBGroup1M       44.91        0.44       45.45        0.51    0% -    
3%
      TermBGroup1M1P       40.42        0.68       40.95        0.76   -2% -    
4%
              Fuzzy2       63.85        1.14       65.01        0.66    0% -    
4%
              Phrase       10.23        0.27       10.46        0.33   -3% -    
8%
              Fuzzy1       61.89        1.06       63.60        0.61    0% -    
5%
              IntNRQ        8.77        0.23        9.02        0.36   -3% -    
9%
            Wildcard       29.22        0.40       30.18        0.84    0% -    
7%
         AndHighHigh        9.13        0.15        9.49        0.18    0% -    
7%
                Term      126.40        0.41      132.48        5.62    0% -    
9%
             Prefix3       30.54        0.69       32.21        1.06    0% -   
11%
          OrHighHigh        8.69        0.38        9.21        0.37   -2% -   
15%
           OrHighMed       28.00        1.15       29.67        1.05   -1% -   
14%
          AndHighMed       32.28        0.67       34.29        0.56    2% -   
10%
{noformat}

Looks like some small gain over BulkVInt but not much...

bq. I like how clean it is without the payloads crap: I still think we probably 
need to know up-front if the consumer is going to consume a payload off the 
enum for positional queries, without that its going to make things like this 
really hairy and messy.

I agree!  Not looking forward to getting payloads working :)

{quote}
Do you think its worth it that even for "big terms" we write the last partial 
block as vints the way we do? 
 Since these terms are going to be biggish anyway (at least enough to fill a 
block), this seems not worth the trouble?
{quote}

We could try just leaving partial blocks at the end ... that made me
nervous :)  I think there are a lot of terms in the 128 - 256 docFreq
range!  But we should try it.

{quote}
Instead if we only did this for low-freq terms, the code might even be 
clearer/faster, but I guess there would be a downside of
 not being able to reuse these enums as much that would hurt e.g. 
NIOFSDirectory?
{quote}

Hmm true.  We'd need to pair up low and high freq enums?  (Like Pulsing).

bq. Thanks for bringing all this back to life... and the new test looks 
awesome! I think it will really make our lives a lot easier...

I really want this test to be thorough, so that if it passes on your
new PF, all other tests should too!  I know that's overly ambitious
... but when it misses something we should go back and add it.
Because debugging a PF bug when you're in a deep scary stack trace
involving Span*Query is a slow process ... it's too hard to make a new
PF now.

{quote}
I don't like how we double every position in the payloads case to record if 
there is one there, and we shouldnt also
have a condition to indicate if the length changed. I think practically its 
typically "all or none", e.g. the analysis
process marks a payload like POS or it doesnt, and a fixed length across the 
whole term or not. So I don't think we 
should waste time with this for block encoders, nor should we put this in 
skipdata. I think we should just do something
simpler, like if payloads are present, we have a block of lengths. Its a 0 if 
there is no payload. If all the payloads
for the entire term are the same, mark that length in the term dictionary and 
omit the lengths blocks.

We could consider the same approach for offset length.
{quote}

That sounds good!

                
> New FixedPostingsFormat for less overhead than SepPostingsFormat
> ----------------------------------------------------------------
>
>                 Key: LUCENE-4225
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4225
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>         Attachments: LUCENE-4225.patch
>
>
> I've worked out the start at a new postings format that should have
> less overhead for fixed-int[] encoders (For,PFor)... using ideas from
> the old bulk branch, and new ideas from Robert.
> It's only a start: there's no payloads support yet, and I haven't run
> Lucene's tests with it, except for one new test I added that tries to
> be a thorough PostingsFormat tester (to make it easier to create new
> postings formats).  It does pass luceneutil's performance test, so
> it's at least able to run those queries correctly...
> Like Lucene40, it uses two files (though once we add payloads it may
> be 3).  The .doc file interleaves doc delta and freq blocks, and .pos
> has position delta blocks.  Unlike sep, blocks are NOT shared across
> terms; instead, it uses block encoding if there are enough ints to
> encode, else the same Lucene40 vInt format.  This means low-freq terms
> (< 128 = current default block size) are always vInts, and high-freq
> terms will have some number of blocks, with a vInt final block.
> Skip points are only recorded at block starts.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4225) New FixedPostingsFormat for less overhead than SepPostingsFormat

Reply via email to