[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

Robert Muir (JIRA) Tue, 16 Dec 2014 18:50:30 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14249360#comment-14249360
 ]


Robert Muir commented on LUCENE-2878:
-------------------------------------

I ran the benchmark:
{noformat}
                    Task   QPS trunk      StdDev   QPS patch      StdDev        
        Pct diff
        HighSloppyPhrase       13.40     (10.4%)       10.31      (5.5%)  
-23.1% ( -35% -   -7%)
              HighPhrase       17.98      (5.6%)       13.92      (3.1%)  
-22.6% ( -29% -  -14%)
               MedPhrase      257.86      (7.1%)      213.38      (3.6%)  
-17.2% ( -26% -   -7%)
               LowPhrase       35.68      (1.8%)       33.32      (1.7%)   
-6.6% (  -9% -   -3%)
         MedSloppyPhrase       15.79      (4.2%)       14.92      (3.6%)   
-5.5% ( -12% -    2%)
         LowSloppyPhrase      118.09      (2.4%)      112.14      (2.0%)   
-5.0% (  -9% -    0%)
                HighTerm      138.18     (10.2%)      136.72      (6.7%)   
-1.1% ( -16% -   17%)
                 MedTerm      202.67      (9.6%)      200.94      (6.3%)   
-0.9% ( -15% -   16%)
            HighSpanNear      144.67      (4.3%)      144.35      (4.3%)   
-0.2% (  -8% -    8%)
             MedSpanNear      143.52      (3.9%)      143.30      (4.0%)   
-0.2% (  -7% -    8%)
                 Respell       85.33      (1.8%)       85.32      (2.6%)   
-0.0% (  -4% -    4%)
                 LowTerm     1052.81      (8.5%)     1053.59      (5.3%)    
0.1% ( -12% -   15%)
             LowSpanNear       27.81      (2.9%)       27.83      (2.9%)    
0.1% (  -5% -    6%)
                 Prefix3      232.97      (4.6%)      233.55      (4.5%)    
0.3% (  -8% -    9%)
             AndHighHigh       90.67      (1.7%)       91.01      (1.1%)    
0.4% (  -2% -    3%)
                  Fuzzy1      102.98      (2.1%)      103.38      (3.5%)    
0.4% (  -5% -    6%)
              AndHighLow     1121.50      (4.8%)     1126.02      (3.9%)    
0.4% (  -7% -    9%)
              AndHighMed      127.28      (2.0%)      127.88      (1.1%)    
0.5% (  -2% -    3%)
                  Fuzzy2       68.39      (2.1%)       68.77      (3.1%)    
0.5% (  -4% -    5%)
                Wildcard       48.08      (2.4%)       48.43      (4.2%)    
0.7% (  -5% -    7%)
                  IntNRQ        9.69      (5.8%)        9.79      (7.2%)    
1.1% ( -11% -   15%)
            OrNotHighLow       67.55      (8.1%)       68.88      (7.8%)    
2.0% ( -12% -   19%)
            OrNotHighMed       61.00      (8.3%)       62.38      (8.0%)    
2.3% ( -12% -   20%)
           OrNotHighHigh       35.44      (9.5%)       36.50      (9.5%)    
3.0% ( -14% -   24%)
           OrHighNotHigh       25.97      (9.6%)       26.80      (9.7%)    
3.2% ( -14% -   24%)
            OrHighNotMed       82.14     (10.1%)       84.84     (10.2%)    
3.3% ( -15% -   26%)
              OrHighHigh       29.25     (10.3%)       30.27     (10.5%)    
3.5% ( -15% -   27%)
            OrHighNotLow      104.15     (10.3%)      107.82     (10.5%)    
3.5% ( -15% -   27%)
               OrHighMed       65.67     (10.4%)       68.01     (10.7%)    
3.6% ( -15% -   27%)
               OrHighLow       63.61     (10.6%)       65.91     (10.7%)    
3.6% ( -16% -   27%)
{noformat}

We should look into the regressions for phrases. But first I need to work on 
LUCENE-6117, it is killing me :)

> Allow Scorer to expose positions and payloads aka. nuke spans 
> --------------------------------------------------------------
>
>                 Key: LUCENE-2878
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2878
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: Positions Branch
>            Reporter: Simon Willnauer
>            Assignee: Robert Muir
>              Labels: gsoc2014
>             Fix For: Positions Branch
>
>         Attachments: LUCENE-2878-OR.patch, LUCENE-2878-vs-trunk.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878_trunk.patch, 
> LUCENE-2878_trunk.patch, PosHighlighter.patch, PosHighlighter.patch
>
>
> Currently we have two somewhat separate types of queries, the one which can 
> make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
> doesn't really do scoring comparable to what other queries do and at the end 
> of the day they are duplicating lot of code all over lucene. Span*Queries are 
> also limited to other Span*Query instances such that you can not use a 
> TermQuery or a BooleanQuery with SpanNear or anthing like that. 
> Beside of the Span*Query limitation other queries lacking a quiet interesting 
> feature since they can not score based on term proximity since scores doesn't 
> expose any positional information. All those problems bugged me for a while 
> now so I stared working on that using the bulkpostings API. I would have done 
> that first cut on trunk but TermScorer is working on BlockReader that do not 
> expose positions while the one in this branch does. I started adding a new 
> Positions class which users can pull from a scorer, to prevent unnecessary 
> positions enums I added ScorerContext#needsPositions and eventually 
> Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
> currently only TermQuery / TermScorer implements this API and other simply 
> return null instead. 
> To show that the API really works and our BulkPostings work fine too with 
> positions I cut over TermSpanQuery to use a TermScorer under the hood and 
> nuked TermSpans entirely. A nice sideeffect of this was that the Position 
> BulkReading implementation got some exercise which now :) work all with 
> positions while Payloads for bulkreading are kind of experimental in the 
> patch and those only work with Standard codec. 
> So all spans now work on top of TermScorer ( I truly hate spans since today ) 
> including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
> to implement the other codecs yet since I want to get feedback on the API and 
> on this first cut before I go one with it. I will upload the corresponding 
> patch in a minute. 
> I also had to cut over SpanQuery.getSpans(IR) to 
> SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
> first but after that pain today I need a break first :).
> The patch passes all core tests 
> (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
> look into the MemoryIndex BulkPostings API yet)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

Reply via email to