[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

Robert Muir (JIRA) Thu, 05 Jul 2012 04:11:40 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406992#comment-13406992
 ]


Robert Muir commented on LUCENE-4100:
-------------------------------------

Hello, thank you for working on this! 

I have just taken a rough glance at the code, and think we should probably look 
at what API changes would make 
this sort of thing fit better into Lucene and it easier to implement.

Random thoughts:

Specifically: what you are doing in the PostingsWriter is similar to computing 
impacts (I don't have a copy of
the paper so admittedly don't know the exact algorithm you are using). But it 
seems to me that you are putting 
a maxScore in the term dictionary metadata for all of the terms postings (as a 
float).

With the tool you provide, this works because you have access to e.g. the 
segment's length normalization information
etc (your postingswriter takes a reader). But we would have to think about how 
to give postingswriters access to this 
on flush... it seems possible to me though.

Giving the postingswriter full statistics (e.g. docfreq) for Similarity 
computation seems difficult: while I think
we could accum this stuff in FreqProxTermsWriter before we flush to the codec, 
it wouldn't solve the problem at merge time,
so you would have to do a 2-pass merge in the codec somehow...

But the alternative of splitting the "impact" (tf/norm) from the 
document-independent weight (e.g. IDF) isn't that pretty
either, because it limits the scoring systems (Similarity implementations) that 
could use the optimization.

as many terms will be low frequency (e.g. docfreq=1), i think its not
worth it to encode the maxscore for these low freq terms: we could save space 
by omitting maxscore for low freq terms 
and just treat it as infinitely large?

the opposite problem: is it really optimal to encode maxscore for the entire 
term? or would it be better for high-freq
terms to encode maxScore for a range of postings (e.g. block). This way, you 
could skip over ranges of postings that cannot
compete (rather than limiting the optimization to an entire term). A codec 
could put this information into a block header,
or at certain intervals, into the skip data, etc.

do we really need a full 4-byte float? How well would the algorithm work with 
degraded precision: e.g. something like
SmallFloat. (I think this SmallFloat currently computes a lower bound, we would 
have to bump to the next byte to make an upper bound).

another idea: it might be nice if this optimization could sit underneath the 
codec, such that you dont need a special
Scorer. One idea here would be for your collector to set an attribute on the 
DocsEnum (maxScore): of course a normal
codec would totally ignore this and proceed as today. But codecs like this one 
could return NO_MORE_DOCS when postings
for that term can no longer compete. I'm just not positive if this algorithm 
can be refactored in this way, and this
would also require some clean way of getting these attributes from Collector -> 
Scorer -> DocsEnum. Currently Scorer
is in the way here :)

Just some random thoughts, I'll try to get a copy of this paper so I have a 
better idea whats going on with this particular
optimization...
                
> Maxscore - Efficient Scoring
> ----------------------------
>
>                 Key: LUCENE-4100
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4100
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/codecs, core/query/scoring, core/search
>    Affects Versions: 4.0
>            Reporter: Stefan Pohl
>              Labels: api-change, patch, performance
>             Fix For: 4.0
>
>         Attachments: contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

Reply via email to