[
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406992#comment-13406992
]
Robert Muir commented on LUCENE-4100:
-------------------------------------
Hello, thank you for working on this!
I have just taken a rough glance at the code, and think we should probably look
at what API changes would make
this sort of thing fit better into Lucene and it easier to implement.
Random thoughts:
Specifically: what you are doing in the PostingsWriter is similar to computing
impacts (I don't have a copy of
the paper so admittedly don't know the exact algorithm you are using). But it
seems to me that you are putting
a maxScore in the term dictionary metadata for all of the terms postings (as a
float).
With the tool you provide, this works because you have access to e.g. the
segment's length normalization information
etc (your postingswriter takes a reader). But we would have to think about how
to give postingswriters access to this
on flush... it seems possible to me though.
Giving the postingswriter full statistics (e.g. docfreq) for Similarity
computation seems difficult: while I think
we could accum this stuff in FreqProxTermsWriter before we flush to the codec,
it wouldn't solve the problem at merge time,
so you would have to do a 2-pass merge in the codec somehow...
But the alternative of splitting the "impact" (tf/norm) from the
document-independent weight (e.g. IDF) isn't that pretty
either, because it limits the scoring systems (Similarity implementations) that
could use the optimization.
as many terms will be low frequency (e.g. docfreq=1), i think its not
worth it to encode the maxscore for these low freq terms: we could save space
by omitting maxscore for low freq terms
and just treat it as infinitely large?
the opposite problem: is it really optimal to encode maxscore for the entire
term? or would it be better for high-freq
terms to encode maxScore for a range of postings (e.g. block). This way, you
could skip over ranges of postings that cannot
compete (rather than limiting the optimization to an entire term). A codec
could put this information into a block header,
or at certain intervals, into the skip data, etc.
do we really need a full 4-byte float? How well would the algorithm work with
degraded precision: e.g. something like
SmallFloat. (I think this SmallFloat currently computes a lower bound, we would
have to bump to the next byte to make an upper bound).
another idea: it might be nice if this optimization could sit underneath the
codec, such that you dont need a special
Scorer. One idea here would be for your collector to set an attribute on the
DocsEnum (maxScore): of course a normal
codec would totally ignore this and proceed as today. But codecs like this one
could return NO_MORE_DOCS when postings
for that term can no longer compete. I'm just not positive if this algorithm
can be refactored in this way, and this
would also require some clean way of getting these attributes from Collector ->
Scorer -> DocsEnum. Currently Scorer
is in the way here :)
Just some random thoughts, I'll try to get a copy of this paper so I have a
better idea whats going on with this particular
optimization...
> Maxscore - Efficient Scoring
> ----------------------------
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Java
> Issue Type: Improvement
> Components: core/codecs, core/query/scoring, core/search
> Affects Versions: 4.0
> Reporter: Stefan Pohl
> Labels: api-change, patch, performance
> Fix For: 4.0
>
> Attachments: contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood,
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with
> example queries and lucenebench, the package of Mike McCandless, resulting in
> very significant speedups.
> This ticket is to get started the discussion on including the implementation
> into Lucene's codebase. Because the technique requires awareness about it
> from the Lucene user/developer, it seems best to become a contrib/module
> package so that it consciously can be chosen to be used.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]