Re: Proposal: Scorer api change

Earwin Burrfoot Tue, 08 Jun 2010 15:44:59 -0700

With your proposed API you HAVE to support arbitrary doc scoring with
each scorer.
This can easily lead to heaps of complex, yet rarely-used code, as
most people will still use score-only-current-doc approach, and this
will invariably produce optimized shortcuts.


MG4J approach, on the other hand, does not raise complexity. As a nice
side-effect it allows switching scoring implementations without
extending/implementing from scratch the whole stack of
query/weight/scorer for all popular queries (term/boolean/phrase).

2010/6/9 John Wang <[email protected]>:
> Wouldn't you get it as well with proposed api?
> You would still be able to iterate the doc and at that point call score with
> the docid. If you call score() along with iteration, you would still get the
> information no?
> Making scorer take a docid allows you score any docid in the reader if the
> query wants it to. Wouldn't it make it more flexible?
> -John
>
> On Tue, Jun 8, 2010 at 10:54 AM, Earwin Burrfoot <[email protected]> wrote:
>>
>> To compute a score you have to see which of your subqueries did not
>> match, which did, and what are the docfreqs/positions for them.
>> When iterating, and calling score() only for current doc - parts of
>> this data (maybe even all of it, not sure) is already gathered for
>> you. If you allow calling score(int doc) - for arbitrary docId, you'll
>> have to redo this work.
>>
>> 2010/6/8 John Wang <[email protected]>:
>> > Hi Earwin:
>> >
>> >      I am not sure I understand here, e.g. what si the difference
>> > between:
>> >
>> >      float myscorinCode(){
>> >          computeMyScore(scorer.score());
>> >      }
>> >
>> >      and
>> >
>> >       float myscorinCode(){
>> >
>> > computeMyScore(scorer.score(scorer.getDocIdSetIterator().docID());
>> >       }
>> >
>> >       In the case of BQ, when you get a hit, would you still be able to
>> > call
>> > subscorer.score(hit)? Why is the point of iteration important for BQ?
>> >
>> >       please elaborate.
>> >
>> > Thanks
>> >
>> > -John
>> >
>> > On Tue, Jun 8, 2010 at 10:10 AM, Earwin Burrfoot <[email protected]>
>> > wrote:
>> >>
>> >> The problem with your proposal is that, currently, Lucene uses current
>> >> iteration state to compute score.
>> >> I.e. it already knows which of SHOULD BQ clauses matched for current
>> >> doc, so it's easier to calculate the score.
>> >> If you change API to allow scoring arbitrary documents (even those
>> >> that didn't match the query at all), you're opening a can of worms :)
>> >>
>> >> As an alternative, you can try looking at MG4J sources. As far as I
>> >> understand, their scoring is decoupled from matching, just like you
>> >> (and I bet many more people) want. The matcher is separate, and the
>> >> scoring entity accepts current matcher state instead of document id,
>> >> so you get the best of both worlds.
>> >>
>> >> On Tue, Jun 8, 2010 at 21:01, John Wang <[email protected]> wrote:
>> >> > re: But Scorer is itself an iterator, so what prevents you from
>> >> > calling
>> >> > nextDoc and advance on it without score()
>> >> >
>> >> > Nothing. It is just inefficient to pay the method call overhead just
>> >> > to
>> >> > overload score.
>> >> >
>> >> > re: If I were in your shoes, I'd simply provider a Query wrapper. If
>> >> > CSQ
>> >> > is not good enough I'd just develop my own.
>> >> >
>> >> > That is what I am doing. I am just proposing the change (see my first
>> >> > email)
>> >> > as an improvement.
>> >> >
>> >> > re: Scorer is itself an iterator
>> >> >
>> >> > yes, that is the current definition. The point of the proposal is to
>> >> > make
>> >> > this change.
>> >> >
>> >> > -John
>> >> >
>> >> > On Tue, Jun 8, 2010 at 9:45 AM, Shai Erera <[email protected]> wrote:
>> >> >>
>> >> >> Well … I don't know the reason as well and always thought Scorer and
>> >> >> Similarity are confusing.
>> >> >>
>> >> >> But Scorer is itself an iterator, so what prevents you from calling
>> >> >> nextDoc and advance on it without score(). And what would the
>> >> >> returned
>> >> >> DISI do when nextDoc is called, if not delegate to its subs?
>> >> >>
>> >> >> If I were in your shoes, I'd simply provider a Query wrapper. If CSQ
>> >> >> is not good enough I'd just develop my own.
>> >> >>
>> >> >> But perhaps others think differently?
>> >> >>
>> >> >> Shai
>> >> >>
>> >> >> On Tuesday, June 8, 2010, John Wang <[email protected]> wrote:
>> >> >> > Hi Shai:
>> >> >> >     I am not sure I understand how changing Similarity would solve
>> >> >> > this
>> >> >> > problem, wouldn't you need the reader?
>> >> >> >     As for PayloadTermQuery, payload is not always the most
>> >> >> > efficient
>> >> >> > way of storing such data, especially when number of terms <<
>> >> >> > numdocs.
>> >> >> > (I am
>> >> >> > not sure accessing the payload when you iterate is a good idea,
>> >> >> > but
>> >> >> > that is
>> >> >> > another discussion)
>> >> >> >
>> >> >> >     Yes, what I described is exactly a simple CustomScoreQuery for
>> >> >> > a
>> >> >> > special use-case. The problem is also in CustomScoreQuery, where
>> >> >> > nextDoc and
>> >> >> > advance are calling the sub-scorers as a wrapper. This can be
>> >> >> > avoided
>> >> >> > if the
>> >> >> > Scorer returns an iterator instead.
>> >> >> >
>> >> >> >     Separating scoring and doc iteration is a good idea anyway. I
>> >> >> > don't
>> >> >> > know the reason to combine them originally.
>> >> >> > Thanks
>> >> >> > -John
>> >> >> >
>> >> >> >
>> >> >> > On Tue, Jun 8, 2010 at 8:47 AM, Shai Erera <[email protected]>
>> >> >> > wrote:
>> >> >> >
>> >> >> > So wouldn't it make sense to add some method to Similarity? Which
>> >> >> > receives the doc Id in question maybe ... just thinking here.
>> >> >> >
>> >> >> > Factoring Scorer like you propose would create 3 objects for
>> >> >> > scoring/iterating: Scorer (which really becomes an iterator),
>> >> >> > Similarity and
>> >> >> > CustomScoreFunction ...
>> >> >> >
>> >> >> > Maybe you can use CustomScoreQuery? or PayloadTermQuery? depends
>> >> >> > how
>> >> >> > you
>> >> >> > compute your age decay function (where you pull the data about the
>> >> >> > age of
>> >> >> > the document).
>> >> >> >
>> >> >> > Shai
>> >> >> >
>> >> >> >
>> >> >> > On Tue, Jun 8, 2010 at 6:41 PM, John Wang <[email protected]>
>> >> >> > wrote:
>> >> >> > Hi Shai:
>> >> >> >     Similarity in many cases is not sufficient for scoring. For
>> >> >> > example,
>> >> >> > to implement age decaying of a document (very useful for corpuses
>> >> >> > like news
>> >> >> > or tweets), you want to project the raw tfidf score onto a time
>> >> >> > curve, say
>> >> >> > f(x), to do this, you'd have a custom scorer that decorates the
>> >> >> > underlying
>> >> >> > scorer from your say, boolean query:
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > public float score(){    return myFunc(innerScorer.score());}
>> >> >> >     This is fine, but then you would have to do this as well:
>> >> >> > public int nextDoc(){
>> >> >> >
>> >> >> >
>> >> >> >    return innerScorer.nextDoc();}
>> >> >> > and also:
>> >> >> > public int advance(int target){   return innerScorer.advance();}
>> >> >> > The difference here is that nextDoc and advance are called far
>> >> >> > more
>> >> >> > times as
>> >> >> > score. And you are introducing an extra method call for them,
>> >> >> > which
>> >> >> > is not
>> >> >> > insignificant for queries result in large recall sets.
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > Hope this makes sense.
>> >> >> > Thanks
>> >> >> > -John
>> >> >> > On Tue, Jun 8, 2010 at 5:02 AM, Shai Erera <[email protected]>
>> >> >> > wrote:
>> >> >> > I'm not sure I understand what you mean - Scorer is a DISI itself,
>> >> >> > and
>> >> >> > the scoring formula is mostly controlled by Similarity.
>> >> >> >
>> >> >> > What will be the benefits of the proposed change?
>> >> >> >
>> >> >> > Shai
>> >> >> >
>> >> >> > On Tue, Jun 8, 2010 at 8:25 AM, John Wang <[email protected]>
>> >> >> > wrote:
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > Hi guys:
>> >> >> >
>> >> >> >     I'd like to make a proposal to change the Scorer class/api to
>> >> >> > the
>> >> >> > following:
>> >> >> >
>> >> >> >
>> >> >> > public abstract class Scorer{
>> >> >> >    DocIdSetIterator getDocIDSetIterator();
>> >> >> >    float score(int docid);
>> >> >> > }
>> >> >> >
>> >> >> > Reasons:
>> >> >> >
>> >> >> > 1) To build a Scorer from an existing Scorer (e.g. that produces
>> >> >> > raw
>> >> >> > scores from tfidf), one would decorate it, and it would introduce
>> >> >> > overhead
>> >> >> > (in function calls) around nextDoc and advance, even if you just
>> >> >> > want
>> >> >> > to
>> >> >> > augment the score method which is called much fewer times.
>> >> >> >
>> >> >> > 2) The current contract forces scoring on the currentDoc in the
>> >> >> > underlying iterator. So once you pass "current", you can no longer
>> >> >> > score. In
>> >> >> > one of our use-cases, it is very inconvenient.
>> >> >> >
>> >> >> > What do you think? I can go ahead and open an issue and work on a
>> >> >> > patch
>> >> >> > if I get some agreement.
>> >> >> >
>> >> >> > Thanks
>> >> >> >
>> >> >> > -John
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >>
>> >> >>
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: [email protected]
>> >> >> For additional commands, e-mail: [email protected]
>> >> >>
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Kirill Zakharenko/Кирилл Захаренко ([email protected])
>> >> Phone: +7 (495) 683-567-4
>> >> ICQ: 104465785
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [email protected]
>> >> For additional commands, e-mail: [email protected]
>> >>
>> >
>> >
>>
>>
>>
>> --
>> Kirill Zakharenko/Кирилл Захаренко ([email protected])
>> Phone: +7 (495) 683-567-4
>> ICQ: 104465785
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>
>



-- 
Kirill Zakharenko/Кирилл Захаренко ([email protected])
Phone: +7 (495) 683-567-4
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Proposal: Scorer api change

Reply via email to