With your proposed API you HAVE to support arbitrary doc scoring with each scorer. This can easily lead to heaps of complex, yet rarely-used code, as most people will still use score-only-current-doc approach, and this will invariably produce optimized shortcuts.
MG4J approach, on the other hand, does not raise complexity. As a nice side-effect it allows switching scoring implementations without extending/implementing from scratch the whole stack of query/weight/scorer for all popular queries (term/boolean/phrase). 2010/6/9 John Wang <[email protected]>: > Wouldn't you get it as well with proposed api? > You would still be able to iterate the doc and at that point call score with > the docid. If you call score() along with iteration, you would still get the > information no? > Making scorer take a docid allows you score any docid in the reader if the > query wants it to. Wouldn't it make it more flexible? > -John > > On Tue, Jun 8, 2010 at 10:54 AM, Earwin Burrfoot <[email protected]> wrote: >> >> To compute a score you have to see which of your subqueries did not >> match, which did, and what are the docfreqs/positions for them. >> When iterating, and calling score() only for current doc - parts of >> this data (maybe even all of it, not sure) is already gathered for >> you. If you allow calling score(int doc) - for arbitrary docId, you'll >> have to redo this work. >> >> 2010/6/8 John Wang <[email protected]>: >> > Hi Earwin: >> > >> > I am not sure I understand here, e.g. what si the difference >> > between: >> > >> > float myscorinCode(){ >> > computeMyScore(scorer.score()); >> > } >> > >> > and >> > >> > float myscorinCode(){ >> > >> > computeMyScore(scorer.score(scorer.getDocIdSetIterator().docID()); >> > } >> > >> > In the case of BQ, when you get a hit, would you still be able to >> > call >> > subscorer.score(hit)? Why is the point of iteration important for BQ? >> > >> > please elaborate. >> > >> > Thanks >> > >> > -John >> > >> > On Tue, Jun 8, 2010 at 10:10 AM, Earwin Burrfoot <[email protected]> >> > wrote: >> >> >> >> The problem with your proposal is that, currently, Lucene uses current >> >> iteration state to compute score. >> >> I.e. it already knows which of SHOULD BQ clauses matched for current >> >> doc, so it's easier to calculate the score. >> >> If you change API to allow scoring arbitrary documents (even those >> >> that didn't match the query at all), you're opening a can of worms :) >> >> >> >> As an alternative, you can try looking at MG4J sources. As far as I >> >> understand, their scoring is decoupled from matching, just like you >> >> (and I bet many more people) want. The matcher is separate, and the >> >> scoring entity accepts current matcher state instead of document id, >> >> so you get the best of both worlds. >> >> >> >> On Tue, Jun 8, 2010 at 21:01, John Wang <[email protected]> wrote: >> >> > re: But Scorer is itself an iterator, so what prevents you from >> >> > calling >> >> > nextDoc and advance on it without score() >> >> > >> >> > Nothing. It is just inefficient to pay the method call overhead just >> >> > to >> >> > overload score. >> >> > >> >> > re: If I were in your shoes, I'd simply provider a Query wrapper. If >> >> > CSQ >> >> > is not good enough I'd just develop my own. >> >> > >> >> > That is what I am doing. I am just proposing the change (see my first >> >> > email) >> >> > as an improvement. >> >> > >> >> > re: Scorer is itself an iterator >> >> > >> >> > yes, that is the current definition. The point of the proposal is to >> >> > make >> >> > this change. >> >> > >> >> > -John >> >> > >> >> > On Tue, Jun 8, 2010 at 9:45 AM, Shai Erera <[email protected]> wrote: >> >> >> >> >> >> Well … I don't know the reason as well and always thought Scorer and >> >> >> Similarity are confusing. >> >> >> >> >> >> But Scorer is itself an iterator, so what prevents you from calling >> >> >> nextDoc and advance on it without score(). And what would the >> >> >> returned >> >> >> DISI do when nextDoc is called, if not delegate to its subs? >> >> >> >> >> >> If I were in your shoes, I'd simply provider a Query wrapper. If CSQ >> >> >> is not good enough I'd just develop my own. >> >> >> >> >> >> But perhaps others think differently? >> >> >> >> >> >> Shai >> >> >> >> >> >> On Tuesday, June 8, 2010, John Wang <[email protected]> wrote: >> >> >> > Hi Shai: >> >> >> > I am not sure I understand how changing Similarity would solve >> >> >> > this >> >> >> > problem, wouldn't you need the reader? >> >> >> > As for PayloadTermQuery, payload is not always the most >> >> >> > efficient >> >> >> > way of storing such data, especially when number of terms << >> >> >> > numdocs. >> >> >> > (I am >> >> >> > not sure accessing the payload when you iterate is a good idea, >> >> >> > but >> >> >> > that is >> >> >> > another discussion) >> >> >> > >> >> >> > Yes, what I described is exactly a simple CustomScoreQuery for >> >> >> > a >> >> >> > special use-case. The problem is also in CustomScoreQuery, where >> >> >> > nextDoc and >> >> >> > advance are calling the sub-scorers as a wrapper. This can be >> >> >> > avoided >> >> >> > if the >> >> >> > Scorer returns an iterator instead. >> >> >> > >> >> >> > Separating scoring and doc iteration is a good idea anyway. I >> >> >> > don't >> >> >> > know the reason to combine them originally. >> >> >> > Thanks >> >> >> > -John >> >> >> > >> >> >> > >> >> >> > On Tue, Jun 8, 2010 at 8:47 AM, Shai Erera <[email protected]> >> >> >> > wrote: >> >> >> > >> >> >> > So wouldn't it make sense to add some method to Similarity? Which >> >> >> > receives the doc Id in question maybe ... just thinking here. >> >> >> > >> >> >> > Factoring Scorer like you propose would create 3 objects for >> >> >> > scoring/iterating: Scorer (which really becomes an iterator), >> >> >> > Similarity and >> >> >> > CustomScoreFunction ... >> >> >> > >> >> >> > Maybe you can use CustomScoreQuery? or PayloadTermQuery? depends >> >> >> > how >> >> >> > you >> >> >> > compute your age decay function (where you pull the data about the >> >> >> > age of >> >> >> > the document). >> >> >> > >> >> >> > Shai >> >> >> > >> >> >> > >> >> >> > On Tue, Jun 8, 2010 at 6:41 PM, John Wang <[email protected]> >> >> >> > wrote: >> >> >> > Hi Shai: >> >> >> > Similarity in many cases is not sufficient for scoring. For >> >> >> > example, >> >> >> > to implement age decaying of a document (very useful for corpuses >> >> >> > like news >> >> >> > or tweets), you want to project the raw tfidf score onto a time >> >> >> > curve, say >> >> >> > f(x), to do this, you'd have a custom scorer that decorates the >> >> >> > underlying >> >> >> > scorer from your say, boolean query: >> >> >> > >> >> >> > >> >> >> > >> >> >> > public float score(){ return myFunc(innerScorer.score());} >> >> >> > This is fine, but then you would have to do this as well: >> >> >> > public int nextDoc(){ >> >> >> > >> >> >> > >> >> >> > return innerScorer.nextDoc();} >> >> >> > and also: >> >> >> > public int advance(int target){ return innerScorer.advance();} >> >> >> > The difference here is that nextDoc and advance are called far >> >> >> > more >> >> >> > times as >> >> >> > score. And you are introducing an extra method call for them, >> >> >> > which >> >> >> > is not >> >> >> > insignificant for queries result in large recall sets. >> >> >> > >> >> >> > >> >> >> > >> >> >> > Hope this makes sense. >> >> >> > Thanks >> >> >> > -John >> >> >> > On Tue, Jun 8, 2010 at 5:02 AM, Shai Erera <[email protected]> >> >> >> > wrote: >> >> >> > I'm not sure I understand what you mean - Scorer is a DISI itself, >> >> >> > and >> >> >> > the scoring formula is mostly controlled by Similarity. >> >> >> > >> >> >> > What will be the benefits of the proposed change? >> >> >> > >> >> >> > Shai >> >> >> > >> >> >> > On Tue, Jun 8, 2010 at 8:25 AM, John Wang <[email protected]> >> >> >> > wrote: >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > Hi guys: >> >> >> > >> >> >> > I'd like to make a proposal to change the Scorer class/api to >> >> >> > the >> >> >> > following: >> >> >> > >> >> >> > >> >> >> > public abstract class Scorer{ >> >> >> > DocIdSetIterator getDocIDSetIterator(); >> >> >> > float score(int docid); >> >> >> > } >> >> >> > >> >> >> > Reasons: >> >> >> > >> >> >> > 1) To build a Scorer from an existing Scorer (e.g. that produces >> >> >> > raw >> >> >> > scores from tfidf), one would decorate it, and it would introduce >> >> >> > overhead >> >> >> > (in function calls) around nextDoc and advance, even if you just >> >> >> > want >> >> >> > to >> >> >> > augment the score method which is called much fewer times. >> >> >> > >> >> >> > 2) The current contract forces scoring on the currentDoc in the >> >> >> > underlying iterator. So once you pass "current", you can no longer >> >> >> > score. In >> >> >> > one of our use-cases, it is very inconvenient. >> >> >> > >> >> >> > What do you think? I can go ahead and open an issue and work on a >> >> >> > patch >> >> >> > if I get some agreement. >> >> >> > >> >> >> > Thanks >> >> >> > >> >> >> > -John >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> >> >> >> >> >> >> --------------------------------------------------------------------- >> >> >> To unsubscribe, e-mail: [email protected] >> >> >> For additional commands, e-mail: [email protected] >> >> >> >> >> > >> >> > >> >> >> >> >> >> >> >> -- >> >> Kirill Zakharenko/Кирилл Захаренко ([email protected]) >> >> Phone: +7 (495) 683-567-4 >> >> ICQ: 104465785 >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: [email protected] >> >> For additional commands, e-mail: [email protected] >> >> >> > >> > >> >> >> >> -- >> Kirill Zakharenko/Кирилл Захаренко ([email protected]) >> Phone: +7 (495) 683-567-4 >> ICQ: 104465785 >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> > > -- Kirill Zakharenko/Кирилл Захаренко ([email protected]) Phone: +7 (495) 683-567-4 ICQ: 104465785 --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
