Op Monday 13 October 2008 17:00:06 schreef Andrzej Bialecki: > Renaud Delbru wrote: > > Hi Andrzej, > > > > sorry for the late reply. > > > > I have looked at the code. As far as I understand, you sort the > > posting lists based on the first doc skip. The first posting list > > will be the one who have the first biggest document skip. > > Do the sparseness of posting lists is a good predictor for sampling > > and ordering posting lists ? Do you know evaluation of such > > technique ? > > It is _some_ predictor ... :) whether it's a good one is another > question. It's certainly very inexpensive - we don't do any > additional IO except what we have to do anyway, which is > scorer.skipTo(). > > In general case it's costly to calculate the frequency (or > sparseness) of matches in a scorer without actually running the > scorer through all its matches. > > > In order to implement sorting based on frequency, we need the > > document frequency of each term. This information should be > > propagated through the Scorer classes (from TermScorer to higher > > level class such as ConjunctiveScorer). This will require a call to > > IndexReader.docFreq(term) for each of the term queries. Is docFreq > > call mean another IO access ? > > It sounds like you plan to order scorers by term frequency ... but in > general case they won't all be TermScorers, so the frequency of > documents matching a scorer won't have any particular connection to a > single term freq.
This could be done, but since not all scorers will be TermScorers it will be necessary to add a method to Scorer (or perhaps even to its DocIdSetIterator superclass): public abstract int estimatedDocFreq(); and implement this for all existing instances. TermScorer could implement it without estimating. For AND/OR/NOT such an estimation is straightforward but for proximity queries it would be more of a guess. Regards, Paul Elschot --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]