On Thu, Apr 26, 2012 at 5:13 AM, Yang <teddyyyy...@gmail.com> wrote: > > I read the paper by Doug "Space optimizations for total ranking", > > since it was written a long time ago, I wonder what algorithms lucene uses > (regarding postings list traversal and score calculation, ranking) > > > particularly the total ranking algorithm described there needs to traverse > down the entire postings list for all the query terms, > so in case of very common query terms like "yellow dog", either of the 2 > terms may have a very very long postings list in case of web search, > are they all really traversed in current lucene/Solr ? or any heuristics > to truncate the list are actually employed? you can read related papers about early termination, they are closely related to ranking algorithm. Now lucene did little thing of this area. Also is it's ranking algorithm. > > in the case of returning top-k results, I can understand that partitioning > the postings list into multiple machines, and then combining the top-k That's distributed searching, solr has this ability. Even for a single node, for conjunction query(and query), lucene will use skip list in posting to speed up. for disjunction query(or query), lucene will use BooleanScorer rather than BooleanScorer2. BooleanScorer is TAAT(Term at a Time) algorithm while BooleanScorer2 is DAAT(Document at a Time). > from each would work, > but if we are required to return "the 100th result page", i.e. results > ranked from 990--1000th, then each partition would still have to find out > the top 1000, so > partitioning would not help much. > yes, that's why many search engines will not allow user visit page number greater than a threshold. for most application, users usually only visit top results. That's why ranking algorithm is important. if you found your users always turn to next page, I think you should consider your application. you should provide more filter condition or improving ranking algorithm.
> > > overall, is there any up-to-date detailed docs on the internal algorithms > of lucene? if you can read Chinese, I recommend http://www.cnblogs.com/forfuture1978/category/300665.htm. you may also find some of my blogs about lucene/solr in blog.csdn.net/fancyerII(I am not a persistent person, and plan of writing blogs of lucene/solr is not continued) anyhow, the source code is the best resource. > > Thanks a lot > Yang --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org