Hi, > I read the paper by Doug "Space optimizations for total ranking", > > since it was written a long time ago, I wonder what algorithms lucene uses > (regarding postings list traversal and score calculation, ranking)
The algorithms described in that paper are still in use to merge posting lists of several terms. The "total ranking" as described here is implemented in several key parts of Lucene: #1 Section 4.1 (Parallel Merge) is done in BooleanScorer2.java #2 Section 4.2 (Block Processing) is done in BooleanScorer.java #3 Maintaining the queue of Top-N-Results is implemented in TopScoreDocCollector.java, the above scorer implementations call for each hit found the "collect(docid, score) method of the collector. > particularly the total ranking algorithm described there needs to traverse down > the entire postings list for all the query terms, so in case of very common query > terms like "yellow dog", either of the 2 terms may have a very very long > postings list in case of web search, are they all really traversed in current > lucene/Solr ? or any heuristics to truncate the list are actually employed? There are possibilities to truncate those lists, but this is not implemented in Lucene core. The main problem with Lucene's segmented index structure is, that you cannot early exit, because the very last document in the posting list could be one with a very high score. Techniques like index pruning or index sorting can optimize all this, but need index preprocessing and loose updatability of the index while in use. The Top-N result collection is optimized in Lucene (item #3), to throw away all collected documents, where the score is never be able to get into the top-n priority queue (too small score, once the queue is full). But the score has to be calculated. > in the case of returning top-k results, I can understand that partitioning the > postings list into multiple machines, and then combining the top-k from each > would work, but if we are required to return "the 100th result page", i.e. results > ranked from 990--1000th, then each partition would still have to find out the > top 1000, so partitioning would not help much. Yes, and because of that almost no search engine support deep paging. Lucene uses lots of memory when trying to do this, although there are new collector implementations in Lucene that allow "optimized" deep paging (included into item #3), if the top-n of the 99th result page were already found. In that case, you can throw away all documents with a higher score in the collection phase than the last one of page 99 when collection page 100. > overall, is there any up-to-date detailed docs on the internal algorithms of > lucene? Reading code is best documentation. But Javadocs also help, we recently updated the documentation of Lucene's internals for the coming version 4: https://builds.apache.org/job/Lucene-trunk/javadoc/ > Thanks a lot > Yang Uwe --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org