Thanks Ralf. basically you are talking about selectivity of columns in a JOIN, right?
but in my above example, "yellow dog", both terms are very common, and both have long postings lists. Yang On Thu, Apr 26, 2012 at 12:17 AM, Ralf Heyde <ralf.he...@gmx.de> wrote: > Hi, > > i dont know the correct implementation. All that I can say is, that you > should take a look at query optimization in state-of-the-art database > systems. The generell solution is to select this part of a query first, > which reduces the resultset most. > > For example you try to search for a term like "I'm looking for a term" - > each token has a selectivity (TF/IDF, histograms (ie. Zipf-Distribution)). > > Term / Frequency > I'm - often > looking - normal > for - stopword (too often and maybe ignored) > a - stopword (too often and maybe ignored) > term - rare > > So - Lucene might to select all documents, which contain "term", then on > the "small" resultset looking for documents matching "looking" ... and so > on. > For that reason Lucene stores information like Histograms and other > "things" ... > > I hope I gave you a simple example, which Lucene might use. > > Greets from Berlin, > > Ralf > > > -------- Original-Nachricht -------- > > Datum: Wed, 25 Apr 2012 14:20:10 -0700 > > Von: Yang <teddyyyy...@gmail.com> > > An: java-user@lucene.apache.org > > Betreff: Re: lucene algorithm ? > > > additionally, anybody knows roughly (of course the details are a secret, > > but I guess the main ideas should be > > common enough these days) how google does fast ranking in cases of > > multi-term queries with AND ? > > (if their postings are sorted by PageRank order, then it's understandable > > that a single term query would quickly return the top-k, but if it's > > multi-term, they would have to traverse the entire lists to find the > > insersection set, because the lists are not sorted by docId, as in the > > Lucene paper case) > > > > > > > > On Wed, Apr 25, 2012 at 2:13 PM, Yang <teddyyyy...@gmail.com> wrote: > > > > > I read the paper by Doug "Space optimizations for total ranking", > > > > > > since it was written a long time ago, I wonder what algorithms lucene > > uses > > > (regarding postings list traversal and score calculation, ranking) > > > > > > > > > particularly the total ranking algorithm described there needs to > > traverse > > > down the entire postings list for all the query terms, > > > so in case of very common query terms like "yellow dog", either of the > 2 > > > terms may have a very very long postings list in case of web search, > > > are they all really traversed in current lucene/Solr ? or any > > heuristics > > > to truncate the list are actually employed? > > > > > > in the case of returning top-k results, I can understand that > > partitioning > > > the postings list into multiple machines, and then combining the top-k > > > from each would work, > > > but if we are required to return "the 100th result page", i.e. results > > > ranked from 990--1000th, then each partition would still have to find > > out > > > the top 1000, so > > > partitioning would not help much. > > > > > > > > > overall, is there any up-to-date detailed docs on the internal > > algorithms > > > of lucene? > > > > > > Thanks a lot > > > Yang > > > > > -- > Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir > belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >