Hi, On Wed, 2006-10-18 at 19:05 +1300, Paul Waite wrote: > No they don't want that. They just want a small number. What happens is > they enter some silly query, like searching for all stories with a single > common non-stop-word in them, and with the usual sort criterion of by date > (ie. a field) descending, and a limit of, say 25. > > So Lucene then presumably has to haul out a massive resultset, sort it, and > return the top 25 (out of 500,000 or whatever).
I had a similar issue recently: users only want the 100 (or whatever) most recently updated documents which match, and our documents aren't stored in date-order. Originally, we would walk the result set, instantiate a Document instance, pull out the timestamp field, and keep around the top 100 documents. Obviously this is extremely slow for large result sets. What I initially did to address this was store a reverse timestamp and walk the list of terms in the reverse timestamp field (they're sorted lexigraphically), and return the 100 most recent matching documents. In most cases this was a lot faster (for a search which returned 153,142 matches, I only had to walk 288 documents to find the 100 most recent), but in some cases it was a lot slower (for another search which returned 339 matches, I had to walk 292,911 documents to find the 100 most recent). In the end I found that I could walk 5 terms for every 2 documents I could instantiate and tuned a heuristic so that in the worst case (my second example) searches are 50% slower, but in almost all other cases they're quite a bit faster. Hope this helps, Joe --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]