Hi folks- I've run into a bit of an interesting situation when attempting to enforce a query evaluation time budget in our Lucene application (Amazon Product Search), and I'm curious if it's something others have run into or have thoughts on. There's a reasonable chance this use-case is fairly specific to our application, but if others have seen similar use-cases, then maybe there's a general solution worth pursuing here in Lucene itself?
We'd like to enforce a strict time budget for query evaluation, even at the cost of potentially missing some matches that have yet to be seen. One tempting solution is to enforce this in our leaf collectors each time they see a hit by throwing a CollectionTerminationException, which is handled nicely by IndexSearcher (we use concurrent search so this more-or-less enforces an overall time budget). Our queries follow a two-phase matching approach, and we've run into some interesting edge-cases where the "approximation" phase may produce a very large set of match candidates but the "confirmation" phase only confirms matches on a very small fraction of them. In extreme cases, the entire index could match in the "approximation" phase and none of the hits could be "confirmed" in the second phase check. This creates an interesting issue where the query may evaluate for a long time before the leaf collectors see hits (or they may never see hits). This boils down to the BulkScorer running a loop over all "approximate" candidates and then attempting to "confirm" each before the leaf collector "sees" anything (it could also happen in a case of many first phase matches with many of those hits having been deleted). In these cases, we can run significantly over our time budget. One solution I've come up with is to create a top-level Query implementation that enforces the time budget each time it produces "approximate" matches. This more-or-less works for our use-case, but has some "rough edges" as a general solution. What I've observed is that Lucene really only supports collectors / leaf collectors throwing CollectionTerminationException and doesn't necessarily support Query implementations doing this. One of the most glaring issues is that the LRU query caching (if enabled) doesn't handle the exception, so if a Query were to throw when pre-populating the cache bitsets, it would terminate the entire search (in a pretty ungraceful way). I'm also aware of ExitableDirectoryReader but it's trickier to manage for our use case since we read from the index outside of the main query evaluation phase for other purposes. I'm sure there's a solution where we maintain multiple Readers, etc. So... I'm interested if anyone else has run into a similar use-case. Does anyone have thoughts on alternative solutions? Is there any appetite to augment Lucene to allow for queries to signal early termination by throwing CollectionTerminationExceptions? I suspect ExitableDirectoryReader probably provides a good enough solution for others in this situation, but I wanted to raise the topic and see what other folks here think. Cheers, -Greg --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org