Posted something similar some time ago, but didn't get any responses, so I thought I'd try again with more details.
We allow end-user queries that have our own proprietary query language, which we then translate to a Lucene Query* AST. This has worked well for us. However, a few of the operators we allow have extremely high document frequency, on the order of > 60%. End users sometimes want to get a count of all documents matching that field value. Since we're trying to get as close to possible to the 2.1B document limit per index, this type of query can take more than 20 seconds. Most of these operators are boolean values, which we could cache externally ahead of time in a bit-set representation in memory, using docID as a pointer to the array. Based on preliminary testing, we know that using bitsets can significantly speed up these count queries. The question then is how to tie-in the bitset implementation to query evaluation. We considered Filters, but it seems like those are only particularly useful when you want to filter a whole result set. In our case these clauses can appear at any level of the query tree. The next thought is creating a custom implementation of the Query class (similar to TermQuery, etc), that knows how to evaluate based on the bitset rather than going to the index itself. This looks possible but fairly involved. It seems like this can't be a new problem, so we're wondering if there's pre-existing work here that we're missing to make this easier. Any thoughts? Thanks, Marcos Juarez