mccullocht commented on PR #15708: URL: https://github.com/apache/lucene/pull/15708#issuecomment-3924607405
This API is probably fine for the described purpose but I'm skeptical about how useful this will be. Recall improvements diminish pretty quickly when increasing the query bit rate without increasing the doc bit rate. I'm optimistic that we could do more to improve recall and performance without exposing this kind of parameter. To obey the proposed API we would need to be able to compare two vectors of different bit rates for any pair of bit rates up to, say, 8 bits/dim. Up to somewhere around 4-8 comparisons/dimension the transpose + popcount strategy that we employ for bit and dibit works, but once the number of comparisons grows larger than that it starts to become cheaper to perform a dot product, and how well that will work depends a lot on how the vectors are packed. The current 1-bit packing scheme in particular would be difficult to compare to other bit rate vectors because of how hard it would be to unpack into the same dimension order as something else. This problem also exists if you look at extending the doc vector with quantized residual as described in the [LVQ paper](https://arxiv.org/pdf/2304.04759). I have another idea that is inspired by placing statistical bounds on estimated distance as described in the RaBitQ paper -- the idea is that if a `minSimilarity` parameter was passed to `score()` the scorer might be able to eliminate certain candidates after examining only 1 bit of a 4 bit query vector. I'll file an issue for this once I have a better handle on the math. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
