jpountz commented on issue #13179:
URL: https://github.com/apache/lucene/issues/13179#issuecomment-2149257220
> Then before evaluating if these docs matches TwoPhaseIterator or not, we
can perform prefetch on these buffered docs (via some prepareMatches mechanism
on TwoPhaseIterator).
This can be done, but I'd note that this would be a significant change to
our APIs since `TwoPhaseIterator` only supports verifying the current document
that the approximation is on. It is not possible to buffer matching documents
from the approximation, to then check them with the `TwoPhaseIterator`. This is
similar to the point I was making in a previous comment about buffering
documents in collectors, `Scorer#score` only supports scoring the current
document that the scorer is positioned on, it is not possible to buffer several
documents and then evaluate their scores in `TopScoreDocCollector` (without API
changes).
> via some prepareMatches mechanism on TwoPhaseIterator
FWIW one thing that is on my mind is that both postings and doc values take
in the order of 1 or 2 bytes per document. So even a query that matches 0.1% of
docs, evenly distributed in the doc ID space, would still end up fetching all
pages in practice. So a very smart prefetching may only perform better than
naive prefetching in the following cases:
- Queries that are _extremely_ sparse.
- Queries whose matches are highly clustered in the doc ID spare, because
of index sorting, recursive graph bisection or early termination.
But then I'd still expect some naive readahead logic to perform ok in such
cases. For the extremely sparse case, it would fetch up to X times too many
pages where X is the number of pages that get read ahead. For reasonable values
of X, this should be ok.
The other thing that is on my mind is that this sort of approach allows us
doing it completely at the OS level, which gives additional efficiency.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]