jpountz commented on code in PR #13364:
URL: https://github.com/apache/lucene/pull/13364#discussion_r1599846740
##########
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99PostingsReader.java:
##########
@@ -2049,6 +2074,44 @@ public long cost() {
}
}
+ private void seekAndPrefetchPostings(IndexInput docIn, IntBlockTermState
state)
+ throws IOException {
+ if (docIn.getFilePointer() != state.docStartFP) {
+ // Don't prefetch if the input is already positioned at the right
offset, which suggests that
+ // the caller is streaming the entire inverted index (e.g. for merging),
let the read-ahead
+ // logic do its work instead. Note that this heuristic doesn't work for
terms that have skip
+ // data, since skip data is stored after the last term, but handling all
terms that have <128
+ // docs is a good start already.
+ docIn.seek(state.docStartFP);
+ if (state.skipOffset < 0) {
+ // This postings list is very short as it doesn't have skip data,
prefetch the page that
+ // holds the first byte of the postings list.
+ docIn.prefetch(1);
+ } else if (state.skipOffset <= MAX_POSTINGS_SIZE_FOR_FULL_PREFETCH) {
+ // This postings list is short as it fits on a few pages, prefetch it
all, plus one byte to
+ // make sure to include some skip data.
+ docIn.prefetch(state.skipOffset + 1);
Review Comment:
This is trying to address your concern about the number of system calls by
doing a single system call when the postings list is short instead of
independently doing a madvise call for postings and skip data. Especially as
short postings are less likely to amortize the cost of system calls as
iterating all docs is fast CPU-wise?
When there are multiple clauses in the same query, usually the dense clauses
would consume a small percentage of their number of docs while the sparser
clauses would consume a large percentage of their number of docs, likely
hitting all pages. So faulting all pages made sense to me but I don't have a
strong feeling either and I'm happy to look at different approaches.
> if the postings are short enough that we are willing to fault them all in
at once, why do we even index skip data at all?
You still get the CPU savings in the case when data fits in the page cache.
Plus skip data also records impacts, and there are many cases when having
impacts on the sparser clauses is important to be able to skip more hits on the
denser clauses.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]