Re: [PR] Prefetch postings data. [lucene]

via GitHub Tue, 14 May 2024 04:18:45 -0700


jpountz commented on code in PR #13364:
URL: https://github.com/apache/lucene/pull/13364#discussion_r1599846740



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99PostingsReader.java:
##########
@@ -2049,6 +2074,44 @@ public long cost() {
     }
   }
 
+  private void seekAndPrefetchPostings(IndexInput docIn, IntBlockTermState 
state)
+      throws IOException {
+    if (docIn.getFilePointer() != state.docStartFP) {
+      // Don't prefetch if the input is already positioned at the right 
offset, which suggests that
+      // the caller is streaming the entire inverted index (e.g. for merging), 
let the read-ahead
+      // logic do its work instead. Note that this heuristic doesn't work for 
terms that have skip
+      // data, since skip data is stored after the last term, but handling all 
terms that have <128
+      // docs is a good start already.
+      docIn.seek(state.docStartFP);
+      if (state.skipOffset < 0) {
+        // This postings list is very short as it doesn't have skip data, 
prefetch the page that
+        // holds the first byte of the postings list.
+        docIn.prefetch(1);
+      } else if (state.skipOffset <= MAX_POSTINGS_SIZE_FOR_FULL_PREFETCH) {
+        // This postings list is short as it fits on a few pages, prefetch it 
all, plus one byte to
+        // make sure to include some skip data.
+        docIn.prefetch(state.skipOffset + 1);

Review Comment:
   This is trying to address your concern about the number of system calls by 
doing a single system call when the postings list is short instead of 
independently doing a madvise call for postings and skip data. Especially as 
short postings are less likely to amortize the cost of system calls as 
iterating all docs is fast CPU-wise?
   
   When there are multiple clauses in the same query, usually the dense clauses 
would consume a small percentage of their number of docs while the sparser 
clauses would consume a large percentage of their number of docs, likely 
hitting all pages. So faulting all pages made sense to me but I don't have a 
strong feeling either and I'm happy to look at different approaches.
   
   > if the postings are short enough that we are willing to fault them all in 
at once, why do we even index skip data at all?
   
   You still get the CPU savings in the case when data fits in the page cache. 
Plus skip data also records impacts, and there are many cases when having 
impacts on the sparser clauses is important to be able to skip more hits on the 
denser clauses.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Prefetch postings data. [lucene]

Reply via email to