Re: [PR] Speed up prefix sums when decoding doc IDs. [lucene]

via GitHub Thu, 15 Aug 2024 11:59:07 -0700


gsmiller commented on code in PR #13658:
URL: https://github.com/apache/lucene/pull/13658#discussion_r1718827005



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene912/ForDeltaUtil.java:
##########
@@ -62,22 +256,374 @@ void encodeDeltas(long[] longs, DataOutput out) throws 
IOException {
       assert or != 0;
       final int bitsPerValue = PackedInts.bitsRequired(or);
       out.writeByte((byte) bitsPerValue);
-      forUtil.encode(longs, bitsPerValue, out);
+      encodeDeltas(longs, bitsPerValue, out);
+    }
+  }
+
+  private void encodeDeltas(long[] longs, int bitsPerValue, DataOutput out) 
throws IOException {
+    final int nextPrimitive;
+    final int numLongs;
+    if (bitsPerValue <= 4) {
+      nextPrimitive = 8;
+      numLongs = BLOCK_SIZE / 8;
+      collapse8(longs);
+    } else if (bitsPerValue <= 11) {
+      nextPrimitive = 16;
+      numLongs = BLOCK_SIZE / 4;
+      collapse16(longs);
+    } else {
+      nextPrimitive = 32;
+      numLongs = BLOCK_SIZE / 2;
+      collapse32(longs);
+    }
+
+    final int numLongsPerShift = bitsPerValue * 2;
+    int idx = 0;
+    int shift = nextPrimitive - bitsPerValue;
+    for (int i = 0; i < numLongsPerShift; ++i) {
+      tmp[i] = longs[idx++] << shift;
+    }
+    for (shift = shift - bitsPerValue; shift >= 0; shift -= bitsPerValue) {
+      for (int i = 0; i < numLongsPerShift; ++i) {
+        tmp[i] |= longs[idx++] << shift;
+      }
+    }
+
+    final int remainingBitsPerLong = shift + bitsPerValue;
+    final long maskRemainingBitsPerLong;
+    if (nextPrimitive == 8) {
+      maskRemainingBitsPerLong = MASKS8[remainingBitsPerLong];
+    } else if (nextPrimitive == 16) {
+      maskRemainingBitsPerLong = MASKS16[remainingBitsPerLong];
+    } else {
+      maskRemainingBitsPerLong = MASKS32[remainingBitsPerLong];
+    }
+
+    int tmpIdx = 0;
+    int remainingBitsPerValue = bitsPerValue;
+    while (idx < numLongs) {
+      if (remainingBitsPerValue >= remainingBitsPerLong) {
+        remainingBitsPerValue -= remainingBitsPerLong;
+        tmp[tmpIdx++] |= (longs[idx] >>> remainingBitsPerValue) & 
maskRemainingBitsPerLong;
+        if (remainingBitsPerValue == 0) {
+          idx++;
+          remainingBitsPerValue = bitsPerValue;
+        }
+      } else {
+        final long mask1, mask2;
+        if (nextPrimitive == 8) {
+          mask1 = MASKS8[remainingBitsPerValue];
+          mask2 = MASKS8[remainingBitsPerLong - remainingBitsPerValue];
+        } else if (nextPrimitive == 16) {
+          mask1 = MASKS16[remainingBitsPerValue];
+          mask2 = MASKS16[remainingBitsPerLong - remainingBitsPerValue];
+        } else {
+          mask1 = MASKS32[remainingBitsPerValue];
+          mask2 = MASKS32[remainingBitsPerLong - remainingBitsPerValue];
+        }
+        tmp[tmpIdx] |= (longs[idx++] & mask1) << (remainingBitsPerLong - 
remainingBitsPerValue);
+        remainingBitsPerValue = bitsPerValue - remainingBitsPerLong + 
remainingBitsPerValue;
+        tmp[tmpIdx++] |= (longs[idx] >>> remainingBitsPerValue) & mask2;
+      }
+    }
+
+    for (int i = 0; i < numLongsPerShift; ++i) {
+      out.writeLong(tmp[i]);
+    }
+  }
+
+  /** Number of bytes required to encode 128 integers of {@code bitsPerValue} 
bits per value. */
+  int numBytes(int bitsPerValue) {
+    return bitsPerValue << (BLOCK_SIZE_LOG2 - 3);
+  }
+
+  private static void decodeSlow(

Review Comment:
   And one more method that I think is duplicated from `ForUtil`?



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene912/ForDeltaUtil.java:
##########
@@ -62,22 +256,374 @@ void encodeDeltas(long[] longs, DataOutput out) throws 
IOException {
       assert or != 0;
       final int bitsPerValue = PackedInts.bitsRequired(or);
       out.writeByte((byte) bitsPerValue);
-      forUtil.encode(longs, bitsPerValue, out);
+      encodeDeltas(longs, bitsPerValue, out);
+    }
+  }
+
+  private void encodeDeltas(long[] longs, int bitsPerValue, DataOutput out) 
throws IOException {

Review Comment:
   I think this is identical to `ForDelta#encode` right? Can we leverage that 
instead of duplicating it? I guess the issue is that `ForUtil` relies on its 
`tmp` buffer so the method isn't static, but we could split out a helper static 
method that takes the buffer as input that could be common I think.



##########
lucene/core/src/java/org/apache/lucene/internal/vectorization/DefaultPostingDecodingUtil.java:
##########
@@ -29,13 +29,14 @@ public DefaultPostingDecodingUtil(IndexInput in) {
 
   @Override
   public void splitLongs(

Review Comment:
   What if we add the "slow" implementation as a static method in 
`PostingDecodingUtil` (as sketched out below) and then delegate to it here and 
in the "slow path" in the vectorized implementation? That we can avoid the code 
duplication.
   
   ```
     protected static void splitLongsSlow(IndexInput in, int count, long[] b, 
int bShift, int dec, long bMask, long[] c, int cIndex, long cMask) throws 
IOException {
       in.readLongs(c, cIndex, count);
       // The below loop is auto-vectorized
       for (int i = 0; i < count; ++i) {
         for (int j = 0, end = (bShift - 1) / dec; j <= end; ++j) {
           b[count * j + i] = (c[cIndex + i] >>> (bShift - j * dec)) & bMask;
         }
         // can we determine cIndex as `count * ((bShift - 1) / dec + 1)`?
         c[cIndex + i] &= cMask;
       }
     }
   ```



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene912/ForUtil.java:
##########
@@ -712,14 +465,14 @@ private static void decode7(IndexInput in, 
PostingDecodingUtil pdu, long[] tmp,
     }
   }
 
-  private static void decode8(IndexInput in, PostingDecodingUtil pdu, long[] 
tmp, long[] longs)
+  static void decode8(IndexInput in, PostingDecodingUtil pdu, long[] tmp, 
long[] longs)

Review Comment:
   If I remember correctly, there was a version of the change where you always 
read the longs through `PostingDecodingUtil` and didn't need `IndexInput` for 
these edge-cases. It seems a little easier to understand that way (to me at 
least). Since `PostingDecodingUtil` is always getting created with an 
`IndexInput`, would it make sense to have a method in `PostingDecodingUtil` 
that just does the read delegation? Or an alternative would be to expose 
`IndexInput` (e.g., it could just be a public field in the abstract class)? 
It's a minor detail but it seems like it would make the code wiring slightly 
cleaner to not have to provide both `IndexInput` and `PostingDecodingUtil` when 
initializing `ForUtil` / `ForDeltaUtil`.



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene912/ForDeltaUtil.java:
##########
@@ -62,22 +256,374 @@ void encodeDeltas(long[] longs, DataOutput out) throws 
IOException {
       assert or != 0;
       final int bitsPerValue = PackedInts.bitsRequired(or);
       out.writeByte((byte) bitsPerValue);
-      forUtil.encode(longs, bitsPerValue, out);
+      encodeDeltas(longs, bitsPerValue, out);
+    }
+  }
+
+  private void encodeDeltas(long[] longs, int bitsPerValue, DataOutput out) 
throws IOException {
+    final int nextPrimitive;
+    final int numLongs;
+    if (bitsPerValue <= 4) {
+      nextPrimitive = 8;
+      numLongs = BLOCK_SIZE / 8;
+      collapse8(longs);
+    } else if (bitsPerValue <= 11) {
+      nextPrimitive = 16;
+      numLongs = BLOCK_SIZE / 4;
+      collapse16(longs);
+    } else {
+      nextPrimitive = 32;
+      numLongs = BLOCK_SIZE / 2;
+      collapse32(longs);
+    }
+
+    final int numLongsPerShift = bitsPerValue * 2;
+    int idx = 0;
+    int shift = nextPrimitive - bitsPerValue;
+    for (int i = 0; i < numLongsPerShift; ++i) {
+      tmp[i] = longs[idx++] << shift;
+    }
+    for (shift = shift - bitsPerValue; shift >= 0; shift -= bitsPerValue) {
+      for (int i = 0; i < numLongsPerShift; ++i) {
+        tmp[i] |= longs[idx++] << shift;
+      }
+    }
+
+    final int remainingBitsPerLong = shift + bitsPerValue;
+    final long maskRemainingBitsPerLong;
+    if (nextPrimitive == 8) {
+      maskRemainingBitsPerLong = MASKS8[remainingBitsPerLong];
+    } else if (nextPrimitive == 16) {
+      maskRemainingBitsPerLong = MASKS16[remainingBitsPerLong];
+    } else {
+      maskRemainingBitsPerLong = MASKS32[remainingBitsPerLong];
+    }
+
+    int tmpIdx = 0;
+    int remainingBitsPerValue = bitsPerValue;
+    while (idx < numLongs) {
+      if (remainingBitsPerValue >= remainingBitsPerLong) {
+        remainingBitsPerValue -= remainingBitsPerLong;
+        tmp[tmpIdx++] |= (longs[idx] >>> remainingBitsPerValue) & 
maskRemainingBitsPerLong;
+        if (remainingBitsPerValue == 0) {
+          idx++;
+          remainingBitsPerValue = bitsPerValue;
+        }
+      } else {
+        final long mask1, mask2;
+        if (nextPrimitive == 8) {
+          mask1 = MASKS8[remainingBitsPerValue];
+          mask2 = MASKS8[remainingBitsPerLong - remainingBitsPerValue];
+        } else if (nextPrimitive == 16) {
+          mask1 = MASKS16[remainingBitsPerValue];
+          mask2 = MASKS16[remainingBitsPerLong - remainingBitsPerValue];
+        } else {
+          mask1 = MASKS32[remainingBitsPerValue];
+          mask2 = MASKS32[remainingBitsPerLong - remainingBitsPerValue];
+        }
+        tmp[tmpIdx] |= (longs[idx++] & mask1) << (remainingBitsPerLong - 
remainingBitsPerValue);
+        remainingBitsPerValue = bitsPerValue - remainingBitsPerLong + 
remainingBitsPerValue;
+        tmp[tmpIdx++] |= (longs[idx] >>> remainingBitsPerValue) & mask2;
+      }
+    }
+
+    for (int i = 0; i < numLongsPerShift; ++i) {
+      out.writeLong(tmp[i]);
+    }
+  }
+
+  /** Number of bytes required to encode 128 integers of {@code bitsPerValue} 
bits per value. */
+  int numBytes(int bitsPerValue) {

Review Comment:
   I think this is also identical to what's in `ForUtil`? Could we reference it 
and avoid the duplication?



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene912/PostingIndexInput.java:
##########
@@ -50,6 +53,6 @@ public void decode(int bitsPerValue, long[] longs) throws 
IOException {
    * and store results into {@code longs}.
    */
   public void decodeAndPrefixSum(int bitsPerValue, long base, long[] longs) 
throws IOException {

Review Comment:
   Actually, I guess this whole class is now only used by benchmarks? Do we 
actually need it or could benchmarks go directly against `ForUtil` / 
`ForDeltaUtil`?



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene912/PostingIndexInput.java:
##########
@@ -50,6 +53,6 @@ public void decode(int bitsPerValue, long[] longs) throws 
IOException {
    * and store results into {@code longs}.
    */
   public void decodeAndPrefixSum(int bitsPerValue, long base, long[] longs) 
throws IOException {

Review Comment:
   It looks like this method is only around for benchmarking? Could benchmarks 
directly call `ForDeltaUtil#decodeAndPrefixSum` so we could not need this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Speed up prefix sums when decoding doc IDs. [lucene]

Reply via email to