Copilot commented on code in PR #2878:
URL: https://github.com/apache/tika/pull/2878#discussion_r3367380860


##########
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/BigramTables.java:
##########
@@ -124,14 +122,18 @@ public void writeTo(DataOutputStream dos) throws 
IOException {
         cpBuf.asIntBuffer().put(codepointIndex);
         dos.write(cpBuf.array());
 
-        // Bigram open-addressing table (keys + values).
+        // Bigram table: sorted-occupied keys (ascending) + parallel values.
+        // Store key[0] raw, then varint (LEB128) deltas from the previous key;
+        // deltas are small because the keys are sorted and dense.
         dos.writeInt(bigramKeys.length);
         dos.writeFloat(bigramQuantMin);
         dos.writeFloat(bigramQuantMax);
-        ByteBuffer keyBuf = ByteBuffer.allocate(bigramKeys.length * 4)
-                .order(ByteOrder.BIG_ENDIAN);
-        keyBuf.asIntBuffer().put(bigramKeys);
-        dos.write(keyBuf.array());
+        if (bigramKeys.length > 0) {
+            dos.writeInt(bigramKeys[0]);
+            for (int i = 1; i < bigramKeys.length; i++) {
+                writeVarLong(dos, (long) bigramKeys[i] - (long) bigramKeys[i - 
1]);
+            }

Review Comment:
   `writeVarLong` is documented as writing a non-negative value, but the delta 
computed from adjacent `bigramKeys` is not validated. If the key array is 
accidentally unsorted, a negative delta will be encoded as a large unsigned 
varint and `readFrom` will reconstruct corrupt keys. Add an explicit delta >= 0 
validation to fail fast.



##########
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/BigramTables.java:
##########
@@ -153,9 +155,13 @@ public static BigramTables readFrom(DataInputStream dis) 
throws IOException {
         int slots = dis.readInt();
         float bMin = dis.readFloat();
         float bMax = dis.readFloat();
-        byte[] keyBytes = dis.readNBytes(slots * 4);
         int[] keys = new int[slots];
-        
ByteBuffer.wrap(keyBytes).order(ByteOrder.BIG_ENDIAN).asIntBuffer().get(keys);
+        if (slots > 0) {
+            keys[0] = dis.readInt();
+            for (int i = 1; i < slots; i++) {
+                keys[i] = (int) (keys[i - 1] + readVarLong(dis));
+            }

Review Comment:
   When reconstructing keys from varint deltas, the code casts the running sum 
to `int` without validating bounds or monotonicity. A malformed/corrupt model 
could overflow and silently wrap, producing unsorted keys and incorrect 
binary-search results. Validate that the reconstructed key stays within the 
`int` range (and ideally remains non-decreasing) and throw `IOException` on 
violation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to