Copilot commented on code in PR #2878:
URL: https://github.com/apache/tika/pull/2878#discussion_r3366545572


##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/NaiveBayesBigramEncodingDetector.java:
##########
@@ -309,9 +309,19 @@ public NaiveBayesBigramEncodingDetector(InputStream 
modelStream) throws IOExcept
                 for (int bg = 0; bg < BIGRAM_SPACE; bg++) {
                     logP8[bg * numClasses + c] = u;
                 }
-                // Overwrite with trained pairs.
+                // Overwrite with trained pairs. Bigram ids are sorted 
ascending and
+                // stored as varint deltas (LEB128) from the previous id.
+                int bigram = 0;
                 for (int i = 0; i < vocabSize; i++) {
-                    int bigram = dis.readUnsignedShort();
+                    int delta = 0;
+                    int shift = 0;
+                    int b;
+                    do {
+                        b = dis.readUnsignedByte();
+                        delta |= (b & 0x7F) << shift;
+                        shift += 7;
+                    } while ((b & 0x80) != 0);
+                    bigram += delta;
                     byte q = dis.readByte();
                     logP8[bigram * numClasses + c] = q;
                 }

Review Comment:
   The new varint-delta decode loop has no bounds/validation: a malformed model 
can drive `shift` past 31 (bit-shift wraps) and/or produce a `bigram` id 
outside `[0, BIGRAM_SPACE)`, leading to incorrect decoding or 
`ArrayIndexOutOfBoundsException` while loading the model. Add a bounded varint 
decode and explicit range checks that throw `IOException` on corrupt input.



##########
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/BigramTables.java:
##########
@@ -167,6 +173,28 @@ public static BigramTables readFrom(DataInputStream dis) 
throws IOException {
                 bMin, bMax, uMin, uMax, uFallback, backoffAlpha);
     }
 
+    /** Writes a non-negative long as an unsigned LEB128 varint. */
+    private static void writeVarLong(DataOutputStream dos, long v) throws 
IOException {
+        while ((v & ~0x7FL) != 0) {
+            dos.writeByte((int) ((v & 0x7F) | 0x80));
+            v >>>= 7;
+        }
+        dos.writeByte((int) v);
+    }
+
+    /** Reads an unsigned LEB128 varint written by {@link #writeVarLong}. */
+    private static long readVarLong(DataInputStream dis) throws IOException {
+        long v = 0;
+        int shift = 0;
+        int b;
+        do {
+            b = dis.readUnsignedByte();
+            v |= (long) (b & 0x7F) << shift;
+            shift += 7;
+        } while ((b & 0x80) != 0);
+        return v;
+    }

Review Comment:
   `readVarLong` can accept arbitrarily long/invalid LEB128 sequences: `shift` 
grows without bound and values beyond 64 bits silently overflow. For 
malformed/corrupt model files this can yield incorrect keys, very large 
allocations downstream, or hard-to-diagnose failures. Add a bounded decode and 
throw an IOException when the varint is too long.



##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/MojibusterEncodingDetector.java:
##########
@@ -749,7 +751,20 @@ private static boolean shouldTryStrip(String contentType) {
         return lower.contains("html") || lower.contains("xml");
     }
 
-    private static byte[] readProbe(TikaInputStream tis) throws IOException {
-        return AdaptiveProbe.read(tis, PROBE_CONTENT_TARGET, PROBE_RAW_CAP);
+    private static byte[] readProbe(TikaInputStream tis, ParseContext 
parseContext)
+            throws IOException {
+        EncodingDetectorContext context = 
parseContext.get(EncodingDetectorContext.class);
+        EncodingProbeCache cache = context == null ? null : 
context.getProbeCache();
+        if (cache != null) {
+            byte[] cached = cache.get(PROBE_CONTENT_TARGET, PROBE_RAW_CAP);
+            if (cached != null) {
+                return cached;
+            }
+        }
+        byte[] probe = AdaptiveProbe.read(tis, PROBE_CONTENT_TARGET, 
PROBE_RAW_CAP);
+        if (cache != null) {
+            cache.put(probe, PROBE_CONTENT_TARGET, PROBE_RAW_CAP);
+        }
+        return probe;

Review Comment:
   `readProbe` now unconditionally dereferences `parseContext` to fetch the 
probe cache, which introduces a new `NullPointerException` if callers invoke 
this detector directly with a null ParseContext (previously harmless because 
the method didn't use it). Guard against null and fall back to the uncached 
path.



##########
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/tools/TrainJunkModel.java:
##########
@@ -1037,12 +1037,8 @@ public static BigramTables buildBigramTablesFromCounts(
         // Quantize unigram log-probs.
         QuantizedFloats qUnigram = quantizeFloats(unigramLogP);
 
-        // --- Build the open-addressing bigram table. ---
-        int slots = nextPowerOfTwo((int) Math.max(2, Math.ceil(keptPairs / 
loadFactor)));
-        int[] keys = new int[slots];
-        java.util.Arrays.fill(keys, BigramTables.EMPTY_KEY);
-        // Compute log-probs first, quantize once, then write into the table
-        // alongside its key.
+        // --- Build the sorted-occupied bigram table (binary-search lookup). 
---
+        // Compute log-probs first, quantize once, then sort by key.
         float[] keptLogP = new float[keptPairs];

Review Comment:
   `buildBigramTablesFromCounts` still accepts a `loadFactor` parameter from 
the old open-addressing implementation, but the new sorted-occupied table 
ignores it. This can confuse callers/readers (and the method comment just above 
still refers to open-addressing). Add a short note here clarifying that 
`loadFactor` is retained for API compatibility and is no longer used.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to