Copilot commented on code in PR #2878:
URL: https://github.com/apache/tika/pull/2878#discussion_r3367764856
##########
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/BigramTables.java:
##########
@@ -124,14 +122,23 @@ public void writeTo(DataOutputStream dos) throws
IOException {
cpBuf.asIntBuffer().put(codepointIndex);
dos.write(cpBuf.array());
- // Bigram open-addressing table (keys + values).
+ // Bigram table: sorted-occupied keys (ascending) + parallel values.
+ // Store key[0] raw, then varint (LEB128) deltas from the previous key;
+ // deltas are small because the keys are sorted and dense.
dos.writeInt(bigramKeys.length);
dos.writeFloat(bigramQuantMin);
dos.writeFloat(bigramQuantMax);
- ByteBuffer keyBuf = ByteBuffer.allocate(bigramKeys.length * 4)
- .order(ByteOrder.BIG_ENDIAN);
- keyBuf.asIntBuffer().put(bigramKeys);
- dos.write(keyBuf.array());
+ if (bigramKeys.length > 0) {
+ dos.writeInt(bigramKeys[0]);
+ for (int i = 1; i < bigramKeys.length; i++) {
+ long delta = (long) bigramKeys[i] - (long) bigramKeys[i - 1];
+ if (delta < 0) {
+ throw new IOException("bigramKeys must be sorted
ascending; "
+ + "non-monotonic at index " + i);
+ }
Review Comment:
`writeTo` allows equal adjacent keys (delta==0), but `readFrom` rejects
non-strictly-increasing sequences (`next <= prev`). If duplicate keys ever slip
in (e.g., from training bugs or corrupted models), the writer will happily
serialize a delta of 0 and the reader will then fail to load the model. Align
the invariants by enforcing strictly increasing keys at write time (or relax
the reader).
##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/NaiveBayesBigramEncodingDetector.java:
##########
@@ -309,9 +308,27 @@ public NaiveBayesBigramEncodingDetector(InputStream
modelStream) throws IOExcept
for (int bg = 0; bg < BIGRAM_SPACE; bg++) {
logP8[bg * numClasses + c] = u;
}
- // Overwrite with trained pairs.
+ // Overwrite with trained pairs. Bigram ids are sorted
ascending and
+ // stored as varint deltas (LEB128) from the previous id.
+ int bigram = 0;
for (int i = 0; i < vocabSize; i++) {
- int bigram = dis.readUnsignedShort();
+ int delta = 0;
+ int shift = 0;
+ int b;
+ do {
+ if (shift >= 32) {
+ throw new IOException(
+ "Malformed varint in bigram-id deltas (too
long)");
+ }
+ b = dis.readUnsignedByte();
+ delta |= (b & 0x7F) << shift;
+ shift += 7;
+ } while ((b & 0x80) != 0);
+ bigram += delta;
+ if (bigram < 0 || bigram >= BIGRAM_SPACE) {
+ throw new IOException("Bigram id out of range: " +
bigram
+ + " (expected [0, " + BIGRAM_SPACE + "))");
+ }
Review Comment:
The varint delta decoder accumulates into an `int` via `delta |= (b & 0x7F)
<< shift`. For malformed/corrupt model files, the last shift (28) can set the
sign bit and overflow `int`, producing negative deltas and a confusing
downstream "out of range" error. Decoding into a `long` and bounds-checking
produces clearer failures and avoids intermediate overflow.
##########
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/tools/TrainJunkModel.java:
##########
@@ -1037,12 +1037,8 @@ public static BigramTables buildBigramTablesFromCounts(
// Quantize unigram log-probs.
QuantizedFloats qUnigram = quantizeFloats(unigramLogP);
- // --- Build the open-addressing bigram table. ---
- int slots = nextPowerOfTwo((int) Math.max(2, Math.ceil(keptPairs /
loadFactor)));
- int[] keys = new int[slots];
- java.util.Arrays.fill(keys, BigramTables.EMPTY_KEY);
- // Compute log-probs first, quantize once, then write into the table
- // alongside its key.
+ // --- Build the sorted-occupied bigram table (binary-search lookup).
---
+ // Compute log-probs first, quantize once, then sort by key.
Review Comment:
This method now builds a sorted-occupied bigram table, but the `loadFactor`
parameter (and the javadoc above) are still from the old open-addressing
implementation and are no longer used. That makes the API misleading and can
trigger unused-parameter warnings for callers/maintainers. Either remove
`loadFactor` (if safe) or add an explicit note that it is retained for
compatibility and ignored.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]