[
https://issues.apache.org/jira/browse/TIKA-4745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18086612#comment-18086612
]
ASF GitHub Bot commented on TIKA-4745:
--------------------------------------
Copilot commented on code in PR #2878:
URL: https://github.com/apache/tika/pull/2878#discussion_r3367764856
##########
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/BigramTables.java:
##########
@@ -124,14 +122,23 @@ public void writeTo(DataOutputStream dos) throws
IOException {
cpBuf.asIntBuffer().put(codepointIndex);
dos.write(cpBuf.array());
- // Bigram open-addressing table (keys + values).
+ // Bigram table: sorted-occupied keys (ascending) + parallel values.
+ // Store key[0] raw, then varint (LEB128) deltas from the previous key;
+ // deltas are small because the keys are sorted and dense.
dos.writeInt(bigramKeys.length);
dos.writeFloat(bigramQuantMin);
dos.writeFloat(bigramQuantMax);
- ByteBuffer keyBuf = ByteBuffer.allocate(bigramKeys.length * 4)
- .order(ByteOrder.BIG_ENDIAN);
- keyBuf.asIntBuffer().put(bigramKeys);
- dos.write(keyBuf.array());
+ if (bigramKeys.length > 0) {
+ dos.writeInt(bigramKeys[0]);
+ for (int i = 1; i < bigramKeys.length; i++) {
+ long delta = (long) bigramKeys[i] - (long) bigramKeys[i - 1];
+ if (delta < 0) {
+ throw new IOException("bigramKeys must be sorted
ascending; "
+ + "non-monotonic at index " + i);
+ }
Review Comment:
`writeTo` allows equal adjacent keys (delta==0), but `readFrom` rejects
non-strictly-increasing sequences (`next <= prev`). If duplicate keys ever slip
in (e.g., from training bugs or corrupted models), the writer will happily
serialize a delta of 0 and the reader will then fail to load the model. Align
the invariants by enforcing strictly increasing keys at write time (or relax
the reader).
##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/NaiveBayesBigramEncodingDetector.java:
##########
@@ -309,9 +308,27 @@ public NaiveBayesBigramEncodingDetector(InputStream
modelStream) throws IOExcept
for (int bg = 0; bg < BIGRAM_SPACE; bg++) {
logP8[bg * numClasses + c] = u;
}
- // Overwrite with trained pairs.
+ // Overwrite with trained pairs. Bigram ids are sorted
ascending and
+ // stored as varint deltas (LEB128) from the previous id.
+ int bigram = 0;
for (int i = 0; i < vocabSize; i++) {
- int bigram = dis.readUnsignedShort();
+ int delta = 0;
+ int shift = 0;
+ int b;
+ do {
+ if (shift >= 32) {
+ throw new IOException(
+ "Malformed varint in bigram-id deltas (too
long)");
+ }
+ b = dis.readUnsignedByte();
+ delta |= (b & 0x7F) << shift;
+ shift += 7;
+ } while ((b & 0x80) != 0);
+ bigram += delta;
+ if (bigram < 0 || bigram >= BIGRAM_SPACE) {
+ throw new IOException("Bigram id out of range: " +
bigram
+ + " (expected [0, " + BIGRAM_SPACE + "))");
+ }
Review Comment:
The varint delta decoder accumulates into an `int` via `delta |= (b & 0x7F)
<< shift`. For malformed/corrupt model files, the last shift (28) can set the
sign bit and overflow `int`, producing negative deltas and a confusing
downstream "out of range" error. Decoding into a `long` and bounds-checking
produces clearer failures and avoids intermediate overflow.
##########
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/tools/TrainJunkModel.java:
##########
@@ -1037,12 +1037,8 @@ public static BigramTables buildBigramTablesFromCounts(
// Quantize unigram log-probs.
QuantizedFloats qUnigram = quantizeFloats(unigramLogP);
- // -
> Small improvements to lang detection, charset detection and junk detection
> --------------------------------------------------------------------------
>
> Key: TIKA-4745
> URL: https://issues.apache.org/jira/browse/TIKA-4745
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Minor
> Fix For: 4.0.0
>
>
> I ran a regression test in prep for the 4.0.0-beta-1 release. There are a
> number of smallish things that we can clean up in the components listed in
> the title.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)