[
https://issues.apache.org/jira/browse/TIKA-4745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18087654#comment-18087654
]
ASF GitHub Bot commented on TIKA-4745:
--------------------------------------
Copilot commented on code in PR #2886:
URL: https://github.com/apache/tika/pull/2886#discussion_r3380656656
##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/CjkDecodeValidator.java:
##########
@@ -102,6 +114,11 @@ public static double strippedFailureRate(byte[] bytes,
Charset cjkCharset) {
}
}
if (nHigh < MIN_HIGH_BYTES) {
+ // Pure UTF-8: no legacy high bytes at all but enough UTF-8
sequences
+ // to be confident. Return 1.0 so the CJK veto fires.
+ if (nHigh == 0 && nUtf8Seqs >= MIN_HIGH_BYTES) {
+ return 1.0;
+ }
return -1.0;
}
Review Comment:
The new “pure UTF-8” special case relies on utf8SequenceLength(), which only
checks lead/continuation byte ranges and can treat structurally-invalid UTF-8
(e.g., surrogate-range sequences) as valid. That can incorrectly trigger the
1.0 veto on inputs that aren’t actually valid UTF-8. Consider additionally
verifying the probe with StructuralEncodingRules.checkUtf8(...) before
returning 1.0.
##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/MojibusterEncodingDetector.java:
##########
@@ -427,17 +427,30 @@ public List<EncodingResult> detect(byte[] probe, Metadata
metadata) {
LOG.trace("mojibuster pool empty -> windows-1252 fallback");
return windows1252Fallback();
}
+ // When the top result is STRUCTURAL (clean UTF-8/UTF-32/ISO-2022
grammar),
+ // return only that one result. JunkFilter must not re-open
Mojibuster's
+ // internal ordering and pick a lower-ranked STATISTICAL CJK candidate
+ // over the STRUCTURAL winner on non-languagey content — that was the
11k
+ // regression root cause. With a single STRUCTURAL result, JunkFilter
+ // still arbitrates when *another* detector disagrees (lying HTML
headers),
+ // which is the intended use case.
+ //
+ // When the top result is STATISTICAL, keep the full ranked list so
that
+ // JunkFilter can arbitrate within-family ambiguities (e.g. GB18030 vs
+ // x-windows-949: NB scores Chinese higher than Korean on JS-heavy
files
+ // because ASCII bigram distributions differ between training corpora,
but
+ // JunkFilter's language-quality scoring correctly prefers Korean
text).
+ EncodingResult top = finalResults.get(0);
+ List<EncodingResult> toReturn = (top.getResultType() ==
EncodingResult.ResultType.STRUCTURAL)
+ ? List.of(top) : finalResults;
Review Comment:
This changes observable behavior by returning only the top STRUCTURAL
candidate (dropping Mojibuster’s lower-ranked STATISTICAL candidates). Please
add a regression test that asserts this contract (e.g., a pure UTF-8 probe
should yield a single UTF-8 STRUCTURAL result) so the JunkFilter interaction
that motivated this change stays covered.
##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/NaiveBayesBigramEncodingDetector.java:
##########
@@ -379,6 +391,13 @@ private static boolean isWhitespace(int b) {
|| b == 0x0d || b == 0x20;
}
+ // BETA-1 WORKAROUND: bigrams containing these HTML/JS markup chars are
+ // over-represented in GB18030 training data and cause misclassification.
+ // Suppressed only for GB18030 in scoreClassesAndCount.
+ static boolean isOffendingAscii(int b) {
+ return b == '{' || b == '"' || b == '&' || b == '<' || b == '>';
+ }
Review Comment:
isOffendingAscii() is only used inside this class; making it private avoids
expanding the class’ surface area unnecessarily.
> Small improvements to lang detection, charset detection and junk detection
> --------------------------------------------------------------------------
>
> Key: TIKA-4745
> URL: https://issues.apache.org/jira/browse/TIKA-4745
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Minor
> Fix For: 4.0.0
>
>
> I ran a regression test in prep for the 4.0.0-beta-1 release. There are a
> number of smallish things that we can clean up in the components listed in
> the title.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)