[jira] [Commented] (TIKA-4745) Small improvements to lang detection, charset detection and junk detection

ASF GitHub Bot (Jira) Tue, 09 Jun 2026 05:43:08 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18087654#comment-18087654
 ]


ASF GitHub Bot commented on TIKA-4745:
--------------------------------------

Copilot commented on code in PR #2886:
URL: https://github.com/apache/tika/pull/2886#discussion_r3380656656


##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/CjkDecodeValidator.java:
##########
@@ -102,6 +114,11 @@ public static double strippedFailureRate(byte[] bytes, 
Charset cjkCharset) {
             }
         }
         if (nHigh < MIN_HIGH_BYTES) {
+            // Pure UTF-8: no legacy high bytes at all but enough UTF-8 
sequences
+            // to be confident.  Return 1.0 so the CJK veto fires.
+            if (nHigh == 0 && nUtf8Seqs >= MIN_HIGH_BYTES) {
+                return 1.0;
+            }
             return -1.0;
         }

Review Comment:
   The new “pure UTF-8” special case relies on utf8SequenceLength(), which only 
checks lead/continuation byte ranges and can treat structurally-invalid UTF-8 
(e.g., surrogate-range sequences) as valid. That can incorrectly trigger the 
1.0 veto on inputs that aren’t actually valid UTF-8. Consider additionally 
verifying the probe with StructuralEncodingRules.checkUtf8(...) before 
returning 1.0.



##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/MojibusterEncodingDetector.java:
##########
@@ -427,17 +427,30 @@ public List<EncodingResult> detect(byte[] probe, Metadata 
metadata) {
             LOG.trace("mojibuster pool empty -> windows-1252 fallback");
             return windows1252Fallback();
         }
+        // When the top result is STRUCTURAL (clean UTF-8/UTF-32/ISO-2022 
grammar),
+        // return only that one result.  JunkFilter must not re-open 
Mojibuster's
+        // internal ordering and pick a lower-ranked STATISTICAL CJK candidate
+        // over the STRUCTURAL winner on non-languagey content — that was the 
11k
+        // regression root cause.  With a single STRUCTURAL result, JunkFilter
+        // still arbitrates when *another* detector disagrees (lying HTML 
headers),
+        // which is the intended use case.
+        //
+        // When the top result is STATISTICAL, keep the full ranked list so 
that
+        // JunkFilter can arbitrate within-family ambiguities (e.g. GB18030 vs
+        // x-windows-949: NB scores Chinese higher than Korean on JS-heavy 
files
+        // because ASCII bigram distributions differ between training corpora, 
but
+        // JunkFilter's language-quality scoring correctly prefers Korean 
text).
+        EncodingResult top = finalResults.get(0);
+        List<EncodingResult> toReturn = (top.getResultType() == 
EncodingResult.ResultType.STRUCTURAL)
+                ? List.of(top) : finalResults;

Review Comment:
   This changes observable behavior by returning only the top STRUCTURAL 
candidate (dropping Mojibuster’s lower-ranked STATISTICAL candidates). Please 
add a regression test that asserts this contract (e.g., a pure UTF-8 probe 
should yield a single UTF-8 STRUCTURAL result) so the JunkFilter interaction 
that motivated this change stays covered.



##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/NaiveBayesBigramEncodingDetector.java:
##########
@@ -379,6 +391,13 @@ private static boolean isWhitespace(int b) {
                 || b == 0x0d || b == 0x20;
     }
 
+    // BETA-1 WORKAROUND: bigrams containing these HTML/JS markup chars are
+    // over-represented in GB18030 training data and cause misclassification.
+    // Suppressed only for GB18030 in scoreClassesAndCount.
+    static boolean isOffendingAscii(int b) {
+        return b == '{' || b == '"' || b == '&' || b == '<' || b == '>';
+    }

Review Comment:
   isOffendingAscii() is only used inside this class; making it private avoids 
expanding the class’ surface area unnecessarily.





> Small improvements to lang detection, charset detection and junk detection
> --------------------------------------------------------------------------
>
>                 Key: TIKA-4745
>                 URL: https://issues.apache.org/jira/browse/TIKA-4745
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Minor
>             Fix For: 4.0.0
>
>
> I ran a regression test in prep for the 4.0.0-beta-1 release. There are a 
> number of smallish things that we can clean up in the components listed in 
> the title.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4745) Small improvements to lang detection, charset detection and junk detection

Reply via email to