[jira] [Commented] (TIKA-4745) Small improvements to lang detection, charset detection and junk detection

ASF GitHub Bot (Jira) Fri, 05 Jun 2026 07:15:57 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18086381#comment-18086381
 ]


ASF GitHub Bot commented on TIKA-4745:
--------------------------------------

Copilot commented on code in PR #2872:
URL: https://github.com/apache/tika/pull/2872#discussion_r3363223702


##########
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/JunkDetector.java:
##########
@@ -973,15 +980,70 @@ public static int packBigramKey(int idxA, int idxB) {
      * once when scanning the text (avoiding a redundant binary search per
      * codepoint).
      */
+    /** Small per-bigram log-prob penalty subtracted from the case-folded
+     *  (lowercase) value when scoring an uppercase pair.  All-caps is a 
genuinely
+     *  weaker/rarer signal than lowercase, so it should score a hair BELOW its
+     *  lowercase form, not equal to it — and the margin guards the edge case 
where
+     *  an all-caps *mojibake* decode whose lowercase twin happens to be a seen
+     *  bigram would otherwise score like real lowercase text.  Kept small 
(0.25):
+     *  the lowercase/junk margin is ~0.8 logit, and δ=0.5 thinned it to ~0.1, 
so
+     *  0.25 leaves all-caps clearly clean (~0.5 above junk) while honoring the
+     *  "somewhat less languagey" principle. */
+    private static final double CASE_FOLD_PENALTY = 0.25;
+
     private static double scorePairF1(int cpA, int idxA, int cpB, int idxB,
                                          BigramTables tables) {
+        double direct = Double.NaN;
         if (idxA >= 0 && idxB >= 0) {
             int slot = lookupBigramSlot(tables, idxA, idxB);
             if (slot >= 0) {
-                return dequantize(tables.bigramValues[slot],
+                direct = dequantize(tables.bigramValues[slot],
                         tables.bigramQuantMin, tables.bigramQuantMax);
             }
         }
+        // Case-folded backoff: an ALL-UPPERCASE pair that is the case variant 
of
+        // a SEEN lowercase pair is real text wearing a different case 
(all-caps
+        // headings / emphasis, e.g. Greek "ΚΑΤΑΛΟΓΟΣ", Russian "МУЗЕЙ"), NOT 
junk.
+        // Score it as the BETTER of its own log-prob and its lowercase twin's 
—
+        // i.e. max(direct, fold).  max (not fold-only-on-miss) is essential: 
real
+        // all-caps bigrams ARE present in training (from headings) but rare, 
so the
+        // direct lookup hits a low value (МУ −12.4 vs lowercase му −6.7) and 
would
+        // otherwise bypass the fold and floor.  This is the discriminator raw
+        // probability cannot be: all-caps real text and all-caps mojibake are 
both
+        // improbable, but only real text has a SEEN lowercase twin.  Gated on 
BOTH
+        // codepoints being uppercase (case-CONSISTENT) so alternating-case 
junk
+        // ("tHiS") stays unfolded and floors; and only the lowercase twin's 
value
+        // is borrowed when that pair is actually seen, so all-caps mojibake
+        // (lowercase form also unseen) floors.
+        // Gate = "at least one uppercase letter AND no LOWERCASE letter" — so 
it
+        // folds both an interior all-caps pair (МУ) AND an edge pair where 
the other
+        // side is a sentinel or glue (^М, Й$, "М."), but NOT a mixed-case 
pair (the
+        // lowercase letter in "aB"/"tHiS" trips the gate, so 
case-inconsistent junk
+        // still floors).  Each uppercase letter is folded; 
sentinels/digits/glue
+        // pass through unchanged.  Folding the edges too is what fully 
rescues short
+        // all-caps headings, whose ^X/X$ bigrams would otherwise floor on the 
rare
+        // uppercase-letter unigram backoff.
+        if ((Character.isUpperCase(cpA) || Character.isUpperCase(cpB))
+                && !Character.isLowerCase(cpA) && !Character.isLowerCase(cpB)) 
{
+            int lcA = Character.isUpperCase(cpA) ? Character.toLowerCase(cpA) 
: cpA;
+            int lcB = Character.isUpperCase(cpB) ? Character.toLowerCase(cpB) 
: cpB;

Review Comment:
   `JunkDetector` uses TOKEN_START/TOKEN_END sentinels with values above 
`Character.MAX_CODE_POINT` (0x10FFFF). In `scorePairF1`, calling 
`Character.isUpperCase/isLowerCase/toLowerCase` on these sentinel values can 
throw `IllegalArgumentException` on some JDKs, which would crash scoring for 
any text (sentinels are emitted for every run). Guard the `Character.*` calls 
with `Character.isValidCodePoint(...)` so the sentinels are treated as 
non-letters while still allowing edge bigram folding when the *other* side is 
uppercase.



##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java:
##########
@@ -286,6 +286,41 @@ public void testHtml5Charset() throws Exception {
         assertEquals("ISO-8859-15", metadata.get(Metadata.CONTENT_ENCODING));
     }
 
+    /**
+     * A page that declares {@code charset=euc-kr} but actually uses UHC 
(MS949)
+     * extension Hangul must be decoded with the superset {@code 
x-windows-949},
+     * not strict EUC-KR (which U+FFFDs the extension syllables).  Mirrors the
+     * promotion {@code AutoDetectReader} already applies for non-HTML.
+     * {@code CONTENT_ENCODING} still reports the detected charset; the 
superset
+     * actually used is recorded in {@code DECODED_CHARSET}.
+     *
+     * @see org.apache.tika.detect.CharsetSupersets
+     */
+    @Test
+    public void testEucKrPromotedToMs949Superset() throws Exception {
+        // U+AC02 is outside EUC-KR (KS X 1001) but inside x-windows-949 
(MS949);
+        // its MS949 bytes 0x81 0x41 decode to U+FFFD followed by 'A' under 
strict
+        // EUC-KR, so a correct decode proves the superset was used.  U+D55C 
U+AD6D
+        // ("Korea") is a normal EUC-KR syllable pair.
+        String test = "<html><head><meta charset=\"euc-kr\" />" +
+                "<title>title</title></head>" +
+                "<body><p>\uAC02 \uD55C\uAD6D</p></body></html>";
+        Metadata metadata = new Metadata();
+        BodyContentHandler handler = new BodyContentHandler();
+        try (TikaInputStream tis = 
TikaInputStream.get(test.getBytes("x-windows-949"))) {
+            new JSoupParser().parse(tis, handler, metadata, new 
ParseContext());
+        }
+        // Metadata reports the *detected* charset ...
+        assertEquals("EUC-KR", metadata.get(Metadata.CONTENT_ENCODING));
+        // ... but decoding used the superset, recorded in DECODED_CHARSET.
+        assertEquals("x-windows-949", 
metadata.get(TikaCoreProperties.DECODED_CHARSET));

Review Comment:
   The assertion hard-codes `"x-windows-949"`, but `Charset.name()` returns the 
JVM's canonical name, which can vary (e.g., `windows-949` vs `x-windows-949`). 
Since `JSoupParser` stores `decodeAs.name()`, the test should compare against 
the runtime canonical name for the charset to avoid platform-dependent failures.





> Small improvements to lang detection, charset detection and junk detection
> --------------------------------------------------------------------------
>
>                 Key: TIKA-4745
>                 URL: https://issues.apache.org/jira/browse/TIKA-4745
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Minor
>
> I ran a regression test in prep for the 4.0.0-beta-1 release. There are a 
> number of smallish things that we can clean up in the components listed in 
> the title.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4745) Small improvements to lang detection, charset detection and junk detection

Reply via email to