Re: [PR] TIKA-4745-follow-on-junk-improvements [tika]

via GitHub Fri, 05 Jun 2026 07:15:53 -0700


Copilot commented on code in PR #2872:
URL: https://github.com/apache/tika/pull/2872#discussion_r3363223702



##########
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/JunkDetector.java:
##########
@@ -973,15 +980,70 @@ public static int packBigramKey(int idxA, int idxB) {
      * once when scanning the text (avoiding a redundant binary search per
      * codepoint).
      */
+    /** Small per-bigram log-prob penalty subtracted from the case-folded
+     *  (lowercase) value when scoring an uppercase pair.  All-caps is a 
genuinely
+     *  weaker/rarer signal than lowercase, so it should score a hair BELOW its
+     *  lowercase form, not equal to it — and the margin guards the edge case 
where
+     *  an all-caps *mojibake* decode whose lowercase twin happens to be a seen
+     *  bigram would otherwise score like real lowercase text.  Kept small 
(0.25):
+     *  the lowercase/junk margin is ~0.8 logit, and δ=0.5 thinned it to ~0.1, 
so
+     *  0.25 leaves all-caps clearly clean (~0.5 above junk) while honoring the
+     *  "somewhat less languagey" principle. */
+    private static final double CASE_FOLD_PENALTY = 0.25;
+
     private static double scorePairF1(int cpA, int idxA, int cpB, int idxB,
                                          BigramTables tables) {
+        double direct = Double.NaN;
         if (idxA >= 0 && idxB >= 0) {
             int slot = lookupBigramSlot(tables, idxA, idxB);
             if (slot >= 0) {
-                return dequantize(tables.bigramValues[slot],
+                direct = dequantize(tables.bigramValues[slot],
                         tables.bigramQuantMin, tables.bigramQuantMax);
             }
         }
+        // Case-folded backoff: an ALL-UPPERCASE pair that is the case variant 
of
+        // a SEEN lowercase pair is real text wearing a different case 
(all-caps
+        // headings / emphasis, e.g. Greek "ΚΑΤΑΛΟΓΟΣ", Russian "МУЗЕЙ"), NOT 
junk.
+        // Score it as the BETTER of its own log-prob and its lowercase twin's 
—
+        // i.e. max(direct, fold).  max (not fold-only-on-miss) is essential: 
real
+        // all-caps bigrams ARE present in training (from headings) but rare, 
so the
+        // direct lookup hits a low value (МУ −12.4 vs lowercase му −6.7) and 
would
+        // otherwise bypass the fold and floor.  This is the discriminator raw
+        // probability cannot be: all-caps real text and all-caps mojibake are 
both
+        // improbable, but only real text has a SEEN lowercase twin.  Gated on 
BOTH
+        // codepoints being uppercase (case-CONSISTENT) so alternating-case 
junk
+        // ("tHiS") stays unfolded and floors; and only the lowercase twin's 
value
+        // is borrowed when that pair is actually seen, so all-caps mojibake
+        // (lowercase form also unseen) floors.
+        // Gate = "at least one uppercase letter AND no LOWERCASE letter" — so 
it
+        // folds both an interior all-caps pair (МУ) AND an edge pair where 
the other
+        // side is a sentinel or glue (^М, Й$, "М."), but NOT a mixed-case 
pair (the
+        // lowercase letter in "aB"/"tHiS" trips the gate, so 
case-inconsistent junk
+        // still floors).  Each uppercase letter is folded; 
sentinels/digits/glue
+        // pass through unchanged.  Folding the edges too is what fully 
rescues short
+        // all-caps headings, whose ^X/X$ bigrams would otherwise floor on the 
rare
+        // uppercase-letter unigram backoff.
+        if ((Character.isUpperCase(cpA) || Character.isUpperCase(cpB))
+                && !Character.isLowerCase(cpA) && !Character.isLowerCase(cpB)) 
{
+            int lcA = Character.isUpperCase(cpA) ? Character.toLowerCase(cpA) 
: cpA;
+            int lcB = Character.isUpperCase(cpB) ? Character.toLowerCase(cpB) 
: cpB;

Review Comment:
   `JunkDetector` uses TOKEN_START/TOKEN_END sentinels with values above 
`Character.MAX_CODE_POINT` (0x10FFFF). In `scorePairF1`, calling 
`Character.isUpperCase/isLowerCase/toLowerCase` on these sentinel values can 
throw `IllegalArgumentException` on some JDKs, which would crash scoring for 
any text (sentinels are emitted for every run). Guard the `Character.*` calls 
with `Character.isValidCodePoint(...)` so the sentinels are treated as 
non-letters while still allowing edge bigram folding when the *other* side is 
uppercase.



##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java:
##########
@@ -286,6 +286,41 @@ public void testHtml5Charset() throws Exception {
         assertEquals("ISO-8859-15", metadata.get(Metadata.CONTENT_ENCODING));
     }
 
+    /**
+     * A page that declares {@code charset=euc-kr} but actually uses UHC 
(MS949)
+     * extension Hangul must be decoded with the superset {@code 
x-windows-949},
+     * not strict EUC-KR (which U+FFFDs the extension syllables).  Mirrors the
+     * promotion {@code AutoDetectReader} already applies for non-HTML.
+     * {@code CONTENT_ENCODING} still reports the detected charset; the 
superset
+     * actually used is recorded in {@code DECODED_CHARSET}.
+     *
+     * @see org.apache.tika.detect.CharsetSupersets
+     */
+    @Test
+    public void testEucKrPromotedToMs949Superset() throws Exception {
+        // U+AC02 is outside EUC-KR (KS X 1001) but inside x-windows-949 
(MS949);
+        // its MS949 bytes 0x81 0x41 decode to U+FFFD followed by 'A' under 
strict
+        // EUC-KR, so a correct decode proves the superset was used.  U+D55C 
U+AD6D
+        // ("Korea") is a normal EUC-KR syllable pair.
+        String test = "<html><head><meta charset=\"euc-kr\" />" +
+                "<title>title</title></head>" +
+                "<body><p>\uAC02 \uD55C\uAD6D</p></body></html>";
+        Metadata metadata = new Metadata();
+        BodyContentHandler handler = new BodyContentHandler();
+        try (TikaInputStream tis = 
TikaInputStream.get(test.getBytes("x-windows-949"))) {
+            new JSoupParser().parse(tis, handler, metadata, new 
ParseContext());
+        }
+        // Metadata reports the *detected* charset ...
+        assertEquals("EUC-KR", metadata.get(Metadata.CONTENT_ENCODING));
+        // ... but decoding used the superset, recorded in DECODED_CHARSET.
+        assertEquals("x-windows-949", 
metadata.get(TikaCoreProperties.DECODED_CHARSET));

Review Comment:
   The assertion hard-codes `"x-windows-949"`, but `Charset.name()` returns the 
JVM's canonical name, which can vary (e.g., `windows-949` vs `x-windows-949`). 
Since `JSoupParser` stores `decodeAs.name()`, the test should compare against 
the runtime canonical name for the charset to avoid platform-dependent failures.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] TIKA-4745-follow-on-junk-improvements [tika]

Reply via email to