[jira] [Commented] (TIKA-4745) Small improvements to lang detection, charset detection and junk detection

ASF GitHub Bot (Jira) Fri, 05 Jun 2026 08:51:07 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18086413#comment-18086413
 ]


ASF GitHub Bot commented on TIKA-4745:
--------------------------------------

tballison commented on code in PR #2872:
URL: https://github.com/apache/tika/pull/2872#discussion_r3363772214


##########
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/JunkDetector.java:
##########
@@ -973,15 +980,70 @@ public static int packBigramKey(int idxA, int idxB) {
      * once when scanning the text (avoiding a redundant binary search per
      * codepoint).
      */
+    /** Small per-bigram log-prob penalty subtracted from the case-folded
+     *  (lowercase) value when scoring an uppercase pair.  All-caps is a 
genuinely
+     *  weaker/rarer signal than lowercase, so it should score a hair BELOW its
+     *  lowercase form, not equal to it — and the margin guards the edge case 
where
+     *  an all-caps *mojibake* decode whose lowercase twin happens to be a seen
+     *  bigram would otherwise score like real lowercase text.  Kept small 
(0.25):
+     *  the lowercase/junk margin is ~0.8 logit, and δ=0.5 thinned it to ~0.1, 
so
+     *  0.25 leaves all-caps clearly clean (~0.5 above junk) while honoring the
+     *  "somewhat less languagey" principle. */
+    private static final double CASE_FOLD_PENALTY = 0.25;
+
     private static double scorePairF1(int cpA, int idxA, int cpB, int idxB,
                                          BigramTables tables) {
+        double direct = Double.NaN;
         if (idxA >= 0 && idxB >= 0) {
             int slot = lookupBigramSlot(tables, idxA, idxB);
             if (slot >= 0) {
-                return dequantize(tables.bigramValues[slot],
+                direct = dequantize(tables.bigramValues[slot],
                         tables.bigramQuantMin, tables.bigramQuantMax);
             }
         }
+        // Case-folded backoff: an ALL-UPPERCASE pair that is the case variant 
of
+        // a SEEN lowercase pair is real text wearing a different case 
(all-caps
+        // headings / emphasis, e.g. Greek "ΚΑΤΑΛΟΓΟΣ", Russian "МУЗЕЙ"), NOT 
junk.
+        // Score it as the BETTER of its own log-prob and its lowercase twin's 
—
+        // i.e. max(direct, fold).  max (not fold-only-on-miss) is essential: 
real
+        // all-caps bigrams ARE present in training (from headings) but rare, 
so the
+        // direct lookup hits a low value (МУ −12.4 vs lowercase му −6.7) and 
would
+        // otherwise bypass the fold and floor.  This is the discriminator raw
+        // probability cannot be: all-caps real text and all-caps mojibake are 
both
+        // improbable, but only real text has a SEEN lowercase twin.  Gated on 
BOTH
+        // codepoints being uppercase (case-CONSISTENT) so alternating-case 
junk
+        // ("tHiS") stays unfolded and floors; and only the lowercase twin's 
value
+        // is borrowed when that pair is actually seen, so all-caps mojibake
+        // (lowercase form also unseen) floors.
+        // Gate = "at least one uppercase letter AND no LOWERCASE letter" — so 
it
+        // folds both an interior all-caps pair (МУ) AND an edge pair where 
the other
+        // side is a sentinel or glue (^М, Й$, "М."), but NOT a mixed-case 
pair (the
+        // lowercase letter in "aB"/"tHiS" trips the gate, so 
case-inconsistent junk
+        // still floors).  Each uppercase letter is folded; 
sentinels/digits/glue
+        // pass through unchanged.  Folding the edges too is what fully 
rescues short
+        // all-caps headings, whose ^X/X$ bigrams would otherwise floor on the 
rare
+        // uppercase-letter unigram backoff.
+        if ((Character.isUpperCase(cpA) || Character.isUpperCase(cpB))
+                && !Character.isLowerCase(cpA) && !Character.isLowerCase(cpB)) 
{
+            int lcA = Character.isUpperCase(cpA) ? Character.toLowerCase(cpA) 
: cpA;
+            int lcB = Character.isUpperCase(cpB) ? Character.toLowerCase(cpB) 
: cpB;

Review Comment:
   Couldn't replicate this. AND other agents disagree.





> Small improvements to lang detection, charset detection and junk detection
> --------------------------------------------------------------------------
>
>                 Key: TIKA-4745
>                 URL: https://issues.apache.org/jira/browse/TIKA-4745
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Minor
>
> I ran a regression test in prep for the 4.0.0-beta-1 release. There are a 
> number of smallish things that we can clean up in the components listed in 
> the title.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4745) Small improvements to lang detection, charset detection and junk detection

Reply via email to