Copilot commented on code in PR #2861:
URL: https://github.com/apache/tika/pull/2861#discussion_r3353345071


##########
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/JunkFilterEncodingDetector.java:
##########
@@ -274,6 +348,154 @@ public List<EncodingResult> detect(TikaInputStream tis, 
Metadata metadata,
         return List.of(new EncodingResult(champion, confidence));
     }
 
+    /** Minimum diff-z margin by which the other family must beat the 
champion's
+     *  family before the family gate overrides.  Large enough to ignore the
+     *  noise-level boundary ties; real CJK-vs-garbage diffs are far larger. */
+    private static final double FAMILY_DIFF_MARGIN = 2.0;
+
+    private static boolean isCjkCharset(String name) {
+        String n = name.toLowerCase(java.util.Locale.ROOT);
+        return n.contains("gb") || n.contains("big5") || n.contains("euc")
+                || n.contains("shift") || n.contains("jis") || 
n.contains("2022")
+                || n.contains("949");
+    }
+
+    /** Highest whole-text-z candidate within the requested family (CJK or 
not). */
+    private static Charset bestInFamily(Map<Charset, Double> wholeZ, boolean 
cjk) {
+        Charset best = null;
+        double bz = Double.NEGATIVE_INFINITY;
+        for (Map.Entry<Charset, Double> e : wholeZ.entrySet()) {
+            if (isCjkCharset(e.getKey().name()) == cjk && e.getValue() > bz) {
+                bz = e.getValue();
+                best = e.getKey();
+            }
+        }
+        return best;
+    }
+
+    /** Script-letter "diff" content: codepoints &ge; 0x80 that are letters/
+     *  ideographs — the high bytes where candidate decodes differ.  Shared 
ASCII
+     *  and non-ASCII punctuation/symbols are dropped (they dilute toward a
+     *  COMMON-dominated tie).  Used only for the CJK-vs-non-CJK family gate. 
*/
+    private static String scriptLetters(String s) {
+        StringBuilder b = new StringBuilder();
+        s.codePoints().forEach(c -> {
+            if (c >= 0x80 && Character.isLetter(c)) {
+                b.appendCodePoint(c);
+            }
+        });
+        return b.toString();
+    }
+
+    /** Canonical {@code Charset.name()} of the WHATWG-default Latin fallback. 
*/
+    private static final String WIN1252 = "windows-1252";
+
+    /** Latin single-byte charsets the within-Latin letter gate may arbitrate.
+     *  EXCLUDES non-Latin SBCS (Cyrillic windows-1251 / ISO-8859-5, Greek
+     *  -1253 / -7, Hebrew -1255 / -8, Arabic -1256 / -6, Thai) whose cased
+     *  letters would pollute the count, and all multi-byte CJK (the family
+     *  gate's territory). */
+    private static final Set<String> LATIN_SBCS = new HashSet<>(Arrays.asList(
+            "windows-1252", "windows-1250", "windows-1254", "windows-1257", 
"windows-1258",
+            "ISO-8859-1", "ISO-8859-2", "ISO-8859-3", "ISO-8859-4", 
"ISO-8859-9",
+            "ISO-8859-10", "ISO-8859-13", "ISO-8859-14", "ISO-8859-15", 
"ISO-8859-16",
+            "IBM437", "IBM850", "IBM852", "IBM858", "IBM860", "IBM861", 
"IBM863", "IBM865",
+            "x-MacRoman", "x-MacCentralEurope", "x-MacRomania", 
"x-MacIceland"));
+
+    /** Probe must have at least this many high bytes for the gate to act —
+     *  below it the letter gap is noise (most over-picks are sparse). */
+    private static final int LATIN_GATE_MIN_HIGH_BYTES = 16;
+    /** windows-1252 must win the cased-letter count by &gt; max(FLOOR, 
FRACTION
+     *  * highBytes).  The margin lets the gate cover Central-European / DOS
+     *  siblings safely — genuine CE text wins MORE letters under its true
+     *  charset so the gate stays silent — without the tie-flip that forces the
+     *  mojibuster Western-Latin fallback to scope itself out of those 
families. */
+    private static final double LATIN_GATE_MARGIN_FLOOR = 6.0;
+    private static final double LATIN_GATE_MARGIN_FRACTION = 0.20;
+
+    /**
+     * Within-Latin letter-plausibility gate (demote-only).  Demotes {@code
+     * champion} to windows-1252 only when windows-1252 is a live candidate, 
both
+     * are Latin SBCS, the probe is high-byte-dense, and windows-1252 decodes
+     * clearly MORE cased high-byte letters than the champion — the box-drawing
+     * signature, where a wrong IBM850 / x-MacRoman decode maps high bytes to
+     * symbols.  The compare is directional: a genuine Central-European / DOS 
doc
+     * wins MORE letters under its true charset, so the gate leaves it 
untouched.
+     * Latin-scoped so it never crosses the CJK boundary (the family gate 
above)
+     * or touches non-Latin SBCS.  Returns the (possibly demoted) charset.
+     */
+    static Charset applyLatinLetterGate(byte[] probe, Charset champion,
+                                        Set<Charset> candidates) {
+        String name = champion.name();
+        if (WIN1252.equals(name) || !LATIN_SBCS.contains(name)) {
+            return champion;
+        }
+        Charset win = null;
+        for (Charset c : candidates) {
+            if (WIN1252.equals(c.name())) {
+                win = c;
+                break;
+            }
+        }
+        if (win == null) {
+            return champion;
+        }
+        int high = HighByteLetterStats.countHighBytes(probe);
+        if (high < LATIN_GATE_MIN_HIGH_BYTES) {
+            return champion;
+        }
+        int winLetters = HighByteLetterStats.countCasedHighByteLetters(probe, 
win);
+        int champLetters = 
HighByteLetterStats.countCasedHighByteLetters(probe, champion);
+        double margin = Math.max(LATIN_GATE_MARGIN_FLOOR, 
LATIN_GATE_MARGIN_FRACTION * high);
+        if (winLetters > champLetters + margin) {
+            LOG.trace("junk-filter latin gate: {} -> windows-1252 (cased 
high-byte "
+                    + "letters {} vs {}, high={})", name, champLetters, 
winLetters, high);

Review Comment:
   The trace log arguments are swapped/misleading: the message says “letters … 
vs …” in the context of demoting to windows-1252, but it currently logs 
`champLetters` first and `winLetters` second. This makes troubleshooting the 
gate harder because the logged counts don’t match the intended narrative.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to