(tika) branch main updated: TIKA-4745 - add cohort-specific caps (#2848)

tallison Fri, 29 May 2026 06:52:49 -0700

This is an automated email from the ASF dual-hosted git repository.

tballison pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git



The following commit(s) were added to refs/heads/main by this push:
     new 499e703579 TIKA-4745 - add cohort-specific caps (#2848)
499e703579 is described below

commit 499e70357937dbe0cb7ecc559b7d8e8172b609ac
Author: Tim Allison <[email protected]>
AuthorDate: Fri May 29 09:52:32 2026 -0400

    TIKA-4745 - add cohort-specific caps (#2848)
---
 .skills/dev.md                                     |  16 ++
 .skills/tika-eval-encoding-regression.md           | 167 +++++++++++++++++++++
 .../NaiveBayesBigramEncodingDetector.java          | 126 +++++++++++++---
 .../org/apache/tika/parser/apple/IWorkTest.java    |  32 ----
 4 files changed, 286 insertions(+), 55 deletions(-)

diff --git a/.skills/dev.md b/.skills/dev.md
index fff9b748e1..52ae923460 100644
--- a/.skills/dev.md
+++ b/.skills/dev.md
@@ -86,6 +86,12 @@ provide the suggested commit message for the user to execute.
 - Spotless formatter runs during build — don't fight it
 - Tests use `@TempDir Path tmp` for temp directories
 - No emojis in code or comments
+- **No local/machine-specific paths** in committed code, tests, docs, or
+  config — never `/home/<user>`, `/Users/<user>`, `C:\Users\<user>`, or a
+  personal `~/data/...`.  Use a placeholder (`<workdir>/`, `<corpus>`),
+  `@TempDir`, or an in-repo `src/test/resources` fixture instead.  *Only*
+  legitimate exception: a path that is the data under test (e.g. an expected
+  metadata value extracted from a test document) — leave those untouched.
 
 ## Testing an End-to-End Change
 
@@ -104,3 +110,13 @@ See `.skills/tika-eval-compare.md` for the full procedure.
 ./mvnw clean test -pl <module> \
   -Dmaven.repo.local=$(pwd)/.local_m2_repo
 ```
+
+Scan the staged diff for machine-specific local paths before committing
+(see Code Conventions). Added lines only; review any hit by hand — a test
+fixture's expected value is allowed, a real config/doc/code path is not:
+
+```bash
+git diff --cached -U0 | grep -E '^\+' \
+  | grep -nE 
'/home/[A-Za-z0-9._-]+|/Users/[A-Za-z0-9._-]+|[A-Za-z]:\\+Users|~/data/' \
+  && echo "^ local path in staged diff — replace with a placeholder/fixture"
+```
diff --git a/.skills/tika-eval-encoding-regression.md 
b/.skills/tika-eval-encoding-regression.md
new file mode 100644
index 0000000000..1d3e61a67c
--- /dev/null
+++ b/.skills/tika-eval-encoding-regression.md
@@ -0,0 +1,167 @@
+# tika-eval for encoding-detector regression hunts
+
+A condensed pattern for finding SBCS→CJK style charset-detector regressions
+(or any "A picks encoding X, B picks encoding Y" question) without
+building two tika-app distributions.
+
+## Two configs, one build
+
+Encoding-detector experiments don't need a "before" and "after" tika-app —
+the chain composition is per-config. Run the SAME tika-app twice against
+two configs, treat the outputs as `-a` and `-b`. Much faster than
+`tika-eval-compare`'s two-build flow.
+
+```bash
+# build once
+./mvnw clean install -pl tika-app -am -Pfast -DskipTests \
+  -Dmaven.repo.local=$(pwd)/.local_m2_repo
+unzip -q tika-app/target/tika-app-*.zip -d /tmp/tika-app-current
+
+# two configs (any combination of detectors)
+java -jar /tmp/tika-app-current/tika-app-*.jar \
+  --config=tika-config-3x-default.json \
+  -i <corpus> -o <workdir>/extracts/A -n 6
+java -jar /tmp/tika-app-current/tika-app-*.jar \
+  --config=tika-config-junkfilter-combiner.json \
+  -i <corpus> -o <workdir>/extracts/B -n 6
+
+# normal Compare
+java -jar /tmp/tika-eval-current/tika-eval-app-*.jar Compare \
+  -a <workdir>/extracts/A -b <workdir>/extracts/B -d <workdir>/extracts/A-vs-B 
-r -rd <workdir>/extracts/A-vs-B-reports
+```
+
+### Canonical 3.x-default encoding chain config
+
+```json
+{
+  "encoding-detectors": [
+    {"html-encoding-detector": {}},
+    {"universal-encoding-detector": {}},
+    {"icu4j-encoding-detector": {}}
+  ]
+}
+```
+
+### Canonical 4.x junkfilter chain config
+
+```json
+{
+  "encoding-detectors": [
+    {"bom-detector": {}},
+    {"html-encoding-detector": {}},
+    {"mojibuster-encoding-detector": {}},
+    {"junk-filter-encoding-detector": {}}
+  ]
+}
+```
+
+### Per-detector isolation configs
+
+Each detector wired alone lives in `<workdir>/configs/`:
+`tika-config-bom.json`, `tika-config-html.json`, 
`tika-config-htmlstandard.json`,
+`tika-config-universal.json`, `tika-config-icu4j.json`,
+`tika-config-mojibuster.json`, `tika-config-junkfilter-chain.json`.
+Use these for chain-attribution work (which detector did the detection).
+
+## Encoding-pair flip query
+
+`MIMES.MIME_STRING` for text-y mimes is `text/html; charset=X` form. Extract
+the charset with a regex split, group by `(enc_a, enc_b)`, filter pairs.
+A=before/`-a`, B=after/`-b`; join on `pa.ID = pb.ID` (paired by id).
+
+```sql
+SELECT
+  REGEXP_REPLACE(ma.MIME_STRING, '^.*charset=', '') AS enc_a,
+  REGEXP_REPLACE(mb.MIME_STRING, '^.*charset=', '') AS enc_b,
+  COUNT(*) n,
+  SUM(cb.NUM_COMMON_TOKENS - ca.NUM_COMMON_TOKENS) AS delta_common
+FROM PROFILES_A pa
+JOIN PROFILES_B pb ON pa.ID = pb.ID
+JOIN MIMES ma ON pa.MIME_ID = ma.MIME_ID
+JOIN MIMES mb ON pb.MIME_ID = mb.MIME_ID
+JOIN CONTENTS_A ca ON ca.ID = pa.ID
+JOIN CONTENTS_B cb ON cb.ID = pb.ID
+WHERE ma.MIME_STRING LIKE '%charset=%' AND mb.MIME_STRING LIKE '%charset=%'
+  AND REGEXP_REPLACE(ma.MIME_STRING, '^.*charset=', '') <>
+      REGEXP_REPLACE(mb.MIME_STRING, '^.*charset=', '')
+GROUP BY enc_a, enc_b
+ORDER BY n DESC, delta_common ASC LIMIT 50;
+```
+
+Add an `IN (...)` filter on either side to constrain to a family
+(e.g. SBCS-Western → CJK):
+
+```sql
+  AND REGEXP_REPLACE(ma.MIME_STRING,'^.*charset=','')
+      IN ('windows-1252','ISO-8859-1','ISO-8859-15','ISO-8859-2','ISO-8859-3',
+          'windows-1250','windows-1254','windows-1257','ISO-8859-13',
+          'windows-1258','x-MacRoman','IBM850','IBM852')
+  AND REGEXP_REPLACE(mb.MIME_STRING,'^.*charset=','')
+      IN ('GB18030','GBK','GB2312','Big5','Big5-HKSCS','Shift_JIS','EUC-JP',
+          'EUC-KR','x-EUC-TW','x-windows-874','x-windows-949',
+          'ISO-2022-JP','ISO-2022-KR','ISO-2022-CN')
+```
+
+### Per-file drilldown
+
+Join `CONTAINERS` to get the source path; pull `LANG_ID_1` from both sides
+to see whether language detection agrees the content is Western while the
+charset has flipped to CJK (the regression's defining shape):
+
+```sql
+SELECT ct.FILE_PATH,
+       REGEXP_REPLACE(ma.MIME_STRING,'^.*charset=','') AS enc_a,
+       REGEXP_REPLACE(mb.MIME_STRING,'^.*charset=','') AS enc_b,
+       ca.NUM_COMMON_TOKENS AS ca_tok, cb.NUM_COMMON_TOKENS AS cb_tok,
+       cb.NUM_COMMON_TOKENS - ca.NUM_COMMON_TOKENS AS delta,
+       ca.LANG_ID_1 AS lang_a, cb.LANG_ID_1 AS lang_b
+FROM PROFILES_A pa JOIN PROFILES_B pb ON pa.ID = pb.ID
+JOIN MIMES ma ON pa.MIME_ID = ma.MIME_ID JOIN MIMES mb ON pb.MIME_ID = 
mb.MIME_ID
+JOIN CONTENTS_A ca ON ca.ID = pa.ID JOIN CONTENTS_B cb ON cb.ID = pb.ID
+JOIN CONTAINERS ct ON ct.CONTAINER_ID = pa.CONTAINER_ID
+WHERE <enc_a/enc_b filter as above>
+ORDER BY delta ASC LIMIT 15;
+```
+
+## Per-file detector attribution (`X-TIKA:encodingDetectionTrace`)
+
+Every JSON extract from a chain with multiple detectors carries
+`X-TIKA:encodingDetectionTrace` in metadata. It's a per-detector emission
+log with the META detector's arbitration tag at the end:
+
+```
+MojibusterEncodingDetector->Shift_JIS[STATISTICAL](1.00) [junk-filter-selected]
+```
+
+When investigating "why did B pick X for this file?", read this trace first
+— it tells you which base detector(s) emitted candidates and which one the
+meta detector chose. If the trace shows ONLY Mojibuster firing with a CJK
+pick, the bug is in Mojibuster's emission (pool too narrow), not in
+JunkFilter's arbitration.
+
+`X-TIKA:encodingDetector` is the simple-name credit string;
+`X-TIKA:detectedEncoding` is the final answer (also in `Content-Encoding`).
+
+## Reproducing a single-file detection without a full chain
+
+```bash
+./mvnw -q -pl tika-ml/tika-ml-junkdetect 
-Dmaven.repo.local=$(pwd)/.local_m2_repo \
+  -Dexec.classpathScope=test \
+  -Dexec.mainClass=org.apache.tika.ml.junkdetect.TraceJunkFilter \
+  -Dexec.args="--file <path> --auto-candidates --content-cleaner --head-bytes 
524288 --sample 120" \
+  exec:java
+```
+
+Key flags:
+- `--auto-candidates` — use Mojibuster's per-file pool as the candidate set
+- `--content-cleaner` — decode each candidate then run text through
+  `HtmlContentCleaner` to match the live chain
+- `--head-bytes 524288` — read up to 512 KB raw to match
+  `AdaptiveProbe.DEFAULT_RAW_CAP`. The default `READ_LIMIT` of 16 KB will
+  give a *different* probe than the live chain on long markup-heavy pages
+  and lead you to disagree with the live chain's pick. Always pass this
+  when reconciling a TraceJunkFilter run with a live extract.
+
+Without `--head-bytes`, you are looking at a different probe than the
+chain saw — this is the most common source of "trace says X, chain
+says Y" confusion.
diff --git 
a/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/NaiveBayesBigramEncodingDetector.java
 
b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/NaiveBayesBigramEncodingDetector.java
index 2460656f0c..4140b6f023 100644
--- 
a/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/NaiveBayesBigramEncodingDetector.java
+++ 
b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/NaiveBayesBigramEncodingDetector.java
@@ -26,7 +26,9 @@ import java.nio.file.Path;
 import java.util.ArrayList;
 import java.util.Arrays;
 import java.util.Collections;
+import java.util.HashMap;
 import java.util.List;
+import java.util.Map;
 
 import org.apache.commons.io.IOUtils;
 
@@ -104,23 +106,15 @@ public class NaiveBayesBigramEncodingDetector implements 
EncodingDetector {
     public static final double MARGIN_THRESHOLD_NATS_PER_BIGRAM = 0.20;
 
     /**
-     * Per-bigram cross-class total-contribution cap (Type C clipping).
-     * For each distinct bigram in the probe, the top-scoring class's
-     * total contribution (count × logP × idf, after dequantization) is
-     * capped at the runner-up class's contribution + this many nats.
-     *
-     * <p>Defends against corpus-skew pathologies where one class
-     * accumulates extreme bigram mass that swings classification on
-     * one or two byte-pairs alone (e.g., Czech "ČR" digraph in
-     * ISO-8859-2 contributing +186 nats over win-1252 on Italian text).
-     * Length-invariant by construction: the cap is on per-bigram
-     * advantage, regardless of how many times the bigram appears.</p>
-     *
-     * <p>20 nats = e^20 ≈ 5×10^8 probability-ratio advantage per
-     * bigram — preserves legitimate CJK-vs-Latin and other cross-script
-     * signal while bounding the diffuse-corpus-skew tail.</p>
+     * Per-distinct-bigram cap: top-scoring class's contribution is
+     * clipped to the best <em>cross-cohort</em> class's contribution +
+     * this many nats.  Bounds both single-bigram corpus skew and the
+     * diffuse coverage asymmetry where broad-vocab cohorts (CJK,
+     * EBCDIC) collectively swamp narrow-vocab cohorts (LATIN) on
+     * rare-ASCII bigrams that fall to the unseen floor in the narrow
+     * cohort.  See {@link Cohort}.
      */
-    public static final double CAP_PER_BIGRAM_NATS = 20.0;
+    public static final double CAP_PER_BIGRAM_NATS = 10.0;
 
     /**
      * Minimum distinct bigrams required before the per-bigram cap
@@ -149,9 +143,60 @@ public class NaiveBayesBigramEncodingDetector implements 
EncodingDetector {
      */
     public static final int MIN_BIGRAMS_FOR_DIVERSITY_GATE = 100;
 
+    /**
+     * Script / writing-system family used by {@link #CAP_PER_BIGRAM_NATS}.
+     * UTF-8 stands alone so the cap engages on UTF-vs-anything pairs
+     * (UTF-8 misread as win-1252 or as GBK).
+     */
+    public enum Cohort {
+        LATIN, CJK, CYRILLIC, GREEK, HEBREW, ARABIC, THAI, EBCDIC, UTF
+    }
+
+    /**
+     * Class label → cohort.  Must cover every NB-model label; load
+     * fails fast on an unmapped label (model and code travel together
+     * in git, no BWC layer).
+     */
+    private static final Map<String, Cohort> COHORT_TABLE = buildCohortTable();
+
+    private static Map<String, Cohort> buildCohortTable() {
+        Map<String, Cohort> m = new HashMap<>();
+        for (String label : new String[]{
+                "windows-1252", "windows-1250", "windows-1254", "windows-1257",
+                "windows-1258", "ISO-8859-2", "ISO-8859-3", "ISO-8859-16",
+                "x-MacRoman", "IBM850", "IBM852"}) {
+            m.put(label, Cohort.LATIN);
+        }
+        for (String label : new String[]{
+                "Big5-HKSCS", "EUC-JP", "GB18030", "Shift_JIS",
+                "x-EUC-TW", "x-windows-949"}) {
+            m.put(label, Cohort.CJK);
+        }
+        for (String label : new String[]{
+                "windows-1251", "KOI8-R", "KOI8-U", "IBM855", "IBM866",
+                "x-mac-cyrillic"}) {
+            m.put(label, Cohort.CYRILLIC);
+        }
+        m.put("windows-1253", Cohort.GREEK);
+        m.put("windows-1255", Cohort.HEBREW);
+        m.put("windows-1256", Cohort.ARABIC);
+        m.put("windows-874", Cohort.THAI);
+        // Bidi-suffix variants (-ltr/-rtl) share a cohort; toJavaCharsetName
+        // collapses them at Charset lookup, but their bigram tables differ.
+        for (String label : new String[]{
+                "IBM1047", "IBM500", "IBM420-ltr", "IBM420-rtl",
+                "IBM424-ltr", "IBM424-rtl"}) {
+            m.put(label, Cohort.EBCDIC);
+        }
+        m.put("UTF-8", Cohort.UTF);
+        return Collections.unmodifiableMap(m);
+    }
+
     private final String[] labels;
     /** Charset objects cached at load — one {@code Charset.forName} per 
class, ever. */
     private final Charset[] charsets;
+    /** Per-class cohort, parallel to {@link #labels}. */
+    private final Cohort[] cohorts;
     /**
      * Bigram-major int8 logP layout.  Quantized at load time via
      * per-class scale {@code scale[c] = maxAbs(class c's logP column) / 127}.
@@ -198,6 +243,7 @@ public class NaiveBayesBigramEncodingDetector implements 
EncodingDetector {
             this.numClasses = dis.readInt();
             this.labels = new String[numClasses];
             this.charsets = new Charset[numClasses];
+            this.cohorts = new Cohort[numClasses];
 
             // Read quantized IDF table + scale.
             float idfScale = dis.readFloat();
@@ -228,6 +274,14 @@ public class NaiveBayesBigramEncodingDetector implements 
EncodingDetector {
                     cs = null;
                 }
                 charsets[c] = cs;
+                Cohort cohort = COHORT_TABLE.get(labels[c]);
+                if (cohort == null) {
+                    throw new IOException(
+                            "NB model class label \"" + labels[c]
+                                    + "\" has no cohort assignment; "
+                                    + "update 
NaiveBayesBigramEncodingDetector.COHORT_TABLE.");
+                }
+                cohorts[c] = cohort;
 
                 scale[c] = dis.readFloat();
                 unseenQ[c] = dis.readByte();
@@ -247,6 +301,23 @@ public class NaiveBayesBigramEncodingDetector implements 
EncodingDetector {
                 }
             }
 
+            // The cohort cap needs a cross-cohort competitor to cap against;
+            // require >=2 cohorts so scoreClassesAndCount never sees an empty
+            // cross-cohort set. Always true for the bundled 9-cohort model;
+            // fails fast only on a single-cohort model shift.
+            boolean multiCohort = false;
+            for (int c = 1; c < numClasses; c++) {
+                if (cohorts[c] != cohorts[0]) {
+                    multiCohort = true;
+                    break;
+                }
+            }
+            if (!multiCohort) {
+                throw new IOException("NB model must span at least two 
cohorts; got "
+                        + numClasses + " class(es) all in cohort "
+                        + (numClasses == 0 ? "<none>" : cohorts[0]));
+            }
+
             // Per-class dequant constant = scale[c] × idfScale.
             // (B-3 per-class score normalization by log V(c) was
             // removed after empirically backfiring on probes where a
@@ -454,21 +525,30 @@ public class NaiveBayesBigramEncodingDetector implements 
EncodingDetector {
             }
 
             // logPs are negative; "best" class for the bigram = highest
-            // (least negative) contribution after dequant.
+            // (least negative) contribution after dequant.  Cap reference
+            // is the best contribution from a class outside top-1's
+            // cohort, so the cap engages on cross-cohort gaps that a
+            // max-vs-overall-runner-up cap missed when multiple classes
+            // in top-1's cohort sat close together.
+            int topClass = -1;
             double max = Double.NEGATIVE_INFINITY;
-            double secondMax = Double.NEGATIVE_INFINITY;
             for (int c = 0; c < numClasses; c++) {
                 double contrib = logP8[base + c] * countTimesIdf * 
perClassDequant[c];
                 contributions[c] = contrib;
                 if (contrib > max) {
-                    secondMax = max;
                     max = contrib;
-                } else if (contrib > secondMax) {
-                    secondMax = contrib;
+                    topClass = c;
+                }
+            }
+            Cohort topCohort = cohorts[topClass];
+            double bestCrossCohort = Double.NEGATIVE_INFINITY;
+            for (int c = 0; c < numClasses; c++) {
+                if (cohorts[c] != topCohort && contributions[c] > 
bestCrossCohort) {
+                    bestCrossCohort = contributions[c];
                 }
             }
-            // Cap any class whose contribution exceeds runner-up + cap.
-            double capValue = secondMax + CAP_PER_BIGRAM_NATS;
+            // bestCrossCohort is always finite here: load requires >=2 
cohorts.
+            double capValue = bestCrossCohort + CAP_PER_BIGRAM_NATS;
             if (max > capValue) {
                 for (int c = 0; c < numClasses; c++) {
                     if (contributions[c] > capValue) {
diff --git 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/java/org/apache/tika/parser/apple/IWorkTest.java
 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/java/org/apache/tika/parser/apple/IWorkTest.java
deleted file mode 100644
index 7ce6bba8cd..0000000000
--- 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/java/org/apache/tika/parser/apple/IWorkTest.java
+++ /dev/null
@@ -1,32 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-package org.apache.tika.parser.apple;
-
-import java.nio.file.Path;
-import java.nio.file.Paths;
-
-import org.junit.jupiter.api.Test;
-
-import org.apache.tika.TikaTest;
-
-public class IWorkTest extends TikaTest {
-
-    @Test
-    public void testBasic() throws Exception {
-        Path p = 
Paths.get("/home/tallison/Downloads/Apple_key_file/keynotecreated.key");
-    }
-}

(tika) branch main updated: TIKA-4745 - add cohort-specific caps (#2848)

Reply via email to