This is an automated email from the ASF dual-hosted git repository. tballison pushed a commit to branch charset-detection in repository https://gitbox.apache.org/repos/asf/tika.git
commit 9f917eca058e12b96363de00df00c1d80a418788 Author: tallison <[email protected]> AuthorDate: Wed Jun 3 10:51:48 2026 -0400 TIKA-4745 - charset/junk/tika-eval improvements --- .skills/tika-eval-compare.md | 10 + .skills/tika-eval-encoding-regression.md | 35 +++ .skills/tika-eval-h2-query.md | 51 ++++ .../pages/advanced/charset-detection-design.adoc | 90 ++++++- .../integration-testing/tika-eval-regression.adoc | 21 +- .../org/apache/tika/detect/CharsetSupersets.java | 2 + .../apache/tika/detect/HighByteLetterStats.java | 94 +++++++ .../apache/tika/detect/CharsetSupersetsTest.java | 67 +++++ .../tika/detect/HighByteLetterStatsTest.java | 72 ++++++ .../tika/ml/chardetect/CjkDecodeValidator.java | 151 +++++++++++ .../tika/ml/chardetect/CosineFamilyArbiter.java | 241 ++++++++++++++++++ .../ml/chardetect/MojibusterEncodingDetector.java | 194 ++++++++------ .../NaiveBayesBigramEncodingDetector.java | 21 +- .../apache/tika/ml/chardetect/cosine-profiles.bin | Bin 0 -> 1080313 bytes .../org/apache/tika/ml/chardetect/nb-bigram.bin | Bin 1016638 -> 1008871 bytes .../tika/ml/chardetect/CjkDecodeValidatorTest.java | 81 ++++++ .../tika/ml/chardetect/Iso2022DetectionTest.java | 83 ++++++ .../org/apache/tika/eval/app/ExtractComparer.java | 5 + .../tika/eval/app/ExtractComparerRunner.java | 2 + .../apache/tika/eval/app/ExtractProfileRunner.java | 1 + .../org/apache/tika/eval/app/ExtractProfiler.java | 9 + .../org/apache/tika/eval/app/ProfilerBase.java | 46 ++++ .../java/org/apache/tika/eval/app/db/Cols.java | 6 +- .../eval/app/reports/MarkdownSummaryWriter.java | 8 +- .../tika/eval/core/langid/LanguageIDWrapper.java | 16 +- .../eval/core/textstats/NonAsciiCharCounter.java | 39 +++ .../core/textstats/ReplacementCharCounter.java | 39 +++ .../ml/junkdetect/JunkFilterEncodingDetector.java | 281 +++++++++++++++++++-- .../junkdetect/JunkFilterEncodingDetectorTest.java | 179 +++++++++++++ .../tika/ml/junkdetect/LatinLetterGateTest.java | 110 ++++++++ 30 files changed, 1836 insertions(+), 118 deletions(-) diff --git a/.skills/tika-eval-compare.md b/.skills/tika-eval-compare.md index 4fc628cc06..d4549d26c8 100644 --- a/.skills/tika-eval-compare.md +++ b/.skills/tika-eval-compare.md @@ -120,6 +120,16 @@ directory, plus a `summary.md` with key metrics: | Exception count | ≤ A | > A | | Total files (B) vs (A) | equal or higher | lower — missing embedded docs | +### Encoding-detection evals + +For charset/encoding-detector changes, the summary reports don't cover it — query +the db directly (see the **tika-eval-h2-query** skill). The detected encoding is in +the `ENCODINGS_A`/`ENCODINGS_B` tables (`DETECTED_ENCODING`, `ENCODING_DETECTOR`, +`DECLARED_METADATA`), **not** `PROFILES`. Key signals: per-encoding counts (e.g. CJK +total), A→B flips by direction, and OOV on the flipped files (a flip that *worsens* +OOV is a regression; one that *improves* it is a fix). Pair on `ID`; map back to the +source file via `PROFILES_*.FILE_NAME` (the content hash). + ### CRITICAL: Review Checklist The purpose of tika-eval is to find regressions BEFORE a release. After diff --git a/.skills/tika-eval-encoding-regression.md b/.skills/tika-eval-encoding-regression.md index 1d3e61a67c..7532df3981 100644 --- a/.skills/tika-eval-encoding-regression.md +++ b/.skills/tika-eval-encoding-regression.md @@ -123,6 +123,41 @@ WHERE <enc_a/enc_b filter as above> ORDER BY delta ASC LIMIT 15; ``` +## Reading the signals — OOV, languageness, and FFFD together + +No single signal is authoritative. Use `oov` as a **secondary** signal alongside +`languageness` (the junk-model coherence z-score) and the U+FFFD rate — each is +right where the others are blind, so cross-check rather than ranking on any one. +(Established 2026-06-03: a 40-file OOV-"worse" set was mostly metric artifacts +once languageness/FFFD were brought in — only ~6 were real. But OOV is also the +*correct* signal where languageness is blind, so neither dominates.) + +- **OOV can mislead** when langid shifts — a CJK/UTF-8 recovery in B is scored + against a different vocab → higher OOV though B is right — or when a wrong + decode fragments words into more short common tokens (→ higher count for the + WORSE decode). A common-token delta is a signal, not proof. +- **languageness can mislead** on SBCS↔SBCS cross-script mojibake — Greek decoded + as KOI8-R is "coherent" Cyrillic, so `languageness` stays flat while `oov` + correctly flags it. Conversely languageness catches OOV's CJK/script-recovery + blind spot. Each covers the other's blind spot. +- **FFFD rate** flags decode failures (illegal bytes): `num_replacement / + num_non_ascii` (un-diluted; `/ content_length` dilutes to ~0 on ASCII-heavy + docs). Tika strips C0 controls at extraction, so legal-but-wrong (C1) mojibake + does not surface here — that signal belongs in the detector chain, not the eval. +- **In practice:** when the signals agree, high confidence; when they disagree + (OOV-worse but languageness-better, or vice versa), that file needs a look — + the disagreement points you at WHICH files to inspect, it does not by itself + declare OOV or languageness "wrong." Split OOV-worse by languageness direction + (query in `tika-eval-regression.adoc`). + +### Isolate a change against the PRIOR run, not just 3.x + +To see what one chain change actually did, Compare the new run against the +*previous* 4.x run (B-new vs B-prior), not only vs 3.x. The diff should be +*surgical* — e.g. the within-Latin letter gate moved exactly 6 files +(IBM850 / x-MacRoman → windows-1252) vs the prior run and nothing else. A +bigger-than-expected diff means the change fired more broadly than intended. + ## Per-file detector attribution (`X-TIKA:encodingDetectionTrace`) Every JSON extract from a chain with multiple detectors carries diff --git a/.skills/tika-eval-h2-query.md b/.skills/tika-eval-h2-query.md index d4f0b2c378..37532d64c0 100644 --- a/.skills/tika-eval-h2-query.md +++ b/.skills/tika-eval-h2-query.md @@ -40,6 +40,7 @@ it then waits on stdin and appears to hang). |---|---| | `PROFILES_A` / `PROFILES_B` | one row per extracted file: `FILE_NAME`, `MD5`, `MIME_ID`, `CONTAINER_ID`, `EMBEDDED_FILE_PATH`, `LENGTH`, `NUM_PAGES`, … (A = "before"/-a, B = "after"/-b) | | `CONTENTS_A` / `CONTENTS_B` | text profile per file (join on `ID`): `OOV`, `LANGUAGENESS`, `NUM_TOKENS`, `NUM_COMMON_TOKENS`, `LANG_ID_1`/`LANG_ID_PROB_1`, `TOKEN_ENTROPY_RATE`, … | +| `ENCODINGS_A` / `ENCODINGS_B` | detected-encoding per file (join on `ID`): `DETECTED_ENCODING`, `ENCODING_DETECTOR`, `DECLARED_METADATA`. **`DETECTED_ENCODING` lives HERE, not on `PROFILES` — moved out in the encodings-table refactor; querying `PROFILES_*.DETECTED_ENCODING` now errors "Column not found".** A file with no detected encoding has no row. | | `CONTENT_COMPARISONS` | per-file A↔B comparison (`ID`): `DICE_COEFFICIENT`, `OVERLAP`, top token diffs | | `MIMES` | `MIME_ID` → `MIME_STRING` | | `CONTAINERS` | container id → input file path | @@ -86,6 +87,56 @@ FROM CONTENTS_A ca JOIN CONTENTS_B cb ON ca.ID = cb.ID; To bring in mime/path, join `PROFILES_A pa ON pa.ID = ca.ID` (and `pb`/`cc` likewise on the same `id`) — all on `id`. +Detected-encoding queries — `DETECTED_ENCODING` is on `ENCODINGS_A`/`ENCODINGS_B` +(join on `ID`), NOT `PROFILES`. CJK count in B (LOWER() — `REGEXP` is case-sensitive, +see below): + +```sql +SELECT COUNT(*) FROM ENCODINGS_B +WHERE LOWER(DETECTED_ENCODING) REGEXP 'gb|big5|euc|shift|jis|2022|949'; +``` + +Encoding flips A→B by direction (what changed between runs): + +```sql +SELECT ea.DETECTED_ENCODING a_enc, eb.DETECTED_ENCODING b_enc, COUNT(*) n +FROM ENCODINGS_A ea JOIN ENCODINGS_B eb ON ea.ID = eb.ID +WHERE ea.DETECTED_ENCODING <> eb.DETECTED_ENCODING +GROUP BY a_enc, b_enc ORDER BY n DESC; +``` + +Map a flipped file back to its source file — `PROFILES_*.FILE_NAME` is the content +hash (the input file is `<corpus>/<first-2-hex>/<FILE_NAME>`); join `CONTENTS` for OOV: + +```sql +SELECT pb.FILE_NAME, ea.DETECTED_ENCODING a_enc, eb.DETECTED_ENCODING b_enc, + ca.OOV oov_a, cb.OOV oov_b +FROM ENCODINGS_A ea JOIN ENCODINGS_B eb ON ea.ID = eb.ID + JOIN PROFILES_B pb ON ea.ID = pb.ID + JOIN CONTENTS_A ca ON ea.ID = ca.ID JOIN CONTENTS_B cb ON ea.ID = cb.ID +WHERE LOWER(eb.DETECTED_ENCODING) REGEXP 'gb|big5|euc|shift|jis|2022|949' + AND NOT (LOWER(ea.DETECTED_ENCODING) REGEXP 'gb|big5|euc|shift|jis|2022|949'); +``` + +## Gotcha: `REGEXP` is case-sensitive (silent wrong results) + +H2's `REGEXP` operator is **case-sensitive**, so `DETECTED_ENCODING REGEXP +'big5|gb|euc'` does **not** match `Big5-HKSCS` or `GB18030` — and it fails +*silently*, quietly dropping/keeping the wrong rows instead of erroring. Always +either lowercase the column or use the inline case-insensitive flag: + +```sql +-- right: +WHERE LOWER(DETECTED_ENCODING) REGEXP 'big5|gb|euc|shift|jis|2022|949' +-- or: +WHERE DETECTED_ENCODING REGEXP '(?i)big5|gb|euc|shift|jis|2022|949' +-- wrong (misses Big5-HKSCS, GB18030, Shift_JIS, ...): +WHERE DETECTED_ENCODING REGEXP 'big5|gb|euc|shift|jis|2022|949' +``` + +(`DETECTED_ENCODING` is on `ENCODINGS_A`/`ENCODINGS_B` — join to `PROFILES`/`CONTENTS` +on `ID` — populated from `X-TIKA:detectedEncoding`.) + ## Tip For a quick interactive session, drop `-sql` and you get an H2 prompt; `SHOW diff --git a/docs/modules/ROOT/pages/advanced/charset-detection-design.adoc b/docs/modules/ROOT/pages/advanced/charset-detection-design.adoc index fc870e5d51..815ed375e4 100644 --- a/docs/modules/ROOT/pages/advanced/charset-detection-design.adoc +++ b/docs/modules/ROOT/pages/advanced/charset-detection-design.adoc @@ -53,7 +53,7 @@ results are collected into an `EncodingDetectorContext` on the | `MojibusterEncodingDetector` | `tika-encoding-detector-mojibuster` | Structural UTF-32 and UTF-16 detection, UTF-8 grammar gate, HTML - attribute-aware stripping, then a 33-class byte-bigram NB + attribute-aware stripping, then a 34-class byte-bigram NB classifier. STRUCTURAL for structural hits; STATISTICAL for NB predictions. See <<nb-pipeline>>. @@ -135,6 +135,21 @@ sequences. Three outcomes: * `AMBIGUOUS` — no complete multi-byte sequence (pure ASCII, or only a truncated lead at probe-end). No emission. +=== ISO-2022-JP/KR/CN structural detection (pure-ASCII branch) + +ISO-2022 encodings are 7-bit and escape-based (`ESC $ B`, `ESC $ ) C`, …), +so they carry no high bytes and are invisible to the byte-bigram +classifier; without a structural check a real ISO-2022-JP page would fall +through to the windows-1252 default and decode to gibberish. On a +pure-ASCII probe — the only place ISO-2022 can occur — the pipeline scans +for the ISO-2022 designation escape and, if found, *verifies* by decoding: +the result must contain real CJK at a near-zero replacement rate. The +verify rejects a stray `ESC $` in ordinary ASCII (which yields no CJK). +On success an ISO-2022-JP/KR/CN STRUCTURAL candidate is emitted. High-byte +binary that happens to contain an escape sequence never reaches this +check — it fails the pure-ASCII gate and takes the normal NB path, so it +cannot trigger a false ISO-2022 detection. + === Layer 4 — HTML stripping (content-type aware) When the probe looks like HTML/XML (explicit content-type or unknown), @@ -155,10 +170,10 @@ just content bytes for NB feature extraction. Optimizations: === Layer 5 — Naive Bayes byte-bigram classifier -33 classes: CJK multibyte (Big5-HKSCS, EUC-JP, GB18030, Shift_JIS, +34 classes: CJK multibyte (Big5-HKSCS, EUC-JP, GB18030, Shift_JIS, x-EUC-TW, x-windows-949), EBCDIC family (IBM420/424-ltr/rtl, IBM500, IBM1047), DOS OEM (IBM850/852/855/866), Cyrillic (KOI8-R, KOI8-U), -Windows single-byte (1250-1258, 874), ISO-8859-3/16, Mac (x-MacRoman, +Windows single-byte (1250-1258, 874), ISO-8859-2/3/16, Mac (x-MacRoman, x-mac-cyrillic), and UTF-8. Features are **stride-1 byte bigrams** — for probe bytes `b[0..N]`, @@ -212,9 +227,10 @@ every probe length we've measured. * **Empty / near-empty probes (< 2 bytes)** → windows-1252 @ 0.1 confidence. WHATWG default; never returns empty result. -* **Pure ASCII probes** (no bytes ≥ 0x80, no nulls) → windows-1252. - Bigram NB cannot discriminate Latin code pages on pure-ASCII - content; return the HTML5-canonical answer directly. +* **Pure ASCII probes** (no bytes ≥ 0x80, no nulls) → ISO-2022 structural + detection first (see above); otherwise windows-1252. Bigram NB cannot + discriminate Latin code pages on pure-ASCII content; return the + HTML5-canonical answer directly. * **Latin-sibling → windows-1252 rewrite** — on low-evidence probes (< 5 high bytes), if the top NB candidate is a non-1252 member of the Latin family and the probe decodes byte-identically under @@ -223,6 +239,34 @@ every probe length we've measured. threshold are not emitted into the pool. Prevents JunkFilter from scoring weak coincidence picks against NB's confident top. +==== CJK decode-failure veto (`CjkDecodeValidator`) + +A legacy multi-byte CJK class (GB18030, Big5-HKSCS, Shift_JIS, EUC-JP, +x-windows-949, x-EUC-TW) that NB picks on Latin/Cyrillic/garbage bytes is +*false-CJK*: those bytes don't validate under the charset, so decoding +produces many malformed/unmappable events, whereas real CJK decodes +cleanly. After NB, each legacy-CJK candidate is decoded under its vendor +superset (`CharsetSupersets`) and its failure rate measured as +`failures / high-bytes`; above ~2.5% the candidate is dropped — and if it +was NB's only pick, the pool empties and windows-1252 wins. Two +corrections make the rate trustworthy: + +* **Decode under the vendor superset, not the strict base** — real + vendor-extension chars (NEC/IBM for Shift_JIS/EUC-JP, HKSCS for Big5) + would otherwise count as failures and penalize genuine CJK. +* **Discount embedded UTF-8** — mixed-encoding pages (legacy CJK body + + UTF-8 widgets) would otherwise read as 2–9.5% failure. The validator + walks the bytes and *skips* positions that begin a valid UTF-8 sequence + (it does NOT physically strip them — that would misalign a pure + legacy-CJK stream and manufacture failures), decoding the legacy charset + in place elsewhere. Post-discount, real CJK (pure or mixed) is ≤1.6% + while genuine false-CJK stays ≥5.3%, so ~2.5% separates them. + +This veto catches *structurally-illegal* false-CJK only. The +*legal-but-wrong* class — Latin/Cyrillic bytes that form a *valid* CJK +decode at ~0 failure — is the typicality layer's job (<<junk-filter>>), +not this veto's. + [[junk-filter]] == JunkFilterEncodingDetector — text-quality arbitration @@ -261,6 +305,36 @@ For plain first-match-wins, omit JunkFilter (see <<opting-out-of-arbitration>>). . **Pairwise tournament** — first candidate seeds champion; each challenger compared via `JunkDetector.compare`; higher z-score wins. +=== Post-tournament demote gates + +Two demote-only refinements run after the champion is chosen. Each fires only +to *demote* the champion across one boundary the whole-text z-score reads +poorly under COMMON-dilution; neither can promote, so they cannot cost a +confident detection. + +* **CJK family gate** — the whole-text z coin-flips on the CJK/non-CJK boundary + when markup and digits decode identically and swamp the few discriminating + high bytes. A script-letter "diff" z — scored over only the `>= 0x80` + letters/ideographs, where candidates actually differ — reads that boundary + cleanly. If the champion is CJK and the best non-CJK diff-z beats the best + CJK diff-z by `FAMILY_DIFF_MARGIN` (2.0), demote to the best non-CJK + candidate. The reverse (promote to CJK) regressed at scale and is + unnecessary — genuine CJK is `<meta>`-declared upstream. + +* **Within-Latin letter gate** — among single-byte Latin siblings the z also + coin-flips, occasionally promoting a DOS-OEM / Mac charset (IBM850, + x-MacRoman) whose high bytes decode to box-drawing / symbols over the + windows-1252 truth. Cased-letter count reads this where typicality cannot: + if the champion is a Latin SBCS, a windows-1252 candidate is present, the + probe is high-byte-dense, and windows-1252 decodes clearly more cased + high-byte letters (by a margin), demote to windows-1252. Directional — a + genuine Central-European / DOS document has *more* letters under its true + charset, so the gate stays silent. Latin-scoped, so it never crosses the + CJK boundary (the family gate's job) or touches a non-Latin SBCS, whose + Cyrillic/Greek cased letters would pollute the count. Shares the + `HighByteLetterStats` letter counter with Mojibuster's Western-Latin sibling + fallback. + === JunkDetector scoring `JunkDetector` partitions decoded text into maximal Unicode-script runs @@ -485,7 +559,7 @@ can't encode typographic characters). value, vocabulary size, and each trained bigram as `(uint16 bigram, int8 logP)` pairs. -Files for the shipped 33-class model are ~1 MB on disk. Loader +Files for the shipped 34-class model are ~1 MB on disk. Loader materializes a dense `logP8[65 536 × numClasses]` array filled with per-class unseen floors, overwritten by trained pairs. Working-set memory: ~2 MB. @@ -498,7 +572,7 @@ with feature hashing. The move to NB was driven by: * **Speed**: direct bigram indexing removes the hash + bucket-lookup cost. Inner loop is `score[c] += logP[b × numClasses + c] × idf[b]` with no branching (zero-IDF bigrams are skipped before the class - loop). Measured ~15 µs on a full 1 KB probe for 33 classes. + loop). Measured ~15 µs on a full 1 KB probe for 34 classes. * **Memory layout**: bigram-major byte arrays fit in L3 cache for the full table. Sequential access through the hot loop is cache-line efficient. diff --git a/docs/modules/ROOT/pages/advanced/integration-testing/tika-eval-regression.adoc b/docs/modules/ROOT/pages/advanced/integration-testing/tika-eval-regression.adoc index a81f6fabd4..61a8981633 100644 --- a/docs/modules/ROOT/pages/advanced/integration-testing/tika-eval-regression.adoc +++ b/docs/modules/ROOT/pages/advanced/integration-testing/tika-eval-regression.adoc @@ -339,7 +339,8 @@ waiting on stdin). Key tables: `profiles_a`/`profiles_b` (one row per extracted file: `file_name`, `mime_id`, `length`, …), `contents_a`/`contents_b` (text profile: `oov`, -`languageness`, `num_tokens`, `lang_id_1`, …), `content_comparisons` +`languageness`, `num_tokens`, `lang_id_1`, `num_replacement` (U+FFFD count), +`num_non_ascii`, …), `content_comparisons` (`dice_coefficient`, `overlap`), `mimes`, `containers`. *A and B are paired by `id`* — the same row `id` is the same file in both runs (this is how the built-in reports join: `join profiles_b pb on pa.id = pb.id`). Always join on `id`. @@ -351,6 +352,24 @@ SELECT SUM(CASE WHEN cb.oov < ca.oov THEN 1 ELSE 0 END) AS oov_better, SUM(CASE WHEN cb.oov > ca.oov THEN 1 ELSE 0 END) AS oov_worse FROM contents_a ca JOIN contents_b cb ON ca.id = cb.id; +-- NOTE: OOV is one signal, not the verdict -- read it with languageness and the +-- FFFD rate (use OOV as a secondary signal). OOV can mislead (a langid shift, +-- e.g. a CJK decode recovered in B, inflates oov_worse even when B is correct; a +-- wrong decode that fragments words can LOWER OOV), and languageness can mislead +-- on SBCS-cross-script mojibake -- each is right where the other is blind. When +-- OOV-worse and languageness disagree, that file needs a look (split below): +SELECT SUM(CASE WHEN cb.languageness > ca.languageness + 0.2 THEN 1 ELSE 0 END) AS lang_better_oov_lied, + SUM(CASE WHEN cb.languageness < ca.languageness - 0.2 THEN 1 ELSE 0 END) AS lang_worse_real_candidate +FROM contents_a ca JOIN contents_b cb ON ca.id = cb.id +WHERE cb.oov > ca.oov + 0.02 AND ca.languageness > -90 AND cb.languageness > -90; + +-- FFFD decode-failure rate, un-diluted (over non-ASCII chars, NOT total length, +-- which dilutes to ~0 on ASCII-dominated docs) +SELECT ROUND(100.0 * cb.num_replacement / NULLIF(cb.num_non_ascii, 0), 1) AS fffd_pct, + cb.num_replacement, cb.num_non_ascii +FROM contents_b cb WHERE cb.num_replacement > 0 +ORDER BY cb.num_replacement DESC FETCH FIRST 20 ROWS ONLY; + -- net common-tokens A vs B (headline "more real text recovered" metric) SELECT SUM(ca.num_common_tokens) AS common_a, SUM(cb.num_common_tokens) AS common_b, diff --git a/tika-core/src/main/java/org/apache/tika/detect/CharsetSupersets.java b/tika-core/src/main/java/org/apache/tika/detect/CharsetSupersets.java index f53c98f847..88bd5416bb 100644 --- a/tika-core/src/main/java/org/apache/tika/detect/CharsetSupersets.java +++ b/tika-core/src/main/java/org/apache/tika/detect/CharsetSupersets.java @@ -42,6 +42,7 @@ import java.util.Map; * <li>GB2312 → GB18030 (GB18030 is a strict superset of both GB2312 and GBK)</li> * <li>GBK → GB18030 (GB18030 is a strict superset; enables 4-byte extension sequences)</li> * <li>Shift_JIS → windows-31j (MS932 is a strict superset with NEC/IBM extensions)</li> + * <li>EUC-JP → x-eucJP-Open (EUC packing of the NEC/IBM vendor extensions)</li> * </ul> */ public final class CharsetSupersets { @@ -59,6 +60,7 @@ public final class CharsetSupersets { m.put("GB2312", "GB18030"); m.put("GBK", "GB18030"); m.put("Shift_JIS", "windows-31j"); + m.put("EUC-JP", "x-eucJP-Open"); SUPERSET_MAP = Collections.unmodifiableMap(m); } diff --git a/tika-core/src/main/java/org/apache/tika/detect/HighByteLetterStats.java b/tika-core/src/main/java/org/apache/tika/detect/HighByteLetterStats.java new file mode 100644 index 0000000000..a06f426d4b --- /dev/null +++ b/tika-core/src/main/java/org/apache/tika/detect/HighByteLetterStats.java @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.tika.detect; + +import java.nio.charset.Charset; + +/** + * High-byte decode-quality statistics shared by the charset detectors. + * + * <p>Used to disambiguate single-byte <em>Latin</em> charset siblings + * (windows-1252 vs IBM850 / x-MacRoman / ISO-8859-x), where a wrong decode maps + * high bytes to box-drawing / symbols while the right one maps them to accented + * letters. The cased-letter count reads that boundary; the byte-bigram + * typicality models cannot (both decodes look like typical Latin, and on + * COMMON-dominated docs the discriminating bytes are diluted to noise).</p> + * + * <p><b>Latin-only.</b> {@link #countCasedHighByteLetters} counts Lu/Ll/Lt, + * which also covers Cyrillic/Greek cased letters and would be polluted by a + * non-Latin SBCS; and it excludes Lo, so a CJK decode (every ideograph is Lo) + * cannot win on "letters". Callers must restrict the comparison to Latin SBCS + * candidates.</p> + */ +public final class HighByteLetterStats { + + private HighByteLetterStats() { + } + + /** Count of bytes ≥ 0x80 in the probe. */ + public static int countHighBytes(byte[] probe) { + if (probe == null) { + return 0; + } + int n = 0; + for (byte b : probe) { + if ((b & 0xFF) >= 0x80) { + n++; + } + } + return n; + } + + /** + * Decode {@code probe} under {@code cs} and count codepoints ≥ 0x80 that + * are Unicode cased letters (Lu/Ll/Lt). Excludes the ordinal / superscript + * indicators ª (U+00AA), º (U+00BA), ⁿ (U+207F): MacRoman's 0xBB/0xBC are + * ª/º while windows-1252's 0xBB is » (punctuation), so without the exclusion + * MacRoman's letter count would beat windows-1252's wherever » appears. + * Lo (CJK / other-letter) is excluded by counting cased categories only. + */ + public static int countCasedHighByteLetters(byte[] probe, Charset cs) { + if (probe == null) { + return 0; + } + String decoded; + try { + decoded = new String(probe, cs); + } catch (Exception e) { + return 0; + } + int count = 0; + for (int i = 0; i < decoded.length(); ) { + int cp = decoded.codePointAt(i); + if (cp >= 0x80 && isCasedLatinishLetter(cp)) { + count++; + } + i += Character.charCount(cp); + } + return count; + } + + private static boolean isCasedLatinishLetter(int cp) { + if (cp == 0x00AA || cp == 0x00BA || cp == 0x207F) { + return false; // ª, º, ⁿ — ordinal / superscript indicators + } + int type = Character.getType(cp); + return type == Character.UPPERCASE_LETTER + || type == Character.LOWERCASE_LETTER + || type == Character.TITLECASE_LETTER; + } +} diff --git a/tika-core/src/test/java/org/apache/tika/detect/CharsetSupersetsTest.java b/tika-core/src/test/java/org/apache/tika/detect/CharsetSupersetsTest.java new file mode 100644 index 0000000000..738fad6196 --- /dev/null +++ b/tika-core/src/test/java/org/apache/tika/detect/CharsetSupersetsTest.java @@ -0,0 +1,67 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.tika.detect; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertNull; + +import java.nio.charset.Charset; +import java.nio.charset.StandardCharsets; + +import org.junit.jupiter.api.Test; + +public class CharsetSupersetsTest { + + private static String name(String detected) { + Charset s = CharsetSupersets.supersetOf(Charset.forName(detected)); + return s == null ? null : s.name(); + } + + @Test + public void mapsLegacyCjkToVendorSupersets() { + assertEquals("x-windows-949", name("EUC-KR")); + assertEquals("Big5-HKSCS", name("Big5")); + assertEquals("GB18030", name("GB2312")); + assertEquals("GB18030", name("GBK")); + assertEquals("windows-31j", name("Shift_JIS")); + assertEquals("x-eucJP-Open", name("EUC-JP")); + } + + @Test + public void returnsNullWhenNoSuperset() { + assertNull(CharsetSupersets.supersetOf(null)); + assertNull(CharsetSupersets.supersetOf(StandardCharsets.UTF_8)); + assertNull(CharsetSupersets.supersetOf(Charset.forName("windows-1252"))); + // Superset targets have no further superset. + assertNull(CharsetSupersets.supersetOf(Charset.forName("GB18030"))); + } + + /** The point of the map: vendor-extension bytes the strict base drops to + * U+FFFD decode correctly under the superset. */ + @Test + public void supersetRecoversVendorExtensionChars() { + // CP932/EUC-JP NEC special U+2460 (circled one): strict base fails to + // U+FFFD, superset maps it. + byte[] sjis = {(byte) 0x87, (byte) 0x40}; + assertEquals('\uFFFD', new String(sjis, Charset.forName("Shift_JIS")).charAt(0)); + assertEquals("\u2460", new String(sjis, Charset.forName(name("Shift_JIS")))); + + byte[] eucjp = {(byte) 0xAD, (byte) 0xA1}; + assertEquals('\uFFFD', new String(eucjp, Charset.forName("EUC-JP")).charAt(0)); + assertEquals("\u2460", new String(eucjp, Charset.forName(name("EUC-JP")))); + } +} diff --git a/tika-core/src/test/java/org/apache/tika/detect/HighByteLetterStatsTest.java b/tika-core/src/test/java/org/apache/tika/detect/HighByteLetterStatsTest.java new file mode 100644 index 0000000000..de4379c99f --- /dev/null +++ b/tika-core/src/test/java/org/apache/tika/detect/HighByteLetterStatsTest.java @@ -0,0 +1,72 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.tika.detect; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertTrue; + +import java.nio.charset.Charset; + +import org.junit.jupiter.api.Test; + +public class HighByteLetterStatsTest { + + private static final Charset WIN1252 = Charset.forName("windows-1252"); + private static final Charset IBM850 = Charset.forName("IBM850"); + private static final Charset SHIFT_JIS = Charset.forName("Shift_JIS"); + + /** Bytes 0xC0-0xCF are À-Ï (all letters) in windows-1252 but mostly + * box-drawing (└┴┬├─┼ ... ¤) in IBM850 — the box-drawing signature the + * within-Latin gate keys on. */ + @Test + void winBeatsIbm850OnBoxDrawingRange() { + byte[] probe = new byte[16]; + for (int i = 0; i < 16; i++) { + probe[i] = (byte) (0xC0 + i); + } + int win = HighByteLetterStats.countCasedHighByteLetters(probe, WIN1252); + int ibm = HighByteLetterStats.countCasedHighByteLetters(probe, IBM850); + assertEquals(16, win, "all of 0xC0-0xCF are letters in windows-1252"); + assertTrue(ibm <= 4, "IBM850 maps most of 0xC0-0xCF to box-drawing; was " + ibm); + assertTrue(win > ibm + 6, "decisive letter gap expected; win=" + win + " ibm=" + ibm); + } + + /** ª (0xAA), º (0xBA) are ordinal indicators, not letters; é (0xE9) is. */ + @Test + void excludesOrdinalIndicators() { + byte[] probe = {(byte) 0xAA, (byte) 0xBA, (byte) 0xE9}; + assertEquals(1, HighByteLetterStats.countCasedHighByteLetters(probe, WIN1252), + "only é should count; ª and º are ordinal indicators"); + } + + /** CJK ideographs are Lo (other-letter), excluded — so a CJK decode can + * never win the cased-letter comparison against a Latin sibling. */ + @Test + void doesNotCountCjkIdeographs() { + byte[] probe = "日本語の文章".getBytes(SHIFT_JIS); + assertEquals(0, HighByteLetterStats.countCasedHighByteLetters(probe, SHIFT_JIS), + "ideographs are Lo and must not count as cased letters"); + } + + @Test + void countHighBytesIsByteCountAtOrAbove0x80() { + byte[] probe = {0x41, (byte) 0x80, (byte) 0xFF, 0x20, (byte) 0xC3}; + assertEquals(3, HighByteLetterStats.countHighBytes(probe)); + assertEquals(0, HighByteLetterStats.countHighBytes(new byte[0])); + assertEquals(0, HighByteLetterStats.countHighBytes(null)); + } +} diff --git a/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/CjkDecodeValidator.java b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/CjkDecodeValidator.java new file mode 100644 index 0000000000..4c7254c6ee --- /dev/null +++ b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/CjkDecodeValidator.java @@ -0,0 +1,151 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.tika.ml.chardetect; + +import java.nio.ByteBuffer; +import java.nio.CharBuffer; +import java.nio.charset.Charset; +import java.nio.charset.CharsetDecoder; +import java.nio.charset.CoderResult; +import java.nio.charset.CodingErrorAction; +import java.util.Locale; + +import org.apache.tika.detect.CharsetSupersets; + +/** + * Structural false-CJK veto: measures how badly a probe fails to decode under a + * legacy multi-byte CJK charset, robustly against embedded UTF-8. + * + * <p>A Latin/Cyrillic/garbage page mis-detected as a legacy CJK charset decodes + * with many malformed/unmappable sequences; real CJK decodes cleanly. Two + * corrections make the rate meaningful (see the findings doc): + * <ol> + * <li>decode under the <em>vendor superset</em> ({@link CharsetSupersets}) so + * real vendor-extension chars aren't counted as failures;</li> + * <li><strong>discount embedded UTF-8</strong> — mixed-encoding pages (legacy + * CJK body + UTF-8 widgets) would otherwise inflate the rate. Post-discount, + * real CJK (pure or mixed) is ≤1.6% while genuine false-CJK stays ≥5.3%.</li> + * </ol> + * + * <p>The discount is done by a <em>UTF-8-aware single pass</em>, NOT by physically + * stripping UTF-8 runs: a real legacy-CJK char can coincidentally match UTF-8 + * grammar (e.g. Shift_JIS kanji with lead 0xE0–0xEA), and physically removing it + * would misalign the stream and manufacture failures on genuine CJK. Instead we + * walk the bytes, skip positions that begin a valid UTF-8 sequence, and decode the + * legacy charset in place everywhere else — so real CJK is never misaligned and + * the rate errs toward <em>not</em> vetoing. + * + * <p>Does NOT catch the legal-but-wrong class (Latin bytes that form <em>valid</em> + * CJK at ~0 failure) — that's the typicality layer's job. + */ +public final class CjkDecodeValidator { + + private CjkDecodeValidator() { + } + + /** Minimum legacy (non-UTF-8) high bytes required before the rate is trusted. */ + public static final int MIN_HIGH_BYTES = 30; + + /** + * Failure rate of {@code bytes} under {@code cjkCharset}'s vendor superset, + * counting only legacy high bytes (embedded UTF-8 is skipped, not counted). + * + * @return failures / legacy-high-bytes, or {@code -1.0} when there is too + * little legacy evidence (legacy high bytes < {@link #MIN_HIGH_BYTES}) + */ + public static double strippedFailureRate(byte[] bytes, Charset cjkCharset) { + Charset decodeAs = CharsetSupersets.supersetOf(cjkCharset); + if (decodeAs == null) { + decodeAs = cjkCharset; + } + CharsetDecoder dec = decodeAs.newDecoder() + .onMalformedInput(CodingErrorAction.REPORT) + .onUnmappableCharacter(CodingErrorAction.REPORT); + CharBuffer one = CharBuffer.allocate(1); + int i = 0; + int n = bytes.length; + int fail = 0; + int nHigh = 0; + while (i < n) { + int x = bytes[i] & 0xFF; + if (x < 0x80) { + i++; + continue; + } + int ulen = utf8SequenceLength(bytes, i); + if (ulen > 0) { + i += ulen; // embedded UTF-8 — not legacy content, skip + continue; + } + nHigh++; + dec.reset(); + one.clear(); + ByteBuffer in = ByteBuffer.wrap(bytes, i, Math.min(4, n - i)); + CoderResult r = dec.decode(in, one, true); + if (r.isError()) { + fail++; + i++; + } else { + int consumed = in.position() - i; + i += Math.max(1, consumed); + } + } + if (nHigh < MIN_HIGH_BYTES) { + return -1.0; + } + return (double) fail / nHigh; + } + + /** True for the legacy multi-byte CJK charsets this veto applies to (the + * decode-failure signal is meaningful only for these; ISO-2022 is handled + * structurally and single-byte charsets don't apply). */ + public static boolean appliesTo(String charsetName) { + String name = charsetName.toLowerCase(Locale.ROOT); + if (name.contains("2022")) { + return false; // escape-based, structural + } + return name.contains("gb") || name.contains("big5") || name.contains("euc") + || name.contains("shift") || name.contains("jis") || name.contains("949"); + } + + /** Length (2/3/4) of a valid UTF-8 multi-byte sequence starting at {@code i}, + * or 0 if none. Lead-byte ranges exclude overlong 2-byte (C0/C1) and + * out-of-range (≥F5) leads; continuations must be 0x80–0xBF. */ + static int utf8SequenceLength(byte[] b, int i) { + int x = b[i] & 0xFF; + int len; + if (x >= 0xC2 && x <= 0xDF) { + len = 2; + } else if (x >= 0xE0 && x <= 0xEF) { + len = 3; + } else if (x >= 0xF0 && x <= 0xF4) { + len = 4; + } else { + return 0; + } + if (i + len > b.length) { + return 0; + } + for (int k = 1; k < len; k++) { + int c = b[i + k] & 0xFF; + if (c < 0x80 || c > 0xBF) { + return 0; + } + } + return len; + } +} diff --git a/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/CosineFamilyArbiter.java b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/CosineFamilyArbiter.java new file mode 100644 index 0000000000..057bba4a92 --- /dev/null +++ b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/CosineFamilyArbiter.java @@ -0,0 +1,241 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.tika.ml.chardetect; + +import java.io.DataInputStream; +import java.io.IOException; +import java.io.InputStream; +import java.nio.charset.Charset; +import java.nio.charset.IllegalCharsetNameException; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Locale; +import java.util.Map; + +import org.apache.tika.detect.EncodingResult; + +/** + * Family-level guard over the NB statistical pick, defending against the + * single-byte-→CJK collision (Cyrillic / Greek / accented-Latin content + * whose high bytes coincide with legal GBK lead/trail pairs and accumulate + * spurious GB18030 / Big5 likelihood under the multinomial NB). + * + * <p>Two complementary, model-light signals, both blind to NB:</p> + * <ul> + * <li><b>high-byte cosine</b> — cosine between the probe's high-byte + * (≥ 0x80) byte-bigram occupancy and each class's control-stripped + * high-byte profile. Direction-based, so length/density-invariant; the + * ASCII quadrant is dropped so shared English text can't dominate. When + * NB picks a CJK class but the cosine argmax is non-CJK (with enough + * high-byte evidence), the CJK pick is vetoed.</li> + * <li><b>GBK illegality</b> — fraction of high-byte lead bytes that do + * not begin a valid GBK 2-byte or GB18030 4-byte sequence. A genuine + * GB18030 document is ~0% illegal; Cyrillic/Greek forced through GBK + * throws illegal trails. Scoped to GB18030 only (it says nothing about + * Shift_JIS/EUC).</li> + * </ul> + * + * <p>On veto the CJK pick is replaced by the best non-CJK candidate (by cosine + * when evidence is sufficient, else the highest-ranked non-CJK NB candidate); + * real-CJK picks are left untouched (cosine argmax stays CJK, illegality ~0), + * so the guard is regression-safe for genuine CJK.</p> + */ +public final class CosineFamilyArbiter { + + /** Minimum high-byte bigram count before the cosine veto is trusted. */ + public static final int MIN_HIGH_BYTE_SUPPORT = 15; + + /** GBK-illegality fraction above which a GB18030 pick is refuted. */ + public static final double GBK_ILLEGAL_THRESHOLD = 0.02; + + private static final String GB18030 = "GB18030"; + + private final String[] names; + private final boolean[] cjk; + private final Charset[] charsets; // resolved JVM charset, null if unsupported + private final int[][] bigramIds; + private final float[][] weights; // L2-normalized per class + + public CosineFamilyArbiter(InputStream in) throws IOException { + try (DataInputStream dis = new DataInputStream(in)) { + int nc = dis.readInt(); + names = new String[nc]; + cjk = new boolean[nc]; + charsets = new Charset[nc]; + bigramIds = new int[nc][]; + weights = new float[nc][]; + for (int c = 0; c < nc; c++) { + names[c] = dis.readUTF(); + cjk[c] = isCjkName(names[c]); + charsets[c] = resolve(names[c]); + int nnz = dis.readInt(); + int[] ids = new int[nnz]; + float[] w = new float[nnz]; + for (int k = 0; k < nnz; k++) { + ids[k] = dis.readUnsignedShort(); + w[k] = dis.readFloat(); + } + bigramIds[c] = ids; + weights[c] = w; + } + } + } + + private static Charset resolve(String name) { + try { + return Charset.isSupported(name) ? Charset.forName(name) : null; + } catch (IllegalCharsetNameException e) { + return null; + } + } + + static boolean isCjkName(String name) { + String n = name.toLowerCase(Locale.ROOT); + return n.contains("gb") || n.contains("big5") || n.contains("euc") + || n.contains("shift") || n.contains("jis") || n.contains("2022") + || n.contains("949"); + } + + /** + * Apply the family guard to NB's ranked candidates. Returns {@code + * nbResults} unchanged unless NB's top pick is CJK and a veto fires, in + * which case a non-CJK replacement is promoted to the front. + */ + public List<EncodingResult> arbitrate(byte[] probe, List<EncodingResult> nbResults) { + if (nbResults == null || nbResults.isEmpty()) { + return nbResults; + } + if (!isCjkName(nbResults.get(0).getCharset().name())) { + return nbResults; + } + // Build high-byte bigram occupancy. + Map<Integer, Integer> docMap = new HashMap<>(); + long support = 0; + for (int i = 0; i + 1 < probe.length; i++) { + int b0 = probe[i] & 0xFF; + int b1 = probe[i + 1] & 0xFF; + if (b0 >= 0x80 || b1 >= 0x80) { + int bg = (b0 << 8) | b1; + docMap.merge(bg, 1, Integer::sum); + support++; + } + } + double docNorm = 0; + for (int v : docMap.values()) { + docNorm += (double) v * v; + } + docNorm = Math.sqrt(docNorm); + + boolean gbkTop = GB18030.equals(nbResults.get(0).getCharset().name()); + double illegal = gbkIllegalRate(probe); + + int cosArg = -1; + double bestCos = -1; + double[] cos = new double[names.length]; + if (docNorm > 0) { + for (int c = 0; c < names.length; c++) { + double dot = 0; + int[] ids = bigramIds[c]; + float[] w = weights[c]; + for (int k = 0; k < ids.length; k++) { + Integer dc = docMap.get(ids[k]); + if (dc != null) { + dot += w[k] * dc; + } + } + cos[c] = dot / docNorm; + if (cos[c] > bestCos) { + bestCos = cos[c]; + cosArg = c; + } + } + } + + boolean veto = (gbkTop && illegal > GBK_ILLEGAL_THRESHOLD) + || (support >= MIN_HIGH_BYTE_SUPPORT && cosArg >= 0 && !cjk[cosArg]); + if (!veto) { + return nbResults; + } + + // Choose replacement: best non-CJK by cosine when evidence is + // sufficient, else the highest-ranked non-CJK NB candidate. + Charset replacement = null; + if (support >= MIN_HIGH_BYTE_SUPPORT && docNorm > 0) { + double bv = -1; + for (int c = 0; c < names.length; c++) { + if (!cjk[c] && charsets[c] != null && cos[c] > bv) { + bv = cos[c]; + replacement = charsets[c]; + } + } + } + float conf = nbResults.get(0).getConfidence(); + List<EncodingResult> out = new ArrayList<>(nbResults.size() + 1); + if (replacement != null) { + out.add(new EncodingResult(replacement, conf, replacement.name(), + EncodingResult.ResultType.STATISTICAL)); + } + for (EncodingResult r : nbResults) { + if (isCjkName(r.getCharset().name())) { + continue; + } + if (replacement != null && r.getCharset().name().equals(replacement.name())) { + continue; + } + out.add(r); + } + // If we couldn't form any non-CJK candidate, don't strand the caller + // with an empty list — leave NB's result untouched. + return out.isEmpty() ? nbResults : out; + } + + /** + * Fraction of high-byte lead bytes that fail to begin a valid GBK 2-byte + * or GB18030 4-byte sequence. 0 for genuine GB18030. + */ + static double gbkIllegalRate(byte[] b) { + int n = b.length; + int i = 0; + int illegal = 0; + int lead = 0; + while (i < n) { + int c = b[i] & 0xFF; + if (c < 0x80) { + i++; + continue; + } + lead++; + if (c >= 0x81 && c <= 0xFE && i + 1 < n) { + int t = b[i + 1] & 0xFF; + if (((t >= 0x40 && t <= 0x7E) || (t >= 0x80 && t <= 0xFE)) && t != 0x7F) { + i += 2; + continue; + } + if (t >= 0x30 && t <= 0x39 && i + 3 < n + && (b[i + 2] & 0xFF) >= 0x81 && (b[i + 2] & 0xFF) <= 0xFE + && (b[i + 3] & 0xFF) >= 0x30 && (b[i + 3] & 0xFF) <= 0x39) { + i += 4; + continue; + } + } + illegal++; + i++; + } + return lead == 0 ? 0 : (double) illegal / lead; + } +} diff --git a/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/MojibusterEncodingDetector.java b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/MojibusterEncodingDetector.java index 00254dcd96..45a919274a 100644 --- a/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/MojibusterEncodingDetector.java +++ b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/MojibusterEncodingDetector.java @@ -30,6 +30,7 @@ import org.slf4j.LoggerFactory; import org.apache.tika.config.TikaComponent; import org.apache.tika.detect.EncodingDetector; import org.apache.tika.detect.EncodingResult; +import org.apache.tika.detect.HighByteLetterStats; import org.apache.tika.io.TikaInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.metadata.TikaCoreProperties; @@ -124,6 +125,19 @@ public class MojibusterEncodingDetector implements EncodingDetector { */ private static final float UTF8_STRUCTURAL_CONF = 0.95f; + /** Confidence for an ISO-2022-JP/KR/CN structural candidate (7-bit, escape-based). */ + private static final float ISO2022_STRUCTURAL_CONF = 0.95f; + + /** ISO-2022 decode-verify: a stray {@code ESC $} in plain ASCII must not win, so + * require the decode to yield real CJK at near-zero replacement rate. */ + private static final int ISO2022_MIN_CJK = 4; + private static final double ISO2022_MAX_FFFD_RATE = 0.05; + + /** False-CJK veto: drop an NB legacy-CJK candidate whose UTF-8-stripped decode + * fails above this rate. Post-strip, real CJK (pure or mixed) is ≤1.6% and + * genuine false-CJK ≥5.3%, so ~2.5% separates them (see CjkDecodeValidator). */ + private static final double CJK_FAILURE_VETO_THRESHOLD = 0.025; + /** Confidence for the windows-1252 fallback emitted on empty/ASCII probes. */ private static final float FALLBACK_CONFIDENCE = 0.1f; @@ -212,7 +226,7 @@ public class MojibusterEncodingDetector implements EncodingDetector { public List<EncodingResult> detect(byte[] probe, Metadata metadata) { if (LOG.isTraceEnabled()) { int probeLen = probe == null ? 0 : probe.length; - int highBytes = probe == null ? 0 : countHighBytes(probe); + int highBytes = probe == null ? 0 : HighByteLetterStats.countHighBytes(probe); LOG.trace("mojibuster enter probe={}B highBytes={}", probeLen, highBytes); } // Empty / near-empty probes: return the WHATWG default so @@ -233,6 +247,18 @@ public class MojibusterEncodingDetector implements EncodingDetector { // consulting NB so we don't hand back a bias-driven x-MacRoman // or IBM850 pick. if (isPureAscii(probe)) { + // ISO-2022-JP/KR/CN are 7-bit escape-based encodings: NB sees no high + // bytes, so without this they fall to the windows-1252 default and + // decode to gibberish (a 4.x-vs-3.x regression; icu4j catches them). + // Gated to the pure-ASCII branch on purpose — high-byte binary that + // happens to contain an ESC sequence never reaches here, it takes the + // normal NB path. decode-verify guards the rare 7-bit stray-ESC case. + Charset iso2022 = detectIso2022Verified(probe); + if (iso2022 != null) { + LOG.trace("mojibuster -> {} (iso-2022 structural)", iso2022.name()); + return List.of(new EncodingResult(iso2022, ISO2022_STRUCTURAL_CONF, + iso2022.name(), EncodingResult.ResultType.STRUCTURAL)); + } LOG.trace("mojibuster -> windows-1252 fallback (pure ASCII)"); return windows1252Fallback(); } @@ -327,7 +353,13 @@ public class MojibusterEncodingDetector implements EncodingDetector { } } LOG.trace("mojibuster utf8Check={} tolerated={}", utf8, utf8Tolerated); - if (utf8 == StructuralEncodingRules.Utf8Result.LIKELY_UTF8) { + // Emit a structural UTF-8 candidate when the grammar is clean (LIKELY) + // OR essentially-UTF-8 (NOT_UTF8 with malformed bytes within tolerance — + // a few corrupt bytes in otherwise-valid UTF-8). Both exclude legacy + // CJK, which produces many grammar errors (measured: 0/321K labeled CJK + // samples return LIKELY or fall within tolerance). The type-priority + // sort in sortAndDedup then ranks this above NB's statistical pick. + if (utf8 == StructuralEncodingRules.Utf8Result.LIKELY_UTF8 || utf8Tolerated) { pool.add(new EncodingResult( java.nio.charset.StandardCharsets.UTF_8, UTF8_STRUCTURAL_CONF, "UTF-8", @@ -361,6 +393,18 @@ public class MojibusterEncodingDetector implements EncodingDetector { && !utf8Tolerated) { continue; } + // False-CJK veto: a legacy multi-byte CJK pick whose bytes don't + // validate (high decode-failure on the UTF-8-stripped remainder) is + // Latin/Cyrillic/garbage mis-read as CJK. Drop it — if it was NB's + // only candidate the pool empties and the windows-1252 fallback wins. + if (CjkDecodeValidator.appliesTo(name)) { + double failRate = CjkDecodeValidator.strippedFailureRate(nbInput, r.getCharset()); + if (failRate >= CJK_FAILURE_VETO_THRESHOLD) { + LOG.trace("mojibuster veto {} (cjk decode-failure {}%)", name, + String.format(Locale.ROOT, "%.2f", failRate * 100)); + continue; + } + } pool.add(r); } @@ -404,6 +448,50 @@ public class MojibusterEncodingDetector implements EncodingDetector { EncodingResult.ResultType.STATISTICAL)); } + /** + * Detect ISO-2022-JP/KR/CN by escape sequence, then verify the decode is + * real CJK (not a stray {@code ESC $} in ASCII text). Returns the charset + * or {@code null}. Caller guarantees {@code probe} is pure 7-bit ASCII. + */ + private static Charset detectIso2022Verified(byte[] probe) { + Charset cs = StructuralEncodingRules.detectIso2022(probe); + if (cs == null) { + return null; + } + String decoded; + try { + decoded = new String(probe, cs); // REPLACE on malformed/unmappable + } catch (Exception e) { + return null; + } + int cjk = 0; + int fffd = 0; + for (int i = 0; i < decoded.length(); ) { + int cp = decoded.codePointAt(i); + i += Character.charCount(cp); + if (cp == 0xFFFD) { + fffd++; + } else if (isCjkChar(cp)) { + cjk++; + } + } + if (cjk >= ISO2022_MIN_CJK + && fffd <= decoded.length() * ISO2022_MAX_FFFD_RATE) { + return cs; + } + return null; + } + + /** Han / kana / hangul / CJK punctuation — the scripts ISO-2022-JP/KR/CN carry. */ + private static boolean isCjkChar(int cp) { + return (cp >= 0x3040 && cp <= 0x30FF) // hiragana + katakana + || (cp >= 0x4E00 && cp <= 0x9FFF) // CJK unified + || (cp >= 0x3400 && cp <= 0x4DBF) // CJK ext A + || (cp >= 0xAC00 && cp <= 0xD7A3) // hangul syllables + || (cp >= 0xFF66 && cp <= 0xFF9F) // halfwidth katakana + || (cp >= 0x3000 && cp <= 0x303F); // CJK symbols/punctuation + } + /** * Pure 7-bit ASCII test: no bytes ≥ 0x80 and no null bytes. * Null-byte exclusion prevents misclassifying UTF-16/32 content @@ -542,8 +630,8 @@ public class MojibusterEncodingDetector implements EncodingDetector { return ranked; } Charset win1252 = Charset.forName(WIN1252); - int winLetters = countHighByteLetters(probe, win1252); - int topLetters = countHighByteLetters(probe, top.getCharset()); + int winLetters = HighByteLetterStats.countCasedHighByteLetters(probe, win1252); + int topLetters = HighByteLetterStats.countCasedHighByteLetters(probe, top.getCharset()); // Tie goes to windows-1252 (WHATWG-canonical default). if (winLetters < topLetters) { return ranked; @@ -558,85 +646,23 @@ public class MojibusterEncodingDetector implements EncodingDetector { } /** - * Decode the probe under {@code cs} and count codepoints that - * are Unicode "cased letters" (Lu / Ll / Lt) at codepoints ≥ - * 0x80. Used by the Latin sibling fallback to compare decoded- - * text quality between two candidate SBCS encodings. - * - * <p>Deliberately excludes a few "letter-ish but typographic" - * categories that {@link Character#isLetter(int)} would otherwise - * count, because they fooled the rule in earlier evals:</p> - * <ul> - * <li><b>Modifier letters (Lm)</b>: spacing-modifier letterlike - * symbols (ʰ ʷ ˆ ˜ ʻ etc.) that some encodings put at - * byte positions where the truthful encoding has a symbol / - * punctuation.</li> - * <li><b>Ordinal indicators</b>: U+00AA (ª), U+00BA (º), - * U+207F (ⁿ), U+2122 (™ — not Ll, included for safety). - * MacRoman's 0xBB and 0xBC are ª / º respectively; the - * windows-1252 truth for byte 0xBB is » (final punctuation, - * not a letter). Without this exclusion, MacRoman's - * letter count beats win-1252's on probes where » appears.</li> - * <li><b>Other letter (Lo)</b>: covers CJK / Korean letterlike - * codepoints that occasionally fall out of byte-level - * decodes; counting those as "Latin letters" would mislead - * the Latin-sibling comparison.</li> - * </ul> - */ - private static int countHighByteLetters(byte[] probe, Charset cs) { - String decoded; - try { - decoded = new String(probe, cs); - } catch (Exception e) { - return 0; - } - int count = 0; - for (int i = 0; i < decoded.length(); ) { - int cp = decoded.codePointAt(i); - if (cp >= 0x80 && isCasedLatinishLetter(cp)) { - count++; - } - i += Character.charCount(cp); - } - return count; - } - - /** - * Returns true for codepoints in Unicode's "cased letter" - * categories (Lu / Ll / Lt) but EXCLUDING specific letterlike - * typographic symbols (ª, º, ⁿ). See {@link #countHighByteLetters}. - */ - private static boolean isCasedLatinishLetter(int cp) { - if (cp == 0x00AA || cp == 0x00BA || cp == 0x207F) { - return false; // ª, º, ⁿ — ordinal / superscript indicators - } - int type = Character.getType(cp); - return type == Character.UPPERCASE_LETTER - || type == Character.LOWERCASE_LETTER - || type == Character.TITLECASE_LETTER; - } - - private static int countHighBytes(byte[] probe) { - int n = 0; - for (byte b : probe) { - if ((b & 0xFF) >= 0x80) { - n++; - } - } - return n; - } - - /** - * Sort pool by confidence descending, deduplicate by charset name - * keeping the highest-confidence instance. Stable ordering is - * good enough for current needs; if we need trust-type tiebreaks - * (STRUCTURAL > DECLARATIVE > STATISTICAL) later, add here. + * Sort pool by trust type (STRUCTURAL > DECLARATIVE > STATISTICAL), + * then by confidence within a type, and deduplicate by charset name keeping + * the first (highest-priority) instance. Type priority is load-bearing: + * NB pins its statistical winner to confidence 1.0, so a structural + * candidate (UTF-8 grammar proof, UTF-32 codepoint validity) emitted below + * 1.0 would otherwise lose the sort to NB despite being the stronger signal. */ private static List<EncodingResult> sortAndDedup(List<EncodingResult> pool) { if (pool.isEmpty()) { return Collections.emptyList(); } - pool.sort((a, b) -> Float.compare(b.getConfidence(), a.getConfidence())); + pool.sort((a, b) -> { + int byType = Integer.compare(typeRank(a.getResultType()), + typeRank(b.getResultType())); + return byType != 0 ? byType + : Float.compare(b.getConfidence(), a.getConfidence()); + }); java.util.Set<String> seen = new java.util.LinkedHashSet<>(); List<EncodingResult> out = new java.util.ArrayList<>(pool.size()); for (EncodingResult r : pool) { @@ -647,6 +673,18 @@ public class MojibusterEncodingDetector implements EncodingDetector { return out; } + /** Trust-type priority for sorting: lower wins. */ + private static int typeRank(EncodingResult.ResultType t) { + switch (t) { + case STRUCTURAL: + return 0; + case DECLARATIVE: + return 1; + default: + return 2; + } + } + /** * Returns stripped bytes if the probe contains well-formed HTML/XML * tags; otherwise returns the original probe unchanged. diff --git a/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/NaiveBayesBigramEncodingDetector.java b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/NaiveBayesBigramEncodingDetector.java index 4140b6f023..5becf20ce6 100644 --- a/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/NaiveBayesBigramEncodingDetector.java +++ b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/NaiveBayesBigramEncodingDetector.java @@ -143,6 +143,22 @@ public class NaiveBayesBigramEncodingDetector implements EncodingDetector { */ public static final int MIN_BIGRAMS_FOR_DIVERSITY_GATE = 100; + /** + * Sublinear count weighting ("count clipping"). A distinct bigram's raw + * repetition count {@code n} is replaced by {@code 1 + ln(n)} before it + * weights the per-class contribution, so a bigram repeated hundreds of + * times (e.g. a {@code "--"} separator run, observed 864× on one page) + * can no longer dominate the score by sheer volume. + * + * <p>Length-dynamic by construction (no fixed cap) and <em>class-agnostic</em>: + * it bounds <em>repetition</em>, an axis orthogonal to the Type C cap + * (which bounds a single class's per-bigram cross-class advantage) and the + * Type A diversity gate (which abstains only on globally-degenerate input). + * Partial concentration — one bigram repeated heavily inside an otherwise + * diverse probe — falls through all three of those guards; this closes it.</p> + */ + public static final boolean SUBLINEAR_COUNT = true; + /** * Script / writing-system family used by {@link #CAP_PER_BIGRAM_NATS}. * UTF-8 stands alone so the cap engages on UTF-vs-anything pairs @@ -513,7 +529,10 @@ public class NaiveBayesBigramEncodingDetector implements EncodingDetector { } int n = counts.countAt(slot); int w = idf8[bigram]; - double countTimesIdf = (double) n * w; + // Sublinear count weighting: cap a repeated bigram's volume so a + // separator run (e.g. "--" x864) can't dominate by repetition. + double tf = (SUBLINEAR_COUNT && n > 1) ? (1.0 + Math.log(n)) : n; + double countTimesIdf = tf * w; int base = bigram * numClasses; if (!applyCap) { diff --git a/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/resources/org/apache/tika/ml/chardetect/cosine-profiles.bin b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/resources/org/apache/tika/ml/chardetect/cosine-profiles.bin new file mode 100644 index 0000000000..646a7e7923 Binary files /dev/null and b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/resources/org/apache/tika/ml/chardetect/cosine-profiles.bin differ diff --git a/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/resources/org/apache/tika/ml/chardetect/nb-bigram.bin b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/resources/org/apache/tika/ml/chardetect/nb-bigram.bin index bcfce41d67..b89188bb32 100644 Binary files a/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/resources/org/apache/tika/ml/chardetect/nb-bigram.bin and b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/resources/org/apache/tika/ml/chardetect/nb-bigram.bin differ diff --git a/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/test/java/org/apache/tika/ml/chardetect/CjkDecodeValidatorTest.java b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/test/java/org/apache/tika/ml/chardetect/CjkDecodeValidatorTest.java new file mode 100644 index 0000000000..14e074212a --- /dev/null +++ b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/test/java/org/apache/tika/ml/chardetect/CjkDecodeValidatorTest.java @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.tika.ml.chardetect; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertTrue; + +import java.io.ByteArrayOutputStream; +import java.nio.charset.Charset; +import java.util.Arrays; + +import org.junit.jupiter.api.Test; + +public class CjkDecodeValidatorTest { + + @Test + public void realJapaneseFarBelowVetoThreshold() throws Exception { + byte[] b = ("日本語のテスト文章をたくさん書いて高バイトを十分に確保します" + + "これは本物の日本語です").getBytes("Shift_JIS"); + double rate = CjkDecodeValidator.strippedFailureRate(b, Charset.forName("Shift_JIS")); + assertTrue(rate >= 0.0 && rate < 0.025, "real JP should be near-zero failure, got " + rate); + } + + @Test + public void realKoreanFarBelowVetoThreshold() throws Exception { + byte[] b = ("안녕하세요 이것은 진짜 한국어 문장입니다 고바이트를 충분히 확보하기 위해 " + + "여러 글자를 적습니다").getBytes("EUC-KR"); + double rate = CjkDecodeValidator.strippedFailureRate(b, Charset.forName("EUC-KR")); + assertTrue(rate >= 0.0 && rate < 0.025, "real KR should be near-zero failure, got " + rate); + } + + /** Mixed-encoding: legacy CJK body + an embedded UTF-8 run. Stripping the UTF-8 + * run de-confounds, so the rate stays low (the WS2 breakthrough). */ + @Test + public void mixedLegacyPlusUtf8NotVetoed() throws Exception { + ByteArrayOutputStream bo = new ByteArrayOutputStream(); + bo.writeBytes("日本語の本文をしっかり書いて高バイトを確保する本物のテキスト".getBytes("Shift_JIS")); + bo.writeBytes("これはUTF-8の埋め込みウィジェット".getBytes("UTF-8")); // embedded UTF-8 + double rate = CjkDecodeValidator.strippedFailureRate(bo.toByteArray(), + Charset.forName("Shift_JIS")); + assertTrue(rate >= 0.0 && rate < 0.025, "mixed real CJK should stay low post-strip, got " + rate); + } + + @Test + public void garbageHighBytesVetoed() { + byte[] b = new byte[60]; + Arrays.fill(b, (byte) 0xFF); // 0xFF is not a valid GB18030 lead → all malformed + double rate = CjkDecodeValidator.strippedFailureRate(b, Charset.forName("GB18030")); + assertTrue(rate >= 0.025, "garbage high bytes should be vetoed, got " + rate); + } + + @Test + public void insufficientHighBytesReturnsMinusOne() { + byte[] b = "mostly ascii with a couple high bytes".getBytes(java.nio.charset.StandardCharsets.ISO_8859_1); + assertEquals(-1.0, CjkDecodeValidator.strippedFailureRate(b, Charset.forName("GB18030"))); + } + + @Test + public void appliesToLegacyCjkButNotIso2022OrLatin() { + assertTrue(CjkDecodeValidator.appliesTo("GB18030")); + assertTrue(CjkDecodeValidator.appliesTo("Shift_JIS")); + assertTrue(CjkDecodeValidator.appliesTo("Big5-HKSCS")); + assertTrue(CjkDecodeValidator.appliesTo("x-windows-949")); + assertEquals(false, CjkDecodeValidator.appliesTo("ISO-2022-JP")); + assertEquals(false, CjkDecodeValidator.appliesTo("windows-1252")); + } +} diff --git a/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/test/java/org/apache/tika/ml/chardetect/Iso2022DetectionTest.java b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/test/java/org/apache/tika/ml/chardetect/Iso2022DetectionTest.java new file mode 100644 index 0000000000..30b0332a01 --- /dev/null +++ b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/test/java/org/apache/tika/ml/chardetect/Iso2022DetectionTest.java @@ -0,0 +1,83 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.tika.ml.chardetect; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertNotEquals; + +import java.io.ByteArrayOutputStream; +import java.nio.charset.Charset; +import java.nio.charset.StandardCharsets; + +import org.junit.jupiter.api.Test; + +import org.apache.tika.detect.EncodingResult; + +/** + * WS3: ISO-2022-JP/KR/CN are 7-bit escape-based encodings invisible to the NB + * bigram model; the detector recognizes them structurally inside the pure-ASCII + * branch. Binary FPs (high-byte) never reach that branch, and a stray {@code + * ESC $} in real ASCII is rejected by the decode-verify. + */ +public class Iso2022DetectionTest { + + private final MojibusterEncodingDetector det = newDetector(); + + private static MojibusterEncodingDetector newDetector() { + try { + return new MojibusterEncodingDetector(); + } catch (Exception e) { + throw new RuntimeException(e); + } + } + + @Test + public void detectsRealIso2022Jp() throws Exception { + byte[] b = ("日本語のテスト文章です。これは ISO-2022-JP でエンコードされた" + + "純粋に7ビットの文書です。").getBytes("ISO-2022-JP"); + EncodingResult top = det.detect(b).get(0); + assertEquals("ISO-2022-JP", top.getCharset().name()); + assertEquals(EncodingResult.ResultType.STRUCTURAL, top.getResultType()); + } + + @Test + public void detectsRealIso2022Kr() throws Exception { + byte[] b = ("안녕하세요 이것은 ISO-2022-KR 로 인코딩된 한국어 문서입니다 " + + "순수한 7비트 텍스트입니다").getBytes("ISO-2022-KR"); + assertEquals("ISO-2022-KR", det.detect(b).get(0).getCharset().name()); + } + + @Test + public void plainAsciiIsNotIso2022() { + byte[] b = "Hello world, this is ordinary 7-bit ASCII prose with no escapes." + .getBytes(StandardCharsets.US_ASCII); + Charset top = det.detect(b).get(0).getCharset(); + assertNotEquals("ISO-2022-JP", top.name()); + assertNotEquals("ISO-2022-KR", top.name()); + } + + /** A real {@code ESC(0x1B) $ B} with an empty JIS section embedded in ASCII + * yields zero CJK, so the decode-verify must reject it (not crown ISO-2022-JP). */ + @Test + public void strayEscapeInAsciiIsNotIso2022() { + ByteArrayOutputStream bo = new ByteArrayOutputStream(); + bo.writeBytes("terminal log dump: ".getBytes(StandardCharsets.US_ASCII)); + bo.writeBytes(new byte[] {0x1b, '$', 'B', 0x1b, '(', 'B'}); // enter then exit JIS + bo.writeBytes("back to ascii output".getBytes(StandardCharsets.US_ASCII)); + assertNotEquals("ISO-2022-JP", det.detect(bo.toByteArray()).get(0).getCharset().name()); + } +} diff --git a/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractComparer.java b/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractComparer.java index b4679c1845..06e5486cb4 100644 --- a/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractComparer.java +++ b/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractComparer.java @@ -68,6 +68,8 @@ public class ExtractComparer extends ProfilerBase { public static TableInfo EMBEDDED_FILE_PATH_TABLE_B = new TableInfo("emb_path_b", ExtractProfiler.EMBEDDED_FILE_PATH_TABLE.getColInfos()); public static TableInfo CONTENTS_TABLE_A = new TableInfo("contents_a", ExtractProfiler.CONTENTS_TABLE.getColInfos()); public static TableInfo CONTENTS_TABLE_B = new TableInfo("contents_b", ExtractProfiler.CONTENTS_TABLE.getColInfos()); + public static TableInfo ENCODINGS_TABLE_A = new TableInfo("encodings_a", ExtractProfiler.ENCODINGS_TABLE.getColInfos()); + public static TableInfo ENCODINGS_TABLE_B = new TableInfo("encodings_b", ExtractProfiler.ENCODINGS_TABLE.getColInfos()); public static TableInfo TAGS_TABLE_A = new TableInfo("tags_a", ExtractProfiler.TAGS_TABLE.getColInfos()); public static TableInfo TAGS_TABLE_B = new TableInfo("tags_b", ExtractProfiler.TAGS_TABLE.getColInfos()); public static TableInfo EXCEPTION_TABLE_A = new TableInfo("exceptions_a", ExtractProfiler.EXCEPTION_TABLE.getColInfos()); @@ -207,6 +209,7 @@ public class ExtractComparer extends ProfilerBase { writeTagData(fileId, contentTagsA, TAGS_TABLE_A); writeProfileData(fpsA, i, contentTagsA, metadataA, fileId, containerID, numAttachmentsA, PROFILES_A); + writeEncodingData(fileId, metadataA, ENCODINGS_TABLE_A); writeExceptionData(fileId, metadataA, EXCEPTION_TABLE_A); int matchIndex = getMatch(i, sharedDigestKey, emptyDigest, handledB, metadataListA, metadataListB); @@ -218,6 +221,7 @@ public class ExtractComparer extends ProfilerBase { contentTagsB = getContent(fpsB, metadataB); writeTagData(fileId, contentTagsB, TAGS_TABLE_B); writeProfileData(fpsB, i, contentTagsB, metadataB, fileId, containerID, numAttachmentsB, PROFILES_B); + writeEncodingData(fileId, metadataB, ENCODINGS_TABLE_B); writeExceptionData(fileId, metadataB, EXCEPTION_TABLE_B); } writeEmbeddedFilePathData(i, fileId, metadataA, metadataB); @@ -263,6 +267,7 @@ public class ExtractComparer extends ProfilerBase { String fileId = (i == 0) ? containerID : Integer.toString(ID.getAndIncrement()); writeTagData(fileId, contentTagsB, TAGS_TABLE_B); writeProfileData(fpsB, i, contentTagsB, metadataB, fileId, containerID, numAttachmentsB, PROFILES_B); + writeEncodingData(fileId, metadataB, ENCODINGS_TABLE_B); writeEmbeddedFilePathData(i, fileId, null, metadataB); writeExceptionData(fileId, metadataB, EXCEPTION_TABLE_B); diff --git a/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractComparerRunner.java b/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractComparerRunner.java index 3535574233..3ffcc5a31d 100644 --- a/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractComparerRunner.java +++ b/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractComparerRunner.java @@ -343,6 +343,7 @@ public class ExtractComparerRunner { tableInfosA.add(ExtractComparer.EXCEPTION_TABLE_A); tableInfosA.add(ExtractComparer.TAGS_TABLE_A); tableInfosA.add(ExtractComparer.CONTENTS_TABLE_A); + tableInfosA.add(ExtractComparer.ENCODINGS_TABLE_A); tableInfosA.add(ExtractComparer.EXTRACT_EXCEPTION_TABLE_A); tableInfosA.add(ExtractComparer.EMBEDDED_FILE_PATH_TABLE_A); @@ -351,6 +352,7 @@ public class ExtractComparerRunner { tableInfosB.add(ExtractComparer.EXTRACT_EXCEPTION_TABLE_B); tableInfosB.add(ExtractComparer.TAGS_TABLE_B); tableInfosB.add(ExtractComparer.CONTENTS_TABLE_B); + tableInfosB.add(ExtractComparer.ENCODINGS_TABLE_B); tableInfosB.add(ExtractComparer.EMBEDDED_FILE_PATH_TABLE_B); tableInfosAandB.add(ExtractComparer.COMPARISON_CONTAINERS); diff --git a/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractProfileRunner.java b/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractProfileRunner.java index b7acb7c684..fca4c61473 100644 --- a/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractProfileRunner.java +++ b/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractProfileRunner.java @@ -256,6 +256,7 @@ public class ExtractProfileRunner { tableInfos.add(ExtractProfiler.EXTRACT_EXCEPTION_TABLE); tableInfos.add(ExtractProfiler.EXCEPTION_TABLE); tableInfos.add(ExtractProfiler.CONTENTS_TABLE); + tableInfos.add(ExtractProfiler.ENCODINGS_TABLE); tableInfos.add(ExtractProfiler.TAGS_TABLE); tableInfos.add(ExtractProfiler.EMBEDDED_FILE_PATH_TABLE); this.tableInfos = Collections.unmodifiableList(tableInfos); diff --git a/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractProfiler.java b/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractProfiler.java index 0073f9ddbb..551f92d11a 100644 --- a/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractProfiler.java +++ b/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractProfiler.java @@ -54,9 +54,17 @@ public class ExtractProfiler extends ProfilerBase { new ColInfo(Cols.ATTACHMENT_TYPE, Types.VARCHAR, 32), new ColInfo(Cols.FILE_EXTENSION, Types.VARCHAR, 12), new ColInfo(Cols.MIME_ID, Types.INTEGER), new ColInfo(Cols.ELAPSED_TIME_MILLIS, Types.INTEGER), new ColInfo(Cols.NUM_ATTACHMENTS, Types.INTEGER), new ColInfo(Cols.NUM_METADATA_VALUES, Types.INTEGER), new ColInfo(Cols.NUM_PAGES, Types.INTEGER), new ColInfo(Cols.NUM_OCR_PAGES, Types.INTEGER), new ColInfo(Cols.HAS_CONTENT, Types.BOOLEAN)); + /** Charset detection per file (one row only when detection ran): final pick, + * winning detector, declared charset from metadata (Content-Type-Hint). */ + public static TableInfo ENCODINGS_TABLE = new TableInfo("encodings", + new ColInfo(Cols.ID, Types.INTEGER, "PRIMARY KEY"), + new ColInfo(Cols.DETECTED_ENCODING, Types.VARCHAR, 64), + new ColInfo(Cols.ENCODING_DETECTOR, Types.VARCHAR, 64), + new ColInfo(Cols.DECLARED_METADATA, Types.VARCHAR, 128)); public static TableInfo EMBEDDED_FILE_PATH_TABLE = new TableInfo("emb_file_names", new ColInfo(Cols.ID, Types.INTEGER, "PRIMARY KEY"), new ColInfo(Cols.EMBEDDED_FILE_PATH, Types.VARCHAR, 1024)); public static TableInfo CONTENTS_TABLE = new TableInfo("contents", new ColInfo(Cols.ID, Types.INTEGER, "PRIMARY KEY"), new ColInfo(Cols.CONTENT_LENGTH, Types.INTEGER), + new ColInfo(Cols.NUM_REPLACEMENT, Types.INTEGER), new ColInfo(Cols.NUM_NON_ASCII, Types.INTEGER), new ColInfo(Cols.NUM_UNIQUE_TOKENS, Types.INTEGER), new ColInfo(Cols.NUM_TOKENS, Types.INTEGER), new ColInfo(Cols.COMMON_TOKENS_LANG, Types.VARCHAR, 12), new ColInfo(Cols.NUM_UNIQUE_COMMON_TOKENS, Types.INTEGER), new ColInfo(Cols.NUM_COMMON_TOKENS, Types.INTEGER), new ColInfo(Cols.NUM_UNIQUE_ALPHABETIC_TOKENS, Types.INTEGER), new ColInfo(Cols.NUM_ALPHABETIC_TOKENS, Types.INTEGER), new ColInfo(Cols.OOV, Types.DOUBLE), @@ -146,6 +154,7 @@ public class ExtractProfiler extends ProfilerBase { String fileId = (i == 0) ? containerIdString : Integer.toString(ID.incrementAndGet()); writeTagData(fileId, contentTags, TAGS_TABLE); writeProfileData(fps, i, contentTags, m, fileId, containerIdString, numAttachments, PROFILE_TABLE); + writeEncodingData(fileId, m, ENCODINGS_TABLE); writeEmbeddedPathData(i, fileId, m, EMBEDDED_FILE_PATH_TABLE); writeExceptionData(fileId, m, EXCEPTION_TABLE); try { diff --git a/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ProfilerBase.java b/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ProfilerBase.java index 9c273b80db..6942c3b5a8 100644 --- a/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ProfilerBase.java +++ b/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ProfilerBase.java @@ -51,6 +51,8 @@ import org.apache.tika.eval.core.textstats.BasicTokenCountStatsCalculator; import org.apache.tika.eval.core.textstats.CommonTokens; import org.apache.tika.eval.core.textstats.CompositeTextStatsCalculator; import org.apache.tika.eval.core.textstats.ContentLengthCalculator; +import org.apache.tika.eval.core.textstats.NonAsciiCharCounter; +import org.apache.tika.eval.core.textstats.ReplacementCharCounter; import org.apache.tika.eval.core.textstats.TextStatsCalculator; import org.apache.tika.eval.core.textstats.TokenEntropy; import org.apache.tika.eval.core.textstats.TokenLengths; @@ -324,6 +326,8 @@ public abstract class ProfilerBase { calculators.add(new TopNTokens(10)); calculators.add(new BasicTokenCountStatsCalculator()); calculators.add(new ContentLengthCalculator()); + calculators.add(new ReplacementCharCounter()); + calculators.add(new NonAsciiCharCounter()); calculators.add(new UnicodeBlockCounter(maxContentLengthForLangId)); return new CompositeTextStatsCalculator(calculators, analyzerManager, langIder); @@ -497,6 +501,19 @@ public abstract class ProfilerBase { return; } data.put(Cols.CONTENT_LENGTH, Integer.toString(length)); + Integer numReplacement = (Integer) textStats.get(ReplacementCharCounter.class); + if (numReplacement != null) { + data.put(Cols.NUM_REPLACEMENT, Integer.toString(numReplacement)); + } + // Store raw counts only; derive the FFFD rate in SQL. Decode failures + // come only from high bytes, so num_replacement/num_non_ascii is the + // un-diluted rate (num_replacement/content_length dilutes to ~0 on + // ASCII-dominated docs). U+FFFD is itself >= 0x80, so it is counted in + // num_non_ascii and both ratios stay in [0,1]. + Integer numNonAscii = (Integer) textStats.get(NonAsciiCharCounter.class); + if (numNonAscii != null) { + data.put(Cols.NUM_NON_ASCII, Integer.toString(numNonAscii)); + } } langid(textStats, data); @@ -541,6 +558,35 @@ public abstract class ProfilerBase { } } + /** + * Per-file charset-detection record: the final detected encoding, the + * detector that won, and the declared charset from metadata + * (Content-Type-Hint). Writes a row only when a detected encoding is + * present, so the table holds only files that ran charset detection. + */ + protected void writeEncodingData(String fileId, Metadata m, TableInfo encodingsTable) { + String detected = m.get(TikaCoreProperties.DETECTED_ENCODING); + if (detected == null) { + return; + } + Map<Cols, String> data = new HashMap<>(); + data.put(Cols.ID, fileId); + data.put(Cols.DETECTED_ENCODING, detected); + String detector = m.get(TikaCoreProperties.ENCODING_DETECTOR); + if (detector != null) { + data.put(Cols.ENCODING_DETECTOR, detector); + } + String declared = m.get(TikaCoreProperties.CONTENT_TYPE_HINT); + if (declared != null) { + data.put(Cols.DECLARED_METADATA, declared); + } + try { + writer.writeRow(encodingsTable, data); + } catch (IOException e) { + throw new RuntimeException(e); + } + } + void writeTagData(String fileId, ContentTags contentTags, TableInfo tagsTable) { Map<String, Integer> tags = contentTags.getTags(); if (tags.size() == 0 && contentTags.getParseException() == false) { diff --git a/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/db/Cols.java b/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/db/Cols.java index 0724f7f16e..6aa5f22259 100644 --- a/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/db/Cols.java +++ b/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/db/Cols.java @@ -26,9 +26,13 @@ public enum Cols { //profile table ID, LENGTH, FILE_NAME, FILE_EXTENSION, ELAPSED_TIME_MILLIS, NUM_METADATA_VALUES, IS_EMBEDDED, EMBEDDED_FILE_PATH, MIME_ID, TIKA_MIME_ID, FILE_MIME_ID, SHA256, MD5, NUM_ATTACHMENTS, ATTACHMENT_TYPE, EMBEDDED_DEPTH, HAS_CONTENT, + //charset detection (encodings table): final pick, winning detector, declared-via-metadata (Content-Type-Hint) + DETECTED_ENCODING, ENCODING_DETECTOR, DECLARED_METADATA, //content - CONTENT_LENGTH, NUM_UNIQUE_TOKENS, NUM_TOKENS, NUM_UNIQUE_ALPHABETIC_TOKENS, NUM_ALPHABETIC_TOKENS, //alphabetic or ideographic tokens + CONTENT_LENGTH, NUM_REPLACEMENT, NUM_NON_ASCII, //U+FFFD + non-ASCII (>=0x80) counts; FFFD rate via SQL: num_replacement/num_non_ascii + + NUM_UNIQUE_TOKENS, NUM_TOKENS, NUM_UNIQUE_ALPHABETIC_TOKENS, NUM_ALPHABETIC_TOKENS, //alphabetic or ideographic tokens COMMON_TOKENS_LANG, //which language was used for the common tokens metric? NUM_UNIQUE_COMMON_TOKENS, NUM_COMMON_TOKENS, TOP_N_TOKENS, LANG_ID_1, LANG_ID_PROB_1, LANG_ID_2, OOV, LANGUAGENESS, LANG_ID_PROB_2, TOKEN_ENTROPY_RATE, TOKEN_LENGTH_SUM, TOKEN_LENGTH_MEAN, diff --git a/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/reports/MarkdownSummaryWriter.java b/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/reports/MarkdownSummaryWriter.java index dfc2f5f8cb..75bc7d4d4b 100644 --- a/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/reports/MarkdownSummaryWriter.java +++ b/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/reports/MarkdownSummaryWriter.java @@ -478,7 +478,9 @@ public class MarkdownSummaryWriter { "round(avg(cb.oov) - avg(ca.oov), 4) as OOV_DELTA, " + "round(avg(ca.languageness), 2) as MEAN_LANG_A, " + "round(avg(cb.languageness), 2) as MEAN_LANG_B, " + - "round(avg(cb.languageness) - avg(ca.languageness), 2) as LANG_DELTA " + + "round(avg(cb.languageness) - avg(ca.languageness), 2) as LANG_DELTA, " + + "round(avg(ca.num_replacement), 1) as MEAN_FFFD_A, " + + "round(avg(cb.num_replacement), 1) as MEAN_FFFD_B " + "from contents_a ca " + "join contents_b cb on ca.id = cb.id " + "join profiles_a pa on ca.id = pa.id " + @@ -498,6 +500,8 @@ public class MarkdownSummaryWriter { "round(cb.oov - ca.oov, 4) as OOV_DELTA, " + "round(ca.languageness, 2) as LANG_A, " + "round(cb.languageness, 2) as LANG_B, " + + "ca.num_replacement as FFFD_A, " + + "cb.num_replacement as FFFD_B, " + "ca.lang_id_1 as LANG_ID_A, " + "cb.lang_id_1 as LANG_ID_B " + "from contents_a ca " + @@ -523,6 +527,8 @@ public class MarkdownSummaryWriter { "round(cb.languageness - ca.languageness, 2) as LANG_DELTA, " + "round(ca.oov, 4) as OOV_A, " + "round(cb.oov, 4) as OOV_B, " + + "ca.num_replacement as FFFD_A, " + + "cb.num_replacement as FFFD_B, " + "ca.lang_id_1 as LANG_ID_A, " + "cb.lang_id_1 as LANG_ID_B " + "from contents_a ca " + diff --git a/tika-eval/tika-eval-core/src/main/java/org/apache/tika/eval/core/langid/LanguageIDWrapper.java b/tika-eval/tika-eval-core/src/main/java/org/apache/tika/eval/core/langid/LanguageIDWrapper.java index c17e6b01f2..2b2e995cf7 100644 --- a/tika-eval/tika-eval-core/src/main/java/org/apache/tika/eval/core/langid/LanguageIDWrapper.java +++ b/tika-eval/tika-eval-core/src/main/java/org/apache/tika/eval/core/langid/LanguageIDWrapper.java @@ -38,10 +38,24 @@ public class LanguageIDWrapper implements StringStatsCalculator<List<LanguageRes public List<LanguageResult> calculate(String txt) { CharSoupLanguageDetector detector = new CharSoupLanguageDetector(); detector.setMaxLength(MAX_TEXT_LENGTH); - detector.addText(txt); + detector.addText(normalizeWhitespace(txt)); return detector.detectAll(); } + /** + * Collapse whitespace runs and trim before langid: the truncation window + * counts whitespace, so extracts differing only in whitespace can flip the + * detected language and pick different common-token dictionaries in an A/B + * eval. CharSoup features are whitespace-invariant, so this only stabilizes + * the window, not the scoring. + */ + static String normalizeWhitespace(String txt) { + if (txt == null) { + return ""; + } + return txt.replaceAll("\\s+", " ").trim(); + } + public static Set<String> getSupportedLanguages() { return CharSoupLanguageDetector.getSupportedLanguages(); diff --git a/tika-eval/tika-eval-core/src/main/java/org/apache/tika/eval/core/textstats/NonAsciiCharCounter.java b/tika-eval/tika-eval-core/src/main/java/org/apache/tika/eval/core/textstats/NonAsciiCharCounter.java new file mode 100644 index 0000000000..7cd87dd810 --- /dev/null +++ b/tika-eval/tika-eval-core/src/main/java/org/apache/tika/eval/core/textstats/NonAsciiCharCounter.java @@ -0,0 +1,39 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.tika.eval.core.textstats; + +/** + * Counts non-ASCII characters (code units ≥ U+0080) in the extracted text. + * + * <p>Used as the denominator for the U+FFFD rate (see {@link ReplacementCharCounter}): + * decode failures only arise from high bytes, so FFFD as a fraction of non-ASCII + * chars is the un-diluted signal, whereas FFFD over total length collapses to ~0 + * on COMMON / ASCII-dominated documents. U+FFFD itself is ≥ 0x80, so it is + * included in this count, keeping the rate in [0, 100].</p> + */ +public class NonAsciiCharCounter implements StringStatsCalculator<Integer> { + @Override + public Integer calculate(String txt) { + int n = 0; + for (int i = 0; i < txt.length(); i++) { + if (txt.charAt(i) >= 0x80) { + n++; + } + } + return n; + } +} diff --git a/tika-eval/tika-eval-core/src/main/java/org/apache/tika/eval/core/textstats/ReplacementCharCounter.java b/tika-eval/tika-eval-core/src/main/java/org/apache/tika/eval/core/textstats/ReplacementCharCounter.java new file mode 100644 index 0000000000..a55dbdc65c --- /dev/null +++ b/tika-eval/tika-eval-core/src/main/java/org/apache/tika/eval/core/textstats/ReplacementCharCounter.java @@ -0,0 +1,39 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.tika.eval.core.textstats; + +/** + * Counts U+FFFD (REPLACEMENT CHARACTER) occurrences in the extracted text. + * + * <p>A high replacement-char count signals a decode failure — the charset used + * to decode the bytes couldn't map them, producing U+FFFD. Unlike OOV, this is + * a structural correctness signal that does not depend on the per-language + * vocabulary, so it does not mis-rank CJK decodes (real CJK is OOV-heavy but + * has zero U+FFFD; mojibake under the wrong charset has many).</p> + */ +public class ReplacementCharCounter implements StringStatsCalculator<Integer> { + @Override + public Integer calculate(String txt) { + int n = 0; + for (int i = 0; i < txt.length(); i++) { + if (txt.charAt(i) == 0xFFFD) { + n++; + } + } + return n; + } +} diff --git a/tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/JunkFilterEncodingDetector.java b/tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/JunkFilterEncodingDetector.java index 056c768a65..f85011c07b 100644 --- a/tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/JunkFilterEncodingDetector.java +++ b/tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/JunkFilterEncodingDetector.java @@ -18,9 +18,13 @@ package org.apache.tika.ml.junkdetect; import java.io.IOException; import java.nio.charset.Charset; +import java.util.ArrayList; import java.util.Arrays; import java.util.Collections; +import java.util.HashMap; +import java.util.HashSet; import java.util.LinkedHashMap; +import java.util.LinkedHashSet; import java.util.List; import java.util.Map; import java.util.Set; @@ -29,8 +33,10 @@ import org.slf4j.Logger; import org.slf4j.LoggerFactory; import org.apache.tika.config.TikaComponent; +import org.apache.tika.detect.CharsetSupersets; import org.apache.tika.detect.EncodingDetectorContext; import org.apache.tika.detect.EncodingResult; +import org.apache.tika.detect.HighByteLetterStats; import org.apache.tika.detect.MetaEncodingDetector; import org.apache.tika.io.TikaInputStream; import org.apache.tika.metadata.Metadata; @@ -80,6 +86,16 @@ public class JunkFilterEncodingDetector implements MetaEncodingDetector { * anchor instead of arbitrating near-identical decodes by quality. */ private static final float NO_INFO_CONFIDENCE = 0.1f; + // Adaptive candidate band (TIKA speed lever). The tournament only needs + // NB's top-2 statistical candidates plus any lower-ranked candidate still + // within MIN_TAIL_CONFIDENCE of the top; deeper, low-confidence candidates + // are clearly dominated and almost never win (measured: δ=0.5 retains + // ~98-99% of selected winners, ~20% smaller pool). Anchors (DECLARATIVE, + // STRUCTURAL) are always kept regardless of confidence. Quality impact is + // validated by a full common-token/OOV eval, NOT assumed. + private static final int ALWAYS_KEEP_TOP_N = 2; + private static final float MIN_TAIL_CONFIDENCE = 0.5f; + /** Cached quality detector. {@code null} if none is on the classpath. */ private final TextQualityDetector qualityDetector; @@ -152,24 +168,25 @@ public class JunkFilterEncodingDetector implements MetaEncodingDetector { // become codepoints whose cross-script transitions expose mojibake // under a wrong decoding (AIT5 case). Map<Charset, String> candidates = new LinkedHashMap<>(); - for (Charset cs : uniqueCharsets) { - String decoded = safeDecode(bytes, cs); - if (decoded != null && !decoded.isEmpty()) { - decoded = HtmlContentCleaner.clean(decoded); + // Dedup: charsets that decode the raw probe to the identical string + // (e.g. GB18030/GBK, x-windows-949/EUC-KR on non-extension content) + // share one clean() call — the cleaned result is identical by + // construction, so this is quality-neutral, purely a work saving. + Map<String, String> cleanedByRaw = new HashMap<>(); + Set<Charset> candidateCharsets = bandFilter(context, uniqueCharsets); + for (Charset cs : candidateCharsets) { + String raw = safeDecode(bytes, cs); + if (raw == null || raw.isEmpty()) { + LOG.trace("junk-filter decode {} -> null/empty", cs.name()); + continue; + } + String decoded = cleanedByRaw.get(raw); + if (decoded == null) { + decoded = HtmlContentCleaner.clean(raw); + cleanedByRaw.put(raw, decoded); } if (decoded != null && !decoded.isEmpty()) { candidates.put(cs, decoded); - if (LOG.isTraceEnabled()) { - int sampleLen = Math.min(400, decoded.length()); - String sample = decoded.substring(0, sampleLen) - .replace('\n', ' ').replace('\r', ' '); - LOG.trace("junk-filter decoded {}: '{}{}' (len={})", - cs.name(), sample, - decoded.length() > sampleLen ? "…" : "", - decoded.length()); - } - } else { - LOG.trace("junk-filter decode {} -> null/empty", cs.name()); } } if (candidates.size() <= 1) { @@ -228,17 +245,31 @@ public class JunkFilterEncodingDetector implements MetaEncodingDetector { Charset champion = null; double championZ = Double.NEGATIVE_INFINITY; Map<Charset, Double> scoreByCharset = new LinkedHashMap<>(); + Map<Charset, Double> diffByCharset = new LinkedHashMap<>(); + // Dedup by text: [0] = whole-text z (the champion + anchor metric, kept + // exactly as before); [1] = script-letter "diff" z (codepoints >= 0x80 + // that are letters/ideographs — the high bytes where the candidate + // decodes actually differ), used ONLY for the family gate below. + Map<String, float[]> zByText = new HashMap<>(); for (Map.Entry<Charset, String> entry : candidates.entrySet()) { - org.apache.tika.quality.TextQualityScore sc = - qualityDetector.score(entry.getValue()); - float rawZ = sc.isUnknown() ? Float.NEGATIVE_INFINITY : sc.getZScore(); - scoreByCharset.put(entry.getKey(), (double) rawZ); - LOG.trace("junk-filter score {} z={} script={}", - entry.getKey().name(), - String.format(java.util.Locale.ROOT, "%.3f", rawZ), - sc.isUnknown() ? "UNKNOWN" : sc.getDominantScript()); - if (rawZ > championZ) { - championZ = rawZ; + String text = entry.getValue(); + float[] zs = zByText.get(text); + if (zs == null) { + org.apache.tika.quality.TextQualityScore sc = qualityDetector.score(text); + float wholeZ = sc.isUnknown() ? Float.NEGATIVE_INFINITY : sc.getZScore(); + String diff = scriptLetters(text); + float diffZ = Float.NEGATIVE_INFINITY; + if (!diff.isEmpty()) { + org.apache.tika.quality.TextQualityScore d = qualityDetector.score(diff); + diffZ = d.isUnknown() ? Float.NEGATIVE_INFINITY : d.getZScore(); + } + zs = new float[]{wholeZ, diffZ}; + zByText.put(text, zs); + } + scoreByCharset.put(entry.getKey(), (double) zs[0]); + diffByCharset.put(entry.getKey(), (double) zs[1]); + if (zs[0] > championZ) { + championZ = zs[0]; champion = entry.getKey(); } } @@ -248,6 +279,48 @@ public class JunkFilterEncodingDetector implements MetaEncodingDetector { champion = candidates.keySet().iterator().next(); } + // CJK-vs-non-CJK family gate. The whole-text z coin-flips on the + // CJK/non-CJK BOUNDARY for COMMON-dominated docs (markup/digits/punct + // decode identically and swamp the few discriminating high bytes), + // producing false-CJK and real-CJK demotion. The script-letter "diff" z + // reads that boundary cleanly (coherent CJK vs garbage), so use it to + // decide ONLY the family; within a family the whole-text champion stands + // (Latin-vs-Latin etc. untouched — a blanket diff-score regressed there). + // Override only on a clear diff margin. + double bestCjkDiff = Double.NEGATIVE_INFINITY; + double bestNonCjkDiff = Double.NEGATIVE_INFINITY; + for (Map.Entry<Charset, Double> e : diffByCharset.entrySet()) { + if (isCjkCharset(e.getKey().name())) { + bestCjkDiff = Math.max(bestCjkDiff, e.getValue()); + } else { + bestNonCjkDiff = Math.max(bestNonCjkDiff, e.getValue()); + } + } + // DEMOTE-ONLY: fire only to demote a CJK champion to non-CJK when the + // diff z clearly prefers non-CJK (the false-CJK fix). The reverse + // (promote non-CJK -> CJK) is NOT done: measured at 29k, the diff z + // reliably says "this CJK pick is really non-CJK" (OOV improves on every + // such flip) but UNreliably says "this non-CJK pick is really CJK" (the + // junk model over-rates ideograph mojibake vs sparse Latin letters — OOV + // worsened on every promote flip). The promote direction is also + // unnecessary: genuine CJK is html-meta-declared upstream. + if (isCjkCharset(champion.name()) + && bestNonCjkDiff > bestCjkDiff + FAMILY_DIFF_MARGIN) { + Charset reFam = bestInFamily(scoreByCharset, false); + if (reFam != null) { + LOG.trace("junk-filter family gate: {} (CJK) -> {} (non-CJK by diff z)", + champion.name(), reFam.name()); + champion = reFam; + } + } + + // Within-Latin letter gate (demote-only). Sibling to the CJK gate, + // for the other boundary the whole-text z can't see: a DOS-OEM / Mac + // pick whose high bytes decode to box-drawing/symbols beating the + // windows-1252 truth under COMMON-dilution. Cased-letter count reads + // it where typicality ties. See {@link #applyLatinLetterGate}. + champion = applyLatinLetterGate(bytes, champion, candidates.keySet()); + // "No-info" guard: if the statistical layer produced no confident // answer — no STRUCTURAL proof, and its best STATISTICAL candidate is // no better than Mojibuster's windows-1252 "I don't know" fallback @@ -274,6 +347,153 @@ public class JunkFilterEncodingDetector implements MetaEncodingDetector { return List.of(new EncodingResult(champion, confidence)); } + /** Minimum diff-z margin by which the other family must beat the champion's + * family before the family gate overrides. Large enough to ignore the + * noise-level boundary ties; real CJK-vs-garbage diffs are far larger. */ + private static final double FAMILY_DIFF_MARGIN = 2.0; + + private static boolean isCjkCharset(String name) { + String n = name.toLowerCase(java.util.Locale.ROOT); + return n.contains("gb") || n.contains("big5") || n.contains("euc") + || n.contains("shift") || n.contains("jis") || n.contains("2022") + || n.contains("949"); + } + + /** Highest whole-text-z candidate within the requested family (CJK or not). */ + private static Charset bestInFamily(Map<Charset, Double> wholeZ, boolean cjk) { + Charset best = null; + double bz = Double.NEGATIVE_INFINITY; + for (Map.Entry<Charset, Double> e : wholeZ.entrySet()) { + if (isCjkCharset(e.getKey().name()) == cjk && e.getValue() > bz) { + bz = e.getValue(); + best = e.getKey(); + } + } + return best; + } + + /** Script-letter "diff" content: codepoints ≥ 0x80 that are letters/ + * ideographs — the high bytes where candidate decodes differ. Shared ASCII + * and non-ASCII punctuation/symbols are dropped (they dilute toward a + * COMMON-dominated tie). Used only for the CJK-vs-non-CJK family gate. */ + private static String scriptLetters(String s) { + StringBuilder b = new StringBuilder(); + s.codePoints().forEach(c -> { + if (c >= 0x80 && Character.isLetter(c)) { + b.appendCodePoint(c); + } + }); + return b.toString(); + } + + /** Canonical {@code Charset.name()} of the WHATWG-default Latin fallback. */ + private static final String WIN1252 = "windows-1252"; + + /** Latin single-byte charsets the within-Latin letter gate may arbitrate. + * EXCLUDES non-Latin SBCS (Cyrillic windows-1251 / ISO-8859-5, Greek + * -1253 / -7, Hebrew -1255 / -8, Arabic -1256 / -6, Thai) whose cased + * letters would pollute the count, and all multi-byte CJK (the family + * gate's territory). */ + private static final Set<String> LATIN_SBCS = new HashSet<>(Arrays.asList( + "windows-1252", "windows-1250", "windows-1254", "windows-1257", "windows-1258", + "ISO-8859-1", "ISO-8859-2", "ISO-8859-3", "ISO-8859-4", "ISO-8859-9", + "ISO-8859-10", "ISO-8859-13", "ISO-8859-14", "ISO-8859-15", "ISO-8859-16", + "IBM437", "IBM850", "IBM852", "IBM858", "IBM860", "IBM861", "IBM863", "IBM865", + "x-MacRoman", "x-MacCentralEurope", "x-MacRomania", "x-MacIceland")); + + /** Probe must have at least this many high bytes for the gate to act — + * below it the letter gap is noise (most over-picks are sparse). */ + private static final int LATIN_GATE_MIN_HIGH_BYTES = 16; + /** windows-1252 must win the cased-letter count by > max(FLOOR, FRACTION + * * highBytes). The margin lets the gate cover Central-European / DOS + * siblings safely — genuine CE text wins MORE letters under its true + * charset so the gate stays silent — without the tie-flip that forces the + * mojibuster Western-Latin fallback to scope itself out of those families. */ + private static final double LATIN_GATE_MARGIN_FLOOR = 6.0; + private static final double LATIN_GATE_MARGIN_FRACTION = 0.20; + + /** + * Within-Latin letter-plausibility gate (demote-only). Demotes {@code + * champion} to windows-1252 only when windows-1252 is a live candidate, both + * are Latin SBCS, the probe is high-byte-dense, and windows-1252 decodes + * clearly MORE cased high-byte letters than the champion — the box-drawing + * signature, where a wrong IBM850 / x-MacRoman decode maps high bytes to + * symbols. The compare is directional: a genuine Central-European / DOS doc + * wins MORE letters under its true charset, so the gate leaves it untouched. + * Latin-scoped so it never crosses the CJK boundary (the family gate above) + * or touches non-Latin SBCS. Returns the (possibly demoted) charset. + */ + static Charset applyLatinLetterGate(byte[] probe, Charset champion, + Set<Charset> candidates) { + String name = champion.name(); + if (WIN1252.equals(name) || !LATIN_SBCS.contains(name)) { + return champion; + } + Charset win = null; + for (Charset c : candidates) { + if (WIN1252.equals(c.name())) { + win = c; + break; + } + } + if (win == null) { + return champion; + } + int high = HighByteLetterStats.countHighBytes(probe); + if (high < LATIN_GATE_MIN_HIGH_BYTES) { + return champion; + } + int winLetters = HighByteLetterStats.countCasedHighByteLetters(probe, win); + int champLetters = HighByteLetterStats.countCasedHighByteLetters(probe, champion); + double margin = Math.max(LATIN_GATE_MARGIN_FLOOR, LATIN_GATE_MARGIN_FRACTION * high); + if (winLetters > champLetters + margin) { + LOG.trace("junk-filter latin gate: {} -> windows-1252 (cased high-byte " + + "letters {} vs {}, high={})", name, champLetters, winLetters, high); + return win; + } + return champion; + } + + /** + * Restrict the candidate set the tournament will decode+clean+score: keep + * every DECLARATIVE/STRUCTURAL anchor (author intent / byte-grammar proof), + * plus the top {@link #ALWAYS_KEEP_TOP_N} STATISTICAL candidates by + * confidence, plus any deeper STATISTICAL candidate still within + * {@link #MIN_TAIL_CONFIDENCE}. Drops the dominated low-confidence tail — + * the speed lever — without removing any anchor or NB's real contenders. + * Returns a subset of {@code all}, preserving its iteration order. + */ + private static Set<Charset> bandFilter(EncodingDetectorContext context, Set<Charset> all) { + Set<Charset> anchors = new HashSet<>(); + List<EncodingResult> stats = new ArrayList<>(); + for (EncodingDetectorContext.Result r : context.getResults()) { + for (EncodingResult er : r.getEncodingResults()) { + EncodingResult.ResultType t = er.getResultType(); + if (t == EncodingResult.ResultType.DECLARATIVE + || t == EncodingResult.ResultType.STRUCTURAL) { + anchors.add(er.getCharset()); + } else if (t == EncodingResult.ResultType.STATISTICAL) { + stats.add(er); + } + } + } + stats.sort((a, b) -> Float.compare(b.getConfidence(), a.getConfidence())); + Set<Charset> keepStat = new HashSet<>(); + for (int i = 0; i < stats.size(); i++) { + if (i < ALWAYS_KEEP_TOP_N + || stats.get(i).getConfidence() >= MIN_TAIL_CONFIDENCE) { + keepStat.add(stats.get(i).getCharset()); + } + } + Set<Charset> kept = new LinkedHashSet<>(); + for (Charset cs : all) { + if (anchors.contains(cs) || keepStat.contains(cs)) { + kept.add(cs); + } + } + return kept; + } + /** * True if some detector produced a confident non-declarative signal: any * STRUCTURAL result (byte-grammar proof), or any STATISTICAL result above @@ -369,10 +589,17 @@ public class JunkFilterEncodingDetector implements MetaEncodingDetector { } private static String safeDecode(byte[] bytes, Charset charset) { + // Score CJK candidates on their vendor superset, not the strict base + // (which U+FFFDs vendor-extension chars and unfairly penalizes real + // CJK). AutoDetectReader re-applies the same superset for content. + Charset decodeAs = CharsetSupersets.supersetOf(charset); + if (decodeAs == null) { + decodeAs = charset; + } try { - return new String(bytes, charset); + return new String(bytes, decodeAs); } catch (Exception e) { - LOG.debug("Decode failed for {}: {}", charset.name(), e.toString()); + LOG.debug("Decode failed for {}: {}", decodeAs.name(), e.toString()); return null; } } diff --git a/tika-ml/tika-ml-junkdetect/src/test/java/org/apache/tika/ml/junkdetect/JunkFilterEncodingDetectorTest.java b/tika-ml/tika-ml-junkdetect/src/test/java/org/apache/tika/ml/junkdetect/JunkFilterEncodingDetectorTest.java index 705cbbe99f..cd29df2616 100644 --- a/tika-ml/tika-ml-junkdetect/src/test/java/org/apache/tika/ml/junkdetect/JunkFilterEncodingDetectorTest.java +++ b/tika-ml/tika-ml-junkdetect/src/test/java/org/apache/tika/ml/junkdetect/JunkFilterEncodingDetectorTest.java @@ -67,6 +67,79 @@ public class JunkFilterEncodingDetectorTest { } } + /** + * Functional stub for the CJK-vs-non-CJK family gate. Returns one of four + * controlled z-scores per scored string, keyed on whether the string + * contains Han ideographs (CJK family) and whether it is a "diff" string + * (script-letters only, i.e. every codepoint ≥ 0x80) vs whole text. + * Lets us drive {@code JunkFilterEncodingDetector}'s gate deterministically + * without the real model: the detector scores both the whole decoded text + * (champion metric) and its script-letter diff (family-gate metric) for + * each candidate, so the four cells fully determine the gate's decision. + */ + private static final class ZStub implements TextQualityDetector { + private final double wholeCjk; + private final double wholeNonCjk; + private final double diffCjk; + private final double diffNonCjk; + + ZStub(double wholeCjk, double wholeNonCjk, double diffCjk, double diffNonCjk) { + this.wholeCjk = wholeCjk; + this.wholeNonCjk = wholeNonCjk; + this.diffCjk = diffCjk; + this.diffNonCjk = diffNonCjk; + } + + private static boolean isCjk(String s) { + return s.codePoints().anyMatch(c -> c >= 0x4E00 && c <= 0x9FFF); + } + + /** Diff string = script-letters only: non-empty, every codepoint ≥ 0x80. */ + private static boolean isDiff(String s) { + return !s.isEmpty() && s.codePoints().allMatch(c -> c >= 0x80); + } + + @Override + public TextQualityScore score(String text) { + boolean cjk = isCjk(text); + double z = isDiff(text) + ? (cjk ? diffCjk : diffNonCjk) + : (cjk ? wholeCjk : wholeNonCjk); + return new TextQualityScore((float) z, Float.NaN, Float.NaN, Float.NaN, + cjk ? "HAN" : "LATIN"); + } + + @Override + public TextQualityComparison compare(String labelA, String candidateA, + String labelB, String candidateB) { + // Not exercised by the gate path (which uses score()); provided only + // to satisfy the interface. + return new TextQualityComparison("A", 0.0f, + score(candidateA), score(candidateB), labelA, labelB); + } + } + + /** + * ASCII filler + 20 copies of the byte pair {@code {0xC4, 0xE3}}: decodes to + * Han ideographs (你…) under GB18030 but accented Latin (Ä ã…) under + * windows-1252. A clean false-CJK vs real-CJK probe — the ASCII keeps the + * whole-text strings out of the "diff" bucket, while the high bytes are the + * only place the two decodes disagree. + */ + private static byte[] cjkAmbiguousBytes() { + byte[] ascii = "the quick brown fox jumps over the lazy dog " + .getBytes(StandardCharsets.US_ASCII); + byte[] hi = new byte[40]; + for (int i = 0; i < 20; i++) { + hi[2 * i] = (byte) 0xC4; + hi[2 * i + 1] = (byte) 0xE3; + } + byte[] out = new byte[ascii.length + hi.length]; + System.arraycopy(ascii, 0, out, 0, ascii.length); + System.arraycopy(hi, 0, out, ascii.length, hi.length); + return out; + } + private static ParseContext contextWith(EncodingResult... results) { EncodingDetectorContext ctx = new EncodingDetectorContext(); ctx.addResult(List.of(results), "stub"); @@ -254,4 +327,110 @@ public class JunkFilterEncodingDetectorTest { String expected = "ത്ര"; assertEquals(expected, JunkFilterEncodingDetector.expandHtmlEntities(input)); } + + // ----- CJK-vs-non-CJK family gate (the demote-only false-CJK fix) ----- + // + // The whole-text z coin-flips on the CJK/non-CJK boundary for + // COMMON-dominated docs: markup/digits/punctuation decode identically under + // every candidate and swamp the few discriminating high bytes, so the junk + // model's whole-text argmax sometimes crowns a garbage CJK decode over the + // correct single-byte one (false-CJK), and sometimes the reverse. The + // script-letter "diff" z reads that boundary cleanly (coherent CJK vs + // ideograph mojibake), so the gate uses it to decide ONLY the family. + // Measured at 29k, the diff z reliably DEMOTES (CJK champion -> non-CJK; OOV + // improves on every flip) but UNreliably promotes, so the gate is + // demote-only and fires only past FAMILY_DIFF_MARGIN. These four tests lock + // each arm of that decision against the {@link ZStub}. + + @Test + public void familyGate_demotesFalseCjkToNonCjk() throws Exception { + // Whole-text champion is the CJK pick (the coin-flip), but the diff z + // clearly prefers the non-CJK decode (coherent Latin >> ideograph + // mojibake, margin 7.0 > 2.0) -> gate must demote to windows-1252. + Charset gb = Charset.forName("GB18030"); + Charset win1252 = Charset.forName("windows-1252"); + ParseContext pc = contextWith( + new EncodingResult(gb, 0.8f, "GB18030", + EncodingResult.ResultType.STATISTICAL), + new EncodingResult(win1252, 0.7f, "windows-1252", + EncodingResult.ResultType.STATISTICAL)); + // wholeCjk(-1.0) > wholeNonCjk(-1.5); diffNonCjk(-1.0) >> diffCjk(-8.0) + JunkFilterEncodingDetector detector = + new JunkFilterEncodingDetector(new ZStub(-1.0, -1.5, -8.0, -1.0)); + try (TikaInputStream tis = TikaInputStream.get(cjkAmbiguousBytes())) { + List<EncodingResult> out = detector.detect(tis, new Metadata(), pc); + assertEquals(1, out.size()); + assertEquals(win1252, out.get(0).getCharset(), + "diff z prefers non-CJK by > FAMILY_DIFF_MARGIN -> CJK " + + "champion must be demoted to windows-1252"); + } + } + + @Test + public void familyGate_keepsRealCjkWhenDiffAgrees() throws Exception { + // Whole-text champion is CJK and the diff z AGREES (ideographs coherent, + // Latin garbage) -> gate must NOT fire; real CJK stays CJK. + Charset gb = Charset.forName("GB18030"); + Charset win1252 = Charset.forName("windows-1252"); + ParseContext pc = contextWith( + new EncodingResult(gb, 0.8f, "GB18030", + EncodingResult.ResultType.STATISTICAL), + new EncodingResult(win1252, 0.7f, "windows-1252", + EncodingResult.ResultType.STATISTICAL)); + // diffCjk(-1.0) >> diffNonCjk(-8.0): non-CJK does not beat CJK -> no demote + JunkFilterEncodingDetector detector = + new JunkFilterEncodingDetector(new ZStub(-1.0, -1.5, -1.0, -8.0)); + try (TikaInputStream tis = TikaInputStream.get(cjkAmbiguousBytes())) { + List<EncodingResult> out = detector.detect(tis, new Metadata(), pc); + assertEquals(1, out.size()); + assertEquals(gb, out.get(0).getCharset(), + "diff z agrees with the CJK champion -> must not demote"); + } + } + + @Test + public void familyGate_isDemoteOnly_neverPromotesNonCjkToCjk() throws Exception { + // Whole-text champion is NON-CJK; even though the diff z would prefer + // CJK, the gate is demote-only (the promote direction regressed at 29k), + // so the non-CJK champion must stand. + Charset gb = Charset.forName("GB18030"); + Charset win1252 = Charset.forName("windows-1252"); + ParseContext pc = contextWith( + new EncodingResult(gb, 0.7f, "GB18030", + EncodingResult.ResultType.STATISTICAL), + new EncodingResult(win1252, 0.8f, "windows-1252", + EncodingResult.ResultType.STATISTICAL)); + // wholeNonCjk(-1.0) > wholeCjk(-1.5) -> champion non-CJK; diffCjk strong but ignored + JunkFilterEncodingDetector detector = + new JunkFilterEncodingDetector(new ZStub(-1.5, -1.0, -1.0, -8.0)); + try (TikaInputStream tis = TikaInputStream.get(cjkAmbiguousBytes())) { + List<EncodingResult> out = detector.detect(tis, new Metadata(), pc); + assertEquals(1, out.size()); + assertEquals(win1252, out.get(0).getCharset(), + "gate is demote-only: a non-CJK champion is never promoted to CJK"); + } + } + + @Test + public void familyGate_respectsDiffMargin() throws Exception { + // Non-CJK diff z beats CJK diff z, but by LESS than FAMILY_DIFF_MARGIN + // (2.0): a boundary-noise tie, not a clear signal -> no demote. + Charset gb = Charset.forName("GB18030"); + Charset win1252 = Charset.forName("windows-1252"); + ParseContext pc = contextWith( + new EncodingResult(gb, 0.8f, "GB18030", + EncodingResult.ResultType.STATISTICAL), + new EncodingResult(win1252, 0.7f, "windows-1252", + EncodingResult.ResultType.STATISTICAL)); + // diffNonCjk(-1.0) - diffCjk(-2.0) = 1.0 < margin 2.0 -> no demote + JunkFilterEncodingDetector detector = + new JunkFilterEncodingDetector(new ZStub(-1.0, -1.5, -2.0, -1.0)); + try (TikaInputStream tis = TikaInputStream.get(cjkAmbiguousBytes())) { + List<EncodingResult> out = detector.detect(tis, new Metadata(), pc); + assertEquals(1, out.size()); + assertEquals(gb, out.get(0).getCharset(), + "diff margin below FAMILY_DIFF_MARGIN -> no demote " + + "(boundary-noise guard)"); + } + } } diff --git a/tika-ml/tika-ml-junkdetect/src/test/java/org/apache/tika/ml/junkdetect/LatinLetterGateTest.java b/tika-ml/tika-ml-junkdetect/src/test/java/org/apache/tika/ml/junkdetect/LatinLetterGateTest.java new file mode 100644 index 0000000000..441677ad65 --- /dev/null +++ b/tika-ml/tika-ml-junkdetect/src/test/java/org/apache/tika/ml/junkdetect/LatinLetterGateTest.java @@ -0,0 +1,110 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.tika.ml.junkdetect; + +import static org.junit.jupiter.api.Assertions.assertEquals; + +import java.nio.charset.Charset; +import java.util.LinkedHashSet; +import java.util.Set; + +import org.junit.jupiter.api.Test; + +/** + * Unit tests for the within-Latin letter-plausibility gate + * ({@link JunkFilterEncodingDetector#applyLatinLetterGate}) in isolation. + */ +public class LatinLetterGateTest { + + private static final Charset WIN1252 = Charset.forName("windows-1252"); + private static final Charset IBM850 = Charset.forName("IBM850"); + private static final Charset ISO_8859_2 = Charset.forName("ISO-8859-2"); + private static final Charset WIN1251 = Charset.forName("windows-1251"); + private static final Charset SHIFT_JIS = Charset.forName("Shift_JIS"); + + private static Set<Charset> candidates(Charset... cs) { + Set<Charset> s = new LinkedHashSet<>(); + for (Charset c : cs) { + s.add(c); + } + return s; + } + + /** 0xC0-0xCF: À-Ï (letters) in windows-1252, box-drawing in IBM850 → + * windows-1252 wins the letter count decisively → demote. */ + private static byte[] boxDrawingProbe(int repeats) { + byte[] probe = new byte[16 * repeats]; + for (int r = 0; r < repeats; r++) { + for (int i = 0; i < 16; i++) { + probe[r * 16 + i] = (byte) (0xC0 + i); + } + } + return probe; + } + + @Test + void demotesBoxDrawingIbm850ToWindows1252() { + Charset out = JunkFilterEncodingDetector.applyLatinLetterGate( + boxDrawingProbe(3), IBM850, candidates(IBM850, WIN1252)); + assertEquals(WIN1252, out, "box-drawing IBM850 pick should demote to windows-1252"); + } + + /** Bytes that are Central-European letters in ISO-8859-2 (Ą Ł Ś Š Ž ...) but + * symbols (¡ £ ¦ © ...) in windows-1252. ISO-8859-2 wins the letter count, + * so the directional gate must NOT flip genuine CE text. */ + @Test + void keepsGenuineCentralEuropean() { + int[] ceLetters = {0xA1, 0xA3, 0xA5, 0xA6, 0xA9, 0xAB, 0xAC, 0xAE, 0xAF, + 0xB1, 0xB3, 0xB6, 0xB9, 0xBB, 0xBC, 0xBE, 0xBF}; + byte[] probe = new byte[ceLetters.length]; + for (int i = 0; i < ceLetters.length; i++) { + probe[i] = (byte) ceLetters[i]; + } + Charset out = JunkFilterEncodingDetector.applyLatinLetterGate( + probe, ISO_8859_2, candidates(ISO_8859_2, WIN1252)); + assertEquals(ISO_8859_2, out, "genuine CE text wins letters under its true charset"); + } + + @Test + void silentBelowHighByteFloor() { + byte[] sparse = {(byte) 0xC0, (byte) 0xC1, (byte) 0xC2}; + Charset out = JunkFilterEncodingDetector.applyLatinLetterGate( + sparse, IBM850, candidates(IBM850, WIN1252)); + assertEquals(IBM850, out, "below the high-byte floor the gate must not act"); + } + + @Test + void silentOnNonLatinChampion() { + Charset out = JunkFilterEncodingDetector.applyLatinLetterGate( + boxDrawingProbe(3), WIN1251, candidates(WIN1251, WIN1252)); + assertEquals(WIN1251, out, "Cyrillic champion is outside the Latin allowlist"); + } + + @Test + void silentOnCjkChampion() { + Charset out = JunkFilterEncodingDetector.applyLatinLetterGate( + boxDrawingProbe(3), SHIFT_JIS, candidates(SHIFT_JIS, WIN1252)); + assertEquals(SHIFT_JIS, out, "CJK champion is the family gate's territory, not this one"); + } + + @Test + void silentWhenWindows1252NotACandidate() { + Charset out = JunkFilterEncodingDetector.applyLatinLetterGate( + boxDrawingProbe(3), IBM850, candidates(IBM850, ISO_8859_2)); + assertEquals(IBM850, out, "nothing canonical to demote to without a windows-1252 candidate"); + } +}
