(tika) 01/06: add html stripper to lang detect and fix charset aliases

tallison Fri, 17 Apr 2026 10:59:40 -0700

This is an automated email from the ASF dual-hosted git repository.

tballison pushed a commit to branch charset-ship-today
in repository https://gitbox.apache.org/repos/asf/tika.git


commit b17c6f5e4a85c934430a40eb5b9a47638e3d9589
Author: tallison <[email protected]>
AuthorDate: Wed Apr 15 17:15:10 2026 -0400

    add html stripper to lang detect and fix charset aliases
---
 .../pages/advanced/charset-detection-design.adoc   |  80 +++++-
 .../charsoup/CharSoupEncodingDetector.java         |  22 +-
 .../tika/langdetect/charsoup/HtmlStripper.java     | 288 +++++++++++++++++++++
 .../charsoup/CharSoupEncodingDetectorTest.java     |  11 -
 .../html/charsetdetector/CharsetAliases.java       |  10 +-
 5 files changed, 376 insertions(+), 35 deletions(-)

diff --git a/docs/modules/ROOT/pages/advanced/charset-detection-design.adoc 
b/docs/modules/ROOT/pages/advanced/charset-detection-design.adoc
index 291971925b..64e2bd4c66 100644
--- a/docs/modules/ROOT/pages/advanced/charset-detection-design.adoc
+++ b/docs/modules/ROOT/pages/advanced/charset-detection-design.adoc
@@ -91,8 +91,23 @@ Each `EncodingResult` carries:
 
 | `DECLARATIVE`
 | Explicit charset declaration: BOM, HTML `<meta>` tag, HTTP Content-Type
-  header, or metadata hint.  Should be respected over statistical inferences
-  unless structurally impossible.
+  header, or metadata hint.
++
+*Important — declared charsets are NOT trusted by default.* When
+`CharSoupEncodingDetector` is in the chain (the default configuration),
+DECLARATIVE candidates are treated as one input among several and are
+arbitrated by language signal alongside STATISTICAL and STRUCTURAL
+candidates.  This is deliberate: real-world declarations are notoriously
+unreliable — sites serve `windows-1252` and declare `ISO-8859-1`, serve
+`UTF-8` and declare `ASCII`, copy-paste templates from other regions
+without updating the meta tag, and so on.  Tika's stance is that
+language signal over the actual decoded bytes is more trustworthy than
+a declaration on the wire.
++
+If you want declared charsets to be authoritative (e.g. you trust your
+input pipeline, or you specifically want HTML5-spec-compliant behaviour),
+configure your detector chain *without* `CharSoupEncodingDetector` —
+see <<opting-out-of-arbitration>>.
 
 | `STRUCTURAL`
 | Derived from byte-level structure (UTF-8 validity, EBCDIC space 
distribution).
@@ -271,6 +286,25 @@ chain switches `CompositeEncodingDetector` into 
collect-all mode.  After all
 other detectors run, CharSoup receives the full `EncodingDetectorContext` and
 arbitrates.
 
+[IMPORTANT]
+====
+*CharSoup intentionally arbitrates over ALL candidates, including
+DECLARATIVE ones.*  A `<meta charset>` tag, HTTP `Content-Type` charset
+parameter, or other declared charset is treated as one input among many
+— not as authoritative.  Real-world declarations on the legacy web are
+notoriously unreliable (sites declare ASCII while serving UTF-8, declare
+ISO-8859-1 while serving windows-1252, copy-paste templates from other
+regions and forget to update the meta tag, etc.).  CharSoup's stance:
+language signal over the actual decoded bytes is more trustworthy than
+the wire declaration.
+
+If you want declared charsets to be authoritative — for example because
+you trust your input pipeline, or you specifically need HTML5
+spec-compliant behaviour — *opt out of CharSoup* (see
+<<opting-out-of-arbitration>>).  This is a configuration choice, not a
+limitation.
+====
+
 Before any charset decoding, CharSoup strips leading BOM bytes from the raw
 probe.  This ensures every candidate charset decodes the same content bytes,
 preventing the BOM itself from skewing language scores.
@@ -306,6 +340,48 @@ false positives from truly lying BOMs or wrong `<meta 
charset>` tags.
   statistical winner; otherwise it returns the first candidate from the
   highest-confidence statistical detector.
 
+[[opting-out-of-arbitration]]
+=== Opting out — strict declared-charset honoring
+
+If your application needs declared charsets to be authoritative, omit
+`CharSoupEncodingDetector` from the encoding-detector chain.  Without
+CharSoup, `CompositeEncodingDetector` runs in classic
+"first-detector-with-a-result wins" mode.  A typical declared-charset-honoring
+configuration:
+
+[source,json]
+----
+{
+  "encoding-detectors": [
+    { "bom-detector": {} },
+    { "metadata-charset-detector": {} },
+    { "standard-html-encoding-detector": {} },
+    { "mojibuster-encoding-detector": {} }
+  ]
+}
+----
+
+In this chain:
+
+* `BOMDetector` returns DECLARATIVE on a recognised BOM.
+* `MetadataCharsetDetector` returns DECLARATIVE from HTTP/MIME headers.
+* `StandardHtmlEncodingDetector` returns DECLARATIVE from `<meta charset>` /
+  `<meta http-equiv>` tags.
+* `MojibusterEncodingDetector` runs only when none of the above produced a
+  declaration, and its STATISTICAL result is final (no language-signal
+  arbitration to second-guess it).
+
+This is HTML5-spec-compliant for the declaration cases and matches the
+behaviour callers familiar with Tika 2.x and earlier expect.  The
+trade-off is that lying declarations (e.g. a Korean MS949 page that
+declares `Windows-949` correctly but where Mojibuster's statistical
+output would have rescued a misdeclaration) propagate unfiltered.
+
+Conversely, the default chain (with CharSoup) tolerates lying
+declarations at the cost of occasionally overriding a correct one when
+the language signal is ambiguous.  Pick the trade-off that matches your
+deployment.
+
 [[thai-gbk-case-study]]
 === Case study: why top-N limiting and the generative model matter
 
diff --git 
a/tika-encoding-detectors/tika-encoding-detector-charsoup/src/main/java/org/apache/tika/langdetect/charsoup/CharSoupEncodingDetector.java
 
b/tika-encoding-detectors/tika-encoding-detector-charsoup/src/main/java/org/apache/tika/langdetect/charsoup/CharSoupEncodingDetector.java
index f68e6720d2..e9fc080c29 100644
--- 
a/tika-encoding-detectors/tika-encoding-detector-charsoup/src/main/java/org/apache/tika/langdetect/charsoup/CharSoupEncodingDetector.java
+++ 
b/tika-encoding-detectors/tika-encoding-detector-charsoup/src/main/java/org/apache/tika/langdetect/charsoup/CharSoupEncodingDetector.java
@@ -190,7 +190,7 @@ public class CharSoupEncodingDetector implements 
MetaEncodingDetector {
 
         Map<Charset, String> candidates = new LinkedHashMap<>();
         for (Charset candidate : uniqueCharsets) {
-            candidates.put(candidate, stripTags(decode(bytes, candidate)));
+            candidates.put(candidate, HtmlStripper.strip(decode(bytes, 
candidate)));
         }
 
         CharSoupLanguageDetector langDetector = new CharSoupLanguageDetector();
@@ -449,26 +449,6 @@ public class CharSoupEncodingDetector implements 
MetaEncodingDetector {
         return cb.toString();
     }
 
-    /**
-     * Simple tag stripping: removes &lt;...&gt; sequences so that
-     * HTML/XML tag names and attributes don't pollute language scoring.
-     */
-    static String stripTags(String text) {
-        StringBuilder sb = new StringBuilder(text.length());
-        boolean inTag = false;
-        for (int i = 0; i < text.length(); i++) {
-            char c = text.charAt(i);
-            if (c == '<') {
-                inTag = true;
-            } else if (c == '>') {
-                inTag = false;
-            } else if (!inTag) {
-                sb.append(c);
-            }
-        }
-        return sb.toString();
-    }
-
     public int getReadLimit() {
         return readLimit;
     }
diff --git 
a/tika-encoding-detectors/tika-encoding-detector-charsoup/src/main/java/org/apache/tika/langdetect/charsoup/HtmlStripper.java
 
b/tika-encoding-detectors/tika-encoding-detector-charsoup/src/main/java/org/apache/tika/langdetect/charsoup/HtmlStripper.java
new file mode 100644
index 0000000000..f36aa635c1
--- /dev/null
+++ 
b/tika-encoding-detectors/tika-encoding-detector-charsoup/src/main/java/org/apache/tika/langdetect/charsoup/HtmlStripper.java
@@ -0,0 +1,288 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.langdetect.charsoup;
+
+/**
+ * HTML/XML markup stripping tuned for language scoring.  Not a full HTML
+ * parser — purpose-built to feed character-bigram language detectors a
+ * markup-free string that still carries the page's content language.
+ *
+ * <p>Real-world HTML probes are routinely 95-99% markup by byte count.
+ * Without this pass, a language detector sees the markup as its primary
+ * input — which on any HTML page looks predominantly like ASCII English
+ * regardless of the page's actual content language.  Stripping markup
+ * (and decoding numeric entities, which can carry content) lets the
+ * detector see the actual content.
+ *
+ * <h3>What it does, in one linear pass</h3>
+ * <ul>
+ *   <li>Removes {@code <script>...</script>} and {@code <style>...</style>}
+ *       block contents — JavaScript identifiers / CSS property names look
+ *       strongly like English and would skew language scoring on any
+ *       page.</li>
+ *   <li>Removes {@code <!-- ... -->} comments.</li>
+ *   <li>Removes {@code <...>} tag markup (element names, attribute names,
+ *       attribute values).</li>
+ *   <li><em>Decodes</em> numeric character references ({@code &#1234;},
+ *       {@code &#xABCD;}) to their actual code points — these can carry
+ *       the page's primary content (e.g. Korean-charset pages that emit
+ *       simplified-Chinese-only ideographs via numeric entities for
+ *       cross-charset compatibility).</li>
+ *   <li>Replaces named entity references ({@code &amp;}, {@code &nbsp;},
+ *       {@code &copy;}) with a space — these are nearly always
+ *       punctuation/typography with low language signal, and a full
+ *       named-entity table would be heavyweight.</li>
+ * </ul>
+ *
+ * <h3>What it doesn't do</h3>
+ * <ul>
+ *   <li>Validate HTML structure.  Malformed input, unclosed
+ *       {@code <script>} blocks, and CDATA sections are handled
+ *       defensively: unclosed brackets and unfound matching tags fall
+ *       through to end-of-input.</li>
+ *   <li>Resolve named entities.  A 2-element shortlist
+ *       ({@code &amp;}, {@code &lt;}, etc.) might be worth adding later
+ *       if some downstream needs them; current users score language and
+ *       don't.</li>
+ *   <li>Preserve element-content semantics ({@code <title>} vs body text,
+ *       {@code <pre>} whitespace).  All content is treated equivalently.</li>
+ * </ul>
+ */
+public final class HtmlStripper {
+
+    private HtmlStripper() {
+    }
+
+    /**
+     * Strip markup from {@code text} and return the content with numeric
+     * entities decoded.  See class javadoc for details.
+     *
+     * @param text input string (HTML/XML or plain text); {@code null} or empty
+     *             returns the input unchanged
+     * @return content with markup removed and numeric entities decoded
+     */
+    public static String strip(String text) {
+        if (text == null || text.isEmpty()) {
+            return text;
+        }
+        StringBuilder out = new StringBuilder(text.length());
+        int n = text.length();
+        int i = 0;
+        while (i < n) {
+            char c = text.charAt(i);
+            if (c == '<') {
+                i = handleOpenAngle(text, i, n, out);
+            } else if (c == '&') {
+                i = handleAmpersand(text, i, n, out);
+            } else {
+                out.append(c);
+                i++;
+            }
+        }
+        return out.toString();
+    }
+
+    /** Handle a {@code <} — element tag, comment, or script/style block. */
+    private static int handleOpenAngle(String s, int i, int n, StringBuilder 
out) {
+        if (startsWithIgnoreCase(s, i, "<!--")) {
+            int end = s.indexOf("-->", i + 4);
+            return end < 0 ? n : end + 3;
+        }
+        if (startsRawElementBlock(s, i, "script")) {
+            return skipPastClosing(s, i, n, "</script", out);
+        }
+        if (startsRawElementBlock(s, i, "style")) {
+            return skipPastClosing(s, i, n, "</style", out);
+        }
+        // Generic tag.  Skip to matching `>`; if none, swallow rest of input
+        // (defensive — malformed `<` shouldn't dump uninterpreted bytes back).
+        int end = s.indexOf('>', i + 1);
+        return end < 0 ? n : end + 1;
+    }
+
+    /** Handle a {@code &} — numeric entity (decode), named entity (drop), or 
literal. */
+    private static int handleAmpersand(String s, int i, int n, StringBuilder 
out) {
+        // Look for ; within a small window — entity references are short.
+        int max = Math.min(n, i + 12);
+        int semi = -1;
+        for (int j = i + 1; j < max; j++) {
+            char c = s.charAt(j);
+            if (c == ';') {
+                semi = j;
+                break;
+            }
+            if (c == '<' || c == '&' || Character.isWhitespace(c)) {
+                break;  // not an entity
+            }
+        }
+        if (semi < 0) {
+            out.append('&');
+            return i + 1;
+        }
+        // Numeric entity?
+        if (semi >= i + 3 && s.charAt(i + 1) == '#') {
+            int cp = parseNumericEntity(s, i + 2, semi);
+            if (cp >= 0) {
+                appendCodePointSafe(out, cp);
+                return semi + 1;
+            }
+            // Unparseable numeric entity — treat as space (it's not literal 
text).
+            out.append(' ');
+            return semi + 1;
+        }
+        // Named entity? Drop to space (low-signal punctuation).
+        if (isNamedEntity(s, i + 1, semi)) {
+            out.append(' ');
+            return semi + 1;
+        }
+        // Otherwise treat as literal.
+        out.append('&');
+        return i + 1;
+    }
+
+    /**
+     * {@code true} if {@code s} starts with {@code <name} followed by a
+     * tag-name boundary character.  We require the boundary to actually be
+     * present (not just end-of-string) so the truncated input {@code 
"<script"}
+     * is treated as malformed-tag rather than a real script-block opener —
+     * no boundary, no contents to skip, and crucially no AIOOBE on the
+     * lookahead.
+     */
+    private static boolean startsRawElementBlock(String s, int i, String name) 
{
+        int after = i + 1 + name.length();
+        if (after >= s.length()) {
+            return false;
+        }
+        if (!startsWithIgnoreCase(s, i + 1, name)) {
+            return false;
+        }
+        char c = s.charAt(after);
+        return c == '>' || c == ' ' || c == '\t' || c == '\n' || c == '\r' || 
c == '/';
+    }
+
+    /**
+     * Skip past the closing tag for a raw-text element (script/style),
+     * returning the position immediately after {@code closing>}.  If no
+     * matching closer is found, swallows to end-of-input.
+     */
+    private static int skipPastClosing(String s, int i, int n, String closing, 
StringBuilder out) {
+        out.append(' ');  // preserve a word boundary in the output
+        int from = i + 1;
+        while (from < n) {
+            int p = indexOfIgnoreCase(s, closing, from);
+            if (p < 0) {
+                return n;
+            }
+            // Verify it's a tag boundary, then skip to the next `>`.
+            int after = p + closing.length();
+            if (after >= n) {
+                return n;
+            }
+            char c = s.charAt(after);
+            if (c == '>' || c == ' ' || c == '\t' || c == '\n' || c == '\r' || 
c == '/') {
+                int gt = s.indexOf('>', after);
+                return gt < 0 ? n : gt + 1;
+            }
+            from = p + 1;
+        }
+        return n;
+    }
+
+    /** Parse a numeric entity body ({@code #1234} or {@code #xABCD}) starting 
at {@code from}. */
+    private static int parseNumericEntity(String s, int from, int 
semiExclusive) {
+        if (from >= semiExclusive) {
+            return -1;
+        }
+        int hex = (s.charAt(from) == 'x' || s.charAt(from) == 'X') ? 1 : 0;
+        int start = from + hex;
+        if (start >= semiExclusive || semiExclusive - start > 7) {
+            return -1;
+        }
+        int cp = 0;
+        for (int j = start; j < semiExclusive; j++) {
+            int d = Character.digit(s.charAt(j), hex == 1 ? 16 : 10);
+            if (d < 0) {
+                return -1;
+            }
+            cp = cp * (hex == 1 ? 16 : 10) + d;
+            if (cp > 0x10FFFF) {
+                return -1;
+            }
+        }
+        return cp;
+    }
+
+    /** Append a code point, replacing controls and surrogate halves with a 
space. */
+    private static void appendCodePointSafe(StringBuilder out, int cp) {
+        if (cp <= 0 || cp > 0x10FFFF
+                || Character.isISOControl(cp)
+                || (cp >= 0xD800 && cp <= 0xDFFF)) {
+            out.append(' ');
+            return;
+        }
+        out.appendCodePoint(cp);
+    }
+
+    /** {@code true} if the body of a {@code &…;} reference is a plausible 
named entity. */
+    private static boolean isNamedEntity(String s, int from, int 
semiExclusive) {
+        int len = semiExclusive - from;
+        if (len < 2 || len > 8) {
+            return false;
+        }
+        for (int j = from; j < semiExclusive; j++) {
+            char c = s.charAt(j);
+            if ((c < 'a' || c > 'z') && (c < 'A' || c > 'Z')) {
+                return false;
+            }
+        }
+        return true;
+    }
+
+    /**
+     * ASCII-only case-insensitive prefix match.  HTML element names are ASCII
+     * by spec, so we avoid {@link Character#toLowerCase} entirely — that
+     * method is Unicode-aware (which we don't need) and behaves differently
+     * in some locales for non-ASCII characters (the Turkish dotted-I being
+     * the canonical example).  An ASCII-only fold is faster, locale-
+     * independent, and exactly matches the HTML spec.
+     */
+    private static boolean startsWithIgnoreCase(String s, int i, String 
prefix) {
+        if (i + prefix.length() > s.length()) {
+            return false;
+        }
+        for (int j = 0; j < prefix.length(); j++) {
+            if (asciiLower(s.charAt(i + j)) != asciiLower(prefix.charAt(j))) {
+                return false;
+            }
+        }
+        return true;
+    }
+
+    private static char asciiLower(char c) {
+        return (c >= 'A' && c <= 'Z') ? (char) (c + 32) : c;
+    }
+
+    private static int indexOfIgnoreCase(String s, String needle, int from) {
+        int last = s.length() - needle.length();
+        for (int i = from; i <= last; i++) {
+            if (startsWithIgnoreCase(s, i, needle)) {
+                return i;
+            }
+        }
+        return -1;
+    }
+}
diff --git 
a/tika-encoding-detectors/tika-encoding-detector-charsoup/src/test/java/org/apache/tika/langdetect/charsoup/CharSoupEncodingDetectorTest.java
 
b/tika-encoding-detectors/tika-encoding-detector-charsoup/src/test/java/org/apache/tika/langdetect/charsoup/CharSoupEncodingDetectorTest.java
index f4f24307cf..ebeceade08 100644
--- 
a/tika-encoding-detectors/tika-encoding-detector-charsoup/src/test/java/org/apache/tika/langdetect/charsoup/CharSoupEncodingDetectorTest.java
+++ 
b/tika-encoding-detectors/tika-encoding-detector-charsoup/src/test/java/org/apache/tika/langdetect/charsoup/CharSoupEncodingDetectorTest.java
@@ -151,17 +151,6 @@ public class CharSoupEncodingDetectorTest {
         }
     }
 
-    @Test
-    public void testStripTags() {
-        assertEquals("Hello world",
-                CharSoupEncodingDetector.stripTags(
-                        "<html><body>Hello world</body></html>"));
-        assertEquals("no tags here",
-                CharSoupEncodingDetector.stripTags("no tags here"));
-        assertEquals("",
-                CharSoupEncodingDetector.stripTags("<empty/>"));
-    }
-
     @Test
     public void testDecode() {
         byte[] utf8Bytes = "caf\u00e9".getBytes(UTF_8);
diff --git 
a/tika-encoding-detectors/tika-encoding-detector-html/src/main/java/org/apache/tika/parser/html/charsetdetector/CharsetAliases.java
 
b/tika-encoding-detectors/tika-encoding-detector-html/src/main/java/org/apache/tika/parser/html/charsetdetector/CharsetAliases.java
index fa055bda23..dfbc957d90 100644
--- 
a/tika-encoding-detectors/tika-encoding-detector-html/src/main/java/org/apache/tika/parser/html/charsetdetector/CharsetAliases.java
+++ 
b/tika-encoding-detectors/tika-encoding-detector-html/src/main/java/org/apache/tika/parser/html/charsetdetector/CharsetAliases.java
@@ -61,7 +61,15 @@ final class CharsetAliases {
         addCharset(charset("Big5"), "big5", "big5-hkscs", "cn-big5", "csbig5", 
"x-x-big5");
         addCharset(charset("EUC-JP"), "cseucpkdfmtjapanese", "euc-jp", 
"x-euc-jp");
         addCharset(charset("EUC-KR"), "cseuckr", "csksc56011987", "euc-kr", 
"iso-ir-149", "korean",
-                "ks_c_5601-1987", "ks_c_5601-1989", "ksc5601", "ksc_5601", 
"windows-949");
+                "ks_c_5601-1987", "ks_c_5601-1989", "ksc5601", "ksc_5601");
+        // windows-949 / MS949 / CP949: the WHATWG encoding spec lists these as
+        // labels for EUC-KR, but MS949 is a *strict superset* of EUC-KR 
(Unified
+        // Hangul Code adds 8,822 syllables outside EUC-KR's Wansung range).
+        // Honoring the spec alias would be data-destructive on any file that
+        // genuinely uses MS949-extension bytes (lead 0x81-0xA0): EUC-KR's 
decoder
+        // emits U+FFFD where MS949 emits the correct Hangul syllable.  Resolve
+        // these labels to Java's x-windows-949 (MS949) for byte-correct 
decoding.
+        addCharset(charset("x-windows-949"), "windows-949", "ms949", "cp949");
         addCharset(charset("GBK"), "chinese", "csgb2312", "csiso58gb231280", 
"gb2312", "gb_2312",
                 "gb_2312-80", "gbk", "iso-ir-58", "x-gbk");
         addCharset(charset("IBM866"), "866", "cp866", "csibm866", "ibm866");

(tika) 01/06: add html stripper to lang detect and fix charset aliases

Reply via email to