Copilot commented on code in PR #2882:
URL: https://github.com/apache/tika/pull/2882#discussion_r3383112380
##########
docs/modules/ROOT/pages/configuration/encoding-detectors.adoc:
##########
@@ -18,44 +18,61 @@
= Configuring Encoding Detectors
Tika uses a chain of _encoding detectors_ to determine the character encoding
-of plain text and HTML content. `DefaultEncodingDetector` loads detectors via
-the Java service-provider interface (SPI) and runs them in registration order;
-the first non-null result wins.
+of plain text and HTML content. `DefaultEncodingDetector` discovers detectors
+via the Java service-provider interface (SPI, `META-INF/services`).
-The default chain is `html-encoding-detector`, `universal-encoding-detector`,
-and `icu4j-encoding-detector`.
+The chain runs in one of two modes:
+
+* *collect-all* — when a `MetaEncodingDetector` is present (the 4.x default
+ includes one), every base detector runs and contributes candidate encodings,
+ then the meta detector picks the best one by decode quality. Registration
+ order does not matter.
+* *first-match-wins* — otherwise, detectors run in registration order and the
+ first non-null result is used.
Review Comment:
In collect-all mode, registration order is not always irrelevant: if the
MetaEncodingDetector abstains (returns an empty result),
CompositeEncodingDetector falls back to the first base detector that emitted a
candidate, so ordering still affects results. The docs should mention this
fallback to avoid misleading users configuring explicit chains or running
without the junk-filter model.
##########
docs/modules/ROOT/pages/configuration/encoding-detectors.adoc:
##########
@@ -124,83 +137,102 @@ auto-registered detectors:
"encoding-detectors": [
{
"default-encoding-detector": {
- "exclude": ["icu4j-encoding-detector"]
+ "exclude": ["html-encoding-detector"]
}
}
]
}
----
+NOTE: Do not combine `default-encoding-detector` with other explicit detector
+entries in the same list. When combined, the loader wraps everything in an
+outer composite that has no `MetaEncodingDetector` at its top level, so
+collect-all arbitration is silently lost and the explicit detectors are never
+reached. Use an explicit chain (see below) when you need to configure
+individual detectors.
+
=== Specify the chain explicitly
-To replace the SPI-discovered chain with an explicit ordered list:
+To replace the SPI-discovered chain with an explicit ordered list. Include
+`junk-filter-encoding-detector` (last) to keep collect-all arbitration; omit it
+for first-match-wins:
[source,json]
----
{
"encoding-detectors": [
{"html-encoding-detector": {}},
- {"universal-encoding-detector": {}}
+ {"mojibuster-encoding-detector": {}},
+ {"junk-filter-encoding-detector": {}}
]
}
----
=== Configure the HTML detector's read limit
-`html-encoding-detector` reads up to 65 536 bytes by default when scanning
-for the `<meta charset>` tag. Raise it if your documents embed large
-`<script>` blocks before the meta tag (TIKA-2485):
+`html-encoding-detector` reads up to 65 536 bytes by default when scanning for
+the `<meta charset>` tag. Raise it if your documents embed large `<script>`
+blocks before the meta tag (TIKA-2485). (`mojibuster-encoding-detector` reads
a
Review Comment:
The documented default read limit for `html-encoding-detector` is incorrect.
`HtmlEncodingDetector` defaults to 8192 bytes (DEFAULT_MARK_LIMIT), not 65536;
65536 is the default for `standard-html-encoding-detector`. This matters for
users deciding whether they need to raise `markLimit`.
##########
tika-core/src/main/java/org/apache/tika/detect/DefaultEncodingDetector.java:
##########
@@ -17,74 +17,41 @@
package org.apache.tika.detect;
import java.util.Collection;
-import java.util.Comparator;
-import java.util.HashMap;
-import java.util.List;
-import java.util.Map;
import javax.imageio.spi.ServiceRegistry;
import org.apache.tika.config.ServiceLoader;
/**
- * A composite encoding detector based on all the {@link EncodingDetector}
+ * A composite encoding detector over all {@link EncodingDetector}
* implementations available through the
* {@link ServiceRegistry service provider mechanism}.
*
- * <p>The default chain (Tika 3.x style) runs three detectors in order, with
- * the first non-empty result winning:
- * <ol>
- * <li>{@code org.apache.tika.parser.html.HtmlEncodingDetector}</li>
- * <li>{@code org.apache.tika.parser.txt.UniversalEncodingDetector}</li>
- * <li>{@code org.apache.tika.parser.txt.Icu4jEncodingDetector}</li>
- * </ol>
- * Any other {@link EncodingDetector} discovered via SPI (e.g.,
- * user-supplied detectors) runs after the three blessed detectors,
- * preserving back-compat for callers who add their own.</p>
+ * <p>The 4.x default chain (via {@code META-INF/services}): BOM and
+ * metadata-charset detectors (tika-core), the HTML {@code <meta>} detector,
+ * MojibusterEncodingDetector, and JunkFilterEncodingDetector — a
+ * {@link MetaEncodingDetector} that arbitrates the candidates by decode
quality
+ * and always runs last, so detector order is irrelevant.</p>
Review Comment:
This Javadoc states that detector order is irrelevant, but
CompositeEncodingDetector only guarantees that the MetaEncodingDetector runs
after the base detectors. If the meta detector abstains (e.g., no junk model
available or identical decodes), CompositeEncodingDetector falls back to the
first base detector's candidate, making SPI/registration order observable.
Please soften the claim to reflect the actual fallback behavior.
##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/test/java/org/apache/tika/ml/chardetect/ZipFilenameDetectionTest.java:
##########
@@ -59,7 +58,6 @@ private static byte[] hexToBytes(String hex) {
* sequentially on two entries differing only in byte 5 (0x31 vs 0x32),
simulating
Review Comment:
The "Full pipeline" comment hard-codes a specific detector sequence ("BOM →
Metadata → Mojibuster → HtmlEncodingDetector"), which is out of date for the
new default chain and can also vary with SPI discovery. Either update it to
match the documented 4.x default chain
(BOM/Metadata/HTML/Mojibuster/JunkFilter) or avoid specifying an exact order
here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]