Copilot commented on code in PR #2882:
URL: https://github.com/apache/tika/pull/2882#discussion_r3369783481


##########
docs/modules/ROOT/pages/configuration/encoding-detectors.adoc:
##########
@@ -133,74 +146,84 @@ auto-registered detectors:
 
 === Specify the chain explicitly
 
-To replace the SPI-discovered chain with an explicit ordered list:
+To replace the SPI-discovered chain with an explicit ordered list.  Include
+`junk-filter-encoding-detector` (last) to keep collect-all arbitration; omit it
+for first-match-wins:
 
 [source,json]
 ----
 {
   "encoding-detectors": [
     {"html-encoding-detector": {}},
-    {"universal-encoding-detector": {}}
+    {"mojibuster-encoding-detector": {}},
+    {"junk-filter-encoding-detector": {}}
   ]
 }
 ----
 
 === Configure the HTML detector's read limit
 
-`html-encoding-detector` reads up to 65 536 bytes by default when scanning
-for the `<meta charset>` tag.  Raise it if your documents embed large
-`<script>` blocks before the meta tag (TIKA-2485):
+`html-encoding-detector` reads up to 65 536 bytes by default when scanning for
+the `<meta charset>` tag.  Raise it if your documents embed large `<script>`
+blocks before the meta tag (TIKA-2485).  (`mojibuster-encoding-detector` reads 
a
+larger content probe, so in the default chain this limit matters mainly for 
very
+large preambles.)
 
 [source,json]
 ----
 {
   "encoding-detectors": [
     {
-      "html-encoding-detector": {
-        "markLimit": 131072
+      "default-encoding-detector": {
+        "exclude": ["html-encoding-detector"]
       }
     },
-    {"universal-encoding-detector": {}},
-    {"icu4j-encoding-detector": {}}
+    {"html-encoding-detector": {"markLimit": 131072}}
   ]
 }

Review Comment:
   The JSON example for raising `html-encoding-detector`'s `markLimit` won't 
actually apply the new `markLimit`. With the current ordering, the top-level 
composite runs `default-encoding-detector` first and stops at the first 
non-empty result, so the configured `html-encoding-detector` entry is never 
invoked. To change `markLimit` while keeping the 4.x default chain, the example 
should specify the full chain explicitly (or otherwise ensure the configured 
HTML detector is part of the active composite that includes the meta arbiter).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to