Re: [PR] TIKA-4752-follow-up [tika]

via GitHub Fri, 05 Jun 2026 07:58:09 -0700


Copilot commented on code in PR #2871:
URL: https://github.com/apache/tika/pull/2871#discussion_r3363492001



##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/ZipParser.java:
##########
@@ -549,22 +550,37 @@ private void parseStreamEntry(ZipArchiveInputStream zis, 
ZipArchiveEntry entry,
         }
     }
 
-    private String detectEntryName(ZipArchiveEntry entry, Metadata 
parentMetadata,
-                                    ParseContext context, ZipParserConfig 
config) throws IOException {
+    private String detectEntryName(ZipArchiveEntry entry, ParseContext context,
+                                    ZipParserConfig config) throws IOException 
{
         // If user specified an encoding, decode raw bytes with that charset
         // This avoids needing to reopen the ZipFile with a different charset
         if (config.getEntryEncoding() != null) {
             return new String(entry.getRawName(), config.getEntryEncoding());
         }
 
+        // A zip only ever declares a name as UTF-8 (it can't name a legacy 
charset),
+        // two ways. The Unicode extra field carries a CRC-validated UTF-8 
name -- that
+        // CRC check is the evaluation, so trust commons-compress's getName().
+        if (entry.getNameSource() == 
ZipArchiveEntry.NameSource.UNICODE_EXTRA_FIELD) {
+            return entry.getName();
+        }
+
         // If charset detection is enabled, try to detect and decode.
         // Mojibuster handles short inputs natively (zip filenames are often
         // 9-30 bytes); no byte-extension trick needed.
         if (config.isDetectCharsetsInEntryNames()) {
             byte[] entryName = entry.getRawName();
+            // The EFS flag (general purpose bit 11) also declares UTF-8, but 
is
+            // unvalidated. Record it as a content-type hint for the detector 
to
+            // evaluate against the bytes, not trust outright.
+            Metadata nameMetadata = new Metadata();
+            if (entry.getNameSource() == 
ZipArchiveEntry.NameSource.NAME_WITH_EFS_FLAG) {
+                nameMetadata.set(TikaCoreProperties.CONTENT_TYPE_HINT,
+                        new MediaType(MediaType.TEXT_PLAIN, 
StandardCharsets.UTF_8).toString());
+            }
             try (TikaInputStream detectStream = 
TikaInputStream.get(entryName)) {
                 List<EncodingResult> encResults =
-                        getEncodingDetector().detect(detectStream, 
parentMetadata, context);
+                        getEncodingDetector(context).detect(detectStream, 
nameMetadata, context);

Review Comment:
   `CONTENT_TYPE_HINT` is set on `nameMetadata`, but the encoding detector you 
call (`getEncodingDetector(context)`) is typically the default chain 
(Html/Universal/Icu4j) which does not consult 
`TikaCoreProperties.CONTENT_TYPE_HINT`. As a result, the EFS flag hint won’t 
actually influence detection and a configured detector (e.g., Universal/Icu4j) 
can still mis-detect and garble UTF-8 names when `detectCharsetsInEntryNames` 
is enabled (the regression this change is trying to avoid). Consider explicitly 
running `MetadataCharsetDetector` ahead of the configured detector (or 
otherwise ensuring the detector chain includes a component that reads 
`CONTENT_TYPE_HINT`).



##########
tika-core/src/main/java/org/apache/tika/detect/MetadataCharsetDetector.java:
##########
@@ -33,10 +34,13 @@
  * reading any bytes from the stream.  Returns a single
  * {@link EncodingResult.ResultType#DECLARATIVE} result when a charset is 
found.
  *
- * <p>Two metadata keys are consulted in order:
+ * <p>Three metadata keys are consulted in order:
  * <ol>
  *   <li>{@link Metadata#CONTENT_TYPE} — the {@code charset} parameter of the
  *       HTTP/MIME Content-Type header (e.g. {@code text/html; 
charset=UTF-8}).</li>
+ *   <li>{@link TikaCoreProperties#CONTENT_TYPE_HINT} — the {@code charset} 
parameter
+ *       of a content-type a source <em>claimed</em> for the bytes (e.g. an 
HTML
+ *       {@code <meta>} tag, or a zip entry's UTF-8 (EFS) flag). A hint, not a 
verdict.</li>

Review Comment:
   The Javadoc describes `CONTENT_TYPE_HINT` as a “hint, not a verdict”, but it 
doesn’t mention that this metadata key is only consulted when 
`MetadataCharsetDetector` is actually part of the active `EncodingDetector` 
chain (it is not in the default detector list). Adding a brief note here would 
prevent readers from assuming `CONTENT_TYPE_HINT` is always honored 
automatically.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] TIKA-4752-follow-up [tika]

Reply via email to