[
https://issues.apache.org/jira/browse/TIKA-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr closed TIKA-4298.
---------------------------------
> Failed to detect charset for zip entry with short non-Unicode file name
> -----------------------------------------------------------------------
>
> Key: TIKA-4298
> URL: https://issues.apache.org/jira/browse/TIKA-4298
> Project: Tika
> Issue Type: Bug
> Components: detector
> Affects Versions: 2.9.2
> Reporter: Mingchun Zhao
> Assignee: Tilman Hausherr
> Priority: Major
> Fix For: 3.0.0, 2.9.3
>
> Attachments: TIKA-4298.patch, testZipEntryNameCharsetShiftSJIS.zip
>
>
> The Japanese file names extracted from a zip file
> [^testZipEntryNameCharsetShiftSJIS.zip] were garbled. The charset of the file
> name is Shift_JIS, but the detect() method within the PackageParser class was
> not able to detect the charset properly.
> {code:java}
> $ ls -1 testZipEntryNameCharsetShiftSJIS
> shiba.png
> 文章1.txt
> 文章2.txt
> {code}
> {code:java}
> $ java -jar tika-app-2.9.2.jar testZipEntryNameCharsetShiftSJIS.zip
> <?xml version="1.0" encoding="UTF-8"?><html
> xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
> <meta name="X-TIKA:Parsed-By"
> content="org.apache.tika.parser.pkg.PackageParser"/>
> <meta name="resourceName" content="testZipEntryNameCharsetShiftSJIS.zip"/>
> <meta name="X-TIKA:detectedEncoding" content="ISO-8859-1"/>
> <meta name="Content-Length" content="28885"/>
> <meta name="X-TIKA:encodingDetector" content="UniversalEncodingDetector"/>
> <meta name="Content-Type" content="application/zip"/>
> <title/>
> </head>
> <body><div class="embedded" id="shiba.png"/>
> <div class="package-entry"><h1>shiba.png</h1>
> </div>
> <div class="embedded" id="���1.txt"/>
> <div class="package-entry"><h1>���1.txt</h1>
> <p>あいうえお
> かきくけこ
> </p></div>
> <div class="embedded" id="���2.txt"/>
> <div class="package-entry"><h1>���2.txt</h1>
> <p>さしすせそ
> たちつてと
> </p></div>
> </body></html>% {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)