[ https://issues.apache.org/jira/browse/TIKA-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison updated TIKA-4298: ------------------------------ Fix Version/s: 3.0.1 (was: 3.0.0) > Failed to detect charset for zip entry with short non-Unicode file name > ----------------------------------------------------------------------- > > Key: TIKA-4298 > URL: https://issues.apache.org/jira/browse/TIKA-4298 > Project: Tika > Issue Type: Bug > Components: detector > Reporter: Mingchun Zhao > Priority: Major > Fix For: 2.9.3, 3.0.1 > > Attachments: TIKA-4298.patch, testZipEntryNameCharsetShiftSJIS.zip > > > The Japanese file names extracted from a zip file > [^testZipEntryNameCharsetShiftSJIS.zip] were garbled. The charset of the file > name is Shift_JIS, but the detect() method within the PackageParser class was > not able to detect the charset properly. > {code:java} > $ ls -1 testZipEntryNameCharsetShiftSJIS > shiba.png > 文章1.txt > 文章2.txt > {code} > {code:java} > $ java -jar tika-app-2.9.2.jar testZipEntryNameCharsetShiftSJIS.zip > <?xml version="1.0" encoding="UTF-8"?><html > xmlns="http://www.w3.org/1999/xhtml"> > <head> > <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/> > <meta name="X-TIKA:Parsed-By" > content="org.apache.tika.parser.pkg.PackageParser"/> > <meta name="resourceName" content="testZipEntryNameCharsetShiftSJIS.zip"/> > <meta name="X-TIKA:detectedEncoding" content="ISO-8859-1"/> > <meta name="Content-Length" content="28885"/> > <meta name="X-TIKA:encodingDetector" content="UniversalEncodingDetector"/> > <meta name="Content-Type" content="application/zip"/> > <title/> > </head> > <body><div class="embedded" id="shiba.png"/> > <div class="package-entry"><h1>shiba.png</h1> > </div> > <div class="embedded" id="���1.txt"/> > <div class="package-entry"><h1>���1.txt</h1> > <p>あいうえお > かきくけこ > </p></div> > <div class="embedded" id="���2.txt"/> > <div class="package-entry"><h1>���2.txt</h1> > <p>さしすせそ > たちつてと > </p></div> > </body></html>% {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)