[ 
https://issues.apache.org/jira/browse/TIKA-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17873948#comment-17873948
 ] 

Mingchun Zhao commented on TIKA-4298:
-------------------------------------

Hi [~tilman] Just in case, please allow me to remove the png file from the ZIP 
file and create the PR again soon. The png file is not essential to testing 
this issue, so I'll recreate the ZIP with just the text files.

> Failed to detect charset for zip entry with short non-Unicode file name
> -----------------------------------------------------------------------
>
>                 Key: TIKA-4298
>                 URL: https://issues.apache.org/jira/browse/TIKA-4298
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>            Reporter: Mingchun Zhao
>            Priority: Major
>             Fix For: 3.0.0, 2.9.3
>
>         Attachments: TIKA-4298.patch, testZipEntryNameCharsetShiftSJIS.zip
>
>
> The Japanese file names extracted from a zip file  
> [^testZipEntryNameCharsetShiftSJIS.zip] were garbled. The charset of the file 
> name is Shift_JIS, but the detect() method within the PackageParser class was 
> not able to detect the charset properly.
> {code:java}
> $ ls -1 testZipEntryNameCharsetShiftSJIS
> shiba.png
> 文章1.txt
> 文章2.txt
> {code}
> {code:java}
> $ java -jar tika-app-2.9.2.jar testZipEntryNameCharsetShiftSJIS.zip
> <?xml version="1.0" encoding="UTF-8"?><html 
> xmlns="http://www.w3.org/1999/xhtml";>
> <head>
> <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
> <meta name="X-TIKA:Parsed-By" 
> content="org.apache.tika.parser.pkg.PackageParser"/>
> <meta name="resourceName" content="testZipEntryNameCharsetShiftSJIS.zip"/>
> <meta name="X-TIKA:detectedEncoding" content="ISO-8859-1"/>
> <meta name="Content-Length" content="28885"/>
> <meta name="X-TIKA:encodingDetector" content="UniversalEncodingDetector"/>
> <meta name="Content-Type" content="application/zip"/>
> <title/>
> </head>
> <body><div class="embedded" id="shiba.png"/>
> <div class="package-entry"><h1>shiba.png</h1>
> </div>
> <div class="embedded" id="���1.txt"/>
> <div class="package-entry"><h1>���1.txt</h1>
> <p>あいうえお&#13;
> かきくけこ&#13;
> </p></div>
> <div class="embedded" id="���2.txt"/>
> <div class="package-entry"><h1>���2.txt</h1>
> <p>さしすせそ&#13;
> たちつてと&#13;
> </p></div>
> </body></html>% {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to