Shinichiro Abe created TIKA-936:
-----------------------------------

             Summary: encoding of ZipArchiveInputStream
                 Key: TIKA-936
                 URL: https://issues.apache.org/jira/browse/TIKA-936
             Project: Tika
          Issue Type: Wish
          Components: parser
    Affects Versions: 1.1
            Reporter: Shinichiro Abe


When extracting from the zip files which are zipped at Windows OS(Japanese), 
the file name extracted from zip is garbled.

ZipArchiveInputStream has three constructors. 
Modifying like the below, the file name was not garbled.
I specified the encoding - SJIS.

{code:title=PackageExtractor|borderStyle=solid}
public void parse(InputStream stream)
 :
 //unpack(new ZipArchiveInputStream(stream), xhtml);  
 unpack(new ZipArchiveInputStream(stream,"SJIS",true), xhtml); 
 :
{code}

In first constructor the platform's default encoding is used. 
In my case the encoding of my computer is UTF-8, the encoding of zip file is 
SJIS,
so the file name was garbled.
We will get garbled file name if there is a difference of 
encoding between platform and zip file.

I want Tika to parse zip by giving some kind of encoding parameter per file,
Where should I give the encoding, somewhere in Metadata 
or ParseContext? Please support this.
I am using Tika via Solr(SolrCell), so when posting zip file to Solr
I want to add encoding parameter to the request.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to