On Mon, Sep 3, 2012 at 10:48 AM, Jukka Zitting <[email protected]> wrote: > Hi, > > On Sun, Sep 2, 2012 at 2:01 PM, Benson Margulies <[email protected]> > wrote: >> It has been working fine on many inputs, but I get no text in the >> content handler when I feed it a file in the Shift-JIS encoding. > > The text detector in Tika doesn't have a reliable way to detect > Shift-JIS, which is why you're seeing the default > application/octet-stream type. AFAIK there is no good way to reliably > detect Shift-JIS by looking only at the incoming byte stream. > > If you already know that you're dealing with text, you can give Tika a > media type hint of "text/plain" or even "text/plain; > charset=Shift--JIS" as input metadata along with the document to be > parsed. That should help Tika determine how to parse the document.
thanks, that did it. Apropos of nothing, I'd offer some patches to the front page and maybe even some more doc, but I'm a little confused about how you are using the site plugin, particularly for the front page. For example, no link points to the SCM page. > > For example, using the Shift-JIS file from > https://issues.alfresco.com/jira/browse/ALF-15233 we get the > following: > > $ java -jar tika-app.jar --detect < shiftjs.txt # look only at the byte stream > application/octet-stream > > $ java -jar tika-app.jar --detect shiftjs.txt # Give the file name > with .txt ending as a type hint > text/plain > > $ java -jar tika-app.jar --text shiftjs.txt # Check that the encoding > is correctly detected > 電子商取引(エレクトロニックコマース)、オンライン [...] > > Yes! > > BR, > > Jukka Zitting
