Hi,

I have some simple code to call Tika.
Tika 1.0/1.1 return "€100" but Tika 1.2 doesn't return "€100"(nothing):

      InputStream input = new ByteArrayInputStream("€100".getBytes("UTF-8"));
      Parser parser = new AutoDetectParser();
      ContentHandler contentHandler = new BodyContentHandler();
      ParseContext parseContext = new ParseContext();
      Metadata metadata = new Metadata();
      metadata.add("Content-Type", "application/octet-stream");
      parser.parse(input, contentHandler, metadata, parseContext);
      System.out.println(contentHandler.toString());

On Tika 1.2 when I replace the metadata with the following I can get "€100".

      metadata.add("Content-Type", "text/plain");

This behavior occurs not only in the case that it contains the euro symbol
but also in the case that it contains not-ascii unicode character.

I expect Tika 1.2 to detect like the behavior the same as Tika 1.0/1.1
which can detect the euro symbol even when Content-Type is 
application/octet-stream.

My question is whether this behavior of Tika 1.2 is correct, 
why Tika 1.2'behavior is changed, and whether I always need to give a right 
hint. 

Thanks in advance,
Shinichiro Abe



Reply via email to