Hi,
I have some simple code to call Tika.
Tika 1.0/1.1 return "€100" but Tika 1.2 doesn't return "€100"(nothing):
InputStream input = new ByteArrayInputStream("€100".getBytes("UTF-8"));
Parser parser = new AutoDetectParser();
ContentHandler contentHandler = new BodyContentHandler();
ParseContext parseContext = new ParseContext();
Metadata metadata = new Metadata();
metadata.add("Content-Type", "application/octet-stream");
parser.parse(input, contentHandler, metadata, parseContext);
System.out.println(contentHandler.toString());
On Tika 1.2 when I replace the metadata with the following I can get "€100".
metadata.add("Content-Type", "text/plain");
This behavior occurs not only in the case that it contains the euro symbol
but also in the case that it contains not-ascii unicode character.
I expect Tika 1.2 to detect like the behavior the same as Tika 1.0/1.1
which can detect the euro symbol even when Content-Type is
application/octet-stream.
My question is whether this behavior of Tika 1.2 is correct,
why Tika 1.2'behavior is changed, and whether I always need to give a right
hint.
Thanks in advance,
Shinichiro Abe