Hi devs, I ran into an issue where a test file that contained UTF-8 text was being displayed in Eclipse as us-ascii.
I had thought that Tika would use UTF-8 everywhere for file encodings, but… Currently the tika-parent/pom.xml has: <properties> <maven.compiler.source>1.7</maven.compiler.source> <maven.compiler.target>1.7</maven.compiler.target> <project.reporting.outputEncoding>${project.build.sourceEncoding}</project.reporting.outputEncoding> <commons.compress.version>1.10</commons.compress.version> <commons.io.version>2.4</commons.io.version> <slf4j.version>1.7.12</slf4j.version> <pax.exam.version>4.4.0</pax.exam.version> </properties> Note that project.reporting.outputEncoding is set to project.build.sourceEncoding, but that's not specified anywhere. Is there a reason for this? I can go ahead and switch it to be explicitly UTF-8, in the 2.x branch. Thanks, -- Ken -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr