Hi devs, I’m trying to remember the history of how Tika’s current mime-type detection has evolved, regarding handling of plain text files.
Currently if I run a Shift-JIS encoded file through Tika (suffix is “.env”) it gets returned as application/octet-stream. I thought that previously we had something which would check if the file only had tab/LF/CR bytes in the 0x00-0x1F range (so no other control chars besides these), and a reasonable number of line ending chars, and if so then we’d return text/plain instead of application/octet-stream Thanks, — Ken -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra