Gregory Lepore created TIKA-4053:
------------------------------------
Summary: Improve detection of text files
Key: TIKA-4053
URL: https://issues.apache.org/jira/browse/TIKA-4053
Project: Tika
Issue Type: Sub-task
Reporter: Gregory Lepore
Attachments: 1990-01.etc, 2008-09.3, 20220708 YouTube1-1.kif,
bub0336d.007, pacman.nas, shab3_36.qbp, wots.diz
Common Crawl data shows lots of text files which are being recognized as
application/octet-stream. Some appear to be due to being in a language other
than English.
Various sample files attached.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)