Gregory Lepore created TIKA-4053:
------------------------------------

             Summary: Improve detection of text files
                 Key: TIKA-4053
                 URL: https://issues.apache.org/jira/browse/TIKA-4053
             Project: Tika
          Issue Type: Sub-task
            Reporter: Gregory Lepore
         Attachments: 1990-01.etc, 2008-09.3, 20220708 YouTube1-1.kif, 
bub0336d.007, pacman.nas, shab3_36.qbp, wots.diz

Common Crawl data shows lots of text files which are being recognized as 
application/octet-stream. Some appear to be due to being in a language other 
than English.

 

Various sample files attached.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to