[ https://issues.apache.org/jira/browse/TIKA-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13426550#comment-13426550 ]
Jukka Zitting commented on TIKA-965: ------------------------------------ I see where you're going, but it's a really tricky path. I tried doing something like that earlier on, but I found no easy way to keep down the number of false positives. The ICU4J classes are written with the assumption that the data you're working on is always text and they just figure out which character encoding is most likely. They fail to take into account the possibility of the document being in some unknown binary format. That's why we currently run the full ICU4J encoding detection (using the {{o.a.t.parser.txt.Icu4jEncodingDetector}} and {{o.a.t.detect.AutoDetectReader}} classes, see TIKA-322 and TIKA-471) only once we already know by other means that we're dealing with textual data. > Text Detection Fails on Mostly Non-ASCII UTF-8 Files > ---------------------------------------------------- > > Key: TIKA-965 > URL: https://issues.apache.org/jira/browse/TIKA-965 > Project: Tika > Issue Type: Bug > Components: general > Affects Versions: 1.2 > Reporter: Ray Gauss II > Attachments: > 0001-TIKA-965-Text-Detection-Fails-on-Mostly-Non-ASCII-UT.patch > > > If a file contains relatively few ASCII characters and more 8 bit UTF-8 > characters the TextDetector and TextStatistics classes fail to detect it as > text. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira