[ https://issues.apache.org/jira/browse/TIKA-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13426541#comment-13426541 ]
Ray Gauss II commented on TIKA-965: ----------------------------------- I have a test file that I've gotten permission to include: [http://svn.alfresco.com/repos/alfresco-open-mirror/alfresco/HEAD/root/projects/repository/source/test-resources/quick/quick.txt] Other encodings/charsets is part of what I was trying to address with the {{Charset}} solution. If we add more {{CharsetRecognizer}} implementations we can easily plug those in to the {{TextDector}} by adding them to {{VALID_TEXT_CHARSETS}}. The charset detection only kicks in if magic detection has failed and {{TextDetector}} comes up with {{isMostlyASCII=false}}, which should only be rare cases, so I don't think we need to be too concerned with performance. Here's what the relevant section in {{TextDetector}} ends up looking like: {code} if (stats.isMostlyAscii()) { return MediaType.TEXT_PLAIN; } else { // Try detecting a valid text charset input.reset(); CharsetDetector charsetDetector = new CharsetDetector(); charsetDetector.setText(input); CharsetMatch match = charsetDetector.detect(); if (match != null && match.getConfidence() >= MINIMUM_CHARSET_MATCH_CONFIDENCE && VALID_TEXT_CHARSETS.contains(match.getName())) { return MediaType.TEXT_PLAIN; } return MediaType.OCTET_STREAM; } {code} It seems simple enough, but I'm happy to pursue whatever solution people want. > Text Detection Fails on Mostly Non-ASCII UTF-8 Files > ---------------------------------------------------- > > Key: TIKA-965 > URL: https://issues.apache.org/jira/browse/TIKA-965 > Project: Tika > Issue Type: Bug > Components: general > Affects Versions: 1.2 > Reporter: Ray Gauss II > Attachments: > 0001-TIKA-965-Text-Detection-Fails-on-Mostly-Non-ASCII-UT.patch > > > If a file contains relatively few ASCII characters and more 8 bit UTF-8 > characters the TextDetector and TextStatistics classes fail to detect it as > text. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira