[ 
https://issues.apache.org/jira/browse/TIKA-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13426550#comment-13426550
 ] 

Jukka Zitting commented on TIKA-965:
------------------------------------

I see where you're going, but it's a really tricky path. I tried doing 
something like that earlier on, but I found no easy way to keep down the number 
of false positives.

The ICU4J classes are written with the assumption that the data you're working 
on is always text and they just figure out which character encoding is most 
likely. They fail to take into account the possibility of the document being in 
some unknown binary format.

That's why we currently run the full ICU4J encoding detection (using the 
{{o.a.t.parser.txt.Icu4jEncodingDetector}} and 
{{o.a.t.detect.AutoDetectReader}} classes, see TIKA-322 and TIKA-471) only once 
we already know by other means that we're dealing with textual data.
                
> Text Detection Fails on Mostly Non-ASCII UTF-8 Files
> ----------------------------------------------------
>
>                 Key: TIKA-965
>                 URL: https://issues.apache.org/jira/browse/TIKA-965
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.2
>            Reporter: Ray Gauss II
>         Attachments: 
> 0001-TIKA-965-Text-Detection-Fails-on-Mostly-Non-ASCII-UT.patch
>
>
> If a file contains relatively few ASCII characters and more 8 bit UTF-8 
> characters the TextDetector and TextStatistics classes fail to detect it as 
> text.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to