[ https://issues.apache.org/jira/browse/TIKA-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880341#comment-17880341 ]
Tim Allison commented on TIKA-4305: ----------------------------------- Thank you for raising this issue. For the following, I'm running a unit test in Tika's main branch in the {{tika-parsers-standard-package}} module: * UTF-8 works * UCS-4 is detected correctly by the ICU4j detector as UTF-32BE. * UCS-2 is detected correctly by the ICU4j detector as UTF-16LE. * UTF-7 is incorrectly detected as windows-1252 by the UniversalCharsetDetector. If I turn off the UniversalCharsetDetector, the ICU4j detector incorrectly detects charset=ISO-8859-1 The fork of UniversalCharsetDetector that we use (https://github.com/albfernandez/juniversalchardet) does not claim to detect utf-7. ICU4j also does not detect utf-7 (https://unicode-org.github.io/icu/userguide/conversion/detection.html#detected-encodings). So, if you can open a ticket in one of those projects and/or identify another charset detector that has a friendly license and can detect utf-7, we should look into adding that to Tika. > Tika producing empty output for UCS encoded txt files; parses UTF-7 files as > UTF-8 > ---------------------------------------------------------------------------------- > > Key: TIKA-4305 > URL: https://issues.apache.org/jira/browse/TIKA-4305 > Project: Tika > Issue Type: Bug > Components: tika-app, tika-core > Affects Versions: 2.9.2 > Environment: Ubuntu 22.04 LTS > Reporter: Manish S N > Priority: Minor > Attachments: multilingual_test_new_UCS-2.txt, > multilingual_test_new_UCS-4.txt, multilingual_test_new_UTF-7.txt, > multilingual_test_new_UTF-8.txt, tika_UTF-7_output.txt > > > Tika producing empty string as output for UCS-2 and UCS-4 encoded txt files. > No logs or errors just an empty string. > Other formats are okay except UTF-7 files are parsed as UTF-8 which wreaks > havoc with non ascii characters. > how do i know that?: I opened an the UTF-7 file as UTF-8 encoded using open > dialog of gedit and found the outputs similar > > I am attaching all four encoded files along with tika's output from parsing > the UTF-7 for reference -- This message was sent by Atlassian Jira (v8.20.10#820010)