[ https://issues.apache.org/jira/browse/TIKA-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880604#comment-17880604 ]
Manish S N commented on TIKA-4305: ---------------------------------- this is from running tika 3.0.0 beta app jar through {{java -jar <file>}} command. I note both UCS-2 and UCS-4 produce the same result. Unlike UTF-8 (which is correctly detected as plain text and parsed by default parser) the UCS content is detected as octet-stream and parsed by empty parser. (I tested this time in both 2.9.2 and 3.0.0-beta runnable jar and both results were identical. also i suspected not adding all parser dependencies in my [^pom.xml] but now it is observed in bundled runnable jar too) {code:java} <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.EmptyParser"/> <meta name="resourceName" content="multilingual_test_new_UCS-2.txt"/> <meta name="Content-Length" content="10282"/> <meta name="Content-Type" content="application/octet-stream"/> <title/> </head> <body/></html>% {code} (P.S: I used inbuilt feature in gedit to +_save as_+ different encodings) > Tika producing empty output for UCS encoded txt files; parses UTF-7 files as > UTF-8 > ---------------------------------------------------------------------------------- > > Key: TIKA-4305 > URL: https://issues.apache.org/jira/browse/TIKA-4305 > Project: Tika > Issue Type: Bug > Components: tika-app, tika-core > Affects Versions: 2.9.2 > Environment: Ubuntu 22.04 LTS > Reporter: Manish S N > Priority: Minor > Attachments: multilingual_test_new_UCS-2.txt, > multilingual_test_new_UCS-4.txt, multilingual_test_new_UTF-7.txt, > multilingual_test_new_UTF-8.txt, pom.xml, tika_UTF-7_output.txt > > > Tika producing empty string as output for UCS-2 and UCS-4 encoded txt files. > No logs or errors just an empty string. > Other formats are okay except UTF-7 files are parsed as UTF-8 which wreaks > havoc with non ascii characters. > how do i know that?: I opened an the UTF-7 file as UTF-8 encoded using open > dialog of gedit and found the outputs similar > > I am attaching all four encoded files along with tika's output from parsing > the UTF-7 for reference -- This message was sent by Atlassian Jira (v8.20.10#820010)