[jira] [Commented] (TIKA-4305) Tika producing empty output for UCS encoded txt files; parses UTF-7 files as UTF-8

Manish S N (Jira) Tue, 10 Sep 2024 04:36:10 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880604#comment-17880604
 ]


Manish S N commented on TIKA-4305:
----------------------------------

this is from running tika 3.0.0 beta app jar through  {{java -jar <file>}} 
command. I note both UCS-2 and UCS-4 produce the same result. Unlike UTF-8 
(which is correctly detected as plain text and parsed by default parser) the 
UCS content is detected as octet-stream and parsed by empty parser.

(I tested this time in both 2.9.2 and 3.0.0-beta runnable jar and both results 
were identical. also i suspected not adding all parser dependencies in my 
[^pom.xml] but now it is observed in bundled runnable jar too)
{code:java}
<?xml version="1.0" encoding="UTF-8"?><html 
xmlns="http://www.w3.org/1999/xhtml";>
<head>
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.EmptyParser"/>
<meta name="resourceName" content="multilingual_test_new_UCS-2.txt"/>
<meta name="Content-Length" content="10282"/>
<meta name="Content-Type" content="application/octet-stream"/>
<title/>
</head>
<body/></html>%     {code}
(P.S: I used inbuilt feature in gedit to +_save as_+ different encodings)

> Tika producing empty output for UCS encoded txt files; parses UTF-7 files as 
> UTF-8
> ----------------------------------------------------------------------------------
>
>                 Key: TIKA-4305
>                 URL: https://issues.apache.org/jira/browse/TIKA-4305
>             Project: Tika
>          Issue Type: Bug
>          Components: tika-app, tika-core
>    Affects Versions: 2.9.2
>         Environment: Ubuntu 22.04 LTS
>            Reporter: Manish S N
>            Priority: Minor
>         Attachments: multilingual_test_new_UCS-2.txt, 
> multilingual_test_new_UCS-4.txt, multilingual_test_new_UTF-7.txt, 
> multilingual_test_new_UTF-8.txt, pom.xml, tika_UTF-7_output.txt
>
>
> Tika producing empty string as output for UCS-2 and UCS-4 encoded txt files.
> No logs or errors just an empty string.
> Other formats are okay except UTF-7 files are parsed as UTF-8 which wreaks 
> havoc with non ascii characters.
> how do i know that?: I opened an the UTF-7 file as UTF-8 encoded using open 
> dialog of gedit and found the outputs similar
>  
> I am attaching all four encoded files along with tika's output from parsing 
> the UTF-7 for reference



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4305) Tika producing empty output for UCS encoded txt files; parses UTF-7 files as UTF-8

Reply via email to