[jira] [Comment Edited] (TIKA-4305) Tika producing empty output for UCS encoded txt files; parses UTF-7 files as UTF-8

Tim Allison (Jira) Tue, 10 Sep 2024 06:22:39 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880641#comment-17880641
 ]


Tim Allison edited comment on TIKA-4305 at 9/10/24 1:18 PM:
------------------------------------------------------------

K. So there are two different issues.

1) Above, I was focusing on the difference in detection between UCS2 vs UTF-16 
and UCS4 vs UTF32. There's not much we can do with that.
2) The other issue is that Tika can have a hard time determining that an 
InputStream is a text file unless the filename is included as a hint. Without 
the file name, Tika detects octet-stream.

So, either of these work for communicating the file name to Tika:

a)    System.out.println(new 
Tika().parseToString(Paths.get(".../multilingual_test_new_UCS-2.txt")));
b)    Metadata metadata = new Metadata();
        try (InputStream is = 
TikaInputStream.get(Paths.get(".../multilingual_test_new_UCS-2.txt"), 
metadata)) {
            System.out.println(new Tika().parseToString(is, metadata));
        }


was (Author: talli...@mitre.org):
K. So there are two different issues.

1) Above, I was focusing on the difference in detection between UCS2 vs UTF-16 
and UCS4 vs UTF32. There's not much we can do with that.
2) The other issue is that Tika can have a hard time determining that an 
InputStream is a text file unless the filename is included as a hint. Without 
the file name, Tika detects octet-stream.

So, either of these work for communicating the file name to Tika:

a)    System.out.println(new 
Tika().parseToString(Paths.get(".../multilingual_test_new_UCS-2.txt")));
b)    Metadata metadata = new Metadata();
        try (InputStream is = 
TikaInputStream.get(Paths.get("/home/tallison/Downloads/multilingual_test_new_UCS-2.txt"),
 metadata)) {
            System.out.println(new Tika().parseToString(is, metadata));
        }

> Tika producing empty output for UCS encoded txt files; parses UTF-7 files as 
> UTF-8
> ----------------------------------------------------------------------------------
>
>                 Key: TIKA-4305
>                 URL: https://issues.apache.org/jira/browse/TIKA-4305
>             Project: Tika
>          Issue Type: Bug
>          Components: tika-app, tika-core
>    Affects Versions: 2.9.2
>         Environment: Ubuntu 22.04 LTS
>            Reporter: Manish S N
>            Priority: Minor
>         Attachments: multilingual_test_new_UCS-2.txt, 
> multilingual_test_new_UCS-4.txt, multilingual_test_new_UTF-7.txt, 
> multilingual_test_new_UTF-8.txt, pom.xml, tika_UTF-7_output.txt
>
>
> Tika producing empty string as output for UCS-2 and UCS-4 encoded txt files.
> No logs or errors just an empty string.
> Other formats are okay except UTF-7 files are parsed as UTF-8 which wreaks 
> havoc with non ascii characters.
> how do i know that?: I opened an the UTF-7 file as UTF-8 encoded using open 
> dialog of gedit and found the outputs similar
>  
> I am attaching all four encoded files along with tika's output from parsing 
> the UTF-7 for reference



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (TIKA-4305) Tika producing empty output for UCS encoded txt files; parses UTF-7 files as UTF-8

Reply via email to