[jira] [Created] (TIKA-4303) Unable to extract Chinese content in onenote

lqangi (Jira) Thu, 29 Aug 2024 01:35:23 -0700

lqangi created TIKA-4303:
----------------------------

             Summary: Unable to extract Chinese content in onenote
                 Key: TIKA-4303
                 URL: https://issues.apache.org/jira/browse/TIKA-4303
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 2.8.0
            Reporter: lqangi
         Attachments: Chinese-notes.one, tika-parsing-chinese-notes-result.png


When I tried to extract the contents of onenote file containing Chinese using 
tika, the Chinese part of the file could not be extracted, only the non-Chinese 
content was extracted.

In addition, some of the extracted content is duplicate, as described in 
[TIKA-3970|https://issues.apache.org/jira/browse/TIKA-3970], it seems to 
extract the historical version of the data along with the extraction, I don't 
know if this issue (TIKA-3970) has been fixed (I see that the code has been 
committed on github, But it doesn't seem to have completely solved the problem 
yet)


The software versions I use are as follows:

Tika: 2.8.0

Onenote: Microsoft® OneNote® LTSC MSO (16.0.14332.20761)

 

In order to reproduce this problem, just use the 2.8.0 version of Tika App to 
open the attachment "Chinese-Notes.one" and check whether the Chinese content 
in the file is extracted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (TIKA-4303) Unable to extract Chinese content in onenote

Reply via email to