[ https://issues.apache.org/jira/browse/TIKA-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17916735#comment-17916735 ]
ASF GitHub Bot commented on TIKA-4303: -------------------------------------- nddipiazza commented on PR #2098: URL: https://github.com/apache/tika/pull/2098#issuecomment-2612666198 @sunluman can you produce an example or unit test? > Unable to extract Chinese content in onenote > -------------------------------------------- > > Key: TIKA-4303 > URL: https://issues.apache.org/jira/browse/TIKA-4303 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 2.8.0, 2.9.2 > Reporter: lqangi > Priority: Major > Attachments: Chinese-notes.one, tika-parsing-chinese-notes-result.png > > > When I tried to extract the contents of onenote file containing Chinese using > tika, the Chinese part of the file could not be extracted, only the > non-Chinese content was extracted. > In addition, some of the extracted content is duplicate, as described in > [TIKA-3970|https://issues.apache.org/jira/browse/TIKA-3970], it seems to > extract the historical version of the data along with the extraction, I don't > know if this issue (TIKA-3970) has been fixed (I see that the code has been > committed on github, But it doesn't seem to have completely solved the > problem yet) > The software versions I use are as follows: > Tika: 2.8.0 > Onenote: Microsoft® OneNote® LTSC MSO (16.0.14332.20761) > > In order to reproduce this problem, just use the 2.8.0 version of Tika App to > open the attachment "Chinese-Notes.one" and check whether the Chinese content > in the file is extracted. -- This message was sent by Atlassian Jira (v8.20.10#820010)