lqangi created TIKA-4303: ---------------------------- Summary: Unable to extract Chinese content in onenote Key: TIKA-4303 URL: https://issues.apache.org/jira/browse/TIKA-4303 Project: Tika Issue Type: Bug Components: parser Affects Versions: 2.8.0 Reporter: lqangi Attachments: Chinese-notes.one, tika-parsing-chinese-notes-result.png
When I tried to extract the contents of onenote file containing Chinese using tika, the Chinese part of the file could not be extracted, only the non-Chinese content was extracted. In addition, some of the extracted content is duplicate, as described in [TIKA-3970|https://issues.apache.org/jira/browse/TIKA-3970], it seems to extract the historical version of the data along with the extraction, I don't know if this issue (TIKA-3970) has been fixed (I see that the code has been committed on github, But it doesn't seem to have completely solved the problem yet) The software versions I use are as follows: Tika: 2.8.0 Onenote: Microsoft® OneNote® LTSC MSO (16.0.14332.20761) In order to reproduce this problem, just use the 2.8.0 version of Tika App to open the attachment "Chinese-Notes.one" and check whether the Chinese content in the file is extracted. -- This message was sent by Atlassian Jira (v8.20.10#820010)