[
https://issues.apache.org/jira/browse/TIKA-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17916931#comment-17916931
]
ASF GitHub Bot commented on TIKA-4303:
--------------------------------------
tballison commented on PR #2098:
URL: https://github.com/apache/tika/pull/2098#issuecomment-2613969450
@nddipiazza wdyt?
> Unable to extract Chinese content in onenote
> --------------------------------------------
>
> Key: TIKA-4303
> URL: https://issues.apache.org/jira/browse/TIKA-4303
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 2.8.0, 2.9.2
> Reporter: lqangi
> Priority: Major
> Attachments: Chinese-notes.one, tika-parsing-chinese-notes-result.png
>
>
> When I tried to extract the contents of onenote file containing Chinese using
> tika, the Chinese part of the file could not be extracted, only the
> non-Chinese content was extracted.
> In addition, some of the extracted content is duplicate, as described in
> [TIKA-3970|https://issues.apache.org/jira/browse/TIKA-3970], it seems to
> extract the historical version of the data along with the extraction, I don't
> know if this issue (TIKA-3970) has been fixed (I see that the code has been
> committed on github, But it doesn't seem to have completely solved the
> problem yet)
> The software versions I use are as follows:
> Tika: 2.8.0
> Onenote: Microsoft® OneNote® LTSC MSO (16.0.14332.20761)
>
> In order to reproduce this problem, just use the 2.8.0 version of Tika App to
> open the attachment "Chinese-Notes.one" and check whether the Chinese content
> in the file is extracted.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)