[jira] [Commented] (TIKA-4303) Unable to extract Chinese content in onenote

ASF GitHub Bot (Jira) Sat, 25 Jan 2025 05:47:37 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17916931#comment-17916931
 ]


ASF GitHub Bot commented on TIKA-4303:
--------------------------------------

tballison commented on PR #2098:
URL: https://github.com/apache/tika/pull/2098#issuecomment-2613969450

   @nddipiazza wdyt?




> Unable to extract Chinese content in onenote
> --------------------------------------------
>
>                 Key: TIKA-4303
>                 URL: https://issues.apache.org/jira/browse/TIKA-4303
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.8.0, 2.9.2
>            Reporter: lqangi
>            Priority: Major
>         Attachments: Chinese-notes.one, tika-parsing-chinese-notes-result.png
>
>
> When I tried to extract the contents of onenote file containing Chinese using 
> tika, the Chinese part of the file could not be extracted, only the 
> non-Chinese content was extracted.
> In addition, some of the extracted content is duplicate, as described in 
> [TIKA-3970|https://issues.apache.org/jira/browse/TIKA-3970], it seems to 
> extract the historical version of the data along with the extraction, I don't 
> know if this issue (TIKA-3970) has been fixed (I see that the code has been 
> committed on github, But it doesn't seem to have completely solved the 
> problem yet)
> The software versions I use are as follows:
> Tika: 2.8.0
> Onenote: Microsoft® OneNote® LTSC MSO (16.0.14332.20761)
>  
> In order to reproduce this problem, just use the 2.8.0 version of Tika App to 
> open the attachment "Chinese-Notes.one" and check whether the Chinese content 
> in the file is extracted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4303) Unable to extract Chinese content in onenote

Reply via email to