[jira] [Commented] (TIKA-4303) Unable to extract Chinese content in onenote

Tilman Hausherr (Jira) Thu, 29 Aug 2024 01:47:38 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877631#comment-17877631
 ]


Tilman Hausherr commented on TIKA-4303:
---------------------------------------

I tried with the 3 beta and there I get more:
====  
中文标题�
�
中文标题�
中文标题�
zhongwen�
中文标题�
中文标题�
中文标题�
中文标题�
�
14:08
zhongwen
zhongwen�
14:08
中文标题�
14:08
中文标题�
14:08
中文标题�
14:08
中文标题�
14:08
中文标题�
Type information into a notebook or insert information from other apps and web 
pages.
OneNote is a digital notebook that automatically saves and syncs notes as you 
work.
Follow up easily with highlights and tags.
Take handwritten notes or draw ideas.
Access the notebook from any device.
Share notebooks to collaborate with others.
14:08
中文标题�
Type information into a notebook or insert information from other apps and web 
pages.
OneNote is a digital notebook that automatically saves and syncs notes as you 
work.
Follow up easily with highlights and tags.
Take handwritten notes or draw ideas.
Access the notebook from any device.
Share notebooks to collaborate with others.
14:08
中文标题�
Type information into a notebook or insert information from other apps and web 
pages.
OneNote is a digital notebook that automatically saves and syncs notes as you 
work.
Follow up easily with highlights and tags.
Take handwritten notes or draw ideas.
Access the notebook from any device.
Share notebooks to collaborate with others.
14:08
====
So maybe changes were done in 3.0 but not committed to 2.9.

> Unable to extract Chinese content in onenote
> --------------------------------------------
>
>                 Key: TIKA-4303
>                 URL: https://issues.apache.org/jira/browse/TIKA-4303
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.8.0
>            Reporter: lqangi
>            Priority: Major
>         Attachments: Chinese-notes.one, tika-parsing-chinese-notes-result.png
>
>
> When I tried to extract the contents of onenote file containing Chinese using 
> tika, the Chinese part of the file could not be extracted, only the 
> non-Chinese content was extracted.
> In addition, some of the extracted content is duplicate, as described in 
> [TIKA-3970|https://issues.apache.org/jira/browse/TIKA-3970], it seems to 
> extract the historical version of the data along with the extraction, I don't 
> know if this issue (TIKA-3970) has been fixed (I see that the code has been 
> committed on github, But it doesn't seem to have completely solved the 
> problem yet)
> The software versions I use are as follows:
> Tika: 2.8.0
> Onenote: Microsoft® OneNote® LTSC MSO (16.0.14332.20761)
>  
> In order to reproduce this problem, just use the 2.8.0 version of Tika App to 
> open the attachment "Chinese-Notes.one" and check whether the Chinese content 
> in the file is extracted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4303) Unable to extract Chinese content in onenote

Reply via email to