[jira] [Commented] (TIKA-4303) Unable to extract Chinese content in onenote

ASF GitHub Bot (Jira) Mon, 13 Jan 2025 06:37:09 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17912556#comment-17912556
 ]


ASF GitHub Bot commented on TIKA-4303:
--------------------------------------

sunluman opened a new pull request, #2098:
URL: https://github.com/apache/tika/pull/2098

   Fixes https://issues.apache.org/jira/browse/TIKA-4303
   
   The issue of garbled text is caused by 
`OneNotePropertyEnum.CachedTitleString` not being correctly parsed. It should 
be parsed using `handleRichEditTextUnicode`.
   
   As for why versions `2.7.0` and earlier did not encounter garbled text, I 
believe it was due to a previously erroneous line of code:
   ```java
   if (options.getUtf16PropertiesToPrint().contains(propertyValue.propertyId))
   ```
   
   This line caused the parsing of OneNote files to never append the parsed 
content of `OneNotePropertyEnum.ImageFilename, OneNotePropertyEnum.Author, and 
OneNotePropertyEnum.CachedTitleString` to the xhtml.
   
   However, when parsing `OneNotePropertyEnum.RichEditTextUnicode`, the logic 
for only parsing the latest version’s content was not added. As a result, the 
files appeared to be successfully parsed and without garbled text, but in 
reality, CachedTitleString was never parsed.
   
   I only fixed the bug in the issue where the title in the uploaded file was 
not parsed. During the testing process, I also discovered the following issues:
   - Non-rich text content is not checked for the latest version, so when the 
content is TextExtendedAscii, it is still parsed repeatedly.
   - Dates are not parsed.
   - Chinese (or other non-Ascii characters? i'm not sure) characters in the 
content are not parsed.
   
   I am not sure whether to create a new issue before proceeding with these 
fixes, so these issues have not been addressed in this PR.




> Unable to extract Chinese content in onenote
> --------------------------------------------
>
>                 Key: TIKA-4303
>                 URL: https://issues.apache.org/jira/browse/TIKA-4303
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.8.0, 2.9.2
>            Reporter: lqangi
>            Priority: Major
>         Attachments: Chinese-notes.one, tika-parsing-chinese-notes-result.png
>
>
> When I tried to extract the contents of onenote file containing Chinese using 
> tika, the Chinese part of the file could not be extracted, only the 
> non-Chinese content was extracted.
> In addition, some of the extracted content is duplicate, as described in 
> [TIKA-3970|https://issues.apache.org/jira/browse/TIKA-3970], it seems to 
> extract the historical version of the data along with the extraction, I don't 
> know if this issue (TIKA-3970) has been fixed (I see that the code has been 
> committed on github, But it doesn't seem to have completely solved the 
> problem yet)
> The software versions I use are as follows:
> Tika: 2.8.0
> Onenote: Microsoft® OneNote® LTSC MSO (16.0.14332.20761)
>  
> In order to reproduce this problem, just use the 2.8.0 version of Tika App to 
> open the attachment "Chinese-Notes.one" and check whether the Chinese content 
> in the file is extracted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4303) Unable to extract Chinese content in onenote

Reply via email to