[ https://issues.apache.org/jira/browse/TIKA-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17912556#comment-17912556 ]
ASF GitHub Bot commented on TIKA-4303: -------------------------------------- sunluman opened a new pull request, #2098: URL: https://github.com/apache/tika/pull/2098 Fixes https://issues.apache.org/jira/browse/TIKA-4303 The issue of garbled text is caused by `OneNotePropertyEnum.CachedTitleString` not being correctly parsed. It should be parsed using `handleRichEditTextUnicode`. As for why versions `2.7.0` and earlier did not encounter garbled text, I believe it was due to a previously erroneous line of code: ```java if (options.getUtf16PropertiesToPrint().contains(propertyValue.propertyId)) ``` This line caused the parsing of OneNote files to never append the parsed content of `OneNotePropertyEnum.ImageFilename, OneNotePropertyEnum.Author, and OneNotePropertyEnum.CachedTitleString` to the xhtml. However, when parsing `OneNotePropertyEnum.RichEditTextUnicode`, the logic for only parsing the latest version’s content was not added. As a result, the files appeared to be successfully parsed and without garbled text, but in reality, CachedTitleString was never parsed. I only fixed the bug in the issue where the title in the uploaded file was not parsed. During the testing process, I also discovered the following issues: - Non-rich text content is not checked for the latest version, so when the content is TextExtendedAscii, it is still parsed repeatedly. - Dates are not parsed. - Chinese (or other non-Ascii characters? i'm not sure) characters in the content are not parsed. I am not sure whether to create a new issue before proceeding with these fixes, so these issues have not been addressed in this PR. > Unable to extract Chinese content in onenote > -------------------------------------------- > > Key: TIKA-4303 > URL: https://issues.apache.org/jira/browse/TIKA-4303 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 2.8.0, 2.9.2 > Reporter: lqangi > Priority: Major > Attachments: Chinese-notes.one, tika-parsing-chinese-notes-result.png > > > When I tried to extract the contents of onenote file containing Chinese using > tika, the Chinese part of the file could not be extracted, only the > non-Chinese content was extracted. > In addition, some of the extracted content is duplicate, as described in > [TIKA-3970|https://issues.apache.org/jira/browse/TIKA-3970], it seems to > extract the historical version of the data along with the extraction, I don't > know if this issue (TIKA-3970) has been fixed (I see that the code has been > committed on github, But it doesn't seem to have completely solved the > problem yet) > The software versions I use are as follows: > Tika: 2.8.0 > Onenote: Microsoft® OneNote® LTSC MSO (16.0.14332.20761) > > In order to reproduce this problem, just use the 2.8.0 version of Tika App to > open the attachment "Chinese-Notes.one" and check whether the Chinese content > in the file is extracted. -- This message was sent by Atlassian Jira (v8.20.10#820010)