[ https://issues.apache.org/jira/browse/TIKA-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877638#comment-17877638 ]
lqangi commented on TIKA-3970: ------------------------------ Hi all, has this issue been resolved? 2.8.0 and later still have duplicate content issues, as mentioned in [TIKA-4303|https://issues.apache.org/jira/browse/TIKA-4303] > Certain OneNote documents produce duplicate text > ------------------------------------------------ > > Key: TIKA-3970 > URL: https://issues.apache.org/jira/browse/TIKA-3970 > Project: Tika > Issue Type: Bug > Components: app > Affects Versions: 2.7.0 > Reporter: David Avant > Priority: Minor > Attachments: Screenshot 2023-02-21 at 3.43.08 PM.png, > lyrics-crawlAllFileNodesFromRoot-false.txt, lyrics.docx, lyrics.one, > lyrics.txt > > > Extracting text from certain OneNote documents produces more text than is > actually in the document. In this case, the OneNote document was created > by opening a Word document and "printing" it to the OneNote. > To reproduce the issue, open the attached "lyrics.one" using the Tika App > version 2.7.0 and view the plain text. Look for the phrase "Sunday > Morning" and observe that there are 14 occurrences. However in the actual > displayed text, it occurs only once. > The original text in this document is only about 12K characters, but the > extracted text from tika is over 300K. > -- This message was sent by Atlassian Jira (v8.20.10#820010)