[
https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500066#comment-17500066
]
Tim Allison edited comment on TIKA-3684 at 3/2/22, 11:29 AM:
-------------------------------------------------------------
Thank you for sharing a triggering document. That is critical.
If you use the /rmeta endpoint (attached), you can see that there's a
thumbnail.emf, which also contains the text, and that file contains another
attachment, a .wmf file, that also contains the text.
We didn't have parsers for emf/wmf back in 1.14. You can turn off those
parsers via tika-config.xml (I'm happy to give an example if needed). The risk
is that emf files can contain attachments...so you may miss information.
was (Author: [email protected]):
If you use the /rmeta endpoint (attached), you can see that there's a
thumbnail.emf, which also contains the text, and that file contains another
attachment, a .wmf file, that also contains the text.
We didn't have parsers for emf/wmf back in 1.14. You can turn off those
parsers via tika-config.xml (I'm happy to give an example if needed). The risk
is that emf files can contain attachments...so you may miss information.
> Extract text returns the text multiple times
> --------------------------------------------
>
> Key: TIKA-3684
> URL: https://issues.apache.org/jira/browse/TIKA-3684
> Project: Tika
> Issue Type: Bug
> Components: docker
> Affects Versions: 2.1.0
> Reporter: Naama Hophstatder
> Priority: Major
> Attachments: example.docx, example.json
>
>
> We are using tika docker container as a linux service, when I want to extract
> text from a word document, e.g.:
> curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain"
> we get the text 3 times.
> Notice: We also have tika server v1.14, and this version returns the text
> just as expected.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)