[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times

Tim Allison (Jira) Wed, 02 Mar 2022 06:04:10 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500168#comment-17500168
 ]


Tim Allison commented on TIKA-3684:
-----------------------------------

We could also parameterize the WMF and EMF parsers to turn off text extraction. 
 It _feels_ like wmf+emf used to be used for new information, images, etc 
within a page, but more recently, I'm seeing it being used as a thumbnail.

Another option for improvement would be to allow configuration of the embedded 
parser to skip thumbnails.  This emf is correctly identified as a thumbnail in 
its metadata in the /rmeta output.  I cannot guarantee that all emf/wmf will be 
identified as such, though.

> Extract text returns the text multiple times
> --------------------------------------------
>
>                 Key: TIKA-3684
>                 URL: https://issues.apache.org/jira/browse/TIKA-3684
>             Project: Tika
>          Issue Type: Bug
>          Components: docker
>    Affects Versions: 2.1.0
>            Reporter: Naama Hophstatder
>            Priority: Major
>         Attachments: example.docx, example.json
>
>
> We are using tika docker container as a linux service, when I want to extract 
> text from a word document, e.g.:
> curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain"
> we get the text 3 times.
> Notice: We also have tika server v1.14, and this version returns the text 
> just as expected.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times

Reply via email to