[jira] [Commented] (TIKA-3657) Microsoft documents are not text parsed when running under Docker

Tim Barrett (Jira) Fri, 28 Jan 2022 03:44:04 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-3657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483721#comment-17483721
 ]


Tim Barrett commented on TIKA-3657:
-----------------------------------

[^scenario traces.txt]

This shows trace of 3 scenarios. 1) Running under Tomcat in Eclipse. This 
works. 2) Running under Tomcat outside Eclipse. As you can see from the moment 
the AutoDetectParser is created its list of detectors is minimal. 3) Running 
under Tomcat outside Eclipse, this time without adding an 
EmbeddedDocumentExtractor to the context. Although the list of detectors 
remains minimal throughout, the document is still fully parsed (though in this 
scenario any embedded documents are parsed in line, which is not what is 
required with an EmbeddedDocumentExtractor in place).

> Microsoft documents are not text parsed when running under Docker
> -----------------------------------------------------------------
>
>                 Key: TIKA-3657
>                 URL: https://issues.apache.org/jira/browse/TIKA-3657
>             Project: Tika
>          Issue Type: Bug
>          Components: config, core, depedency
>    Affects Versions: 2.2.0, 2.2.1
>            Reporter: Tim Barrett
>            Priority: Major
>             Fix For: 2.2.2
>
>         Attachments: scenario traces.txt, tika-config.xml
>
>
> We use EmbeddedDocumentExtractor, with this code:
> NalyticsEmbeddedDocumentExtractor nalyticsEmbeddedDocumentExtractor = *new* 
> NalyticsEmbeddedDocumentExtractor(*this*);
> *this*.context.set(EmbeddedDocumentExtractor.*class*, 
> nalyticsEmbeddedDocumentExtractor);
> This all works fine for us, and has been used in production for a few years. 
> This also works under Tika 2.2.0 when running in development environments 
> (Eclipse, Apache Tomcat). However when running under Docker the text 
> withinMicrosoft documents (Word etc) is not parsed. Under Tika 2.1.0, under 
> Docker, the Microsoft documents are fully parsed, so this problem was 
> introduced in 2.2.0
> Interestingly, I found that if *anything at all* is added to the context via 
> context.set the same problem occurs. Also, if the standard Tika Embedded 
> Document Extractor is used the same problem occurs. Our Docker image contains 
> our application's code which uses Tika, as well as Apache DS. The problem 
> occurs running Docker on Ubuntu, Mac OS and Windows.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (TIKA-3657) Microsoft documents are not text parsed when running under Docker

Reply via email to