[
https://issues.apache.org/jira/browse/TIKA-3657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17485914#comment-17485914
]
Tim Allison commented on TIKA-3657:
-----------------------------------
I tried throwing a security exception from a couple of points in the code on
the theory that maybe the SecurityManager is somehow turned on in one version
of tomcat but not another. The SecurityExceptions percolated up and prevented
the loading of the TikaConfig entirely so you should see some catastrophic
messages if that's happening.
My other theory was that if there was a classnotfound exception (like if you
were excluding the apple.BPListDetector class from loading somehow), that might
prevent the other detectors from loading...I think it did at one point, but I
thought we had fixed that.
> Microsoft documents are not text parsed when running under Docker
> -----------------------------------------------------------------
>
> Key: TIKA-3657
> URL: https://issues.apache.org/jira/browse/TIKA-3657
> Project: Tika
> Issue Type: Bug
> Components: config, core, depedency
> Affects Versions: 2.2.0, 2.2.1
> Reporter: Tim Barrett
> Priority: Major
> Fix For: 2.2.2
>
> Attachments: POIFSContainerDetector.java, scenario traces.txt,
> tika-config.xml
>
>
> We use EmbeddedDocumentExtractor, with this code:
> NalyticsEmbeddedDocumentExtractor nalyticsEmbeddedDocumentExtractor = *new*
> NalyticsEmbeddedDocumentExtractor(*this*);
> *this*.context.set(EmbeddedDocumentExtractor.*class*,
> nalyticsEmbeddedDocumentExtractor);
> This all works fine for us, and has been used in production for a few years.
> This also works under Tika 2.2.0 when running in development environments
> (Eclipse, Apache Tomcat). However when running under Docker the text
> withinMicrosoft documents (Word etc) is not parsed. Under Tika 2.1.0, under
> Docker, the Microsoft documents are fully parsed, so this problem was
> introduced in 2.2.0
> Interestingly, I found that if *anything at all* is added to the context via
> context.set the same problem occurs. Also, if the standard Tika Embedded
> Document Extractor is used the same problem occurs. Our Docker image contains
> our application's code which uses Tika, as well as Apache DS. The problem
> occurs running Docker on Ubuntu, Mac OS and Windows.
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)