[
https://issues.apache.org/jira/browse/TIKA-3657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484789#comment-17484789
]
Tim Allison commented on TIKA-3657:
-----------------------------------
Attempt #5 to reproduce failed.
Take a look at this repo:
https://github.com/tballison/tika-addons/tree/main/tika-tomcat
This is built with Java 11.
To package: {{mvn package}}
To build+run: from the tika-domcat/docker directory: {{cp
../target/SampleServlet.war . && docker build -t sample . && docker run -itd -p
8085:8080 --name sample sample}}
To send a file: {{curl -T testWORD.docx
http://localhost:8085/SampleServlet/MyServlet/}}
Output includes:
{noformat}
detector: org.apache.tika.detect.CompositeDetector@79be458
detector: org.apache.tika.detect.OverrideDetector@56f5a53d
detector: org.gagravarr.tika.OggDetector@396d99d7
detector: org.apache.tika.detect.apple.BPListDetector@3d2fe3c2
detector:
org.apache.tika.detect.microsoft.POIFSContainerDetector@65ce691b
detector: org.apache.tika.detect.ole.MiscOLEDetector@1ee9e4d2
detector:
org.apache.tika.detect.zip.DefaultZipContainerDetector@4b9f65f9
detector: org.apache.tika.mime.MimeTypes@96cbebb
{noformat}
Then the content is correctly extracted.
Are you able to modify my repo to reproduce the error? Again, sorry and thank
you!
> Microsoft documents are not text parsed when running under Docker
> -----------------------------------------------------------------
>
> Key: TIKA-3657
> URL: https://issues.apache.org/jira/browse/TIKA-3657
> Project: Tika
> Issue Type: Bug
> Components: config, core, depedency
> Affects Versions: 2.2.0, 2.2.1
> Reporter: Tim Barrett
> Priority: Major
> Fix For: 2.2.2
>
> Attachments: scenario traces.txt, tika-config.xml
>
>
> We use EmbeddedDocumentExtractor, with this code:
> NalyticsEmbeddedDocumentExtractor nalyticsEmbeddedDocumentExtractor = *new*
> NalyticsEmbeddedDocumentExtractor(*this*);
> *this*.context.set(EmbeddedDocumentExtractor.*class*,
> nalyticsEmbeddedDocumentExtractor);
> This all works fine for us, and has been used in production for a few years.
> This also works under Tika 2.2.0 when running in development environments
> (Eclipse, Apache Tomcat). However when running under Docker the text
> withinMicrosoft documents (Word etc) is not parsed. Under Tika 2.1.0, under
> Docker, the Microsoft documents are fully parsed, so this problem was
> introduced in 2.2.0
> Interestingly, I found that if *anything at all* is added to the context via
> context.set the same problem occurs. Also, if the standard Tika Embedded
> Document Extractor is used the same problem occurs. Our Docker image contains
> our application's code which uses Tika, as well as Apache DS. The problem
> occurs running Docker on Ubuntu, Mac OS and Windows.
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)