Flushot opened a new pull request, #27: URL: https://github.com/apache/tika-docker/pull/27
Tesseract OCR image preprocessor is broken in the current Docker image because ImageMagick is missing. You can reproduce this by setting `enableImagePreprocessing` in tika-config.xml: ```xml <?xml version="1.0" encoding="UTF-8"?> <properties> <parsers> <parser class="org.apache.tika.parser.DefaultParser"> <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/> </parser> <parser class="org.apache.tika.parser.ocr.TesseractOCRParser"> <params> <param name="enableImagePreprocessing" type="bool">true</param> </params> </parser> </parsers> </properties> ``` When you try to process a document, you'll get this error: ``` org.apache.tika.parser.ocr.TesseractOCRParser User has selected to preprocess images, but I can't find ImageMagick.Backing off to original file. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org