[
https://issues.apache.org/jira/browse/TIKA-4747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18085768#comment-18085768
]
Adrian Bird commented on TIKA-4747:
-----------------------------------
There is a another thing.
9. Exception when extracting images from a PDF file with Tesseract
Note that there is no ImageMagick defined in my configuration for this issue.
This is the content I see inside the output file when an image is encountered
(I can provide the full stack trace if needed):
{noformat}
"Content-Type" : "image/png",
"Content-Type-Magic-Detected" : "image/png",
"X-TIKA:EXCEPTION:embedded_exception" :
"org.apache.tika.exception.TikaException: TesseractOCRParser bad exit value 1
err msg: Error in findFileFormatStream: truncated file\r\nError during
processing.\r\n\r\n\tat
org.apache.tika.parser.ocr.TesseractOCRParser.runOCRProcess(TesseractOCRParser.java:564)\r\n\tat
org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:517)\r\n\tat
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:392)\r\n\tat
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:296)\r\n\tat
...
"X-TIKA:EXCEPTION:embedded_parser" : "org.apache.tika.parser.AutoDetectParser",
"X-TIKA:Parsed-By" : [ "org.apache.tika.parser.CompositeParser",
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.image.ImageParser",
"org.apache.tika.parser.ocr.TesseractOCRParser" ],
"X-TIKA:embedded_depth" : "1",
"X-TIKA:embedded_id" : "1",
"X-TIKA:embedded_id_path" : "/1",
"X-TIKA:embedded_resource_path" : "/image-0.png",
"X-TIKA:final_embedded_resource_path" : "/image-0.png",
"X-TIKA:parse_time_millis" : "190",
"X-TIKA:resourceName" : "image-0.png",
"X-TIKA:resourceNameExtensionInferred" : "true",
"embeddedResourceType" : "INLINE",
"imagereader:NumImages" : "1",
"pdf:hasXMP" : "false",
"tiff:ImageLength" : "282",
"tiff:ImageWidth" : "290",
"tika_pg:page_number" : "3" }
{noformat}
> tika-4.0.0-alpha1 - PDF and Tesseract Parser Comments
> -----------------------------------------------------
>
> Key: TIKA-4747
> URL: https://issues.apache.org/jira/browse/TIKA-4747
> Project: Tika
> Issue Type: Bug
> Affects Versions: 4.0.0
> Environment: Windows 11
> Reporter: Adrian Bird
> Priority: Major
>
> I've tried the PDF and Tesseract parsers, independently and together and here
> are some comments.
> 1. pdf-parser Full Configuration example
> The [Full Configuration
> Example|https://tika.apache.org/docs/4.0.0-SNAPSHOT/configuration/parsers/pdf-parser.html#_full_configuration]
> has an unknown property "maxPages".
> {code:java}
> Caused by: java.lang.RuntimeException: Failed to parse PDFParserConfig
> configuration: Unrecognized field "maxPages" (class
> org.apache.tika.parser.pdf.PDFParserConfig), not marked as ignorable (39
> known properties: "ocrStrategyAuto", "imageGraphicsEngineFactory",
> "detectAngles", "ocrMaxPagesToOcr", "ignoreContentStreamSpaceGlyphs",
> "accessCheckMode", "extractBookmarksText", "spacingTolerance",
> "extractUniqueInlineImagesOnly", "suppressDuplicateOverlappingText",
> "extractInlineImages", "enableAutoSpace", "imageGraphicsEngineFactoryClass",
> "extractInlineImageMetadataOnly", "extractAnnotationText", "sortByPosition",
> "ocrDPI", "setKCMS", "ocrRenderingStrategy", "ocrMaxImagePixels",
> "parseIncrementalUpdates", "extractMarkedContent", "maxMainMemoryBytes",
> "imageStrategy", "throwOnEncryptedPayload", "ocrStrategy", "ocrImageFormat",
> "extractAcroFormContent", "ocrImageType", "extractFontNames",
> "averageCharTolerance", "dropThreshold", "extractIncrementalUpdateInfo",
> "maxIncrementalUpdates", "ifXFAExtractOnlyXFA", "ocr", "ocrImageQuality",
> "extractActions", "catchIntermediateIOExceptions")
> at [Source: REDACTED (`StreamReadFeature.INCLUDE_SOURCE_IN_LOCATION`
> disabled); line: 1, column: 626] (through reference chain:
> org.apache.tika.parser.pdf.PDFParserConfig["maxPages"])
> {code}
> 2. Refers to 1 above
> In the list of 39 known properties the following do not appear in the full
> configuration example:
> {noformat}
> "imageGraphicsEngineFactory",
> "imageGraphicsEngineFactoryClass",
> "ocrMaxImagePixels",
> "ocrMaxPagesToOcr",
> {noformat}
> 3. Refers to 1 above
> Is there a description of the properties somewhere?
> 4. Tesseract OCR Full Configuration example
> The [Full Configuration
> example|https://tika.apache.org/docs/4.0.0-SNAPSHOT/configuration/parsers/tesseract-ocr-parser.html#_full_configuration]
> didn't work for me.
> I saw the following message:
> {noformat}
> DEBUG [main] 10:48:58,549
> org.apache.tika.config.loader.AbstractSpiComponentLoader Skipping SPI parsers
> - 'default-parser' not in config
> {noformat}
> and decided to add the following:
> {code:java}
> {
> "default-parser": {}
> }
> {code}
> That fixed the problem.
> The pdf-parser worked without this 'default-parser' entry.
> 5. Refers to 4 above
> Is there a description of the properties somewhere?
> Also, is there some documentation to say ImageMagick is an optional component.
> 6. Disabling Tesseract
> A message is output that refers to the old XML way of disabling Tesseract:
> {noformat}
> INFO [main] 09:32:13,811 org.apache.tika.parser.ocr.TesseractOCRParser
> Tesseract is installed and is being invoked. This can add greatly to
> processing time. If you do not want tesseract to be applied to your files
> see:
> https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr
> {noformat}
> 7. ImageMagick and Tesseract locations in Windows
> If I use a Windows path for the ImageMagick or Tesseract locations I get an
> exception (using / on Windows works ok):
> {noformat}
> "imageMagickPath": "C:\ImageMagick",
> "tessdataPath": "C:\Tesseract-OCR\tessdata",
> "tesseractPath": "C:\Tesseract-OCR",
> {noformat}
> gives the following for an invalid Tesseract location:
> {noformat}
> Exception in thread "main" java.io.IOException:
> com.fasterxml.jackson.core.JsonParseException: Unrecognized character escape
> 'T' (code 84)
> at [Source: REDACTED (`StreamReadFeature.INCLUDE_SOURCE_IN_LOCATION`
> disabled); line: 30, column: 29]
> at
> org.apache.tika.async.cli.PluginsWriter.write(PluginsWriter.java:167)
> at
> org.apache.tika.async.cli.TikaAsyncCLI.processCommandLine(TikaAsyncCLI.java:117)
> at org.apache.tika.async.cli.TikaAsyncCLI.main(TikaAsyncCLI.java:93)
> at org.apache.tika.cli.TikaCLI.async(TikaCLI.java:301)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:261)
> {noformat}
> 8. ImageMagick Failures
> When Tika runs ImageMagick it always returned an error code of 1.
> ImageMagick on path and no "imageMagickPath" key set gave these messages:
> {noformat}
> Use "magick" instead of the deprecated command "magick convert".
> WARN [main] 09:39:35,333 org.apache.tika.parser.ocr.ImagePreprocessor
> ImageMagick failed (commandline: [magick, convert, -density, 300, -depth, 4,
> -colorspace, gray, -filter, triangle, -resize, 200%,
> C:\Users\xxx\AppData\Local\Temp\apache-tika-8707805858872770017.tmp,
> C:\Users\xxx\AppData\Local\Temp\apache-tika-8707805858872770017.tmp])
> org.apache.commons.exec.ExecuteException: Process exited with an error: 1
> (Exit value: 1)
> {noformat}
> ImageMagick not on path and "imageMagickPath" key set gave these messages:
> {noformat}
> magick: no decode delegate for this image format `' @
> error/constitute.c/ReadImage/746.
> WARN [main] 10:09:59,780 org.apache.tika.parser.ocr.ImagePreprocessor
> ImageMagick failed (commandline: [C:\ImageMagick\magick, convert, -density,
> 300, -depth, 4, -colorspace, gray, -filter, triangle, -resize, 200%,
> C:\Users\xxx\AppData\Local\Temp\apache-tika-4722539874421120895.tmp,
> C:\Users\xxx\AppData\Local\Temp\apache-tika-4722539874421120895.tmp])
> org.apache.commons.exec.ExecuteException: Process exited with an error: 1
> (Exit value: 1)
> {noformat}
> Is the fact that the same filename is used twice at the end a cause for
> concern?
> I know very little about ImageMagick but could reproduce the error by running
> this outside of Tika:
> {noformat}
> %IMAGEMAGICK_HOME%\magick convert -density 300 -depth 4 -colorspace gray
> -filter triangle -resize 200% image.jpg image.png
> {noformat}
> I get the error:
> {noformat}
> magick: no decode delegate for this image format `' @
> error/constitute.c/ReadImage/746.
> {noformat}
> If I change it by removing the 'convert' and putting the source image at the
> start:
> {noformat}
> %IMAGEMAGICK_HOME%\magick image.jpg -density 300 -depth 4 -colorspace gray
> -filter triangle -resize 200% image.png
> {noformat}
> it runs successfully.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)