[jira] [Commented] (TIKA-4747) tika-4.0.0-alpha1 - PDF and Tesseract Parser Comments

ASF GitHub Bot (Jira) Wed, 03 Jun 2026 21:02:22 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18085975#comment-18085975
 ]


ASF GitHub Bot commented on TIKA-4747:
--------------------------------------

Copilot commented on code in PR #2865:
URL: https://github.com/apache/tika/pull/2865#discussion_r3353346876


##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/java/org/apache/tika/parser/AndroidBinaryXMLTest.java:
##########
@@ -0,0 +1,120 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.parser;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertNull;
+
+import java.io.ByteArrayOutputStream;
+import java.nio.ByteBuffer;
+import java.nio.ByteOrder;
+import java.nio.charset.StandardCharsets;
+import java.util.List;
+import java.util.zip.ZipEntry;
+import java.util.zip.ZipOutputStream;
+
+import org.junit.jupiter.api.Test;
+
+import org.apache.tika.TikaTest;
+import org.apache.tika.io.TikaInputStream;
+import org.apache.tika.metadata.Metadata;
+import org.apache.tika.metadata.TikaCoreProperties;
+
+/**
+ * Android Binary XML (AXML) is the compiled binary form of 
AndroidManifest.xml and the
+ * res/*.xml resources packed inside an APK. Those entries keep a .xml 
extension and live
+ * inside the (zip) APK, so before TIKA-4747 the *.xml glob caused them to be 
detected as
+ * application/xml and handed to the XML parser, which failed on the binary 
header with
+ * "Invalid byte 1 of 1-byte UTF-8 sequence". This was a large source of 
exceptions in
+ * regression runs over APK-heavy corpora.
+ *
+ * <p>Real corpus APKs can't be committed, so this builds an equivalent zip in 
memory:
+ * two compiled (AXML) entries plus one genuine text-XML entry under assets/ 
as a control,
+ * and asserts the AXML entries are detected as application/vnd.android.axml 
and produce no
+ * exception, while the text-XML entry is still application/xml.
+ */
+public class AndroidBinaryXMLTest extends TikaTest {
+
+    private static final String AXML = "application/vnd.android.axml";
+
+    /**
+     * Minimal but structurally-plausible Android Binary XML header:
+     * ResChunk_header {type=RES_XML_TYPE(0x0003), headerSize=0x0008, 
size=&lt;total&gt;}
+     * followed by a zeroed ResStringPool_header. Only the leading 4 bytes 
(0x00080003 LE)
+     * are the detection signature; the following 4 bytes are the per-file 
size.
+     */
+    private static byte[] axmlBytes() {
+        ByteBuffer bb = ByteBuffer.allocate(64).order(ByteOrder.LITTLE_ENDIAN);
+        bb.putShort((short) 0x0003);   // RES_XML_TYPE
+        bb.putShort((short) 0x0008);   // headerSize
+        bb.putInt(0x00000038);         // string pool chunk size
+        // remaining bytes (string/style counts, flags, offsets) left zero

Review Comment:
   The inline comment describes the 32-bit field as a "string pool chunk size", 
but at this position it is the `ResChunk_header.size` (total chunk size) for 
the `RES_XML_TYPE` chunk. This mismatch makes the test data harder to 
understand/maintain.





> tika-4.0.0-alpha1 - PDF and Tesseract Parser Comments
> -----------------------------------------------------
>
>                 Key: TIKA-4747
>                 URL: https://issues.apache.org/jira/browse/TIKA-4747
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 4.0.0
>         Environment: Windows 11
>            Reporter: Adrian Bird
>            Priority: Major
>
> I've tried the PDF and Tesseract parsers, independently and together and here 
> are some comments.
> 1. pdf-parser Full Configuration example
> The [Full Configuration 
> Example|https://tika.apache.org/docs/4.0.0-SNAPSHOT/configuration/parsers/pdf-parser.html#_full_configuration]
>  has an unknown property "maxPages".
> {code:java}
> Caused by: java.lang.RuntimeException: Failed to parse PDFParserConfig 
> configuration: Unrecognized field "maxPages" (class 
> org.apache.tika.parser.pdf.PDFParserConfig), not marked as ignorable (39 
> known properties: "ocrStrategyAuto", "imageGraphicsEngineFactory", 
> "detectAngles", "ocrMaxPagesToOcr", "ignoreContentStreamSpaceGlyphs", 
> "accessCheckMode", "extractBookmarksText", "spacingTolerance", 
> "extractUniqueInlineImagesOnly", "suppressDuplicateOverlappingText", 
> "extractInlineImages", "enableAutoSpace", "imageGraphicsEngineFactoryClass", 
> "extractInlineImageMetadataOnly", "extractAnnotationText", "sortByPosition", 
> "ocrDPI", "setKCMS", "ocrRenderingStrategy", "ocrMaxImagePixels", 
> "parseIncrementalUpdates", "extractMarkedContent", "maxMainMemoryBytes", 
> "imageStrategy", "throwOnEncryptedPayload", "ocrStrategy", "ocrImageFormat", 
> "extractAcroFormContent", "ocrImageType", "extractFontNames", 
> "averageCharTolerance", "dropThreshold", "extractIncrementalUpdateInfo", 
> "maxIncrementalUpdates", "ifXFAExtractOnlyXFA", "ocr", "ocrImageQuality", 
> "extractActions", "catchIntermediateIOExceptions")
>  at [Source: REDACTED (`StreamReadFeature.INCLUDE_SOURCE_IN_LOCATION` 
> disabled); line: 1, column: 626] (through reference chain: 
> org.apache.tika.parser.pdf.PDFParserConfig["maxPages"])
> {code}
> 2. Refers to 1 above
> In the list of 39 known properties the following do not appear in the full 
> configuration example:
> {noformat}
> "imageGraphicsEngineFactory",
> "imageGraphicsEngineFactoryClass",
> "ocrMaxImagePixels",
> "ocrMaxPagesToOcr",
> {noformat}
> 3. Refers to 1 above
> Is there a description of the properties somewhere?
> 4. Tesseract OCR Full Configuration example
> The [Full Configuration 
> example|https://tika.apache.org/docs/4.0.0-SNAPSHOT/configuration/parsers/tesseract-ocr-parser.html#_full_configuration]
>  didn't work for me.
> I saw the following message:
> {noformat}
> DEBUG [main] 10:48:58,549 
> org.apache.tika.config.loader.AbstractSpiComponentLoader Skipping SPI parsers 
> - 'default-parser' not in config
> {noformat}
> and decided to add the following:
> {code:java}
>     {
>       "default-parser": {}
>     }
> {code}
> That fixed the problem.
> The pdf-parser worked without this 'default-parser' entry.
> 5. Refers to 4 above
> Is there a description of the properties somewhere?
> Also, is there some documentation to say ImageMagick is an optional component.
> 6. Disabling Tesseract
> A message is output that refers to the old XML way of disabling Tesseract:
> {noformat}
> INFO  [main] 09:32:13,811 org.apache.tika.parser.ocr.TesseractOCRParser 
> Tesseract is installed and is being invoked. This can add greatly to 
> processing time.  If you do not want tesseract to be applied to your files 
> see: 
> https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr
> {noformat}
> 7. ImageMagick and Tesseract locations in Windows
> If I use a Windows path for the ImageMagick or Tesseract locations I get an 
> exception (using / on Windows works ok):
> {noformat}
>         "imageMagickPath": "C:\ImageMagick",
>         "tessdataPath": "C:\Tesseract-OCR\tessdata",
>         "tesseractPath": "C:\Tesseract-OCR",
> {noformat}
> gives the following for an invalid Tesseract location:
> {noformat}
> Exception in thread "main" java.io.IOException: 
> com.fasterxml.jackson.core.JsonParseException: Unrecognized character escape 
> 'T' (code 84)
>  at [Source: REDACTED (`StreamReadFeature.INCLUDE_SOURCE_IN_LOCATION` 
> disabled); line: 30, column: 29]
>         at 
> org.apache.tika.async.cli.PluginsWriter.write(PluginsWriter.java:167)
>         at 
> org.apache.tika.async.cli.TikaAsyncCLI.processCommandLine(TikaAsyncCLI.java:117)
>         at org.apache.tika.async.cli.TikaAsyncCLI.main(TikaAsyncCLI.java:93)
>         at org.apache.tika.cli.TikaCLI.async(TikaCLI.java:301)
>         at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:261)
> {noformat}
> 8. ImageMagick Failures
> When Tika runs ImageMagick it always returned an error code of 1.
> ImageMagick on path and no "imageMagickPath" key set gave these messages:
> {noformat}
> Use "magick" instead of the deprecated command "magick convert".
> WARN  [main] 09:39:35,333 org.apache.tika.parser.ocr.ImagePreprocessor 
> ImageMagick failed (commandline: [magick, convert, -density, 300, -depth, 4, 
> -colorspace, gray, -filter, triangle, -resize, 200%, 
> C:\Users\xxx\AppData\Local\Temp\apache-tika-8707805858872770017.tmp, 
> C:\Users\xxx\AppData\Local\Temp\apache-tika-8707805858872770017.tmp])
> org.apache.commons.exec.ExecuteException: Process exited with an error: 1 
> (Exit value: 1)
> {noformat}
> ImageMagick not on path and "imageMagickPath" key set gave these messages:
> {noformat}
> magick: no decode delegate for this image format `' @ 
> error/constitute.c/ReadImage/746.
> WARN  [main] 10:09:59,780 org.apache.tika.parser.ocr.ImagePreprocessor 
> ImageMagick failed (commandline: [C:\ImageMagick\magick, convert, -density, 
> 300, -depth, 4, -colorspace, gray, -filter, triangle, -resize, 200%, 
> C:\Users\xxx\AppData\Local\Temp\apache-tika-4722539874421120895.tmp, 
> C:\Users\xxx\AppData\Local\Temp\apache-tika-4722539874421120895.tmp])
> org.apache.commons.exec.ExecuteException: Process exited with an error: 1 
> (Exit value: 1)
> {noformat}
> Is the fact that the same filename is used twice at the end a cause for 
> concern?
> I know very little about ImageMagick but could reproduce the error by running 
> this outside of Tika:
> {noformat}
> %IMAGEMAGICK_HOME%\magick convert -density 300 -depth 4 -colorspace gray 
> -filter triangle -resize 200% image.jpg image.png
> {noformat}
> I get the error:
> {noformat}
> magick: no decode delegate for this image format `' @ 
> error/constitute.c/ReadImage/746.
> {noformat}
> If I change it by removing the 'convert' and putting the source image at the 
> start:
> {noformat}
> %IMAGEMAGICK_HOME%\magick image.jpg -density 300 -depth 4 -colorspace gray 
> -filter triangle -resize 200% image.png
> {noformat}
> it runs successfully.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4747) tika-4.0.0-alpha1 - PDF and Tesseract Parser Comments

Reply via email to