[ https://issues.apache.org/jira/browse/TIKA-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14145420#comment-14145420 ]
Tim Allison commented on TIKA-1396: ----------------------------------- When I run your file through a modified version of a test case: {noformat} @Test public void testEmbeddedFilesInChildren2() throws Exception { RecursiveMetadataParser p = new RecursiveMetadataParser(new AutoDetectParser(), false); TikaInputStream tis = null; ParseContext context = new ParseContext(); PDFParserConfig config = new PDFParserConfig(); config.setExtractInlineImages(true); config.setExtractUniqueInlineImagesOnly(false); context.set(org.apache.tika.parser.pdf.PDFParserConfig.class, config); context.set(org.apache.tika.parser.Parser.class, p); try { tis = TikaInputStream.get( getResourceAsStream("/test-documents/tika_images.pdf")); p.parse(tis, new BodyContentHandler(-1), new Metadata(), context); } finally { if (tis != null) { tis.close(); } } List<Metadata> metadatas = p.getAllMetadata(); int i = 0; for (Metadata m : metadatas) { for (String n : m.names()) { for (String v : m.getValues(n)) { System.out.println("metadata #"+i + ": " + n + " : " + v); } } i++; } {noformat} I get this: {noformat} metadata #0: Dimension VerticalPixelSize : 0.35273367 metadata #0: Data BitsPerSample : 8 8 8 metadata #0: Compression Lossless : true metadata #0: tiff:BitsPerSample : 8 8 8 metadata #0: width : 482 metadata #0: Dimension ImageOrientation : Normal metadata #0: Dimension PixelAspectRatio : 1.0 metadata #0: Compression CompressionTypeName : deflate metadata #0: X-Parsed-By : org.apache.tika.parser.DefaultParser metadata #0: X-Parsed-By : org.apache.tika.parser.image.ImageParser metadata #0: tiff:ImageLength : 424 metadata #0: Data SampleFormat : UnsignedIntegral metadata #0: Dimension HorizontalPixelSize : 0.35273367 metadata #0: Transparency Alpha : none metadata #0: height : 424 metadata #0: pHYs : pixelsPerUnitXAxis=2835, pixelsPerUnitYAxis=2835, unitSpecifier=meter metadata #0: Chroma NumChannels : 3 metadata #0: Compression NumProgressiveScans : 1 metadata #0: Chroma ColorSpaceType : RGB metadata #0: Data PlanarConfiguration : PixelInterleaved metadata #0: embeddedResourceType : INLINE metadata #0: tiff:ImageWidth : 482 metadata #0: IHDR : width=482, height=424, bitDepth=8, colorType=RGB, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none metadata #0: Chroma BlackIsZero : true metadata #0: Content-Type : image/png metadata #1: dcterms:modified : 2014-09-23T18:53:17Z metadata #1: meta:creation-date : 2014-09-23T18:53:17Z metadata #1: meta:save-date : 2014-09-23T18:53:17Z metadata #1: pdf:PDFVersion : 1.4 metadata #1: dcterms:created : 2014-09-23T18:53:17Z metadata #1: Last-Modified : 2014-09-23T18:53:17Z metadata #1: date : 2014-09-23T18:53:17Z metadata #1: X-Parsed-By : org.apache.tika.parser.DefaultParser metadata #1: X-Parsed-By : org.apache.tika.parser.pdf.PDFParser metadata #1: modified : 2014-09-23T18:53:17Z metadata #1: xmpTPg:NPages : 1 metadata #1: Creation-Date : 2014-09-23T18:53:17Z metadata #1: pdf:encrypted : false metadata #1: title : tika_images metadata #1: created : Tue Sep 23 14:53:17 EDT 2014 metadata #1: dc:format : application/pdf; version=1.4 metadata #1: producer : Mac OS X 10.9.5 Quartz PDFContext metadata #1: Content-Type : application/pdf metadata #1: xmp:CreatorTool : Pages metadata #1: Last-Save-Date : 2014-09-23T18:53:17Z metadata #1: dc:title : tika_images {noformat} How are you calling your code? How setting the config? > Embedded images in PDF documents > -------------------------------- > > Key: TIKA-1396 > URL: https://issues.apache.org/jira/browse/TIKA-1396 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.5 > Environment: *OS:* > Ubuntu 14.04.1 LTS > *KERNEL:* > 3.13.0-33-generic > gcc version 4.8.2 > *JAVA:* > java version "1.8.0_11" > Java(TM) SE Runtime Environment (build 1.8.0_11-b12) > Java HotSpot(TM) 64-Bit Server VM (build 25.11-b03, mixed mode) > Reporter: Damiano > Priority: Critical > Fix For: 1.6 > > Attachments: tika_images.pdf > > > Hello! > I just found a problem with PDF documents that have embedded images. > Doing: > java -jar tika-app-1.5.jar --extract tika.pdf > Tika can not find the image. > Is this a PDF related problem? Because if i do the same operation with a DOC > document Tika finds the image correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)