Re: pdfobjectstreamparser fail to parse content

mountain the blue Thu, 20 Feb 2025 01:08:17 -0800

Hi Andreas,

I forgot to tell ...


looking at the contained 13679_stream.dat, we see the first offsets for the
objects ...

5550 0 5551 5 5552 *7* 5553 *11* 5554 16

that actual data starting at offset 1990 (see the /First 1990 in the
13679_objstm.raw).
looking at the content at offset 1990 in 13679_stream.dat, we have ...

offset - 1990: 0       7         11
data  :        ... 3505 4 248*03*505 ...

as we see, the bold '0' is the end of object 5552 ... while object 5553
starts at bolded '3': no separator between both tokens


On Wed, Feb 19, 2025 at 9:36 PM mountain the blue <[email protected]>
wrote:

> Hi Andreas,
>
> Sorry for this delayed response.
>
> re: "Of course. it is better than nothing"
>
> I have uploaded a zip file carrying:
> 1- the raw extract of the /ObjStream object and related stream content
> 2- the decode content stream
>
> it is accessible for 6 days @ https://filebin.net/8lz0dqbyhmiif1jj
>
> re: "Old version of what major version?..."
> Yes, this is a 2.x base
> ... that borrowed some of the 3.x parsing you was working on as it was
> solving numerous issues we were encountering then
>
>
> On Tue, Feb 18, 2025 at 7:57 AM Andreas Lehmkühler
> <[email protected]> wrote:
>
>>
>>
>> Am 17.02.25 um 22:16 schrieb mountain the blue:
>> > hi Andreas,
>> >
>> > re: 'is there any chance ...'
>> > I would have to ask for authorisation to the owner (a company) and I
>> doubt
>> > I could have it sent quickly.
>> > I can, though, share the actual /ObjStm content, decompressed; let me
>> know
>> > if this would help you.
>> Of course. it is better than nothing
>>
>> > re: "which version ..."
>> > I am using an old version ... (that I am patching myself) ...
>> > I can, however, reproduce it with current code on the trunk branch ...
>> > (therefore, the 2 unit tests to exhibit the current behavior)
>> Old version of what major version? 2.x or 3.x?
>>
>>
>> > On Mon, Feb 17, 2025 at 6:02 PM Andreas Lehmkühler
>> <[email protected]>
>> > wrote:
>> >
>> >> Hi,
>> >>
>> >> is there any chance to get a hand on the pdf in question?
>> >>
>> >> Which version pd PDFBox are you using?
>> >>
>> >> Andreas
>> >>
>> >> Am 17.02.25 um 17:16 schrieb mountain the blue:
>> >>> hi,
>> >>>
>> >>> first of all, many thanks for the contributors of the pdfbox project
>> that
>> >>> I've been using for long time for anything relating to pdf in java.
>> >>>
>> >>> I am using pdfbox to process various pdf files.
>> >>> lately, I received a file whose parsing failed:
>> >>> ie:
>> >>> ...
>> >>> Exception in thread "main" java.io.IOException: Error: Unknown
>> annotation
>> >>> type COSInt{49633506}
>> >>> at
>> >>>
>> >>
>> org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation.createAnnotation(PDAnnotation.java:198)
>> >>> at org.apache.pdfbox.pdmodel.PDPage.getAnnotations(PDPage.java:696)
>> >>> at org.apache.pdfbox.pdmodel.PDPage.getAnnotations(PDPage.java:663)
>> >>> ...
>> >>>
>> >>> Looking further into this error, the reason was coming from the
>> parsing
>> >> of
>> >>> /ObjStm ... that expects each object, serialised in the stream, to
>> have
>> >>> separator (ie; white space) while the
>> >>> pdf was having some COS object serialised without the such separation
>> >>>
>> >>> in the current code base, accessible on GitHub, the following test
>> >> passes:
>> >>>
>> >>> @Test
>> >>> void testParse2NumberObjects () throws IOException
>> >>> {
>> >>>       COSStream stream = new COSStream();
>> >>>       stream.setItem(COSName.N, COSInteger.TWO);
>> >>>       stream.setItem(COSName.FIRST, COSInteger.get(8));
>> >>>       OutputStream outputStream = stream.createOutputStream();
>> >>>       outputStream.write("6 0 4 2 1 2".getBytes());
>> >>>       outputStream.close();
>> >>>       PDFObjectStreamParser objectStreamParser = new
>> >>> PDFObjectStreamParser(stream, null);
>> >>>       Map<COSObjectKey, COSBase> objectNumbers =
>> >>> objectStreamParser.parseAllObjects();
>> >>>       assertEquals(2, objectNumbers.size());
>> >>>       assertEquals(COSInteger.get (1), objectNumbers.get(new
>> >> COSObjectKey(6, 0)));
>> >>>       assertEquals(COSInteger.get (2), objectNumbers.get(new
>> >> COSObjectKey(4, 0)));
>> >>> }
>> >>>
>> >>>
>> >>> while this one fails:
>> >>>
>> >>> @Test
>> >>> void testParse2NumberObjectsNoSpace () throws IOException
>> >>> {
>> >>>       COSStream stream = new COSStream();
>> >>>       stream.setItem(COSName.N, COSInteger.TWO);
>> >>>       stream.setItem(COSName.FIRST, COSInteger.get(8));
>> >>>       OutputStream outputStream = stream.createOutputStream();
>> >>>       outputStream.write("6 0 4 *1* *12*".getBytes());
>> >>>       outputStream.close();
>> >>>       PDFObjectStreamParser objectStreamParser = new
>> >>> PDFObjectStreamParser(stream, null);
>> >>>       Map<COSObjectKey, COSBase> objectNumbers =
>> >>> objectStreamParser.parseAllObjects();
>> >>>       assertEquals(2, objectNumbers.size());
>> >>>       assertEquals(COSInteger.get (1), objectNumbers.get(new
>> >> COSObjectKey(6, 0)));
>> >>>       assertEquals(COSInteger.get (2), objectNumbers.get(new
>> >> COSObjectKey(4, 0)));
>> >>> }
>> >>>
>> >>> with error:
>> >>> org.opentest4j.AssertionFailedError:
>> >>> Expected :COSInt{*1*}
>> >>> Actual   :COSInt{*12*}
>> >>>
>> >>> at
>> >>>
>> >>
>> org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>> >>> ...
>> >>> at
>> >>>
>> >>
>> org.apache.pdfbox.pdfparser.PDFObjectStreamParserTest.testParse2NumberObjectsNoSpace(PDFObjectStreamParserTest.java:103)
>> >>> ...
>> >>> notes:
>> >>>
>> >>> a- the second object (number = 4), now indicates 1 as its offset and
>> both
>> >>> '1' and '2' are now 'joined'.
>> >>>
>> >>> b- the file was being created by on November last year and converted
>> from
>> >>> word to pdf by 'Adobe Acrobat Pro (64-bit) 24 Paper Capture Plug-in':
>> I
>> >> do
>> >>> expect to see such (valid) pdf construction more often in the (near)
>> >> future.
>> >>>
>> >>> @ (Tilman & Andreas): I was able to have the pdfbox working by
>> changing
>> >> the
>> >>> PDFObjectStreamParser implementation, rewriting the
>> >>> privateReadObjectOffsets() method to return an array and using a
>> parser
>> >>> that does not parse beyond implicit limitation given by next object's
>> >>> offset. let me know if you want to access this change.
>> >>>
>> >>> thank you,
>> >>>
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [email protected]
>> >> For additional commands, e-mail: [email protected]
>> >>
>> >>
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>

Re: pdfobjectstreamparser fail to parse content

Reply via email to