Hi Andreas, I forgot to tell ...
looking at the contained 13679_stream.dat, we see the first offsets for the objects ... 5550 0 5551 5 5552 *7* 5553 *11* 5554 16 that actual data starting at offset 1990 (see the /First 1990 in the 13679_objstm.raw). looking at the content at offset 1990 in 13679_stream.dat, we have ... offset - 1990: 0 7 11 data : ... 3505 4 248*03*505 ... as we see, the bold '0' is the end of object 5552 ... while object 5553 starts at bolded '3': no separator between both tokens On Wed, Feb 19, 2025 at 9:36 PM mountain the blue <thebluemount...@gmail.com> wrote: > Hi Andreas, > > Sorry for this delayed response. > > re: "Of course. it is better than nothing" > > I have uploaded a zip file carrying: > 1- the raw extract of the /ObjStream object and related stream content > 2- the decode content stream > > it is accessible for 6 days @ https://filebin.net/8lz0dqbyhmiif1jj > > re: "Old version of what major version?..." > Yes, this is a 2.x base > ... that borrowed some of the 3.x parsing you was working on as it was > solving numerous issues we were encountering then > > > On Tue, Feb 18, 2025 at 7:57 AM Andreas Lehmkühler > <andr...@lehmi.de.invalid> wrote: > >> >> >> Am 17.02.25 um 22:16 schrieb mountain the blue: >> > hi Andreas, >> > >> > re: 'is there any chance ...' >> > I would have to ask for authorisation to the owner (a company) and I >> doubt >> > I could have it sent quickly. >> > I can, though, share the actual /ObjStm content, decompressed; let me >> know >> > if this would help you. >> Of course. it is better than nothing >> >> > re: "which version ..." >> > I am using an old version ... (that I am patching myself) ... >> > I can, however, reproduce it with current code on the trunk branch ... >> > (therefore, the 2 unit tests to exhibit the current behavior) >> Old version of what major version? 2.x or 3.x? >> >> >> > On Mon, Feb 17, 2025 at 6:02 PM Andreas Lehmkühler >> <andr...@lehmi.de.invalid> >> > wrote: >> > >> >> Hi, >> >> >> >> is there any chance to get a hand on the pdf in question? >> >> >> >> Which version pd PDFBox are you using? >> >> >> >> Andreas >> >> >> >> Am 17.02.25 um 17:16 schrieb mountain the blue: >> >>> hi, >> >>> >> >>> first of all, many thanks for the contributors of the pdfbox project >> that >> >>> I've been using for long time for anything relating to pdf in java. >> >>> >> >>> I am using pdfbox to process various pdf files. >> >>> lately, I received a file whose parsing failed: >> >>> ie: >> >>> ... >> >>> Exception in thread "main" java.io.IOException: Error: Unknown >> annotation >> >>> type COSInt{49633506} >> >>> at >> >>> >> >> >> org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation.createAnnotation(PDAnnotation.java:198) >> >>> at org.apache.pdfbox.pdmodel.PDPage.getAnnotations(PDPage.java:696) >> >>> at org.apache.pdfbox.pdmodel.PDPage.getAnnotations(PDPage.java:663) >> >>> ... >> >>> >> >>> Looking further into this error, the reason was coming from the >> parsing >> >> of >> >>> /ObjStm ... that expects each object, serialised in the stream, to >> have >> >>> separator (ie; white space) while the >> >>> pdf was having some COS object serialised without the such separation >> >>> >> >>> in the current code base, accessible on GitHub, the following test >> >> passes: >> >>> >> >>> @Test >> >>> void testParse2NumberObjects () throws IOException >> >>> { >> >>> COSStream stream = new COSStream(); >> >>> stream.setItem(COSName.N, COSInteger.TWO); >> >>> stream.setItem(COSName.FIRST, COSInteger.get(8)); >> >>> OutputStream outputStream = stream.createOutputStream(); >> >>> outputStream.write("6 0 4 2 1 2".getBytes()); >> >>> outputStream.close(); >> >>> PDFObjectStreamParser objectStreamParser = new >> >>> PDFObjectStreamParser(stream, null); >> >>> Map<COSObjectKey, COSBase> objectNumbers = >> >>> objectStreamParser.parseAllObjects(); >> >>> assertEquals(2, objectNumbers.size()); >> >>> assertEquals(COSInteger.get (1), objectNumbers.get(new >> >> COSObjectKey(6, 0))); >> >>> assertEquals(COSInteger.get (2), objectNumbers.get(new >> >> COSObjectKey(4, 0))); >> >>> } >> >>> >> >>> >> >>> while this one fails: >> >>> >> >>> @Test >> >>> void testParse2NumberObjectsNoSpace () throws IOException >> >>> { >> >>> COSStream stream = new COSStream(); >> >>> stream.setItem(COSName.N, COSInteger.TWO); >> >>> stream.setItem(COSName.FIRST, COSInteger.get(8)); >> >>> OutputStream outputStream = stream.createOutputStream(); >> >>> outputStream.write("6 0 4 *1* *12*".getBytes()); >> >>> outputStream.close(); >> >>> PDFObjectStreamParser objectStreamParser = new >> >>> PDFObjectStreamParser(stream, null); >> >>> Map<COSObjectKey, COSBase> objectNumbers = >> >>> objectStreamParser.parseAllObjects(); >> >>> assertEquals(2, objectNumbers.size()); >> >>> assertEquals(COSInteger.get (1), objectNumbers.get(new >> >> COSObjectKey(6, 0))); >> >>> assertEquals(COSInteger.get (2), objectNumbers.get(new >> >> COSObjectKey(4, 0))); >> >>> } >> >>> >> >>> with error: >> >>> org.opentest4j.AssertionFailedError: >> >>> Expected :COSInt{*1*} >> >>> Actual :COSInt{*12*} >> >>> >> >>> at >> >>> >> >> >> org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) >> >>> ... >> >>> at >> >>> >> >> >> org.apache.pdfbox.pdfparser.PDFObjectStreamParserTest.testParse2NumberObjectsNoSpace(PDFObjectStreamParserTest.java:103) >> >>> ... >> >>> notes: >> >>> >> >>> a- the second object (number = 4), now indicates 1 as its offset and >> both >> >>> '1' and '2' are now 'joined'. >> >>> >> >>> b- the file was being created by on November last year and converted >> from >> >>> word to pdf by 'Adobe Acrobat Pro (64-bit) 24 Paper Capture Plug-in': >> I >> >> do >> >>> expect to see such (valid) pdf construction more often in the (near) >> >> future. >> >>> >> >>> @ (Tilman & Andreas): I was able to have the pdfbox working by >> changing >> >> the >> >>> PDFObjectStreamParser implementation, rewriting the >> >>> privateReadObjectOffsets() method to return an array and using a >> parser >> >>> that does not parse beyond implicit limitation given by next object's >> >>> offset. let me know if you want to access this change. >> >>> >> >>> thank you, >> >>> >> >> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >> >> For additional commands, e-mail: users-h...@pdfbox.apache.org >> >> >> >> >> > >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >> For additional commands, e-mail: users-h...@pdfbox.apache.org >> >>