@Andreas- I have downloaded the latest trunk and came close (it got much further) before failing. However, I think I may have a fix for that failure:
The code is returning 0 when the xrefstm fixedOffset is not found. However, the code still tries to load and parse from xref 0, resulting in a null reference exception later in parser.parse(). However, thinking about this, I came up with this: // check for a XRef stream, it may contain some object ids of compressed objects if(trailer.containsKey(COSName.XREF_STM)) { int streamOffset = trailer.getInt(COSName.XREF_STM); // check the xref stream reference fixedOffset = checkXRefStreamOffset(streamOffset, false); //<== fixedoffset comes back as 0 => not found if (fixedOffset > -1 && fixedOffset != streamOffset) { streamOffset = (int)fixedOffset; // <== streamOffset gets set to 0 here trailer.setInt(COSName.XREF_STM, streamOffset); } if (streamOffset > 0) //<==== I added this test because an xref stream starting at // offset 0 can never happen, so we should simply skip it { pdfSource.seek(streamOffset); skipSpaces(); parseXrefObjStream(prev, false); <== this call ultimately throws a null ref exception if streamOffset == 0 on entry } } Adding that, the file successfully parses. Also, there was this proposal that I put up on github in a repo that I directly forked from pdfbox (it is the only change) It relaxes the looping a bit to allow limited recursion. I would appreciate your thoughts on it. https://github.com/santoch/pdfbox/commit/75cc32ab8307062709c30f1cfea5e2fdb8c00ddd Thank you so much! You have been tremendously helpful. I wish I could have given you the files, but unfortunately, they are proprietary and we cannot release them. :-( Best regards- Steve ________________________________________ From: Andreas Lehmkühler <andr...@lehmi.de> Sent: Monday, February 23, 2015 3:43 AM To: users@pdfbox.apache.org Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present (or variation of it still present) Hi, I've improved the self repair mechnism of the trunk based on Steves report. @Steve Please give the newest trunk version/SNAPSHOT a try. Does the issue still persist? BR Andreas Lehmkühler > Steve Antoch <sant...@yuzu.com> hat am 17. Februar 2015 um 00:05 geschrieben: > > > > Andreas- > Thanks for the response. > Sorry for sending directly. > > Yes, it tries to read from offset 112085940, but does not find the xrefstm > there, so > that's when it goes searching. It seems to be landing in the middle of > something else (perhaps an image?) > > I tried running the preflight command on the file, and this is what it found > there. > This is in the middle of a whole series of repetitive byte patterns like > these, which is interspersed with other sections of content that is also > binary only. > > <?xml version="1.0" encoding="UTF-8" standalone="no"?> > <preflight name="file.pdf"> > <executionTimeMS>2646</executionTimeMS> > <isValid type="">false</isValid> > <errors count="1"> > <error count="1"> > <code>1.0</code> > <details>Syntax error, Error: Expected a long type at offset 112085940, > instead got > '6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ±¯Ó"z·Cœ3Í}yó£g‚?1º·Óž-óVÏ:ë½NsËŽ¸6lÙ³fÅ#듨Î÷å.£=‰ù}ÕsÞÿ'</details> > </error> > </errors> > </preflight> > > The patterns seem to be: > > lots of these: 6lÙ³fÍ› > interspersed between blocks that are similar to this: > ±¯Ó"z·Cœ3Í}yó£g‚?1º·Óž-óVÏ:ë½NsËŽ¸6lÙ³fÅ#듨Î÷å.£=‰ù}ÕsÞÿ' > > It just so happens that the offset 112085940 falls right in the middle of a > big block of those 6lÙ³fÍ› repetitive blocks. > > Not sure if that's any help. > > Steve > > ________________________________________ > From: Andreas Lehmkühler <andr...@lehmi.de> > Sent: Monday, February 16, 2015 3:34 AM > To: users@pdfbox.apache.org > Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present > (or variation of it still present) > > Hi, > > > Steve Antoch <sant...@yuzu.com> hat am 13. Februar 2015 um 23:34 > > geschrieben: > > > > > > > > Hi Tilman and Andreas-- > Please don't contact developers directly, use our mailing lists instead. I've > put the users list back into the boat... > > > I am working with Krasimir on this issue. > > > > Although we asked, we were denied permission to send the document out. > :-( > > > The failure is being triggered when we attempt to use the Encrypt() class to > > password protect the pdf. > > We end up with the "Expected a long type at offset 113884174, instead got > > 'xref'" failure. > > > > I have debugged into the PDFBox code and found the offending parts. > > > > PdfBox is trying to parse an xref table located at 113884174. > > > > The problem we are seeing is that the inside the trailer it finds the > > /XRefStm > > label, and its offset value is returned as 112085940 (which is what is given > > in the file), > > However, the checkXRefOffset() call made to verify it doesn't find the xref > > stream there, so it goes searching and ends up returning the closest xref > > offset it can find, which happens to be that it returns its own offset at > > 113884174. > > > > > > I believe that there is an error in PdfBox with respect to this fixup logic, > > even if it had found the 'correct' xref stream. > > That is because the fixup offset can NEVER work. Every time it fixes up the > > location, it lands on a section which begins with "xref". > > The next call is to skip the whitespace, but since there is never any there > > (it's already proven to be 'xref'), it does not advance the input stream. > > Then, the first call to parse that xrefstm always calls readObjectID(), > > which > > always will throw the exception because the bytes are always 'xref'. > > > > So, my questions are: > > > > 1) Are these docs fixable or are they truly corrupt? > Without having a hand on the pdf itself it's hard to give a 100% answer. But I > guess there has to be fix, as adobe is able to open that pdf. I'll try to find > one, following your description of the pdf > > > 2) Is this xref issue a known issue with PdfBox? I would try to create a > > document that displays the error but I honesty don't know how to do so > > (beyond > > sending the ones that we have that DO display it). > Not until now > > > 3) Do you have any idea how these documents end up in this state if they are > > being edited by tools such as InDesign, Acrobat, etc? Is there something I > > can > > do to identify them? > There are a lot of more or less corrupt files in the wild. Those are created > using different tools. > > > 4) If this is a truly corrupted document, why would Acrobat be able to open > > these files but pdfBox cannot? Are these streams somehow ignorable? I ask > > this because I saw this statement on a web page > > (http://resources.infosecinstitute.com/pdf-file-format-basic-structure/) > > when > > I was searching for answers on this: > Adobe implements a lot of self healing mechanisms to repair broken files and > we > try to do so too, but it's complicated. > > > – /XrefStm [integer]: specifies the offset from the beginning of the file to > > the cross-reference stream in the decoded stream. This is only present in > > hybrid-reference files, which is specified if we would also like to open > > documents even if the applications don’t support compressed reference > > streams. > > > > Any light you can shed on this is appreciated. > > > > Thanks- > > Steve > > > > > > See below for the pertinent data and the code which is marked with the > > values > > as I traced through. > > > > I have confirmed that the byte offset of the word xref below is exactly at > > 113884174. > > Does the xref stream start at 112085940 (stream offset from the trailer > dictionary) or what did you find at that offset? > > > > xref > > 0 53641 > > 0000000000 65535 f > > 0000000017 00000 n > > > > <massive snip/> > > > > > > trailer > > \<\< > > /Size 53641 > > /Root 1 0 R > > /XRefStm 112085940 > > /Info 8 0 R > > /ID [\<19790A83488211E283B50017F203355C> > > \<E3DF7097A16969B08238787F19E7E219>] > > >> > > startxref > > 113884174 > > %%EOF1 0 obj\<\</Outlines 2 0 R/Metadata 53641 0 R/AcroForm 4 0 R/Pages 5 0 > > R/StructTreeRoot 6 0 R/Type/Catalog/PageLabels 7 0 R>> > > endobj > > > > > > protected COSDictionary parseXref(long startXRefOffset) throws > > IOException > > { > > pdfSource.seek(startXRefOffset); > > long startXrefOffset = parseStartXref(); > > // check the startxref offset > > long fixedOffset = checkXRefOffset(startXrefOffset); > > if (fixedOffset > -1) > > { > > startXrefOffset = fixedOffset; > > } > > document.setStartXref(startXrefOffset); > > long prev = startXrefOffset; > > // ---- parse whole chain of xref tables/object streams using PREV > > reference > > while (prev > -1) <== prev here is 113884174. > > { > > // seek to xref table > > pdfSource.seek(prev); > > > > // skip white spaces > > skipSpaces(); > > // -- parse xref > > if (pdfSource.peek() == X) > > { > > // xref table and trailer > > // use existing parser to parse xref table > > parseXrefTable(prev); > > // parse the last trailer. > > trailerOffset = pdfSource.getOffset(); > > // PDFBOX-1739 skip extra xref entries in RegisSTAR > > documents > > while (isLenient && pdfSource.peek() != 't') > > { > > if (pdfSource.getOffset() == trailerOffset) > > { > > // warn only the first time > > LOG.warn("Expected trailer object at position " + > > trailerOffset > > + ", keep trying"); > > } > > readLine(); > > } > > if (!parseTrailer()) > > { > > throw new IOException("Expected trailer object at > > position: " > > + pdfSource.getOffset()); > > } > > COSDictionary trailer = > > xrefTrailerResolver.getCurrentTrailer(); > > // check for a XRef stream, it may contain some object ids > > of > > compressed objects > > if(trailer.containsKey(COSName.XREF_STM)) <== YES - but > > falue > > { > > int streamOffset = trailer.getInt(COSName.XREF_STM); > > <== > > This returns 112085940, which is the value from the trailer > > // check the xref stream reference > > fixedOffset = checkXRefOffset(streamOffset); > > <== > > checks it and returns 113884174 instead > > if (fixedOffset > -1 && fixedOffset != streamOffset) > > { > > streamOffset = (int)fixedOffset; > > trailer.setInt(COSName.XREF_STM, streamOffset); > > } > > pdfSource.seek(streamOffset); <== Seeks to 113884174 > > //readExpectedString(XREF_TABLE, false); > > skipSpaces(); <=== It's ON "xref", so it doesn't > > skip anything > > parseXrefObjStream(prev, false); <== goes in here, first > > thing it tries to do is readObjectNumber(), which can't work because it's > > 'xref' -- BOOM > > } > > prev = trailer.getInt(COSName.PREV); > > if (prev > -1) > > { > > // check the xref table reference > > fixedOffset = checkXRefOffset(prev); > > if (fixedOffset > -1 && fixedOffset != prev) > > { > > prev = fixedOffset; > > trailer.setLong(COSName.PREV, prev); > > } > > } > > } > > else > > { > > // parse xref stream > > prev = parseXrefObjStream(prev, true); > > if (prev > -1) > > { > > // check the xref table reference > > fixedOffset = checkXRefOffset(prev); > > if (fixedOffset > -1 && fixedOffset != prev) > > { > > prev = fixedOffset; > > COSDictionary trailer = > > xrefTrailerResolver.getCurrentTrailer(); > > trailer.setLong(COSName.PREV, prev); > > } > > } > > } > > } > > // ---- build valid xrefs out of the xref chain > > xrefTrailerResolver.setStartxref(startXrefOffset); > > COSDictionary trailer = xrefTrailerResolver.getTrailer(); > > document.setTrailer(trailer); > > document.setIsXRefStream(XRefType.STREAM == > > xrefTrailerResolver.getXrefType()); > > // check the offsets of all referenced objects > > checkXrefOffsets(); > > // copy xref table > > document.addXRefTable(xrefTrailerResolver.getXrefTable()); > > return trailer; > > } > > > BR > Andreas Lehmkühler > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org