Hi Steve, > Steve Antoch <sant...@yuzu.com> hat am 23. Februar 2015 um 19:42 geschrieben: > > > @Andreas- > > I have downloaded the latest trunk and came close (it got much further) before > failing. > However, I think I may have a fix for that failure: Thanks for the test
> The code is returning 0 when the xrefstm fixedOffset is not found. However, > the code still tries to load and parse from xref 0, resulting in a null > reference exception later in parser.parse(). Your analysis is correct, but I hope that my last improvements should eliminate such cases, see PDFBOX-2572 for details. Could you give the latest trunk (r1661747) a try? > However, thinking about this, I came up with this: > > // check for a XRef stream, it may contain some object ids of > compressed objects > if(trailer.containsKey(COSName.XREF_STM)) > { > int streamOffset = trailer.getInt(COSName.XREF_STM); > // check the xref stream reference > fixedOffset = checkXRefStreamOffset(streamOffset, false); > //<== fixedoffset comes back as 0 => not found > if (fixedOffset > -1 && fixedOffset != streamOffset) > { > streamOffset = (int)fixedOffset; > // <== streamOffset gets set to > 0 here > trailer.setInt(COSName.XREF_STM, streamOffset); > } > > if (streamOffset > 0) //<==== I added this test > because an xref stream starting at > // offset 0 can > never happen, so we should simply skip it > { > pdfSource.seek(streamOffset); > skipSpaces(); > parseXrefObjStream(prev, false); <== this call > ultimately throws a null ref exception if streamOffset == 0 on entry > } > } > > Adding that, the file successfully parses. > > Also, there was this proposal that I put up on github in a repo that I > directly forked from pdfbox (it is the only change) > It relaxes the looping a bit to allow limited recursion. I would appreciate > your thoughts on it. Is this change related to the discussed issue above? > https://github.com/santoch/pdfbox/commit/75cc32ab8307062709c30f1cfea5e2fdb8c00ddd > > Thank you so much! You have been tremendously helpful. I wish I could have > given you the files, but unfortunately, they are proprietary and we cannot > release them. :-( No need to worry, you are not the only one who is not allowed to share a specific pdf. > Best regards- > Steve BR Andreas Lehmkühler > > ________________________________________ > From: Andreas Lehmkühler <andr...@lehmi.de> > Sent: Monday, February 23, 2015 3:43 AM > To: users@pdfbox.apache.org > Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present > (or variation of it still present) > > Hi, > > I've improved the self repair mechnism of the trunk based on Steves report. > > @Steve Please give the newest trunk version/SNAPSHOT a try. Does the issue > still > persist? > > BR > Andreas Lehmkühler > > > Steve Antoch <sant...@yuzu.com> hat am 17. Februar 2015 um 00:05 > > geschrieben: > > > > > > > > Andreas- > > Thanks for the response. > > Sorry for sending directly. > > > > Yes, it tries to read from offset 112085940, but does not find the xrefstm > > there, so > > that's when it goes searching. It seems to be landing in the middle of > > something else (perhaps an image?) > > > > I tried running the preflight command on the file, and this is what it found > > there. > > This is in the middle of a whole series of repetitive byte patterns like > > these, which is interspersed with other sections of content that is also > > binary only. > > > > <?xml version="1.0" encoding="UTF-8" standalone="no"?> > > <preflight name="file.pdf"> > > <executionTimeMS>2646</executionTimeMS> > > <isValid type="">false</isValid> > > <errors count="1"> > > <error count="1"> > > <code>1.0</code> > > <details>Syntax error, Error: Expected a long type at offset > > 112085940, > > instead got > > '6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ±¯Ó"z·Cœ3Í}yó£g‚?1º·Óž-óVÏ:ë½NsËŽ¸6lÙ³fÅ#듨Î÷å.£=‰ù}ÕsÞÿ'</details> > > </error> > > </errors> > > </preflight> > > > > The patterns seem to be: > > > > lots of these: 6lÙ³fÍ› > > interspersed between blocks that are similar to this: > > ±¯Ó"z·Cœ3Í}yó£g‚?1º·Óž-óVÏ:ë½NsËŽ¸6lÙ³fÅ#듨Î÷å.£=‰ù}ÕsÞÿ' > > > > It just so happens that the offset 112085940 falls right in the middle of a > > big block of those 6lÙ³fÍ› repetitive blocks. > > > > Not sure if that's any help. > > > > Steve > > > > ________________________________________ > > From: Andreas Lehmkühler <andr...@lehmi.de> > > Sent: Monday, February 16, 2015 3:34 AM > > To: users@pdfbox.apache.org > > Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present > > (or variation of it still present) > > > > Hi, > > > > > Steve Antoch <sant...@yuzu.com> hat am 13. Februar 2015 um 23:34 > > > geschrieben: > > > > > > > > > > > > Hi Tilman and Andreas-- > > Please don't contact developers directly, use our mailing lists instead. > > I've > > put the users list back into the boat... > > > > > I am working with Krasimir on this issue. > > > > > > Although we asked, we were denied permission to send the document out. > > :-( > > > > > The failure is being triggered when we attempt to use the Encrypt() class > > > to > > > password protect the pdf. > > > We end up with the "Expected a long type at offset 113884174, instead got > > > 'xref'" failure. > > > > > > I have debugged into the PDFBox code and found the offending parts. > > > > > > PdfBox is trying to parse an xref table located at 113884174. > > > > > > The problem we are seeing is that the inside the trailer it finds the > > > /XRefStm > > > label, and its offset value is returned as 112085940 (which is what is > > > given > > > in the file), > > > However, the checkXRefOffset() call made to verify it doesn't find the > > > xref > > > stream there, so it goes searching and ends up returning the closest xref > > > offset it can find, which happens to be that it returns its own offset at > > > 113884174. > > > > > > > > > I believe that there is an error in PdfBox with respect to this fixup > > > logic, > > > even if it had found the 'correct' xref stream. > > > That is because the fixup offset can NEVER work. Every time it fixes up > > > the > > > location, it lands on a section which begins with "xref". > > > The next call is to skip the whitespace, but since there is never any > > > there > > > (it's already proven to be 'xref'), it does not advance the input stream. > > > Then, the first call to parse that xrefstm always calls readObjectID(), > > > which > > > always will throw the exception because the bytes are always 'xref'. > > > > > > So, my questions are: > > > > > > 1) Are these docs fixable or are they truly corrupt? > > Without having a hand on the pdf itself it's hard to give a 100% answer. But > > I > > guess there has to be fix, as adobe is able to open that pdf. I'll try to > > find > > one, following your description of the pdf > > > > > 2) Is this xref issue a known issue with PdfBox? I would try to create a > > > document that displays the error but I honesty don't know how to do so > > > (beyond > > > sending the ones that we have that DO display it). > > Not until now > > > > > 3) Do you have any idea how these documents end up in this state if they > > > are > > > being edited by tools such as InDesign, Acrobat, etc? Is there something I > > > can > > > do to identify them? > > There are a lot of more or less corrupt files in the wild. Those are created > > using different tools. > > > > > 4) If this is a truly corrupted document, why would Acrobat be able to > > > open > > > these files but pdfBox cannot? Are these streams somehow ignorable? I > > > ask > > > this because I saw this statement on a web page > > > (http://resources.infosecinstitute.com/pdf-file-format-basic-structure/) > > > when > > > I was searching for answers on this: > > Adobe implements a lot of self healing mechanisms to repair broken files and > > we > > try to do so too, but it's complicated. > > > > > – /XrefStm [integer]: specifies the offset from the beginning of the file > > > to > > > the cross-reference stream in the decoded stream. This is only present in > > > hybrid-reference files, which is specified if we would also like to open > > > documents even if the applications don’t support compressed reference > > > streams. > > > > > > Any light you can shed on this is appreciated. > > > > > > Thanks- > > > Steve > > > > > > > > > See below for the pertinent data and the code which is marked with the > > > values > > > as I traced through. > > > > > > I have confirmed that the byte offset of the word xref below is exactly at > > > 113884174. > > > > Does the xref stream start at 112085940 (stream offset from the trailer > > dictionary) or what did you find at that offset? > > > > > > > xref > > > 0 53641 > > > 0000000000 65535 f > > > 0000000017 00000 n > > > > > > <massive snip/> > > > > > > > > > trailer > > > \<\< > > > /Size 53641 > > > /Root 1 0 R > > > /XRefStm 112085940 > > > /Info 8 0 R > > > /ID [\<19790A83488211E283B50017F203355C> > > > \<E3DF7097A16969B08238787F19E7E219>] > > > >> > > > startxref > > > 113884174 > > > %%EOF1 0 obj\<\</Outlines 2 0 R/Metadata 53641 0 R/AcroForm 4 0 R/Pages 5 > > > 0 > > > R/StructTreeRoot 6 0 R/Type/Catalog/PageLabels 7 0 R>> > > > endobj > > > > > > > > > protected COSDictionary parseXref(long startXRefOffset) throws > > > IOException > > > { > > > pdfSource.seek(startXRefOffset); > > > long startXrefOffset = parseStartXref(); > > > // check the startxref offset > > > long fixedOffset = checkXRefOffset(startXrefOffset); > > > if (fixedOffset > -1) > > > { > > > startXrefOffset = fixedOffset; > > > } > > > document.setStartXref(startXrefOffset); > > > long prev = startXrefOffset; > > > // ---- parse whole chain of xref tables/object streams using PREV > > > reference > > > while (prev > -1) <== prev here is 113884174. > > > { > > > // seek to xref table > > > pdfSource.seek(prev); > > > > > > // skip white spaces > > > skipSpaces(); > > > // -- parse xref > > > if (pdfSource.peek() == X) > > > { > > > // xref table and trailer > > > // use existing parser to parse xref table > > > parseXrefTable(prev); > > > // parse the last trailer. > > > trailerOffset = pdfSource.getOffset(); > > > // PDFBOX-1739 skip extra xref entries in RegisSTAR > > > documents > > > while (isLenient && pdfSource.peek() != 't') > > > { > > > if (pdfSource.getOffset() == trailerOffset) > > > { > > > // warn only the first time > > > LOG.warn("Expected trailer object at position " + > > > trailerOffset > > > + ", keep trying"); > > > } > > > readLine(); > > > } > > > if (!parseTrailer()) > > > { > > > throw new IOException("Expected trailer object at > > > position: " > > > + pdfSource.getOffset()); > > > } > > > COSDictionary trailer = > > > xrefTrailerResolver.getCurrentTrailer(); > > > // check for a XRef stream, it may contain some object ids > > > of > > > compressed objects > > > if(trailer.containsKey(COSName.XREF_STM)) <== YES - but > > > falue > > > { > > > int streamOffset = trailer.getInt(COSName.XREF_STM); > > > <== > > > This returns 112085940, which is the value from the trailer > > > // check the xref stream reference > > > fixedOffset = checkXRefOffset(streamOffset); > > > <== > > > checks it and returns 113884174 instead > > > if (fixedOffset > -1 && fixedOffset != streamOffset) > > > { > > > streamOffset = (int)fixedOffset; > > > trailer.setInt(COSName.XREF_STM, streamOffset); > > > } > > > pdfSource.seek(streamOffset); <== Seeks to 113884174 > > > //readExpectedString(XREF_TABLE, false); > > > skipSpaces(); <=== It's ON "xref", so it > > > doesn't > > > skip anything > > > parseXrefObjStream(prev, false); <== goes in here, > > > first > > > thing it tries to do is readObjectNumber(), which can't work because it's > > > 'xref' -- BOOM > > > } > > > prev = trailer.getInt(COSName.PREV); > > > if (prev > -1) > > > { > > > // check the xref table reference > > > fixedOffset = checkXRefOffset(prev); > > > if (fixedOffset > -1 && fixedOffset != prev) > > > { > > > prev = fixedOffset; > > > trailer.setLong(COSName.PREV, prev); > > > } > > > } > > > } > > > else > > > { > > > // parse xref stream > > > prev = parseXrefObjStream(prev, true); > > > if (prev > -1) > > > { > > > // check the xref table reference > > > fixedOffset = checkXRefOffset(prev); > > > if (fixedOffset > -1 && fixedOffset != prev) > > > { > > > prev = fixedOffset; > > > COSDictionary trailer = > > > xrefTrailerResolver.getCurrentTrailer(); > > > trailer.setLong(COSName.PREV, prev); > > > } > > > } > > > } > > > } > > > // ---- build valid xrefs out of the xref chain > > > xrefTrailerResolver.setStartxref(startXrefOffset); > > > COSDictionary trailer = xrefTrailerResolver.getTrailer(); > > > document.setTrailer(trailer); > > > document.setIsXRefStream(XRefType.STREAM == > > > xrefTrailerResolver.getXrefType()); > > > // check the offsets of all referenced objects > > > checkXrefOffsets(); > > > // copy xref table > > > document.addXRefTable(xrefTrailerResolver.getXrefTable()); > > > return trailer; > > > } > > > > > > BR > > Andreas Lehmkühler > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > > For additional commands, e-mail: users-h...@pdfbox.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org