Re: [Podofo-users] Patch for pdfParser - findToken function

Michal Sudolsky Wed, 27 Apr 2022 11:23:03 -0700

On Wed, Apr 27, 2022 at 7:56 PM Francesco Pretto <cez...@gmail.com> wrote:


> My report on pdfmm:
>
> 512.pdf -> OK
> 513.pdf -> OK
> 514.pdf -> OK
> rev.pdf -> FAIL
> big.pdf -> OK
> false.pdf -> OK
>
> I also created a big2.pdf (attached) that also fails on pdfmm but
> opens on Adobe, where the garbage is put just in between the numeric
> offset and the %%EOF. As you say, a better backward function should be
> created to handle such edge cases.
>

Yes I can see why (maybe you forgot to actually attach it). Also something
like false.pdf with garbage with string "startxref" after that numeric
offset should also fail.


> I think for PoDoFo 0.9.8 we could focus on just handling the specific
> issue reported by Dennis, if possible with few lines of code and not
> breaking other pdfs.
>

Just to note although it may fix podofo on 512, 513, 514 and big.pdf, it
breaks rev.pdf which is currently working. Maybe that patch could choose
max(file_size - xref_offset, lRange) and use that as buffer size as a quick
workaround. Then it will be the same as old behaviour such that buffer size
will be equal or larger than 512. One possible problem with this could be
that there is then no more functional safeguard. In my pdf with 1 GB of
garbage it would need to load whole this garbage into memory. Also when I
tested this pdf in acrobat it was loaded really fast but it was slower in
other viewers so maybe they loaded the whole file but acrobat did something
smarter.


> I'm sorry but I'm not available to work on PoDoFo 0.9.x codebase, but
> I will create test cases using the pdfs you created and fix it in
> pdfmm (which is candidate for merging to PoDoFo).
>

512.pdf could be made also with garbage between numeric offset and EOF and
then it should trigger that internal logic error also in pdfmm. I wrote
"That will be that "if( !i )" and it will probably throw such an error also
in pdfmm." but I forgot that it does not parse the trailer backwards for a
moment. I made them specifically for podofo but each can be changed to
target startxref instead of trailer.


> Regards,
> Francesco
>
>
> On Wed, 27 Apr 2022 at 18:27, Michal Sudolsky <sudols...@gmail.com> wrote:
> >
> > Attached are 6 PDF files and all of them open well in 3 pdf viewers I
> tested.
> >
> >>
> >> so the backward search is correct, but it's better to limit it to
> "startxref".
> >>
> >> > Seems you are searching for a trailer right after xref (if I read
> that part well).
> >> >
> >>
> >> Yes, correct, that was a cleaner solution: in my case it was useful to
> >> fix some spurious warnings as the commit message says. It also
> >> improved parsing performance.
> >
> >
> > Btw I noticed some typo here "Ooffset read position to the EOF marker if
> it is not the last thing in the file".
> >
> >>
> >>
> >> > So is there actually some reason that for "i == 0" it is internal
> logic? What if startxref is precisely PDF_XREF_BUF bytes before the last
> EOF offset (m_LastEOFOffset)?
> >> >
> >>
> >> I didn't modify that code but I believe this was kind of a intended
> >> safeguard since the backward search is slow. Assuming one put a big
> >> amount of garbage also between "startxref" and "%%EOF" yes, what you
> >> say is true.
> >
> >
> > Yes, searching backward may be slow unless the whole file is loaded into
> memory (which is not really good) but this can be also done by parts see at
> bottom. And also it can search for both the trailer and startxref at once.
> >
> > 512.pdf gives error:
> >
> > PoDoFo encountered an error. Error: 8 ePdfError_InternalLogic
> > Error Description: An internal error occurred.
> >
> > That will be that "if( !i )" and it will probably throw such an error
> also in pdfmm. I still do not believe this is really intentional (rather it
> is just a bug).
> >
> > 513.pdf surprisingly works in podofo (trailer is not found by FindToken
> but i is -1 so it seeks 513 bytes backwards where is subsequently found
> trailer by IsNextToken after call to FindToken in ReadTrailer).
> >
> > 514.pdf same error as big.pdf.
> >
> >> We should test if Adobe handles arbitrary amount of
> >> garbage.
> >>
> >
> > big.pdf gives error (it has 1 MB of garbage so it is zipped):
> >
> > PoDoFo encountered an error. Error: 15 ePdfError_NoNumber
> > Error Description: A number was expected but not found.
> >
> > At the bottom of the call stack there is "Information: Unable to find
> trailer in file."
> >
> > I also tested 1 GB of garbage (comments) and also this worked fine in
> the mentioned 3 viewers.
> >
> >> Going back to the reporter issue: I don't know how to fix it in PoDoFo
> >> with a few lines patch, but if you don't think anything safe enough a
> >> better fix is doing like a did in pdfmm not reading "trailer"
> >> backward. Of course such change won't need being merged to pdfmm.
> >
> >
> > rev.pdf is working fine in podofo but when is applied patch from this
> email thread then it gives error:
> >
> > PoDoFo encountered an error. Error: 15 ePdfError_NoNumber
> > Error Description: A number was expected but not found.
> >
> > It cannot find the trailer.
> >
> > Also I suppose rev.pdf cannot be opened in pdfmm. It has reordered xref
> and trailer. Note that there is nothing in the pdf specification which says
> that trailer and xref must be in particular order just that trailer is
> before startxref. It also does not say how far from the end can be trailer
> or startxref (only that %%EOF must be within 1024 bytes).
> >
> > Maybe the best approach would be to load chunks of file into memory from
> backwards. Lets say first it loads the last 16 kB and searches for a token,
> if not found it will discard this chunk and loads next 16 kB and so on so
> even when there are GBs of garbage it will not drain the whole memory (of
> course these chunks should somehow overlap because there can be "trai" at
> end of one chunk and "ler" in previous chunk).
> >
> > But there is another case:
> > false.pdf gives error on podofo:
> >
> > PoDoFo encountered an error. Error: 20 ePdfError_InvalidDataType
> >
> > There is a "false" trailer in a comment. This means that it is not
> enough to just search for a specific string but it needs to be aware of
> context whether that string is in comment or not (this is the case for both
> trailer and startxref).
> >
> >
> >>
> >> Cheers,
> >> Francesco
>

_______________________________________________
Podofo-users mailing list
Podofo-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/podofo-users

Re: [Podofo-users] Patch for pdfParser - findToken function

Reply via email to