Hi Everyone,

There is a problem with PdfPagesTree::GetPageNode() which yields NULL for valid 
PDFs.

E.g. GetPageNode() for nPageNum=1 fails for this 3 page PDF:
https://eur-lex.europa.eu/legal-content/DE/TXT/PDF/?uri=CELEX:52018XC0810(05)&from=DE

This PDF is an example for a strange but valid page tree containing 
"/Pages“-Nodes with "/Count 0“ and „/Kids [ ]“.
According to the PDF Spec "Section 7.7.3 Page Tree / 7.7.3.1 General" this tree 
should be handled:

> Conforming products shall be prepared to handle any form of tree structure 
> built of such nodes.

In fact, Adobe products have no problems with the PDF and Preflight checks show 
no problem either. However, PoDoFo cannot handle this tree:

> 372 0 obj
> <<
> /Type /Pages
> /Count 3
> /Kids [ 373 0 R 374 0 R 375 0 R ]
> >>
> endobj
> 373 0 obj
> <<
> /Type /Pages
> /Count 3
> /Kids [ 380 0 R 1 0 R 6 0 R ]
> /Parent 372 0 R
> >>
> endobj
> 374 0 obj
> <<
> /Type /Pages
> /Count 0
> /Kids [ ]
> /Parent 372 0 R
> >>
> endobj
> 375 0 obj
> <<
> /Type /Pages
> /Count 0
> /Kids [ ]
> /Parent 372 0 R
> >>
> endobj
> ...
> 379 0 obj
> <<
> /Type /Catalog
> /Lang (de)
> /MarkInfo <<
> /Marked true
> >>
> /Metadata 21 0 R
> /OpenAction [ 380 0 R /XYZ null null null ]
> /OutputIntents [ 376 0 R ]
> /Pages 372 0 R
> /StructTreeRoot 39 0 R
> >>
> endobj

The problem stems from this part of GetPageNode() where it calls 
GetPageNodeFromArray():

>  if( numDirectKids == numKids && static_cast<size_t>(nPageNum) < 
> numDirectKids )
>     {
>         // This node has only page nodes as kids,
>         // so we can access the array directly
>         rLstParents.push_back( pParent );
>         return GetPageNodeFromArray( nPageNum, rKidsArray, rLstParents );
>     } 

The condition of the if-statement is true for this tree. However, 
GetPageNodeFromArray() cannot handle the tree layout in rKidsArray correctly.

Closer inspection of the code in GetPageNode() and GetPageNodeFromArray() shows 
that there is considerable code duplication and a lot of special cases, even 
for malformed PDFs. In fact, I would like to propose the complete removal of 
GetPageNodeFromArray() because it’s not needed, the condition for calling it is 
currently wrong and not easy to correct, and it introduces unclean code. There 
is another call to GetPageNodeFromArray() which also is unsure about its 
results and tries at least to correct this by checking the result for NULL. 

Rather the full tree traversal in GetPageNode() would be sufficient and correct 
for all cases. This end clearly needs further inspection of a PoDoFo expert.

Best regards,
Amin


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Podofo-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/podofo-users

Reply via email to