Hi Everyone, There is a problem with PdfPagesTree::GetPageNode() which yields NULL for valid PDFs.
E.g. GetPageNode() for nPageNum=1 fails for this 3 page PDF: https://eur-lex.europa.eu/legal-content/DE/TXT/PDF/?uri=CELEX:52018XC0810(05)&from=DE This PDF is an example for a strange but valid page tree containing "/Pages“-Nodes with "/Count 0“ and „/Kids [ ]“. According to the PDF Spec "Section 7.7.3 Page Tree / 7.7.3.1 General" this tree should be handled: > Conforming products shall be prepared to handle any form of tree structure > built of such nodes. In fact, Adobe products have no problems with the PDF and Preflight checks show no problem either. However, PoDoFo cannot handle this tree: > 372 0 obj > << > /Type /Pages > /Count 3 > /Kids [ 373 0 R 374 0 R 375 0 R ] > >> > endobj > 373 0 obj > << > /Type /Pages > /Count 3 > /Kids [ 380 0 R 1 0 R 6 0 R ] > /Parent 372 0 R > >> > endobj > 374 0 obj > << > /Type /Pages > /Count 0 > /Kids [ ] > /Parent 372 0 R > >> > endobj > 375 0 obj > << > /Type /Pages > /Count 0 > /Kids [ ] > /Parent 372 0 R > >> > endobj > ... > 379 0 obj > << > /Type /Catalog > /Lang (de) > /MarkInfo << > /Marked true > >> > /Metadata 21 0 R > /OpenAction [ 380 0 R /XYZ null null null ] > /OutputIntents [ 376 0 R ] > /Pages 372 0 R > /StructTreeRoot 39 0 R > >> > endobj The problem stems from this part of GetPageNode() where it calls GetPageNodeFromArray(): > if( numDirectKids == numKids && static_cast<size_t>(nPageNum) < > numDirectKids ) > { > // This node has only page nodes as kids, > // so we can access the array directly > rLstParents.push_back( pParent ); > return GetPageNodeFromArray( nPageNum, rKidsArray, rLstParents ); > } The condition of the if-statement is true for this tree. However, GetPageNodeFromArray() cannot handle the tree layout in rKidsArray correctly. Closer inspection of the code in GetPageNode() and GetPageNodeFromArray() shows that there is considerable code duplication and a lot of special cases, even for malformed PDFs. In fact, I would like to propose the complete removal of GetPageNodeFromArray() because it’s not needed, the condition for calling it is currently wrong and not easy to correct, and it introduces unclean code. There is another call to GetPageNodeFromArray() which also is unsure about its results and tries at least to correct this by checking the result for NULL. Rather the full tree traversal in GetPageNode() would be sufficient and correct for all cases. This end clearly needs further inspection of a PoDoFo expert. Best regards, Amin ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Podofo-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/podofo-users
