Hello,
When I use the latest build of the Tika application jar's CLI with the
-h option to parse testAnnotations.pdf (from the parsers' test
documents folder), added in TIKA-738, the result has two "<p>"
elements and three "</p>" elements. Attempting to open this file in
the GUI also causes it to crash with a NPE--the same one described in
TIKA-778. I see in issue PDFBox-1143 that the code introduced for
TIKA-738 will go away once this PDFBox issue is resolved, but perhaps
meanwhile PDF2XHTML.java should be modified to produce a different
number of "</p>" elements: should one of the
"handler.endElement("p");" lines be removed from the endPage method?
Thanks,
John Mastarone