[jira] [Comment Edited] (PDFBOX-4800) Parsing of numbers does not always terminate

Eckhart Pedersen (Jira) Thu, 19 Mar 2020 23:51:15 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063121#comment-17063121
 ]


Eckhart Pedersen edited comment on PDFBOX-4800 at 3/20/20, 6:50 AM:
--------------------------------------------------------------------

Ah yes, I see now how my description can be misleading, sorry.

I am well aware that our solution is most likely not the best fix/the entire 
solution, but it got our customer up and running and that had the highest 
priority for us right now. I provided our solution mostly for you to better 
understand the issue and hoped that you could determine the correct fix, which 
we then eventually could incorporate.

That being said, from my limited understanding of the PDF format it does seem 
like the list of characters that can signal the end of a number should be 
expanded, and/or the entire approach needs to be rethought a bit.

I am also a bit concerned about the parser swallowing all exceptions and then 
later complaining about a missing Page Tree without mentioning that there 
actually was a Page Tree and that there were errors while parsing it. This 
makes debugging a bit more difficult than necessary :)

Thank you for looking into this!


was (Author: cryptomathic_epe):
Ah yes, I see now how my description can be misleading, sorry.

I am well aware that our solution is most likely not the best fix/the entire 
solution, but it got our customer up and running and that had the highest 
priority for us right now. I provided our solution mostly for you to better 
understand the issue and hoped that you could determine the correct fix, which 
we then eventually could incorporate.

That being said, from my limited understanding of the PDF format it does seem 
like the list of characters that can signal the end of a number should be 
expanded, and/or the entire approach needs to be rethought a bit.

I am also a bit concerned about the parser swallowing all exceptions and then 
later complaining about a missing PageTree without stating that there have 
indeed been exceptions that could explain the missing PageTree. This makes 
debugging a bit more difficult than necessary. 

Thank you for looking into this!

> Parsing of numbers does not always terminate
> --------------------------------------------
>
>                 Key: PDFBOX-4800
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4800
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 2.0.12, 2.0.15, 2.0.19
>            Reporter: Eckhart Pedersen
>            Priority: Major
>         Attachments: 1584634522723.txt, demobank_case_error_doc1.pdf, 
> demobank_case_ok_doc1.pdf
>
>
> *Short description:*
> The method *readStringNumber* in *BaseParser.java* fails to terminate parsing 
> of numbers for certain documents. We have internally fixed the issue by 
> adding the following line ({color:#de350b}marked with red{color}):
> {color:#505f79}while( (lastByte = seqSource.read() ) != _ASCII_SPACE_ ** 
> &&{color}
> {color:#505f79}             lastByte != _ASCII_LF_ ** &&{color}
> {color:#505f79}             lastByte != _ASCII_CR_ ** &&{color}
> {color:#505f79}             lastByte != 60 && _//see sourceforge bug 
> 1714707_{color}
> {color:#505f79}             __             lastByte != '[' && _// 
> PDFBOX-1845_{color}
> {color:#505f79}             __             lastByte != '(' && _// 
> PDFBOX-2579_{color}
> {color:#505f79}             __             lastByte != 0 && _//See 
> sourceforge bug 853328_{color}
> {color:#de350b}             __             *lastByte != '/' &&*{color}
>  {color:#505f79}        lastByte != -1 ){color}
> {color:#505f79}     {{color}
>  
> *Background:*
> Our customer ran into an issue with certain documents that were converted to 
> PDF/A2 format with Qoppa jPDFPreflight 
> ([https://www.qoppa.com/pdfpreflight/]). In some instances pdfbox would 
> afterwards fail to open the document.
> (It is possible that the Qoppa conversion tool does something wrong and that 
> the resulting PDF is invalid somehow, but all other tools seem to open the 
> converted documents without any problems. We are not PDF experts, so this is 
> difficult for us to judge. If you determine that the problematic PDF document 
> is incorrect somehow, please notify us so that we can create a bug report at 
> Qoppa also.)
> I am attaching both an original version of the document (which pdfbox can 
> open just fine) and the converted version (which pdfbox cannot parse 
> correctly).
> *Additional information*
> **My colleague refers to ISO 32000-1 section 7.2.2 which describes all valid 
> white-space and delimiter characters for PDF.
> According to the list of delimiter/white-space characters the following 
> characters should also be handled in the readStringNumber method: '%','\{', 
> ')', ']', '}', '>' , FORM FEED, and HORIZONTAL TAB.
> Though again, as we are not experts on the PDF standard we recommend that you 
> check the mentioned standard documents yourself and determine what kind of 
> solution you want to implement (if any).
> *Final Note:*
> We are filing this bug report in the hope that you find it helpful. I have 
> tried to include all relevant information as well as I can, if you have 
> further questions, I would be happy to address them as well as I can.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-4800) Parsing of numbers does not always terminate

Reply via email to