Hi team, thank you for your great work on PDFBox!

I want to report an issue with PDF parsing/rendering.

In production, we have encountered with a PDF file that is not rendered
properly with PDFBox. It looks like it's cut in the middle. On the other
hand, Acrobat and pdf.js can render it without any problem.

I troubleshot the issue. PDFBox reports a warning at a specific offset,
which is in the middle of a string parameter to a TJ operator. What's
interesting is that, the string contains the byte sequence "\\)\n>" (hex:
5C 29 0A 3E) around the offset. I found that PDFBox has a special handling
<https://github.com/apache/pdfbox/blob/2.0.28/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/BaseParser.java#L480>
for this byte sequence. This seems to explain our issue perfectly.

Looking at the comment
<https://github.com/apache/pdfbox/blob/2.0.28/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/BaseParser.java#L365>
I
can understand that it's trying to work around some PDF producer bug.
However, now it causes a rendering error for properly generated PDF files.
Is there something that we can do to get our PDFs rendered correctly?

--
Yuxiao Zeng(ユーシャオ ゼン)
*スタッフエンジニアリングマネージャー*

*医療情報技師*

Flatiron Health株式会社
https://flatiron.co.jp

Reply via email to