Oliver Schmidtmer created PDFBOX-6046:
-----------------------------------------

             Summary: PDFTextStripper: Sorting issue with overlaying text
                 Key: PDFBOX-6046
                 URL: https://issues.apache.org/jira/browse/PDFBOX-6046
             Project: PDFBox
          Issue Type: Bug
            Reporter: Oliver Schmidtmer
         Attachments: 10600601393673.ANF - 20.03.2025, 08_57_48.pdf, 
image-2025-07-28-20-24-32-787.png

We found an issue with the PDFTextStripper if text is "layered", with in this 
case some spaces as placeholder.

The PDFs in question are templates for orders, which are filled with data in a 
second step.

So if the text is ordered by concurrence in the PDF source, the first half are 
the field labels, the second half then the field values. So we need sorting by 
rendered position with PDFTextStripper#setSortByPosition(true)

Now as the first example of the file, what should be

"Auftraggeber: NAGEL-GROUP"

is extracted as

"Auftraggeber: N AGEL-GROUP" with a space.

!image-2025-07-28-20-24-32-787.png|width=440,height=62!

This is caused by spaces after "Auftraggeber:  " as a placeholder in the 
template, which overlap with the first glyph of the field value.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to