Oliver Schmidtmer created PDFBOX-6046: -----------------------------------------
Summary: PDFTextStripper: Sorting issue with overlaying text Key: PDFBOX-6046 URL: https://issues.apache.org/jira/browse/PDFBOX-6046 Project: PDFBox Issue Type: Bug Reporter: Oliver Schmidtmer Attachments: 10600601393673.ANF - 20.03.2025, 08_57_48.pdf, image-2025-07-28-20-24-32-787.png We found an issue with the PDFTextStripper if text is "layered", with in this case some spaces as placeholder. The PDFs in question are templates for orders, which are filled with data in a second step. So if the text is ordered by concurrence in the PDF source, the first half are the field labels, the second half then the field values. So we need sorting by rendered position with PDFTextStripper#setSortByPosition(true) Now as the first example of the file, what should be "Auftraggeber: NAGEL-GROUP" is extracted as "Auftraggeber: N AGEL-GROUP" with a space. !image-2025-07-28-20-24-32-787.png|width=440,height=62! This is caused by spaces after "Auftraggeber: " as a placeholder in the template, which overlap with the first glyph of the field value. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org