[ https://issues.apache.org/jira/browse/PDFBOX-5411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888292#comment-17888292 ]
Lapo Luchini commented on PDFBOX-5411: -------------------------------------- I was trying a patch to implement this… but while integrating the example PDF in the automated tests I noticed they they were separated between "sorted" and unsorted. Only then I understood (and verified) that it was this line that produced the mix of letters: {{stripper.setSortByPosition({color:#cf8e6d}true{color});}} and that by switching that to false I get both overlapping texts "one by one"… I guess in the order they were printed in the PDF. This solved my problem, but I wonder which corner cases were correctly detected before and would be no longer. e.g. when the string I'm searching is produced by two separate "chunks" in the PDF, maybe sorting by position would connect them together and now it would no longer do? > PDFTextStripper could use text size in reconstruction > ----------------------------------------------------- > > Key: PDFBOX-5411 > URL: https://issues.apache.org/jira/browse/PDFBOX-5411 > Project: PDFBox > Issue Type: Improvement > Components: Text extraction > Affects Versions: 2.0.25, 3.0.0 PDFBox > Reporter: Lapo Luchini > Priority: Minor > Attachments: image-2022-04-08-16-13-17-334.png, > image-2022-04-15-09-26-20-917.png, textDoubleText.pdf > > > When two texts are partially overlapping {{PDFTextStripper}} seems to return > a mix simply based on "leftmost x coordinate of the glyph", which makes > sense, but it could make use of glyph size to disambiguate "easy" cases like > this one: > !image-2022-04-08-16-13-17-334.png! > currently this is the first parameter of PDFTextStripper.writeString(String > string, List<TextPosition> textPositions): > {{"T0510E09620_S368b3aT92-29fa -4Leef-80I5e-N53c23efE7979f"}} > I would of course hope for two calls: > {{"TEST LINE"}} > {{"051009620_368b3a92-29fa-4eef-805e-53c23ef7979f"}} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org