[ 
https://issues.apache.org/jira/browse/PDFBOX-5411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888292#comment-17888292
 ] 

Lapo Luchini commented on PDFBOX-5411:
--------------------------------------

I was trying a patch to implement this… but while integrating the example PDF 
in the automated tests I noticed they they were separated between "sorted" and 
unsorted.

Only then I understood (and verified) that it was this line that produced the 
mix of letters:

{{stripper.setSortByPosition({color:#cf8e6d}true{color});}}
and that by switching that to false I get both overlapping texts "one by one"… 
I guess in the order they were printed in the PDF.
 
This solved my problem, but I wonder which corner cases were correctly detected 
before and would be no longer. e.g. when the string I'm searching is produced 
by two separate "chunks" in the PDF, maybe sorting by position would connect 
them together and now it would no longer do?

> PDFTextStripper could use text size in reconstruction
> -----------------------------------------------------
>
>                 Key: PDFBOX-5411
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5411
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 2.0.25, 3.0.0 PDFBox
>            Reporter: Lapo Luchini
>            Priority: Minor
>         Attachments: image-2022-04-08-16-13-17-334.png, 
> image-2022-04-15-09-26-20-917.png, textDoubleText.pdf
>
>
> When two texts are partially overlapping {{PDFTextStripper}} seems to return 
> a mix simply based on "leftmost x coordinate of the glyph", which makes 
> sense, but it could make use of glyph size to disambiguate "easy" cases like 
> this one:
> !image-2022-04-08-16-13-17-334.png!
> currently this is the first parameter of PDFTextStripper.writeString(String 
> string, List<TextPosition> textPositions):
> {{"T0510E09620_S368b3aT92-29fa -4Leef-80I5e-N53c23efE7979f"}}
> I would of course hope for two calls:
> {{"TEST LINE"}}
> {{"051009620_368b3a92-29fa-4eef-805e-53c23ef7979f"}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to