[
https://issues.apache.org/jira/browse/PDFBOX-5411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888352#comment-17888352
]
Michael Klink commented on PDFBOX-5411:
---------------------------------------
You guess right, with {{SortByPosition}} set to {{false}} text is extracted in
the order it is drawn by the instructions in the content streams. Concerning
your question, therefore -
{quote}I wonder which corner cases were correctly detected before and would be
no longer{quote}
\- the cases that require sorting are those in which the text is _not_ drawn in
reading order. Theoretically text in PDFs can be drawn in any order, so the
need to sort can arise for arbitrary PDFs. In real PDFs text often is drawn in
reading order as that's quite a natural thing to do. But there are exceptions.
And as programs usually cannot determine which PDFs draw the text in reading
order and which don't, many of them sort always, just in case.
In particular if forms are prefilled (or filled and then flattened), you
usually get content streams in which first all the labels and flavor texts are
drawn and thereafter all the filled-in values. Sorting such PDFs allows for
sensible text extraction.
> PDFTextStripper could use text size in reconstruction
> -----------------------------------------------------
>
> Key: PDFBOX-5411
> URL: https://issues.apache.org/jira/browse/PDFBOX-5411
> Project: PDFBox
> Issue Type: Improvement
> Components: Text extraction
> Affects Versions: 2.0.25, 3.0.0 PDFBox
> Reporter: Lapo Luchini
> Priority: Minor
> Attachments: image-2022-04-08-16-13-17-334.png,
> image-2022-04-15-09-26-20-917.png, textDoubleText.pdf
>
>
> When two texts are partially overlapping {{PDFTextStripper}} seems to return
> a mix simply based on "leftmost x coordinate of the glyph", which makes
> sense, but it could make use of glyph size to disambiguate "easy" cases like
> this one:
> !image-2022-04-08-16-13-17-334.png!
> currently this is the first parameter of PDFTextStripper.writeString(String
> string, List<TextPosition> textPositions):
> {{"T0510E09620_S368b3aT92-29fa -4Leef-80I5e-N53c23efE7979f"}}
> I would of course hope for two calls:
> {{"TEST LINE"}}
> {{"051009620_368b3a92-29fa-4eef-805e-53c23ef7979f"}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]