On 13.12.2024 15:41, Kevin Day wrote:
Ok, this is very interesting.

So we have one text render that adds a bunch of spaces.

Then we have another text render that puts visible characters on top of
those spaces.

Then we positional-sort the text positions, and they wind up interleaving,
which kills the text extraction accuracy.

Would a reasonable refinement to the algorithm be to look for overlapping
glyphs? And if there is a space that overlaps a non-space, then ignore the
space?

I know that text extraction is a black art (disclosure: I wrote the
original text extraction modules for iText, so lots of hard-earned
experience here), but I think the above would be appropriate for all
situations...

Would you be up for reviewing a patch if I implement this?

PS I also think that having an option to ignore spaces in rendering (the
issue you linked to) would be a good idea. But that should be optional. I'm
happy to include that in my patch if you would like.

Yes to both!

Note that there is a pending patch about a corner case:

https://github.com/apache/pdfbox/pull/155

But don't use code from itext unless you are allowed to, i.e. check the papers that you signed. If you patch is more than just a small bugfix you may have to sign something with us as well.

https://www.apache.org/licenses/icla.pdf

Tilman




K

On Thu, Dec 12, 2024, 9:40 PM Tilman Hausherr <thaush...@t-online.de> wrote:

Hi,

These spaces are really in the PDF:

     BT
     /Content <</MCID 224 >>BDC
     1 i
     /T1_4 1 Tf
     7 0 0 7 *195.4 110.502* Tm
     *(\(                                                     \))Tj*
     EMC
     /Content <</MCID 225 >>BDC
     /T1_0 1 Tf
     -19.686 6.786 Td
     (Beginning capital account)Tj
     /T1_4 1 Tf
     ( )Tj
     14.057 0 Td
     (.)Tj
     1.714 0 Td
     (.)Tj
     1.714 0 Td
     (.)Tj
     EMC
     ET


And *195.4 110.502* is really the position, you can move your mouse
there in PDFDebugger.

The font messages are not important here.
There is a way to get rid of such spaces, but it requires a source code
change, it is described here:
https://issues.apache.org/jira/browse/PDFBOX-3774

However it's possible that other files would have a bad text extraction.

Tilman

On 13.12.2024 03:29, Kevin Day wrote:
Hello-

We are using PDFTextStripper, and have found some cases where there are a
*lot* of extraneous spaces being added to the output.  It almost acts
like
the stripper is thinking that the space width of the font is super tiny.

I managed to get a document that exhibits the behavior:


https://drive.google.com/file/d/1B2Mc4mMdsYfk9jKVqQ9OxEhKLRAxprrU/view?usp=sharing
The easiest way to see the behavior is in PDFDebugger, View->Show
Stripper
Text Positions.

Note in the lower left corner of the document, there is text "999".  The
text above and below that is fine, but the line with 999 has a *ton* of
extra space rectangles displated.

The extract text function in PDFDebugger doesn't sort, so that one comes
out fine, but if you use PDFTextStripper with sorting enabled (), the
line
renders like this:

Withdrawals and distributions . . . $ ( 9 9 9 )

Note the many space characters, and that there are even spaces between
each
9.

I also observe that the PDF has warning messages about fonts (not sure if
this might be involved):

[main] WARN org.apache.pdfbox.pdmodel.font.PDType1Font - Using fallback
font ArialMT for HelveticaLTStd-Roman

[main] WARN org.apache.fontbox.ttf.CmapSubtable - Format 14 cmap table is
not supported and will be ignored



It almost acts like the parenthesis on the line are triggering some
different detection mode where the font's space width is computing to be
much smaller than it should be.

Any ideas on what is going on or if it is fixable?

Thanks!

- K



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Reply via email to