Hello-

We are using PDFTextStripper, and have found some cases where there are a
*lot* of extraneous spaces being added to the output.  It almost acts like
the stripper is thinking that the space width of the font is super tiny.

I managed to get a document that exhibits the behavior:

https://drive.google.com/file/d/1B2Mc4mMdsYfk9jKVqQ9OxEhKLRAxprrU/view?usp=sharing

The easiest way to see the behavior is in PDFDebugger, View->Show Stripper
Text Positions.

Note in the lower left corner of the document, there is text "999".  The
text above and below that is fine, but the line with 999 has a *ton* of
extra space rectangles displated.

The extract text function in PDFDebugger doesn't sort, so that one comes
out fine, but if you use PDFTextStripper with sorting enabled (), the line
renders like this:

Withdrawals and distributions . . . $ ( 9 9 9 )

Note the many space characters, and that there are even spaces between each
9.

I also observe that the PDF has warning messages about fonts (not sure if
this might be involved):

[main] WARN org.apache.pdfbox.pdmodel.font.PDType1Font - Using fallback
font ArialMT for HelveticaLTStd-Roman

[main] WARN org.apache.fontbox.ttf.CmapSubtable - Format 14 cmap table is
not supported and will be ignored



It almost acts like the parenthesis on the line are triggering some
different detection mode where the font's space width is computing to be
much smaller than it should be.

Any ideas on what is going on or if it is fixable?

Thanks!

- K

Reply via email to