Re: Text extraction adding lots of strange spaces

Kevin Day Sat, 14 Dec 2024 09:00:45 -0800

Great, I will put that together in the next couple of days.

I haven't looked at the iText code in several years - this will be
greenfield development work.


Take care,

K

On Fri, Dec 13, 2024, 9:21 PM Tilman Hausherr <thaush...@t-online.de> wrote:

> On 13.12.2024 15:41, Kevin Day wrote:
> > Ok, this is very interesting.
> >
> > So we have one text render that adds a bunch of spaces.
> >
> > Then we have another text render that puts visible characters on top of
> > those spaces.
> >
> > Then we positional-sort the text positions, and they wind up
> interleaving,
> > which kills the text extraction accuracy.
> >
> > Would a reasonable refinement to the algorithm be to look for overlapping
> > glyphs? And if there is a space that overlaps a non-space, then ignore
> the
> > space?
> >
> > I know that text extraction is a black art (disclosure: I wrote the
> > original text extraction modules for iText, so lots of hard-earned
> > experience here), but I think the above would be appropriate for all
> > situations...
> >
> > Would you be up for reviewing a patch if I implement this?
> >
> > PS I also think that having an option to ignore spaces in rendering (the
> > issue you linked to) would be a good idea. But that should be optional.
> I'm
> > happy to include that in my patch if you would like.
>
> Yes to both!
>
> Note that there is a pending patch about a corner case:
>
> https://github.com/apache/pdfbox/pull/155
>
> But don't use code from itext unless you are allowed to, i.e. check the
> papers that you signed. If you patch is more than just a small bugfix
> you may have to sign something with us as well.
>
> https://www.apache.org/licenses/icla.pdf
>
> Tilman
>
>
>
> >
> > K
> >
> > On Thu, Dec 12, 2024, 9:40 PM Tilman Hausherr <thaush...@t-online.de>
> wrote:
> >
> >> Hi,
> >>
> >> These spaces are really in the PDF:
> >>
> >>      BT
> >>      /Content <</MCID 224 >>BDC
> >>      1 i
> >>      /T1_4 1 Tf
> >>      7 0 0 7 *195.4 110.502* Tm
> >>      *(\(                                                     \))Tj*
> >>      EMC
> >>      /Content <</MCID 225 >>BDC
> >>      /T1_0 1 Tf
> >>      -19.686 6.786 Td
> >>      (Beginning capital account)Tj
> >>      /T1_4 1 Tf
> >>      ( )Tj
> >>      14.057 0 Td
> >>      (.)Tj
> >>      1.714 0 Td
> >>      (.)Tj
> >>      1.714 0 Td
> >>      (.)Tj
> >>      EMC
> >>      ET
> >>
> >>
> >> And *195.4 110.502* is really the position, you can move your mouse
> >> there in PDFDebugger.
> >>
> >> The font messages are not important here.
> >> There is a way to get rid of such spaces, but it requires a source code
> >> change, it is described here:
> >> https://issues.apache.org/jira/browse/PDFBOX-3774
> >>
> >> However it's possible that other files would have a bad text extraction.
> >>
> >> Tilman
> >>
> >> On 13.12.2024 03:29, Kevin Day wrote:
> >>> Hello-
> >>>
> >>> We are using PDFTextStripper, and have found some cases where there
> are a
> >>> *lot* of extraneous spaces being added to the output.  It almost acts
> >> like
> >>> the stripper is thinking that the space width of the font is super
> tiny.
> >>>
> >>> I managed to get a document that exhibits the behavior:
> >>>
> >>>
> >>
> https://drive.google.com/file/d/1B2Mc4mMdsYfk9jKVqQ9OxEhKLRAxprrU/view?usp=sharing
> >>> The easiest way to see the behavior is in PDFDebugger, View->Show
> >> Stripper
> >>> Text Positions.
> >>>
> >>> Note in the lower left corner of the document, there is text "999".
> The
> >>> text above and below that is fine, but the line with 999 has a *ton* of
> >>> extra space rectangles displated.
> >>>
> >>> The extract text function in PDFDebugger doesn't sort, so that one
> comes
> >>> out fine, but if you use PDFTextStripper with sorting enabled (), the
> >> line
> >>> renders like this:
> >>>
> >>> Withdrawals and distributions . . . $ ( 9 9 9 )
> >>>
> >>> Note the many space characters, and that there are even spaces between
> >> each
> >>> 9.
> >>>
> >>> I also observe that the PDF has warning messages about fonts (not sure
> if
> >>> this might be involved):
> >>>
> >>> [main] WARN org.apache.pdfbox.pdmodel.font.PDType1Font - Using fallback
> >>> font ArialMT for HelveticaLTStd-Roman
> >>>
> >>> [main] WARN org.apache.fontbox.ttf.CmapSubtable - Format 14 cmap table
> is
> >>> not supported and will be ignored
> >>>
> >>>
> >>>
> >>> It almost acts like the parenthesis on the line are triggering some
> >>> different detection mode where the font's space width is computing to
> be
> >>> much smaller than it should be.
> >>>
> >>> Any ideas on what is going on or if it is fixable?
> >>>
> >>> Thanks!
> >>>
> >>> - K
> >>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
>
>

Re: Text extraction adding lots of strange spaces

Reply via email to