Great, I will put that together in the next couple of days. I haven't looked at the iText code in several years - this will be greenfield development work.
Take care, K On Fri, Dec 13, 2024, 9:21 PM Tilman Hausherr <thaush...@t-online.de> wrote: > On 13.12.2024 15:41, Kevin Day wrote: > > Ok, this is very interesting. > > > > So we have one text render that adds a bunch of spaces. > > > > Then we have another text render that puts visible characters on top of > > those spaces. > > > > Then we positional-sort the text positions, and they wind up > interleaving, > > which kills the text extraction accuracy. > > > > Would a reasonable refinement to the algorithm be to look for overlapping > > glyphs? And if there is a space that overlaps a non-space, then ignore > the > > space? > > > > I know that text extraction is a black art (disclosure: I wrote the > > original text extraction modules for iText, so lots of hard-earned > > experience here), but I think the above would be appropriate for all > > situations... > > > > Would you be up for reviewing a patch if I implement this? > > > > PS I also think that having an option to ignore spaces in rendering (the > > issue you linked to) would be a good idea. But that should be optional. > I'm > > happy to include that in my patch if you would like. > > Yes to both! > > Note that there is a pending patch about a corner case: > > https://github.com/apache/pdfbox/pull/155 > > But don't use code from itext unless you are allowed to, i.e. check the > papers that you signed. If you patch is more than just a small bugfix > you may have to sign something with us as well. > > https://www.apache.org/licenses/icla.pdf > > Tilman > > > > > > > K > > > > On Thu, Dec 12, 2024, 9:40 PM Tilman Hausherr <thaush...@t-online.de> > wrote: > > > >> Hi, > >> > >> These spaces are really in the PDF: > >> > >> BT > >> /Content <</MCID 224 >>BDC > >> 1 i > >> /T1_4 1 Tf > >> 7 0 0 7 *195.4 110.502* Tm > >> *(\( \))Tj* > >> EMC > >> /Content <</MCID 225 >>BDC > >> /T1_0 1 Tf > >> -19.686 6.786 Td > >> (Beginning capital account)Tj > >> /T1_4 1 Tf > >> ( )Tj > >> 14.057 0 Td > >> (.)Tj > >> 1.714 0 Td > >> (.)Tj > >> 1.714 0 Td > >> (.)Tj > >> EMC > >> ET > >> > >> > >> And *195.4 110.502* is really the position, you can move your mouse > >> there in PDFDebugger. > >> > >> The font messages are not important here. > >> There is a way to get rid of such spaces, but it requires a source code > >> change, it is described here: > >> https://issues.apache.org/jira/browse/PDFBOX-3774 > >> > >> However it's possible that other files would have a bad text extraction. > >> > >> Tilman > >> > >> On 13.12.2024 03:29, Kevin Day wrote: > >>> Hello- > >>> > >>> We are using PDFTextStripper, and have found some cases where there > are a > >>> *lot* of extraneous spaces being added to the output. It almost acts > >> like > >>> the stripper is thinking that the space width of the font is super > tiny. > >>> > >>> I managed to get a document that exhibits the behavior: > >>> > >>> > >> > https://drive.google.com/file/d/1B2Mc4mMdsYfk9jKVqQ9OxEhKLRAxprrU/view?usp=sharing > >>> The easiest way to see the behavior is in PDFDebugger, View->Show > >> Stripper > >>> Text Positions. > >>> > >>> Note in the lower left corner of the document, there is text "999". > The > >>> text above and below that is fine, but the line with 999 has a *ton* of > >>> extra space rectangles displated. > >>> > >>> The extract text function in PDFDebugger doesn't sort, so that one > comes > >>> out fine, but if you use PDFTextStripper with sorting enabled (), the > >> line > >>> renders like this: > >>> > >>> Withdrawals and distributions . . . $ ( 9 9 9 ) > >>> > >>> Note the many space characters, and that there are even spaces between > >> each > >>> 9. > >>> > >>> I also observe that the PDF has warning messages about fonts (not sure > if > >>> this might be involved): > >>> > >>> [main] WARN org.apache.pdfbox.pdmodel.font.PDType1Font - Using fallback > >>> font ArialMT for HelveticaLTStd-Roman > >>> > >>> [main] WARN org.apache.fontbox.ttf.CmapSubtable - Format 14 cmap table > is > >>> not supported and will be ignored > >>> > >>> > >>> > >>> It almost acts like the parenthesis on the line are triggering some > >>> different detection mode where the font's space width is computing to > be > >>> much smaller than it should be. > >>> > >>> Any ideas on what is going on or if it is fixable? > >>> > >>> Thanks! > >>> > >>> - K > >>> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > >