[jira] [Created] (PDFBOX-5580) PDFTextStripperByArea ignores text for overlapping areas (regions) when suppressing duplicate overlapping text

Sebastian Holzki (Jira) Thu, 23 Mar 2023 04:05:05 -0700

Sebastian Holzki created PDFBOX-5580:
----------------------------------------


             Summary: PDFTextStripperByArea ignores text for overlapping areas 
(regions) when suppressing duplicate overlapping text
                 Key: PDFBOX-5580
                 URL: https://issues.apache.org/jira/browse/PDFBOX-5580
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 2.0.27
            Reporter: Sebastian Holzki
         Attachments: test.pdf

h3. Problem

Recently we encountered duplicate texts in our clients PDF documents which are 
typically created by applications to simulate some kind of bold text when no 
bold variant of a font is available. Fortunately, PDFBox's 
PDFTextStripperByArea has some logic to ignore exact duplicates at the same 
positions for these situations (which is inherited from the normal 
PDFTextStripper). So we changed from setSuppressDuplicateOverlappingText(false) 
to true.

But we encountered that texts for multiple regions are not extracted correctly 
in this case when some special conditions are met:

When using multiple regions which overlap each other and would provide exactly 
the same text, the first region text is extracted correctly but any following 
region with same text remains empty.

We believe this is a bug due to duplicate suppression not being respected 
correctly in PDFTextStripperByArea.
h3. Possible cause

While investigating this problem we found that PDFTextStripperByArea swaps 
charactersByArticle for multiple regions and interprets a single page multiple 
times (once for each region). In PDFTextStripper a private HashMap 
characterListMapping keeps track of possible duplicate symbols with their 
positions. The HasMap is not being reset after each region extraction which 
leads to characters being ignored for subsequent areas.

Since the HashMap is private we were not able to subclass and customize 
PDFTextStripperByArea with some adjusted behavior to test this finding.
h3. Workaround

When extracting regions one at a time for every page everything works fine. We 
currently don't see any performance disadvantages.
h3. Reproduction

The attached PDF file does not actually include duplicate overlapping text 
which is actually not needed to reproduce the issue.

 
{code:java}
try (final PDDocument doc = PDDocument.load(new File("C:\\Source\\test.pdf"))) {
    final PDFTextStripperByArea stripper = new PDFTextStripperByArea();
    stripper.setSuppressDuplicateOverlappingText(true);
    stripper.setPageEnd("");

    final Rectangle2D areaA = new Rectangle2D.Double(45, 319, 124, 19);
    final Rectangle2D areaB = new Rectangle2D.Double(43, 319, 130, 19);

    stripper.addRegion("A", areaA);
    stripper.addRegion("B", areaB);

    stripper.extractRegions(doc.getPage(0));

    System.out.println("A: " + stripper.getTextForRegion("A"));
    System.out.println("B: " + stripper.getTextForRegion("B"));
} {code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Created] (PDFBOX-5580) PDFTextStripperByArea ignores text for overlapping areas (regions) when suppressing duplicate overlapping text

Reply via email to