[
https://issues.apache.org/jira/browse/PDFBOX-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ilija Pavlic updated PDFBOX-1201:
---------------------------------
Priority: Major (was: Minor)
Description: The text stripper region doesn't capture text starting and
finishing outside the capture region but flowing through the capture region.
(was: The text stripper region seems to be shifted up from the given
coordinates, causing lines below the region to be included and ones above the
defined region to be included.
...
PDPage page = (PDPage) allPages.get(0);
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
Rectangle2D.Float region = new Rectangle2D.Float(x, y, width, height);
stripper.addRegion("test region", region);
// overlay the region with a cyan rectangle to check if I got the coordinates
and dimensions right
PDPageContentStream contentStream = new PDPageContentStream(document, page,
true, true);
contentStream.setNonStrokingColor( Color.CYAN );
contentStream.fillRect(x, y, width, height);
contentStream.close();
stripper.extractRegions(page);
String content = stripper.getTextForRegion("test region");
...
document.save(...);
...
The cyan rectangle overlays the desired region exactly when viewing the saved
output document. On the other hand, stripper misses a couple of lines at the
bottom of the rectangle and includes couple of lines above the rectangle.)
Summary: PDFTextStripperByArea doesn't capture text that flows inside
the capture region (was: PDFTextStripperByArea y coordinate shifted "up")
> PDFTextStripperByArea doesn't capture text that flows inside the capture
> region
> -------------------------------------------------------------------------------
>
> Key: PDFBOX-1201
> URL: https://issues.apache.org/jira/browse/PDFBOX-1201
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.6.0
> Reporter: Ilija Pavlic
>
> The text stripper region doesn't capture text starting and finishing outside
> the capture region but flowing through the capture region.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira