[
https://issues.apache.org/jira/browse/PDFBOX-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ilija Pavlic updated PDFBOX-1202:
---------------------------------
Comment: was deleted
(was: I am also able to extract the text from the entire document without
removing security, by just invoking decrypt in a minimal working example.
The difference between the MWE and the buggy code is that I have been appending
to the PDPageContentStream, so perhaps that was the reason for the error. I
don't have the code by hand now, I'll look into it later in the day.)
> org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt
> stream
> -------------------------------------------------------------------------------
>
> Key: PDFBOX-1202
> URL: https://issues.apache.org/jira/browse/PDFBOX-1202
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.6.0
> Environment: Mac OS X 10.7.2
> Reporter: Ilija Pavlic
> Priority: Critical
> Attachments: IATAUnitedStates.pdf
>
>
> Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading
> corrupt stream" thrown when extracting text.
> The error is thrown at:
> - page 397 if the page loop starts at zero -- for (int i = 0; i <
> allPages.size(); i++)
> - page 790 if the loop starts at 395 (that would make it approx. 397 pages
> from the beggining of the loop)
> - page 848 if the loop starts at 450 (that would make it aprox. 397 pages
> from the beggining of the loop)
> The error is not thrown if:
> - the loop starts at page 452 or later
> - the loop starts at 0 and ends before 396
> - the loop starts at 200 and ends before 595
> Therefore I suspect that a loop spanning more than 396 pages will throw an
> error. Is that an indication of a memory leak of some sort?
> Here is the full code:
> package transhotel.pdf.iata;
> import java.awt.geom.Rectangle2D;
> import java.io.IOException;
> import java.util.List;
> import org.apache.pdfbox.exceptions.COSVisitorException;
> import org.apache.pdfbox.exceptions.CryptographyException;
> import org.apache.pdfbox.exceptions.InvalidPasswordException;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.pdmodel.PDPage;
> import org.apache.pdfbox.util.PDFTextStripperByArea;
> public class Main {
> public static void main(String[] args) throws IOException,
> COSVisitorException, CryptographyException {
>
> PDDocument document = null;
> try {
> document =
> PDDocument.load("/Users/ilijapavlic/Desktop/IATAUnitedStates.pdf");
> if (document.isEncrypted()) {
> try {
> document.decrypt("");
> } catch (InvalidPasswordException e) {
> System.err.println("Error: Document is
> encrypted with a password.");
> System.exit(1);
> }
> }
> float x = 55f;
> float y = 40f;
> float width = 168.5f;
> float height = 689f;
> float evenOffset = -10f;
> List allPages =
> document.getDocumentCatalog().getAllPages();
> for (int i = 0; i < allPages.size(); i++) {
> System.out.println("Page " + i);
> PDPage page = (PDPage) allPages.get(i);
> PDFTextStripperByArea stripper = new
> PDFTextStripperByArea();
> stripper.setSortByPosition(true);
> for (int j = 0; j < 3; j++)
> {
> if (i % 2 == 0) {
> Rectangle2D.Float region = new
> Rectangle2D.Float(x, y, width*3, height);
> stripper.addRegion("region",
> region);
> }
> else {
> Rectangle2D.Float region = new
> Rectangle2D.Float(x + evenOffset, y, width*3, height);
> stripper.addRegion("region",
> region);
> }
> }
> stripper.extractRegions(page);
> for (String regionName : stripper.getRegions())
> {
> stripper.getTextForRegion(regionName);
> }
> }
> }
>
> catch(Exception e) {
> e.printStackTrace();
> }
> finally {
> if (document != null) {
> document.close();
> }
> }
> }
> }
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira