[
https://issues.apache.org/jira/browse/PDFBOX-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183733#comment-13183733
]
Ilija Pavlic commented on PDFBOX-1202:
--------------------------------------
Here's the stack trace from the latest pdfbox built from svn.
11.01.2012. 01:10:42 org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to an OutOfMemoryError
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at
org.apache.pdfbox.io.RandomAccessBuffer.expandBuffer(RandomAccessBuffer.java:151)
at
org.apache.pdfbox.io.RandomAccessBuffer.write(RandomAccessBuffer.java:131)
at
org.apache.pdfbox.io.RandomAccessFileOutputStream.write(RandomAccessFileOutputStream.java:108)
at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:117)
at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:279)
at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221)
at
org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:156)
at
org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:105)
at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:262)
at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
at
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
at
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:448)
at
org.apache.pdfbox.util.PDFTextStripperByArea.extractRegions(PDFTextStripperByArea.java:153)
at pdf.test.Main.main(Main.java:61)
You were right about an java.lang.OutOfMemoryError error. What does that mean?
Somewhat amusing is that a larger document of a similar type (947 pages long)
can be read without the exception thrown.
> org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt
> stream
> -------------------------------------------------------------------------------
>
> Key: PDFBOX-1202
> URL: https://issues.apache.org/jira/browse/PDFBOX-1202
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.6.0
> Environment: Mac OS X 10.7.2
> Reporter: Ilija Pavlic
> Priority: Critical
> Attachments: IATAUnitedStates.pdf
>
>
> Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading
> corrupt stream" thrown when extracting text.
> The error is thrown at:
> - page 397 if the page loop starts at zero -- for (int i = 0; i <
> allPages.size(); i++)
> - page 790 if the loop starts at 395 (that would make it approx. 397 pages
> from the beggining of the loop)
> - page 848 if the loop starts at 450 (that would make it aprox. 397 pages
> from the beggining of the loop)
> The error is not thrown if:
> - the loop starts at page 452 or later
> - the loop starts at 0 and ends before 396
> - the loop starts at 200 and ends before 595
> Therefore I suspect that a loop spanning more than 396 pages will throw an
> error. Is that an indication of a memory leak of some sort?
> Full code is below. Note that the result is the same when instantiating a
> single PDFTextStripperByArea outside the page loop and invoking resetEngine()
> on the stripper inside the page loop.
> import java.awt.geom.Rectangle2D;
> import java.io.IOException;
> import java.util.List;
> import org.apache.pdfbox.exceptions.COSVisitorException;
> import org.apache.pdfbox.exceptions.CryptographyException;
> import org.apache.pdfbox.exceptions.InvalidPasswordException;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.pdmodel.PDPage;
> import org.apache.pdfbox.util.PDFTextStripperByArea;
> public class Main {
> public static void main(String[] args) throws IOException,
> COSVisitorException, CryptographyException {
>
> PDDocument document = null;
> try {
> document =
> PDDocument.load("/Users/ilijapavlic/Desktop/IATAUnitedStates.pdf");
> if (document.isEncrypted()) {
> try {
> document.decrypt("");
> } catch (InvalidPasswordException e) {
> System.err.println("Error: Document is
> encrypted with a password.");
> System.exit(1);
> }
> }
> float x = 55f;
> float y = 40f;
> float width = 168.5f;
> float height = 689f;
> float evenOffset = -10f;
> List allPages =
> document.getDocumentCatalog().getAllPages();
> for (int i = 0; i < allPages.size(); i++) {
> System.out.println("Page " + i);
> PDPage page = (PDPage) allPages.get(i);
> PDFTextStripperByArea stripper = new
> PDFTextStripperByArea();
> stripper.setSortByPosition(true);
> for (int j = 0; j < 3; j++)
> {
> if (i % 2 == 0) {
> Rectangle2D.Float region = new
> Rectangle2D.Float(x, y, width*3, height);
> stripper.addRegion("region",
> region);
> }
> else {
> Rectangle2D.Float region = new
> Rectangle2D.Float(x + evenOffset, y, width*3, height);
> stripper.addRegion("region",
> region);
> }
> }
> stripper.extractRegions(page);
> for (String regionName : stripper.getRegions())
> {
> stripper.getTextForRegion(regionName);
> }
> }
> }
>
> catch(Exception e) {
> e.printStackTrace();
> }
> finally {
> if (document != null) {
> document.close();
> }
> }
> }
> }
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira