[ 
https://issues.apache.org/jira/browse/PDFBOX-5606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724332#comment-17724332
 ] 

Tilman Hausherr edited comment on PDFBOX-5606 at 5/19/23 4:07 PM:
------------------------------------------------------------------

I'm able to run ExtractText on the command line with -Xmx15m with 2.0.27 and 
2.0.28 using jdk20, which is pretty low. (I could retest with jdk8 if needed)
What really counts is if the memory usage would be different.
PDFTextStripperByArea hasn't been changed, but the parent class had a minor 
change.


was (Author: tilman):
I'm able to run ExtractText with -Xmx15m with 2.0.27 and 2.0.28 using jdk20, 
which is pretty low. (I could retest with jdk8 if needed)
What really counts is if the memory usage would be different.
PDFTextStripperByArea hasn't been changed, but the parent class had a minor 
change.

> PDFTextStripper runs out of memory in 2.0.28 but not in 2.0.27 same code
> ------------------------------------------------------------------------
>
>                 Key: PDFBOX-5606
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5606
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.28
>            Reporter: Joe Li
>            Priority: Major
>              Labels: memory-bug
>         Attachments: pdfbox-2.0.27.png, pdfbox-2.0.28.png
>
>
> Given the follwing simplified Groovy code (for succinctness over Java)
>  
> {code:java}
> // Groovy 4.0.12
> import org.apache.pdfbox.pdmodel.PDDocument
> import org.apache.pdfbox.pdmodel.PDPage
> import org.apache.pdfbox.text.PDFTextStripperByArea
> import java.awt.geom.Rectangle2D
> int GRID_WIDTH = 10
> int GRID_HEIGHT = 10
> PDDocument.load(new File('./test.pdf')).withCloseable { doc ->
>     doc.pages.eachWithIndex { PDPage page, int pageIndex ->
>         int rows = Math.ceil((page.mediaBox.height as int) /GRID_HEIGHT)
>         int columns = Math.ceil((page.mediaBox.width as int) /GRID_WIDTH)
>         println "processing page $pageIndex, rows = $rows, columns = $columns"
>         def rectangles = [:]
>         (0..<rows).each {rowIndex ->
>             (0..<columns).each { colIndex ->
>                 rectangles["${rowIndex * columns + colIndex}"] = new 
> Rectangle2D.Float(colIndex * GRID_WIDTH, rowIndex * GRID_HEIGHT, GRID_WIDTH, 
> GRID_HEIGHT)
>             }
>         }
>         rectangles.each { key, rect ->
>             PDFTextStripperByArea textStripper = new PDFTextStripperByArea()
>             textStripper.addRegion(key, rect)
>             textStripper.extractRegions(page)
>         }
>     }
> }{code}
>  
>  
> PDFBox version 2.0.28 uses ever increasing memory, but version 2.0.27 does 
> not. 
> The test.pdf file I am using can be downloaded from Apple SEC filings page, 
> `8-K` from [https://investor.apple.com/sec-filings/default.aspx], but any 10+ 
> page pdf with a lot of text will work. 
> I have attached profiler screenshots of the difference. 
> Thanks in advance for your help. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to