[ 
https://issues.apache.org/jira/browse/PDFBOX-4668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17946052#comment-17946052
 ] 

Tilman Hausherr commented on PDFBOX-4668:
-----------------------------------------

This happens with 2.0:

Exception in thread "main" java.lang.NullPointerException: Cannot invoke 
"String.equals(Object)" because the return value of 
"org.apache.pdfbox.pdmodel.font.PDFont.getName()" is null
        at 
org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:571)
        at 
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:382)
        at 
org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:308)
        at 
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:256)
        at 
org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:403)
        at 
org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:300)
        at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:97)
        at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)



> Add ResourceCacheFactory as global setting
> ------------------------------------------
>
>                 Key: PDFBOX-4668
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4668
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Rendering
>    Affects Versions: 3.0.4 PDFBox, 4.0.0
>            Reporter: Ben Manes
>            Assignee: Andreas Lehmkühler
>            Priority: Major
>             Fix For: 3.0.5 PDFBox, 4.0.0
>
>         Attachments: 002145.pdf, Screenshot 2023-03-20 at 18.57.40.png, 
> memory.png, threads.png
>
>
> Image rendering is cached by {{DefaultResourceCache}} per-document using soft 
> references. As described in the [FAQ|https://pdfbox.apache.org/2.0/faq.html], 
> this can lead to an {{OutOfMemoryError}} when processing, e.g. many documents 
> in parallel. The configuration of this cache is per-document and it is 
> initialized with the default.
> {code}
> // document-wide cached resources
> private ResourceCache resourceCache = new DefaultResourceCache();
> {code}
> This requires all call sites be modified to disable it, some of which may be 
> in 3rd party code. The ask is to static factory to configure the default 
> globally, which would return a new {{DefaultResourceCache}} when called. This 
> would let a user specify a new static factory, e.g. one that returns a custom 
> cache or {{null}} if disabled.
> Soft references are a problematic caching scheme, which degrades poorly. It 
> is very likely that the many and large image fragments causes GC promotion 
> (eden=>young=>old) which requires a full GC to collect. Under memory/cpu 
> pressure, the GC can devolve into a death spiral of collecting the minimal 
> heap space to match its pause time constraints, leading to repeated GCs due 
> to soft reference pollutions and an eventual OOME. If caching was set, it 
> might be preferable to be size-based (by rough byte-size) and perhaps tied 
> into {{MemoryUsageSetting}} main memory configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to