[jira] [Updated] (PDFBOX-6010) PDF Image Extraction resulting in an infinite recursion

Kabir Soneja (Jira) Thu, 15 May 2025 16:18:05 -0700


     [ 
https://issues.apache.org/jira/browse/PDFBOX-6010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Kabir Soneja updated PDFBOX-6010:
---------------------------------
    Description: 
Hi,

I am working on extracting images from a PDF using pdfbox version 2.0.34. While 
doing so we have our own recursive logic to recurse through all PDResources for 
each page and within each page we check for all the objects to filter out 
images. This recursive logic has a max depth of 25 to avoid infinite recursion.

When trying out the image extraction for the same PDF using the CLI, the image 
is extracted within a second indicating that the image extraction logic within 
the pdfbox source code is handling image extraction using an 
ImageGraphicsEngine defined within the source code.

Can you help me understand:
 * To handle image extraction, are there are any API directly provided by 
PDFBox?
 * Is there any way to reuse the image extraction logic within the source code 
i.e is it exposed as a public API?
 * Any other suggestions to handle image extraction gracefully with/without 
recursion?

  was:
Hi,

I am working on extracting images from a PDF using pdfbox version 2.0.34. While 
doing so we have our own recursive logic to recurse through all PDResources for 
each page and within each page we check for all the objects to filter out 
images. This recursive logic has a max depth of 25 to avoid infinite recursion.

When trying out the image extraction for the same PDF using the CLI, the image 
is extracted within a second indicating that the image extraction logic within 
the pdfbox source code is handling image extraction using an 
ImageGraphicsEngine defined within the source code.


 * To handle image extraction, are there are any API directly provided by 
PDFBox?
 * Is there any way to reuse the image extraction logic within the source code 
i.e is it exposed as a public API?
 * Any other suggestions to handle image extraction gracefully with/without 
recursion?


> PDF Image Extraction resulting in an infinite recursion
> -------------------------------------------------------
>
>                 Key: PDFBOX-6010
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-6010
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Kabir Soneja
>            Priority: Major
>
> Hi,
> I am working on extracting images from a PDF using pdfbox version 2.0.34. 
> While doing so we have our own recursive logic to recurse through all 
> PDResources for each page and within each page we check for all the objects 
> to filter out images. This recursive logic has a max depth of 25 to avoid 
> infinite recursion.
> When trying out the image extraction for the same PDF using the CLI, the 
> image is extracted within a second indicating that the image extraction logic 
> within the pdfbox source code is handling image extraction using an 
> ImageGraphicsEngine defined within the source code.
> Can you help me understand:
>  * To handle image extraction, are there are any API directly provided by 
> PDFBox?
>  * Is there any way to reuse the image extraction logic within the source 
> code i.e is it exposed as a public API?
>  * Any other suggestions to handle image extraction gracefully with/without 
> recursion?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-6010) PDF Image Extraction resulting in an infinite recursion

Reply via email to