[ 
https://issues.apache.org/jira/browse/PDFBOX-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed PDFBOX-6031.
-----------------------------------
    Resolution: Won't Fix

> PDFStreamEngine: inconsistent processPage behaviour in multithreading
> ---------------------------------------------------------------------
>
>                 Key: PDFBOX-6031
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-6031
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 3.0.5 PDFBox
>            Reporter: Zer Jun Eng
>            Priority: Blocker
>         Attachments: Catalogo_Egitto_2025.pdf, 
> image-2025-07-07-22-35-15-823.png
>
>
> Dear PDFBox developers,
> I modified the 
> [PrintImageLocations.java|https://github.com/apache/pdfbox/blob/3.0.5/examples/src/main/java/org/apache/pdfbox/examples/util/PrintImageLocations.java]
>  example to count the number of unique images in a PDF document. The minimal 
> reproducible code is below:
> {code:java}
> import java.io.File;
> import java.io.IOException;
> import java.util.List;
> import java.util.Set;
> import java.util.concurrent.Callable;
> import java.util.concurrent.ConcurrentHashMap;
> import java.util.concurrent.ExecutorService;
> import java.util.concurrent.Executors;
> import java.util.concurrent.TimeUnit;
> import org.apache.pdfbox.Loader;
> import org.apache.pdfbox.contentstream.PDFStreamEngine;
> import org.apache.pdfbox.contentstream.operator.DrawObject;
> import org.apache.pdfbox.contentstream.operator.Operator;
> import org.apache.pdfbox.contentstream.operator.OperatorName;
> import org.apache.pdfbox.contentstream.operator.state.Concatenate;
> import org.apache.pdfbox.contentstream.operator.state.Restore;
> import org.apache.pdfbox.contentstream.operator.state.Save;
> import 
> org.apache.pdfbox.contentstream.operator.state.SetGraphicsStateParameters;
> import org.apache.pdfbox.contentstream.operator.state.SetMatrix;
> import org.apache.pdfbox.cos.COSBase;
> import org.apache.pdfbox.cos.COSName;
> import org.apache.pdfbox.cos.COSObjectKey;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.pdmodel.PDPage;
> import org.apache.pdfbox.pdmodel.graphics.PDXObject;
> import org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject;
> import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
> /**
>  * Adapted from
>  * 
> https://github.com/apache/pdfbox/blob/trunk/examples/src/main/java/org/apache/pdfbox/examples/util/PrintImageLocations.java
>  */
> public class CountUniqueImages {
>   private final Set<COSObjectKey> uniqueImageKeys = 
> ConcurrentHashMap.newKeySet();
>   public int countUniqueImages(File file, int nThreads) throws IOException, 
> InterruptedException {
>     try (PDDocument document = Loader.loadPDF(file);
>         ExecutorService executor = Executors.newFixedThreadPool(nThreads)) {
>       for (PDPage page : document.getPages()) {
>         ImageEngine imageEngine = new ImageEngine(page);
>         executor.submit(imageEngine);
>       }
>       executor.shutdown();
>       executor.awaitTermination(1, TimeUnit.MINUTES);
>       return uniqueImageKeys.size();
>     }
>   }
>   final class ImageEngine extends PDFStreamEngine implements Callable<Object> 
> {
>     private static final Object DONE = new Object();
>     private final PDPage page;
>     public ImageEngine(PDPage page) {
>       this.page = page;
>       addOperator(new Concatenate(this));
>       addOperator(new DrawObject(this));
>       addOperator(new SetGraphicsStateParameters(this));
>       addOperator(new Save(this));
>       addOperator(new Restore(this));
>       addOperator(new SetMatrix(this));
>     }
>     @Override
>     protected void processOperator(Operator operator, List<COSBase> operands) 
> throws IOException {
>       String operation = operator.getName();
>       if (OperatorName.DRAW_OBJECT.equals(operation)) {
>         COSName objectName = (COSName) operands.get(0);
>         PDXObject xobject = getResources().getXObject(objectName);
>         if (xobject instanceof PDImageXObject) {
>           PDImageXObject imageXObj = (PDImageXObject) xobject;
>           COSObjectKey key = imageXObj.getCOSObject().getKey();
>           uniqueImageKeys.add(key);
>         } else if (xobject instanceof PDFormXObject) {
>           PDFormXObject form = (PDFormXObject) xobject;
>           showForm(form);
>         }
>       } else {
>         super.processOperator(operator, operands);
>       }
>     }
>     @Override
>     public Object call() throws Exception {
>       processPage(page);
>       return DONE;
>     }
>   }
> }
> {code}
> Below is the JUnit test to verify the correctness of the multithreaded 
> implementation. I have also attached the PDF file used for testing:
> {code:java}
> import static org.junit.jupiter.api.Assertions.*;
> import java.io.File;
> import java.io.IOException;
> import org.junit.jupiter.api.Test;
> class CountUniqueImagesTest {
>   @Test
>   void testSingleThreaded() throws IOException, InterruptedException {
>     CountUniqueImages counter = new CountUniqueImages();
>     int count =
>         counter.countUniqueImages(new 
> File("src/test/resources/Catalogo_Egitto_2025.pdf"), 1);
>     assertEquals(122, count);
>   }
>   @Test
>   void testMultiThreaded() throws IOException, InterruptedException {
>     CountUniqueImages counter = new CountUniqueImages();
>     int count =
>         counter.countUniqueImages(new 
> File("src/test/resources/Catalogo_Egitto_2025.pdf"), 4);
>     assertEquals(122, count);
>   }
> }
> {code}
> I am getting inconsistent results when using multithreading. The PDF file is 
> expected to contain 122 unique images. Out of 100 test runs, the 
> multithreaded test case fails 19 times. In those cases, the code does not 
> correctly count the number of unique images.
> !image-2025-07-07-22-35-15-823.png!
> I have read the 
> [FAQ|https://pdfbox.apache.org/3.0/faq.html#is-pdfbox-thread-safe%3F] and I 
> understand that PDFBox is not thread-safe. Therefore, this issue might be 
> related to or a duplicate of 
> https://issues.apache.org/jira/browse/PDFBOX-5541 or 
> https://issues.apache.org/jira/browse/PDFBOX-5542. However, I'm still 
> wondering if this might be a bug, since my code only performs read-only 
> operations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to