[ https://issues.apache.org/jira/browse/PDFBOX-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tilman Hausherr closed PDFBOX-6031. ----------------------------------- Resolution: Won't Fix > PDFStreamEngine: inconsistent processPage behaviour in multithreading > --------------------------------------------------------------------- > > Key: PDFBOX-6031 > URL: https://issues.apache.org/jira/browse/PDFBOX-6031 > Project: PDFBox > Issue Type: Bug > Components: Parsing > Affects Versions: 3.0.5 PDFBox > Reporter: Zer Jun Eng > Priority: Blocker > Attachments: Catalogo_Egitto_2025.pdf, > image-2025-07-07-22-35-15-823.png > > > Dear PDFBox developers, > I modified the > [PrintImageLocations.java|https://github.com/apache/pdfbox/blob/3.0.5/examples/src/main/java/org/apache/pdfbox/examples/util/PrintImageLocations.java] > example to count the number of unique images in a PDF document. The minimal > reproducible code is below: > {code:java} > import java.io.File; > import java.io.IOException; > import java.util.List; > import java.util.Set; > import java.util.concurrent.Callable; > import java.util.concurrent.ConcurrentHashMap; > import java.util.concurrent.ExecutorService; > import java.util.concurrent.Executors; > import java.util.concurrent.TimeUnit; > import org.apache.pdfbox.Loader; > import org.apache.pdfbox.contentstream.PDFStreamEngine; > import org.apache.pdfbox.contentstream.operator.DrawObject; > import org.apache.pdfbox.contentstream.operator.Operator; > import org.apache.pdfbox.contentstream.operator.OperatorName; > import org.apache.pdfbox.contentstream.operator.state.Concatenate; > import org.apache.pdfbox.contentstream.operator.state.Restore; > import org.apache.pdfbox.contentstream.operator.state.Save; > import > org.apache.pdfbox.contentstream.operator.state.SetGraphicsStateParameters; > import org.apache.pdfbox.contentstream.operator.state.SetMatrix; > import org.apache.pdfbox.cos.COSBase; > import org.apache.pdfbox.cos.COSName; > import org.apache.pdfbox.cos.COSObjectKey; > import org.apache.pdfbox.pdmodel.PDDocument; > import org.apache.pdfbox.pdmodel.PDPage; > import org.apache.pdfbox.pdmodel.graphics.PDXObject; > import org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject; > import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject; > /** > * Adapted from > * > https://github.com/apache/pdfbox/blob/trunk/examples/src/main/java/org/apache/pdfbox/examples/util/PrintImageLocations.java > */ > public class CountUniqueImages { > private final Set<COSObjectKey> uniqueImageKeys = > ConcurrentHashMap.newKeySet(); > public int countUniqueImages(File file, int nThreads) throws IOException, > InterruptedException { > try (PDDocument document = Loader.loadPDF(file); > ExecutorService executor = Executors.newFixedThreadPool(nThreads)) { > for (PDPage page : document.getPages()) { > ImageEngine imageEngine = new ImageEngine(page); > executor.submit(imageEngine); > } > executor.shutdown(); > executor.awaitTermination(1, TimeUnit.MINUTES); > return uniqueImageKeys.size(); > } > } > final class ImageEngine extends PDFStreamEngine implements Callable<Object> > { > private static final Object DONE = new Object(); > private final PDPage page; > public ImageEngine(PDPage page) { > this.page = page; > addOperator(new Concatenate(this)); > addOperator(new DrawObject(this)); > addOperator(new SetGraphicsStateParameters(this)); > addOperator(new Save(this)); > addOperator(new Restore(this)); > addOperator(new SetMatrix(this)); > } > @Override > protected void processOperator(Operator operator, List<COSBase> operands) > throws IOException { > String operation = operator.getName(); > if (OperatorName.DRAW_OBJECT.equals(operation)) { > COSName objectName = (COSName) operands.get(0); > PDXObject xobject = getResources().getXObject(objectName); > if (xobject instanceof PDImageXObject) { > PDImageXObject imageXObj = (PDImageXObject) xobject; > COSObjectKey key = imageXObj.getCOSObject().getKey(); > uniqueImageKeys.add(key); > } else if (xobject instanceof PDFormXObject) { > PDFormXObject form = (PDFormXObject) xobject; > showForm(form); > } > } else { > super.processOperator(operator, operands); > } > } > @Override > public Object call() throws Exception { > processPage(page); > return DONE; > } > } > } > {code} > Below is the JUnit test to verify the correctness of the multithreaded > implementation. I have also attached the PDF file used for testing: > {code:java} > import static org.junit.jupiter.api.Assertions.*; > import java.io.File; > import java.io.IOException; > import org.junit.jupiter.api.Test; > class CountUniqueImagesTest { > @Test > void testSingleThreaded() throws IOException, InterruptedException { > CountUniqueImages counter = new CountUniqueImages(); > int count = > counter.countUniqueImages(new > File("src/test/resources/Catalogo_Egitto_2025.pdf"), 1); > assertEquals(122, count); > } > @Test > void testMultiThreaded() throws IOException, InterruptedException { > CountUniqueImages counter = new CountUniqueImages(); > int count = > counter.countUniqueImages(new > File("src/test/resources/Catalogo_Egitto_2025.pdf"), 4); > assertEquals(122, count); > } > } > {code} > I am getting inconsistent results when using multithreading. The PDF file is > expected to contain 122 unique images. Out of 100 test runs, the > multithreaded test case fails 19 times. In those cases, the code does not > correctly count the number of unique images. > !image-2025-07-07-22-35-15-823.png! > I have read the > [FAQ|https://pdfbox.apache.org/3.0/faq.html#is-pdfbox-thread-safe%3F] and I > understand that PDFBox is not thread-safe. Therefore, this issue might be > related to or a duplicate of > https://issues.apache.org/jira/browse/PDFBOX-5541 or > https://issues.apache.org/jira/browse/PDFBOX-5542. However, I'm still > wondering if this might be a bug, since my code only performs read-only > operations. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org