Hello All, I have been using pdfbox 2.0.27 to generate accessible pdfs. The pdf contains a table with several 100s of rows and 6 columns and each cell in the table is added as a marked content.
When the number of rows increase, I noticed a spike in response times. Profiling the process showed that the main consumer of cpu time was the invocation to begin marked content at https://github.com/apache/pdfbox/blob/e72963ca5b283a87828ee731cd85c0b6baf1ff57/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDPageContentStream.java#L2302 Looking into this a bit further, the latency was when the propertylist was being added to the resources object, where we first check if the property exists in the map before adding it in. As the number of properties in the map increases, this is adding to the CPU time. https://github.com/apache/pdfbox/blob/e72963ca5b283a87828ee731cd85c0b6baf1ff57/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java#L701 I also noticed that 2.0.27 updated the resources object to use a LinkedHashMap instead of a SmallMap which has greatly improved performance in this area in our case, however we are still looking to reduce response times further. Given that we are adding marked content as table cells, we need a new property list for every marked content, so in our case it feels like the check to verify if the property list exists already is offerring very little value given the resources it consumes. To work around this, I was looking at ways of bypassing this check and I noticed there was a deprecated method https://github.com/apache/pdfbox/blob/e72963ca5b283a87828ee731cd85c0b6baf1ff57/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDPageContentStream.java#L2287 which allowed us to pass in a COSName and manually add the property list to the resources tree. I was wondering what's the best way forward for us in this case: 1. Request the beginMarkedContentSequence method to be un-deprecated. and expose some methods in PDresources to simplify addition of resources in the map. or 2. Request a new overloaded beginMarkedContent method that allows us to may be pass in a boolean flag that can override the check to see if a property list already exists in the resource tree. As we work with stringent data controls, I am unable to share the pdf or profiling details, so if you require any further details then please let me know. Alternately, I can raise a ticket on the JIRA tracker, but I just wanted to check here first before doing that. Many thanks for your help.