Axel Howind created PDFBOX-6002:
-----------------------------------

             Summary: change parse methods to take CharSequence argument
                 Key: PDFBOX-6002
                 URL: https://issues.apache.org/jira/browse/PDFBOX-6002
             Project: PDFBox
          Issue Type: Improvement
            Reporter: Axel Howind
         Attachments: image-2025-05-02-07-00-52-161.png

PDFBox parsing works on Strings in almost all places. Often, StringBuilder 
instances are created to prepare a fragment to parse, and then another parse 
method is called using the result of calling toString() on the StringBuilder. 
If the parse methods were changed to take CharSequence instead, the 
StringBuilder instance could be passed on without creating a temporary String 
instance. This would reduce memory consumption and load on the GC.

I did some profiling using the async profiler, and for example in 
BaseParser.parseCOSNumber() about 25% of the runtime is spent in 
StringBuilder().toString() which would be completely eliminated if the parse 
methods worked on CharSequences instead of Strings (see image):


!image-2025-05-02-07-00-52-161.png!

A consequence would be that user code needs to be recompiled (no code changes 
on the user side) against the new version because the method signature changes.

An alternative approach is to introduce new methods with the prefix CS, like 
parseCOSNumberCS(), and to delegate parseCOSNumber() to the new method. This 
would be a PDFBox 3 compatible change.

Please let me know if, and if yes, which version of a patch you would possibly 
accept. I'd then create incremental patches to provide this functionality.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to