Hi,

No opinion re release schedule but a comment on the PDFBox update:

tl;dr: ignore the PDF differences this time.

The new version includes the /ActualText support:
https://issues.apache.org/jira/browse/PDFBOX-5868

It is always enabled. In most cases the extraction is better. But sometimes content is lost because the feature is used for obfuscation (see example in the issue above).

Another major change is the detection of the space width:
https://issues.apache.org/jira/browse/PDFBOX-5920
It has been improved, however this will result in many differences with angled texts if angle detection isn't enabled. Some scientific texts with superscript prefix will also look different, "1 Coupled" will extract as "1Coupled". This is because these fonts don't have a space and the fallback we are using sucks.

Tilman

On 16.01.2025 14:20, Tim Allison wrote:
Sorry, on second thought, a small tweak:

I propose that we release 3.1.0 after PDFBox 3.x is released. I further
propose that we make a 2.9.3 release at some point after the 3.1.0 release
IF we get requests for a 2.x release...otherwise we'll do a final 2.x EOL
release in April, 2025.

On Thu, Jan 16, 2025 at 8:15 AM Tim Allison <talli...@apache.org> wrote:

All,
   It has been a while since we last released 2.x (April 2024) and 3.x
(October 2024). We've had a number of dependency updates. PDFBox is on the
cusp of a new 3.x release.
   I propose that we release 3.1.0 after PDFBox 3.x is released and that we
make a 2.9.3 release the following week.
   WDYT?

             Best,

                  Tim


Reply via email to