I'm going to rerun the eval on 2.9.3-rc1 after I cherry picked the csv fixes today.
On Mon, Jan 27, 2025 at 1:25 PM Nicholas DiPiazza <nicholas.dipia...@gmail.com> wrote: > > that's wonderful. thanks for that. > i'm concentrating on finishing up tika-pipes so i can get the removal PR > started. > getting very close - maybe set up a zoom sometime to chat > > On Mon, Jan 27, 2025 at 9:50 AM Tim Allison <talli...@apache.org> wrote: > > > I'm kicking off the regression tests for 3.x. > > > > Nicholas, I merged TIKA-4303 and cherry-picked it back to 3.x. I hope > > that's ok. > > > > On Fri, Jan 24, 2025 at 2:25 AM Tilman Hausherr <thaush...@t-online.de> > > wrote: > > > > > Hi, > > > > > > No opinion re release schedule but a comment on the PDFBox update: > > > > > > tl;dr: ignore the PDF differences this time. > > > > > > The new version includes the /ActualText support: > > > https://issues.apache.org/jira/browse/PDFBOX-5868 > > > > > > It is always enabled. In most cases the extraction is better. But > > > sometimes content is lost because the feature is used for obfuscation > > > (see example in the issue above). > > > > > > Another major change is the detection of the space width: > > > https://issues.apache.org/jira/browse/PDFBOX-5920 > > > It has been improved, however this will result in many differences with > > > angled texts if angle detection isn't enabled. Some scientific texts > > > with superscript prefix will also look different, "1 Coupled" will > > > extract as "1Coupled". This is because these fonts don't have a space > > > and the fallback we are using sucks. > > > > > > Tilman > > > > > > On 16.01.2025 14:20, Tim Allison wrote: > > > > Sorry, on second thought, a small tweak: > > > > > > > > I propose that we release 3.1.0 after PDFBox 3.x is released. I further > > > > propose that we make a 2.9.3 release at some point after the 3.1.0 > > > release > > > > IF we get requests for a 2.x release...otherwise we'll do a final 2.x > > EOL > > > > release in April, 2025. > > > > > > > > On Thu, Jan 16, 2025 at 8:15 AM Tim Allison <talli...@apache.org> > > wrote: > > > > > > > >> All, > > > >> It has been a while since we last released 2.x (April 2024) and 3.x > > > >> (October 2024). We've had a number of dependency updates. PDFBox is on > > > the > > > >> cusp of a new 3.x release. > > > >> I propose that we release 3.1.0 after PDFBox 3.x is released and > > > that we > > > >> make a 2.9.3 release the following week. > > > >> WDYT? > > > >> > > > >> Best, > > > >> > > > >> Tim > > > >> > > > > > > > >