Re: Same line calculation of PDFTextStripper

2025-05-16 Thread Kevin Day
Tilman- Yes, please - I would like to have these 180 files. Here is a secure upload link: https://trumpet.sharefile.com/r-rc2276979734447478d58b9ae23549d3e Would you prefer to continue this conversation in JIRA, or is this the most appropriate place? I am almost certainly going to need to initi

Re: Same line calculation of PDFTextStripper

2025-05-16 Thread Tilman Hausherr
On 12.05.2025 21:37, Kevin Day wrote: Are there test files that exercise the superscript/subscript correction that the non-transitive comparator is supposed to address? Besides the 25 test files I have a test set of about 180 text files with their results that I use to check changes. Some

Re: Same line calculation of PDFTextStripper

2025-05-13 Thread sahy...@fileaffairs.de
There is no public test suite other than files attached to Jira tickets. There is a suite of some 1 files which we use for regression tests prior to new releases but also when doing larger changes but that can not be shared due to data privacy, licensing ... (it has been public but with new act

Re: Same line calculation of PDFTextStripper

2025-05-12 Thread Kevin Day
Are there test files that exercise the superscript/subscript correction that the non-transitive comparator is supposed to address? And is there some way that I can get access to the test suite that includes 2991? I can copy the file down from the Jira ticket, but I hate to do a ton of development

Re: Same line calculation of PDFTextStripper

2025-04-10 Thread Tilman Hausherr
On 09.04.2025 16:36, Kevin Day wrote: Understood. My biggest comment is that having a non-transitive comparator in a sort algorithm is a really bad idea. It produces all sorts of non-deterministic behavior. So I'm in agreement that a better solution is needed. Do you have any history of why t

Re: Same line calculation of PDFTextStripper

2025-04-10 Thread Kevin Day
I had one other thought on this. Without question, the ordering of the TextPositions after the JRE sort completes is not consistent with the comparator. It should be easy to just loop the sorted TPs and check to ensure the comparator always returns <=0. I'm wondering if the slower fallback sort w

Re: Same line calculation of PDFTextStripper

2025-04-09 Thread Kevin Day
Thank you for directing me to the discussion. This is pretty much what I expected (the reason for the fuzzy logic is superscript/subscript handling). I am pretty confident that the problem is not with the comparator. The problem is that we are trying to use a simple sort algorithm to do somethin

Re: Same line calculation of PDFTextStripper

2025-04-09 Thread Kevin Day
Understood. My biggest comment is that having a non-transitive comparator in a sort algorithm is a really bad idea. It produces all sorts of non-deterministic behavior. So I'm in agreement that a better solution is needed. Do you have any history of why the fuzzy logic is in that comparator? S

Re: Same line calculation of PDFTextStripper

2025-04-08 Thread Tilman Hausherr
Oops, the one from PDFBOX-3019 is no longer available. The one from PDFBOX-2991 is here, you can test it yourself: https://issues.apache.org/jira/secure/attachment/12766900/sample-resume.pdf The original extraction is Benjamin Costa Mesa, California benjaminmccan(ätt)gmail.com I don't have any th

Re: Same line calculation of PDFTextStripper

2025-04-08 Thread Kevin Day
Hmmm. Do you know what the extracted text was for those two examples under the original sort algorithm? Were those text chunks properly extracted with the expected space between them? I'm not very clear on why the examples you show would be missing a word break detection after changing the sort.

Re: Same line calculation of PDFTextStripper

2025-04-08 Thread Tilman Hausherr
I tried this and get lots of differences, obviously. I looked at two files (PDFBOX-2991 and PDFBOX-3019) and the difference make sense, but there's a new problem: the segments are not separated. PDFBOX-2991: Costa Mesa, California, benjaminmccan(ätt)gmail.co*mB*enjamin PDFBOX-3019: Originally