Tilman-
Yes, please - I would like to have these 180 files. Here is a secure
upload link:
https://trumpet.sharefile.com/r-rc2276979734447478d58b9ae23549d3e
Would you prefer to continue this conversation in JIRA, or is this the most
appropriate place? I am almost certainly going to need to initi
On 12.05.2025 21:37, Kevin Day wrote:
Are there test files that exercise the superscript/subscript correction
that the non-transitive comparator is supposed to address?
Besides the 25 test files I have a test set of about 180 text files
with their results that I use to check changes. Some
There is no public test suite other than files attached to Jira
tickets. There is a suite of some 1 files which we use for
regression tests prior to new releases but also when doing larger
changes but that can not be shared due to data privacy, licensing ...
(it has been public but with new act
Are there test files that exercise the superscript/subscript correction
that the non-transitive comparator is supposed to address? And is there
some way that I can get access to the test suite that includes 2991? I can
copy the file down from the Jira ticket, but I hate to do a ton of
development
On 09.04.2025 16:36, Kevin Day wrote:
Understood.
My biggest comment is that having a non-transitive comparator in a sort
algorithm is a really bad idea. It produces all sorts of non-deterministic
behavior.
So I'm in agreement that a better solution is needed.
Do you have any history of why t
I had one other thought on this.
Without question, the ordering of the TextPositions after the JRE sort
completes is not consistent with the comparator. It should be easy to just
loop the sorted TPs and check to ensure the comparator always returns <=0.
I'm wondering if the slower fallback sort w
Thank you for directing me to the discussion. This is pretty much what I
expected (the reason for the fuzzy logic is superscript/subscript handling).
I am pretty confident that the problem is not with the comparator. The
problem is that we are trying to use a simple sort algorithm to do
somethin
Understood.
My biggest comment is that having a non-transitive comparator in a sort
algorithm is a really bad idea. It produces all sorts of non-deterministic
behavior.
So I'm in agreement that a better solution is needed.
Do you have any history of why the fuzzy logic is in that comparator? S
Oops, the one from PDFBOX-3019 is no longer available. The one from
PDFBOX-2991 is here, you can test it yourself:
https://issues.apache.org/jira/secure/attachment/12766900/sample-resume.pdf
The original extraction is
Benjamin Costa Mesa, California benjaminmccan(ätt)gmail.com
I don't have any th
Hmmm.
Do you know what the extracted text was for those two examples under the
original sort algorithm? Were those text chunks properly extracted with the
expected space between them?
I'm not very clear on why the examples you show would be missing a word
break detection after changing the sort.
I tried this and get lots of differences, obviously. I looked at two
files (PDFBOX-2991 and PDFBOX-3019) and the difference make sense, but
there's a new problem: the segments are not separated.
PDFBOX-2991:
Costa Mesa, California, benjaminmccan(ätt)gmail.co*mB*enjamin
PDFBOX-3019:
Originally
I've got an interesting problem.
We are running into scenarios where parsing fails to treat consecutive
words as being on the same line.
Here is an example:
https://drive.google.com/file/d/1XRd6itkNzXCd9CbPuGSmB6lYZMMMHzvH/view?usp=drive_link
If you extract the text, it comes out:
B checked . .
12 matches
Mail list logo