Am 26.04.2022 um 21:45 schrieb Tim Allison:
I should clarify that I fixed the two regressions that I had
identified in the release candidate.  The regression results that I
shared were run with 1.x before those fixes.

Ah ok, but then the tests should be run again after the fixes in case something got broken by the fix (it happened in the pdfbox project).  If nothing got broken, then there's still the satisfaction of having very small result files :-)

Also suspicious:

bug_trackers/TIKA/TIKA-2215-0.ppt


Tilman



Still, let's fix the dependency convergence, and please let me know if
there's anything else you find in the regression reports!

On Tue, Apr 26, 2022 at 3:40 PM Tim Allison <[email protected]> wrote:
Hi Tilman,

   Thank you for raising this. 3X4JRZZ4TQ2GK4QQDQEXMFCVLM3FM5I4 is not
related to TIKA-3734.  The updated junrar (7.5.0) is swallowing a
(new) exception on this file and stopping the parse without throwing
an exception.  The earlier version of junrar (7.4.1) did not find a
problem with the file.

   My ubuntu package util throws an exception on this file, and I think
it is just kind of wonky.

   I'm going to fix the dependency convergence issues.  Is there anything else?

       Best,

                  Tim

On Tue, Apr 26, 2022 at 2:52 PM Tilman Hausherr <[email protected]> wrote:
Am 26.04.2022 um 13:07 schrieb Tim Allison:
Reports are here:
https://corpora.tika.apache.org/base/reports/reports-tika-1.28.2-SNAPSHOT.tgz

I found two issues that should be fixed (TIKA-3733 and TIKA-3734).  I
think both are related to the underlying parsers being stricter (which
is good), but we need to change our code to handle these cases more
robustly.

Let me know if you see anything else.
What about commoncrawl3/3X/3X4JRZZ4TQ2GK4QQDQEXMFCVLM3FM5I4 , this is
also a rar file and the last entry in content_diffs_no_exceptions.xlsx .
Is that related to TIKA-3734 ?

Tilman


Reply via email to