[ 
https://issues.apache.org/jira/browse/TIKA-4373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17921740#comment-17921740
 ] 

Tim Allison edited comment on TIKA-4373 at 1/28/25 1:54 PM:
------------------------------------------------------------

Couple of observations.

1) LibreOffice 24.2 is complaining about all the xlsx reports now. It is able 
to repair them. This was widely discussed on the POI lists.
2) We've lost quite a few "common words" in files that used to be detected as 
colon-delimited "csv" files.
3) PDF extraction has seen quite good improvements
4) zip extraction has improved in several handfuls of documents -- more 
attachments
5) We're getting a bunch  more files identified as json.
6) handful of new exceptions in RTF (zip bomb?!) and xps
7) improved text extraction in xps

I want to manually sample some files for 2), 5) and 6) to see if these are 
serious problems.

We updated commons-codec after running these regression tests. I propose that 
unless there are problems identified in the report, we move forth with a 
3.1.0-rc1 vote and concurrently rerun the regression tests to pick up any 
surprises with the updated commons-codec.

Let me know if you find anything.

Many, many thanks again to [~msahyoun] for his ongoing support of the 
regression server.


was (Author: talli...@mitre.org):
Couple of observations.

1) LibreOffice 24.2 is complaining about all the xlsx reports now. It is able 
to repair them. This was widely discussed on the POI lists.
2) We've lost quite a few "common words" in files that used to be detected as 
colon-delimited "csv" files.
3) PDF extraction has seen quite good improvements
4) zip extraction has improved in several handfuls of documents -- more 
attachments
5) We're getting a bunch  more files identified as json.
6) handful of new exceptions in RTF (zip bomb?!) and xps

I want to manually sample some files for 2), 5) and 6) to see if these are 
serious problems.

We updated commons-codec after running these regression tests. I propose that 
unless there are problems identified in the report, we move forth with a 
3.1.0-rc1 vote and concurrently rerun the regression tests to pick up any 
surprises with the updated commons-codec.

Let me know if you find anything.

Many, many thanks again to [~msahyoun] for his ongoing support of the 
regression server.

> Regression tests for 3.1.0 release
> ----------------------------------
>
>                 Key: TIKA-4373
>                 URL: https://issues.apache.org/jira/browse/TIKA-4373
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: reports_tika-3.0-vs-3.1.tgz
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to