[ https://issues.apache.org/jira/browse/TIKA-4373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17921740#comment-17921740 ]
Tim Allison edited comment on TIKA-4373 at 1/28/25 1:54 PM: ------------------------------------------------------------ Couple of observations. 1) LibreOffice 24.2 is complaining about all the xlsx reports now. It is able to repair them. This was widely discussed on the POI lists. 2) We've lost quite a few "common words" in files that used to be detected as colon-delimited "csv" files. 3) PDF extraction has seen quite good improvements 4) zip extraction has improved in several handfuls of documents -- more attachments 5) We're getting a bunch more files identified as json. 6) handful of new exceptions in RTF (zip bomb?!) and xps 7) improved text extraction in xps I want to manually sample some files for 2), 5) and 6) to see if these are serious problems. We updated commons-codec after running these regression tests. I propose that unless there are problems identified in the report, we move forth with a 3.1.0-rc1 vote and concurrently rerun the regression tests to pick up any surprises with the updated commons-codec. Let me know if you find anything. Many, many thanks again to [~msahyoun] for his ongoing support of the regression server. was (Author: talli...@mitre.org): Couple of observations. 1) LibreOffice 24.2 is complaining about all the xlsx reports now. It is able to repair them. This was widely discussed on the POI lists. 2) We've lost quite a few "common words" in files that used to be detected as colon-delimited "csv" files. 3) PDF extraction has seen quite good improvements 4) zip extraction has improved in several handfuls of documents -- more attachments 5) We're getting a bunch more files identified as json. 6) handful of new exceptions in RTF (zip bomb?!) and xps I want to manually sample some files for 2), 5) and 6) to see if these are serious problems. We updated commons-codec after running these regression tests. I propose that unless there are problems identified in the report, we move forth with a 3.1.0-rc1 vote and concurrently rerun the regression tests to pick up any surprises with the updated commons-codec. Let me know if you find anything. Many, many thanks again to [~msahyoun] for his ongoing support of the regression server. > Regression tests for 3.1.0 release > ---------------------------------- > > Key: TIKA-4373 > URL: https://issues.apache.org/jira/browse/TIKA-4373 > Project: Tika > Issue Type: Task > Reporter: Tim Allison > Priority: Major > Attachments: reports_tika-3.0-vs-3.1.tgz > > -- This message was sent by Atlassian Jira (v8.20.10#820010)