[ https://issues.apache.org/jira/browse/TIKA-4375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17922738#comment-17922738 ]
Tim Allison edited comment on TIKA-4375 at 1/31/25 3:22 PM: ------------------------------------------------------------ A few observations: 1) fewer exceptions. The small handful of new exceptions are not surprising 2) attachment diffs look good 3) Files now identified as {{model/x.stl.ascii}} have no text extracted -- I think this is ok? This is the behavior in 3.x. Should we have this format extend text/plain so that the text is still parsed? 4) A number of files are now being identified as GB18030 instead of UTF-8. From the common words scores, this looks like a regression on some but an improvement on others. I'm not sure there's much we can do. was (Author: talli...@mitre.org): A few observations: 1) fewer exceptions. The small handful of new exceptions are not surprising 2) attachment diffs look good 3) Files now identified as {{model/x.stl.ascii}} have no text extracted -- I think this is ok? Should we have this format extend text/plain so that the text is still parsed? 4) A number of files are now being identified as GB18030 instead of UTF-8. From the common words scores, this looks like a regression, but I'm not sure there's much we can do. There are also a number of files now identified as GB18030 that appear to have improved extraction. > Regression tests for 2.9.3 release > ---------------------------------- > > Key: TIKA-4375 > URL: https://issues.apache.org/jira/browse/TIKA-4375 > Project: Tika > Issue Type: Task > Reporter: Tim Allison > Priority: Major > Attachments: tika-2.9.2-v-tika-2.9.3-reports.tgz > > -- This message was sent by Atlassian Jira (v8.20.10#820010)