[ 
https://issues.apache.org/jira/browse/TIKA-4375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17922738#comment-17922738
 ] 

Tim Allison commented on TIKA-4375:
-----------------------------------

A few observations:
1) fewer exceptions. The small handful of new exceptions are not surprising
2) attachment diffs look good
3) Files now identified as {{model/x.stl.ascii}} have no text extracted -- I 
think this is ok? Should we have this format extend text/plain so that the text 
is still parsed?
4) A number of files are now being identified as GB18030 instead of UTF-8. From 
the common words scores, this looks like a regression, but I'm not sure there's 
much we can do. There are also a number of files now identified as GB18030 that 
appear to have improved extraction.

> Regression tests for 2.9.3 release
> ----------------------------------
>
>                 Key: TIKA-4375
>                 URL: https://issues.apache.org/jira/browse/TIKA-4375
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: tika-2.9.2-v-tika-2.9.3-reports.tgz
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to