[
https://issues.apache.org/jira/browse/TIKA-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18087668#comment-18087668
]
Tim Allison commented on TIKA-4730:
-----------------------------------
That's a rough view of improvements vs loss per charset flip. There are losses
(157k common tokens), but the wins are bigger (889k).
The detector chain I ran for beta is not actually the default. That's what the
open draft PR would do.
Users can configure the 3.x chain if they want no matter what we choose to do.
If the overall improvement is not sufficient, we can back off and return to the
3.x chain as we ran in 4.0.0-alpha-1.
My thought was to go with this for beta and continue to improve for the 4.0.0
release. wdyt?
> Prep for 4.0.0-beta-1 release
> -----------------------------
>
> Key: TIKA-4730
> URL: https://issues.apache.org/jira/browse/TIKA-4730
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
> Attachments: reports-2020609.tgz, reports.tar.gz
>
>
> We made a number of important fixes to the published artifacts in ASF's dist
> repo, maven central and docker.
> I think we're set on changing APIs for 4.x generally.
> Is there anything else we need for this beta release?
> I propose starting the 4.0.0-beta-1 release in two weeks. WDYT?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)