[
https://issues.apache.org/jira/browse/TIKA-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18085477#comment-18085477
]
Tim Allison edited comment on TIKA-4730 at 6/2/26 12:17 PM:
------------------------------------------------------------
Y. Will do. I'll also turn off xhtml validator and align pdf access permissions.
On charset detection, ~2700 have improved oov, ~1200 have worse.
One example of OOV being likely very wrong: FAWJEKDFIRKFN7JKLZ6Y7LOZ5BVJXAHI,
detected as win-1252 (English) in 3.x and win-1250 (Polish) in 4.x appears to
lose 1613 common tokens. However, it really is Polish.
was (Author: [email protected]):
Y. Will do. I'll also turn of xhtml validator and align pdf access permissions.
On charset detection, ~2700 have improved oov, ~1200 have worse.
One example of OOV being likely very wrong: FAWJEKDFIRKFN7JKLZ6Y7LOZ5BVJXAHI,
detected as win-1252 (English) in 3.x and win-1250 (Polish) in 4.x appears to
lose 1613 common tokens. However, it really is Polish.
> Prep for 4.0.0-beta-1 release
> -----------------------------
>
> Key: TIKA-4730
> URL: https://issues.apache.org/jira/browse/TIKA-4730
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
> Attachments: reports.tar.gz
>
>
> We made a number of important fixes to the published artifacts in ASF's dist
> repo, maven central and docker.
> I think we're set on changing APIs for 4.x generally.
> Is there anything else we need for this beta release?
> I propose starting the 4.0.0-beta-1 release in two weeks. WDYT?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)