[ 
https://issues.apache.org/jira/browse/TIKA-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18085477#comment-18085477
 ] 

Tim Allison edited comment on TIKA-4730 at 6/2/26 12:17 PM:
------------------------------------------------------------

Y. Will do. I'll also turn off xhtml validator and align pdf access permissions.

On charset detection, ~2700 have improved oov, ~1200 have worse. 

One example of OOV being likely very wrong: FAWJEKDFIRKFN7JKLZ6Y7LOZ5BVJXAHI, 
detected as win-1252 (English) in 3.x and win-1250 (Polish) in 4.x appears to 
lose 1613 common tokens. However, it really is Polish.


was (Author: [email protected]):
Y. Will do. I'll also turn of xhtml validator and align pdf access permissions.

On charset detection, ~2700 have improved oov, ~1200 have worse. 

One example of OOV being likely very wrong: FAWJEKDFIRKFN7JKLZ6Y7LOZ5BVJXAHI, 
detected as win-1252 (English) in 3.x and win-1250 (Polish) in 4.x appears to 
lose 1613 common tokens. However, it really is Polish.

> Prep for 4.0.0-beta-1 release
> -----------------------------
>
>                 Key: TIKA-4730
>                 URL: https://issues.apache.org/jira/browse/TIKA-4730
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: reports.tar.gz
>
>
> We made a number of important fixes to the published artifacts in ASF's dist 
> repo, maven central and docker.
> I think we're set on changing APIs for 4.x generally.
> Is there anything else we need for this beta release?
> I propose starting the 4.0.0-beta-1 release in two weeks. WDYT?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to