[
https://issues.apache.org/jira/browse/TIKA-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18086481#comment-18086481
]
Tim Allison commented on TIKA-4754:
-----------------------------------
I tested this on 50k html files from common crawl to select the error
rate/compression. Diff was tiny. There may still be pathological data in the
wild that will skew oov.
> Switch to bloom filters for common tokens in tika-eval
> ------------------------------------------------------
>
> Key: TIKA-4754
> URL: https://issues.apache.org/jira/browse/TIKA-4754
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Trivial
>
> This can bring the tika-eval jar from 22mb -> 8.5mb without much of a change
> in stats. We could go lower, but then there's more of a diff because of
> expected bloom filter limitations.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)