[
https://issues.apache.org/jira/browse/TIKA-4053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17727655#comment-17727655
]
Gregory Lepore edited comment on TIKA-4053 at 6/5/23 3:05 PM:
--------------------------------------------------------------
ref:
https://issues.apache.org/jira/browse/TIKA-2484?jql=project%20%3D%20TIKA%20AND%20text%20~%20icu
https://issues.apache.org/jira/browse/TIKA-40?jql=project%20%3D%20TIKA%20AND%20text%20~%20icu
https://issues.apache.org/jira/browse/TIKA-2038?jql=project%20%3D%20TIKA%20AND%20text%20~%20icu
Possible libraries for improved encoding and language detection:
Decodetect
[https://github.com/ethteck/decodetect|https://github.com/ethteck/decodetect,]
MIT Licensed, last update 2021
was (Author: [email protected]):
ref:
https://issues.apache.org/jira/browse/TIKA-2484?jql=project%20%3D%20TIKA%20AND%20text%20~%20icu
https://issues.apache.org/jira/browse/TIKA-40?jql=project%20%3D%20TIKA%20AND%20text%20~%20icu
https://issues.apache.org/jira/browse/TIKA-2038?jql=project%20%3D%20TIKA%20AND%20text%20~%20icu
Possible libraries for improved encoding and language detection:
Decodetect
[https://github.com/ethteck/decodetect,] MIT Licensed, last update 2021
> Improve detection of text files
> -------------------------------
>
> Key: TIKA-4053
> URL: https://issues.apache.org/jira/browse/TIKA-4053
> Project: Tika
> Issue Type: Sub-task
> Reporter: Gregory Lepore
> Priority: Major
> Attachments: 1990-01.etc, 2008-09.3, 20220708 YouTube1-1.kif,
> bub0336d.007, pacman.nas, shab3_36.qbp, wots.diz
>
>
> Common Crawl data shows lots of text files which are being recognized as
> application/octet-stream. Some appear to be due to being in a language other
> than English.
>
> Various sample files attached.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)