[ 
https://issues.apache.org/jira/browse/TIKA-4053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17727655#comment-17727655
 ] 

Gregory Lepore edited comment on TIKA-4053 at 6/5/23 3:05 PM:
--------------------------------------------------------------

ref:

https://issues.apache.org/jira/browse/TIKA-2484?jql=project%20%3D%20TIKA%20AND%20text%20~%20icu

https://issues.apache.org/jira/browse/TIKA-40?jql=project%20%3D%20TIKA%20AND%20text%20~%20icu

https://issues.apache.org/jira/browse/TIKA-2038?jql=project%20%3D%20TIKA%20AND%20text%20~%20icu

 

Possible libraries for improved encoding and language detection:

Decodetect

[https://github.com/ethteck/decodetect|https://github.com/ethteck/decodetect,] 
MIT Licensed, last update 2021

 

 


was (Author: [email protected]):
ref:

https://issues.apache.org/jira/browse/TIKA-2484?jql=project%20%3D%20TIKA%20AND%20text%20~%20icu

https://issues.apache.org/jira/browse/TIKA-40?jql=project%20%3D%20TIKA%20AND%20text%20~%20icu

https://issues.apache.org/jira/browse/TIKA-2038?jql=project%20%3D%20TIKA%20AND%20text%20~%20icu

 

Possible libraries for improved encoding and language detection:

Decodetect

[https://github.com/ethteck/decodetect,] MIT Licensed, last update 2021

 

 

> Improve detection of text files
> -------------------------------
>
>                 Key: TIKA-4053
>                 URL: https://issues.apache.org/jira/browse/TIKA-4053
>             Project: Tika
>          Issue Type: Sub-task
>            Reporter: Gregory Lepore
>            Priority: Major
>         Attachments: 1990-01.etc, 2008-09.3, 20220708 YouTube1-1.kif, 
> bub0336d.007, pacman.nas, shab3_36.qbp, wots.diz
>
>
> Common Crawl data shows lots of text files which are being recognized as 
> application/octet-stream. Some appear to be due to being in a language other 
> than English.
>  
> Various sample files attached.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to