[ 
https://issues.apache.org/jira/browse/TIKA-4370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17932887#comment-17932887
 ] 

Subbu edited comment on TIKA-4370 at 3/6/25 8:25 AM:
-----------------------------------------------------

While I understand CharsetDetector could be run before MimeTypes and let it 
figure out that incoming file is SJIS, and if it says so MimeType can return 
text. But I think CharsetDetector is in parser, can we have core depend on 
parser?

_Another thought is to hardcode the text detector after MimeTypes...which I 
don't like, but I'm not beyond. :D_

I couldn't get this clearly as even if we hardcode TextDetector after MimeTypes 
without it being able to detect SJIS, it would be still be octet-stream? Let me 
if I misunderstood or thinking of a better way. 
 


was (Author: JIRAUSER307746):
While I think understand CharsetDetector could be run before MimeTypes and let 
it figure out that incoming file is SJIS, and if it says so MimeType can return 
text. But I think CharsetDetector is in parser, can we have core depend on 
parser?



_Another thought is to hardcode the text detector after MimeTypes...which I 
don't like, but I'm not beyond. :D_

I couldn't get this clearly as even if we hardcode TextDetector after MimeTypes 
without it being able to detect SJIS, it would be still be octet-stream? Let me 
if I misunderstood or thinking of a better way. 
 

> SJIS Encoded Files Can't be Detected
> ------------------------------------
>
>                 Key: TIKA-4370
>                 URL: https://issues.apache.org/jira/browse/TIKA-4370
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Subbu
>            Priority: Major
>
> When character encoding of file is SJIS, without file name in the metadata, 
> most files content-type detected as application/octet-stream. Is there zero 
> support for SJIS? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to