[ 
https://issues.apache.org/jira/browse/TIKA-4370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17924076#comment-17924076
 ] 

Subbu edited comment on TIKA-4370 at 2/5/25 1:52 PM:
-----------------------------------------------------

Appreciate your responses here.

_Perhaps, if the file is application/octet-stream, run the charset detector on 
the bytes and if the charset is detected as Shift-JIS then return text/pain._
1) Do you see that I can add this logic in TextDetector and create a PR? But I 
believe I won't have tika-parsers dependency in tika-core. Or are you 
suggesting I do this in my application layer? 

If you prefer the latter, are we open here to make a change such that 
application passes character encoding in metadata (using pre ran TXTParser) in 
their classpath and TextDetector, if it detects application/octet-stream and 
metadata says Shift_JIS, it can return fallback as text/plain? 

2) Even if i do that (either in tika-core / or in my app logic) I think the 
best I can do is without file name, if returned type is 
application/octet-stream, I can char encoding detection and confirm it's 
text/plain. But for other textual files, won't the problem exist? Like csv 
files always with Shift-JIS. Or do you think problem is specific to text/plain 
and all other SJIS files should work (without file name)? 

[~tallison] 


was (Author: JIRAUSER307746):
Appreciate your responses here.

_Perhaps, if the file is application/octet-stream, run the charset detector on 
the bytes and if the charset is detected as Shift-JIS then return text/pain._
1) Do you see that I can add this logic in TextDetector and create a PR? But I 
believe I won't have tika-parsers dependency in tika-core. Or are you 
suggesting I do this in my application layer? 

If you prefer the latter, are we open here an application passes character 
encoding in metadata (using pre ran TXTParser) in their classpath and 
TextDetector, if it detects application/octet-stream and metadata says 
Shift_JIS, it can return fallback as text/plain? 



2) Even if i do that (either in tika-core / or in my app logic) I think the 
best I can do is without file name, if returned type is 
application/octet-stream, I can char encoding detection and confirm it's 
text/plain. But for other textual files, won't the problem exist? Like csv 
files always with Shift-JIS. Or do you think problem is specific to text/plain 
and all other SJIS files should work (without file name)? 

[~tallison] 

> SJIS Encoded Files Can't be Detected
> ------------------------------------
>
>                 Key: TIKA-4370
>                 URL: https://issues.apache.org/jira/browse/TIKA-4370
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Subbu
>            Priority: Major
>
> When character encoding of file is SJIS, without file name in the metadata, 
> most files content-type detected as application/octet-stream. Is there zero 
> support for SJIS? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to