[ https://issues.apache.org/jira/browse/TIKA-4370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17924064#comment-17924064 ]
Tim Allison edited comment on TIKA-4370 at 2/5/25 1:23 PM: ----------------------------------------------------------- Not sure precisely what your proposal is. If you're getting better detection with TXTParser than with TextAndCSVParser, you can definitely swap in the TXTParser instead of the TextAndCSVParser. However, that's a bug, and we'd like you to share an example file if possible so that we can fix it. :D What do you think of turning the TXTParser on for {{application/octet-stream}}? As in: https://issues.apache.org/jira/browse/TIKA-4370?focusedCommentId=17921798&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17921798 You could then apply tika-eval to the extracted text to measure the "junkness" that you're getting out of the files. was (Author: talli...@mitre.org): Not sure precisely what your proposal is. If you're getting better detection with TXTParser than with TextAndCSVParser, you can definitely swap in the TXTParser instead of the TextAndCSVParser. However, that's a bug, and we'd like you to share an example file if possible so that we can fix it. :D What do you think of turning the TXTParser on for {{application/octet-stream}}? As in: https://issues.apache.org/jira/browse/TIKA-4370?focusedCommentId=17921798&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17921798 > SJIS Encoded Files Can't be Detected > ------------------------------------ > > Key: TIKA-4370 > URL: https://issues.apache.org/jira/browse/TIKA-4370 > Project: Tika > Issue Type: Bug > Reporter: Subbu > Priority: Major > > When character encoding of file is SJIS, without file name in the metadata, > most files content-type detected as application/octet-stream. Is there zero > support for SJIS? -- This message was sent by Atlassian Jira (v8.20.10#820010)