[ https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866147#comment-17866147 ]
Tilman Hausherr edited comment on TIKA-4278 at 7/15/24 6:24 PM: ---------------------------------------------------------------- I've now added a check that if the delimiter isn't in row zero then further hits later don't count. This fixes the problem that too many files are recognized as CSV that are not. Only one problem left now: colon-separated lines. I never had any in decades, but a google search does find some SO questions, so I'll leave that there for now. We can still change it after the "big" regression tests. was (Author: tilman): I've now added a check that if the delimiter isn't in row zero then further hits later don't count. This fixes the problem that too many files are recognized as CSV that are not. Only one problem left now: colon-separated lines. I never had any in decades, but a google search does find some SO questions, so I'll leave that there for now. > TextAndCSVParser doesn't detect semicolon separated file > -------------------------------------------------------- > > Key: TIKA-4278 > URL: https://issues.apache.org/jira/browse/TIKA-4278 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 2.9.2 > Reporter: Tilman Hausherr > Assignee: Tilman Hausherr > Priority: Major > Labels: csv, csvparser > Fix For: 3.0.0, 2.9.3 > > Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, > reports_csv_2.9.2_vs_2.9.3_3.tar.xz > > > I ran the code from the attached SO issue and yes it doesn't detect semicolon > separated files. The reason is this line in {{TextAndCSVParser.java}}: > {code:java} > private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'}; > {code} > This is later used by {{CSVSniffer}}. For some reason the other delimiters > (pipe, colon and semicolon) aren't in that array, although they are in > {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now > it works for semicolon. > Can I change this by adding the missing delimiters or was there a reason that > I missed? Proposed change would change CSVSniffer so that delimiters is a set > and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)