[ https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888632#comment-17888632 ]
ASF GitHub Bot commented on TIKA-4278: -------------------------------------- THausherr commented on PR #1976: URL: https://github.com/apache/tika/pull/1976#issuecomment-2407325929 I see you removed the "colon isn't reliable" code part. Did you test what would happen with the file mentioned in that code segment (242970.txt)? IMHO the colon should still be "discriminated" if others have the same confidence. I'm currently trying to build and modified version and will then run a regression test on the csv fileset (this is faster than running the build itself). > TextAndCSVParser doesn't detect semicolon separated file > -------------------------------------------------------- > > Key: TIKA-4278 > URL: https://issues.apache.org/jira/browse/TIKA-4278 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 2.9.2 > Reporter: Tilman Hausherr > Priority: Major > Labels: csv, csvparser > Fix For: 3.0.0, 2.9.3 > > Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, > reports_csv_2.9.2_vs_2.9.3_3.tar.xz, reports_csv_2.9.2_vs_2.9.3_4.tar.xz > > > I ran the code from the attached SO issue and yes it doesn't detect semicolon > separated files. The reason is this line in {{TextAndCSVParser.java}}: > {code:java} > private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'}; > {code} > This is later used by {{CSVSniffer}}. For some reason the other delimiters > (pipe, colon and semicolon) aren't in that array, although they are in > {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now > it works for semicolon. > Can I change this by adding the missing delimiters or was there a reason that > I missed? Proposed change would change CSVSniffer so that delimiters is a set > and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)