[ 
https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888651#comment-17888651
 ] 

Tilman Hausherr edited comment on TIKA-4278 at 10/11/24 1:17 PM:
-----------------------------------------------------------------

Here's the test result: [^reports_csv_3.0.0_vs_3.0.0_nocolon.tar.xz] No 
surprises here. However the test runs only on .csv files so it misses some of 
the files mentioned in the previous report.

(This does not yet contain the latest change, and didn't include the colon)


was (Author: tilman):
Here's the test result: [^reports_csv_3.0.0_vs_3.0.0_nocolon.tar.xz] No 
surprises here. However the test runs only on .csv files so it misses some of 
the files mentioned in the previous report.

(This does not yet contain the latest change)

> TextAndCSVParser doesn't detect semicolon separated file
> --------------------------------------------------------
>
>                 Key: TIKA-4278
>                 URL: https://issues.apache.org/jira/browse/TIKA-4278
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.9.2
>            Reporter: Tilman Hausherr
>            Priority: Major
>              Labels: csv, csvparser
>             Fix For: 3.0.0, 2.9.3
>
>         Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, 
> reports_csv_2.9.2_vs_2.9.3_3.tar.xz, reports_csv_2.9.2_vs_2.9.3_4.tar.xz, 
> reports_csv_3.0.0_vs_3.0.0_nocolon.tar.xz
>
>
> I ran the code from the attached SO issue and yes it doesn't detect semicolon 
> separated files. The reason is this line in {{TextAndCSVParser.java}}:
> {code:java}
> private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'};
> {code}
> This is later used by {{CSVSniffer}}. For some reason the other delimiters 
> (pipe, colon and semicolon) aren't in that array, although they are in 
> {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now 
> it works for semicolon.
> Can I change this by adding the missing delimiters or was there a reason that 
> I missed? Proposed change would change CSVSniffer so that delimiters is a set 
> and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to