[ https://issues.apache.org/jira/browse/SPARK-51199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17932691#comment-17932691 ]
Felix Wollschläger commented on SPARK-51199: -------------------------------------------- This is probably an issue with the underlying library: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L25 The library appears to be unmaintained since 2021 (see: https://github.com/uniVocity/univocity-parsers/issues/534). JUnit is looking into replacing the same library with [FastCSV|https://fastcsv.org/] (see: https://github.com/junit-team/junit5/issues/4339). > Valid CSV records considered malformed > -------------------------------------- > > Key: SPARK-51199 > URL: https://issues.apache.org/jira/browse/SPARK-51199 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.5.4 > Environment: SparkContext: Running Spark version 3.5.4 > SparkContext: OS info Mac OS X, 15.3, aarch64 > SparkContext: Java version 17.0.14 2025-01-21 LTS > OpenJDK Runtime Environment Corretto-17.0.14.7.1 (build 17.0.14+7-LTS) > OpenJDK 64-Bit Server VM Corretto-17.0.14.7.1 (build 17.0.14+7-LTS, mixed > mode, sharing) > Reporter: Andreas Franz > Priority: Major > > There is an issue parsing CSV files with a combination of escaped double > quotes and commas in a field. > I've created a small example that demonstrates the issue: > {code:java} > package com.example > import org.apache.spark.sql.SparkSession > object Example { > def main(args: Array[String]): Unit = { > val spark = SparkSession.builder() > .appName("CSV Example") > .master("local[*]") > .config("spark.driver.host", "localhost") > .config("spark.ui.enabled", "false") > .getOrCreate() > val csv = spark > .read > .option("header", "true") > .option("mode", "FAILFAST") > .csv("./src/main/scala/com/example/example.csv") > csv.show(2, truncate = false) > spark.stop() > } > } {code} > {code:java} > id,region_name,gp_id,gp_name,gp_group_id,gp_group_name,gp_group_region_name > 111234567,east,1122723,"Test 1",,, 001234567,east,1122723,"Foo ""Bar"", New > York, US",,, > {code} > According to [https://www.ietf.org/rfc/rfc4180.txt|http://example.com/] this > is a valid CSV record. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org