changed behavior for csv datasource and quoting in spark 2.0.0-SNAPSHOT

Koert Kuipers Thu, 26 May 2016 15:36:32 -0700

in spark 1.6.1 we used:
 sqlContext.read
      .format("com.databricks.spark.csv")
      .delimiter("~")
      .option("quote", null)


this effectively turned off quoting, which is a necessity for certain data
formats where quoting is not supported and "\"" is a valid character itself
in the data.

in spark 2.0.0-SNAPSHOT we did same thing:
 sqlContext.read
      .format("csv")
      .delimiter("~")
      .option("quote", null)

but this did not work, we got weird blowups where spark was trying to parse
thousands of lines as if it is one record. the reason was that a (valid)
quote character ("\"") was present in the data. for example
a~b"c~d

as it turns out setting quote to null does not turn of quoting anymore.
instead it means to use the default quote character.

does anyone know how to turn off quoting now?

our current workaround is:
 sqlContext.read
      .format("csv")
      .delimiter("~")
      .option("quote", "☃")

(we assume there are no unicode snowman's in our data...)

changed behavior for csv datasource and quoting in spark 2.0.0-SNAPSHOT

Reply via email to