I tried spark-csv with file having new line inside field value - does not work as well
$ cat /tmp/cars.csv 1,"Hello1 world" 2,"Hello2" 3,"Hello3" scala> val df = sqlContext.read. | format("com.databricks.spark.csv"). | load("/tmp/cars.csv") java.io.IOException: (startline 1) EOF reached before encapsulated token finished at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:282) at org.apache.commons.csv.Lexer.nextToken(Lexer.java:152) at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:498) at org.apache.commons.csv.CSVParser.getRecords(CSVParser.java:365) at com.databricks.spark.csv.CsvRelation.inferSchema(CsvRelation.scala:223) at com.databricks.spark.csv.CsvRelation.<init>(CsvRelation.scala:72) at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:157) at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:44) what you can do is 1. read list of files to Array[String] val files = getListOfFiles(dir) 2. create RDD out of it and repartition by files.length (So, each task get one file) val filesRdd = sc.parallelize(files, files.length) 3. unbzip2 and parse. 1 file = 1 task val lines: RDD[Array[String]] = filesRdd.flatMap(file => unbzip2AndCsvToListOfArrays(file)) unbzip2AndCsvToListOfArrays(file: String): List[Array[String]] can use csv parser which understands new line inside field value, e.g. Super CSV 4. create RDD of Rows val rows = lines.map(line => Row.fromSeq(line.toSeq)) 5. create dataframe val df = getSqlContext.createDataFrame(rows, schema) schema describes column name and types. 6. save df as ORC df.repartition(outputFilesCount).write.format("orc").save(outputPath) On Jan 12, 2016 9:58 AM, "Gerber, Bryan W" <bryan.ger...@pnnl.gov> wrote: > From that wiki: > > "This SerDe works for most CSV data, but does not handle embedded > newlines." > > > > The Hive SerDe interface is all downstream of the TextInputFormat, which > has already split records by newlines. In theory you can give it a > different line delimiter, but Hive 1.2.1 does not support it: "FAILED: > SemanticException 3:20 LINES TERMINATED BY only supports newline '\n' right > now." > > > > *From:* Alexander Pivovarov [mailto:apivova...@gmail.com] > *Sent:* Tuesday, January 12, 2016 9:52 AM > *To:* user@hive.apache.org > *Subject:* Re: Loading data containing newlines > > > > Try CSV serde. It should correctly parse quoted field value having newline > inside > > https://cwiki.apache.org/confluence/display/Hive/CSV+Serde > > > > Hadoop should automatically read bz2 files > > > > > > On Tue, Jan 12, 2016 at 9:40 AM, Gerber, Bryan W <bryan.ger...@pnnl.gov> > wrote: > > We are attempting to load CSV text files (compressed to bz2) containing > newlines in fields using EXTERNAL tables and INSERT/SELECT into ORC format > tables. Data volume is ~1TB/day, we are really trying to avoid unpacking > them to condition the data. > > > > A few days of research has us ready to implement custom input/output > formats to handle the ingest. Any other suggestions that may be less > effort with low impact to load times? > > > > Thanks, > > Bryan G. > > >