RE: Loading data containing newlines

Alexander Pivovarov Tue, 12 Jan 2016 11:48:06 -0800

I tried spark-csv with file having new line inside field value - does not
work as well


$ cat /tmp/cars.csv
1,"Hello1
world"
2,"Hello2"
3,"Hello3"

scala> val df = sqlContext.read.
     |   format("com.databricks.spark.csv").
     |   load("/tmp/cars.csv")
java.io.IOException: (startline 1) EOF reached before encapsulated token
finished
at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:282)
at org.apache.commons.csv.Lexer.nextToken(Lexer.java:152)
at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:498)
at org.apache.commons.csv.CSVParser.getRecords(CSVParser.java:365)
at com.databricks.spark.csv.CsvRelation.inferSchema(CsvRelation.scala:223)
at com.databricks.spark.csv.CsvRelation.<init>(CsvRelation.scala:72)
at
com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:157)
at
com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:44)

what you can do is
1. read list of files to Array[String]
val files = getListOfFiles(dir)

2. create RDD out of it and repartition by files.length (So, each task get
one file)
val filesRdd = sc.parallelize(files, files.length)

3. unbzip2 and parse. 1 file = 1 task

val lines: RDD[Array[String]] = filesRdd.flatMap(file =>
unbzip2AndCsvToListOfArrays(file))

unbzip2AndCsvToListOfArrays(file: String): List[Array[String]] can use csv
parser which understands new line inside field value, e.g. Super CSV

4. create RDD of Rows
val rows = lines.map(line => Row.fromSeq(line.toSeq))

5. create dataframe
val df = getSqlContext.createDataFrame(rows, schema)

schema describes column name and types.

6. save df as ORC
df.repartition(outputFilesCount).write.format("orc").save(outputPath)


On Jan 12, 2016 9:58 AM, "Gerber, Bryan W" <bryan.ger...@pnnl.gov> wrote:

> From that wiki:
>
> "This SerDe works for most CSV data, but does not handle embedded
> newlines."
>
>
>
> The Hive SerDe interface is all downstream of the TextInputFormat, which
> has already split records by newlines.  In theory you can give it a
> different line delimiter, but Hive 1.2.1 does not support it: "FAILED:
> SemanticException 3:20 LINES TERMINATED BY only supports newline '\n' right
> now."
>
>
>
> *From:* Alexander Pivovarov [mailto:apivova...@gmail.com]
> *Sent:* Tuesday, January 12, 2016 9:52 AM
> *To:* user@hive.apache.org
> *Subject:* Re: Loading data containing newlines
>
>
>
> Try CSV serde. It should correctly parse quoted field value having newline
> inside
>
> https://cwiki.apache.org/confluence/display/Hive/CSV+Serde
>
>
>
> Hadoop should automatically read bz2 files
>
>
>
>
>
> On Tue, Jan 12, 2016 at 9:40 AM, Gerber, Bryan W <bryan.ger...@pnnl.gov>
> wrote:
>
> We are attempting to load CSV text files (compressed to bz2) containing
> newlines in fields using EXTERNAL tables and INSERT/SELECT into ORC format
> tables.  Data volume is ~1TB/day, we are really trying to avoid unpacking
> them to condition the data.
>
>
>
> A few days of research has us ready to implement custom  input/output
> formats to handle the ingest.  Any other suggestions that may be less
> effort with low impact to load times?
>
>
>
> Thanks,
>
> Bryan G.
>
>
>

RE: Loading data containing newlines

Reply via email to