spark-csv partitionBy

Srikanth Tue, 09 Feb 2016 15:29:20 -0800

Hello,

I want to save Spark job result as LZO compressed CSV files partitioned by
one or more columns.
Given that partitionBy is not supported by spark-csv, is there any
recommendation for achieving this in user code?


One quick option is to
  i) cache the result dataframe
  ii) get unique partition keys
  iii) Iterate over keys and filter the result for that key

   rawDF.cache
   val idList =
rawDF.select($"ID").distinct.collect.toList.map(_.getLong(0))
      idList.foreach( id => {
        val rows = rawDF.filter($"ID" === id)

rows.write.format("com.databricks.spark.csv").save(s"hdfs:///output/id=$id/")
      })

This approach doesn't scale well. Especially since no.of unique IDs can be
between 500-700.
And adding a second partition column will make this even worst.

Wondering if anyone has an efficient work around?

Srikanth

spark-csv partitionBy

Reply via email to