Hello, I want to save Spark job result as LZO compressed CSV files partitioned by one or more columns. Given that partitionBy is not supported by spark-csv, is there any recommendation for achieving this in user code?
One quick option is to i) cache the result dataframe ii) get unique partition keys iii) Iterate over keys and filter the result for that key rawDF.cache val idList = rawDF.select($"ID").distinct.collect.toList.map(_.getLong(0)) idList.foreach( id => { val rows = rawDF.filter($"ID" === id) rows.write.format("com.databricks.spark.csv").save(s"hdfs:///output/id=$id/") }) This approach doesn't scale well. Especially since no.of unique IDs can be between 500-700. And adding a second partition column will make this even worst. Wondering if anyone has an efficient work around? Srikanth