Hi ,
I have a set of data, I need to group by specific key and then save as
parquet. Refer to the code snippet below. I am querying trade and then
grouping by date
val df = sqlContext.sql("SELECT * FROM trade")
val dfSchema = df.schema
val partitionKeyIndex = dfSchema.fieldNames.seq.indexOf("date")
//group by date
val groupedByPartitionKey = df.rdd.groupBy { row =>
row.getString(partitionKeyIndex) }
val failure = groupedByPartitionKey.map(row => {
val rowDF = sqlContext.createDataFrame(sc.parallelize(row._2.toSeq),
dfSchema)
val fileName = config.getTempFileName(row._1)
try {
val dest = new Path(fileName)
if(DefaultFileSystem.getFS.exists(dest)) {
DefaultFileSystem.getFS.delete(dest, true)
}
rowDF.saveAsParquetFile(fileName)
} catch {
case e : Throwable => {
logError("Failed to save parquet
file")
}
failure = true
}
This code doesn't work well because of NestedRDD , what is the best way to
solve this problem?
Regards,
Gaurav
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Group-by-specific-key-and-save-as-parquet-tp24527.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]