Starting from Spark 1.4, you can do this via dynamic partitioning:

sqlContext.table("trade").write.partitionBy("date").parquet("/tmp/path")

Cheng

On 9/1/15 8:27 AM, gtinside wrote:
Hi ,

I have a set of data, I need to group by specific key and then save as
parquet. Refer to the code snippet below. I am querying trade and then
grouping by date

val df = sqlContext.sql("SELECT * FROM trade")
val dfSchema = df.schema
val partitionKeyIndex = dfSchema.fieldNames.seq.indexOf("date")
//group by date
val groupedByPartitionKey = df.rdd.groupBy { row =>
row.getString(partitionKeyIndex) }
val failure = groupedByPartitionKey.map(row => {
val rowDF = sqlContext.createDataFrame(sc.parallelize(row._2.toSeq),
dfSchema)
val fileName = config.getTempFileName(row._1)
try {
         val dest = new Path(fileName)
         if(DefaultFileSystem.getFS.exists(dest)) {
             DefaultFileSystem.getFS.delete(dest, true)
          }
          rowDF.saveAsParquetFile(fileName)
     } catch {
            case e : Throwable => {
                                         logError("Failed to save parquet
file")
            }
        failure = true
    }

This code doesn't work well because of NestedRDD , what is the best way to
solve this problem?

Regards,
Gaurav



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Group-by-specific-key-and-save-as-parquet-tp24527.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to