I think you could create a DataFrame with schema (mykey, value1, value2), then partition it by mykey when saving as parquet.
r2 = rdd.map((k, v) => Row(k, v._1, v._2)) df = sqlContext.createDataFrame(r2, schema) df.write.partitionBy("myKey").parquet(path) On Tue, Mar 15, 2016 at 10:33 AM, Mohamed Nadjib MAMI <m...@iai.uni-bonn.de> wrote: > Hi, > > I have a pair RDD of the form: (mykey, (value1, value2)) > > How can I create a DataFrame having the schema [V1 String, V2 String] to > store [value1, value2] and save it into a Parquet table named "mykey"? > > createDataFrame() method takes an RDD and a schema (StructType) in > parameters. The schema is known up front ([V1 String, V2 String]), but > getting an RDD by partitioning the original RDD based on the key is what I > can't get my head around so far. > > Similar questions have been around (like > http://stackoverflow.com/questions/25046199/apache-spark-splitting-pair-rdd-into-multiple-rdds-by-key-to-save-values) > but they do not use DataFrames. > > Thanks in advance! > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org