I will try this in Monday. Thanks for the tip. On Fri, 15 Jan 2016, 18:58 Cheng Lian <lian.cs....@gmail.com> wrote:
> You may try DataFrame.repartition(partitionExprs: Column*) to shuffle all > data belonging to a single (data) partition into a single (RDD) partition: > > df.coalesce(1).repartition("entity", "year", "month", "day", > "status").write.partitionBy("entity", "year", "month", "day", > "status").mode(SaveMode.Append).parquet(s"$location") > > (Unfortunately the naming here can be quite confusing.) > > > Cheng > > > On 1/14/16 11:48 PM, Patrick McGloin wrote: > > Hi, > > I would like to reparation / coalesce my data so that it is saved into one > Parquet file per partition. I would also like to use the Spark SQL > partitionBy API. So I could do that like this: > > df.coalesce(1).write.partitionBy("entity", "year", "month", "day", > "status").mode(SaveMode.Append).parquet(s"$location") > > I've tested this and it doesn't seem to perform well. This is because > there is only one partition to work on in the dataset and all the > partitioning, compression and saving of files has to be done by one CPU > core. > > I could rewrite this to do the partitioning manually (using filter with > the distinct partition values for example) before calling coalesce. > > But is there a better way to do this using the standard Spark SQL API? > > Best regards, > > Patrick > > > >