Re: DataFrame partitionBy to a single Parquet file (per partition)

Patrick McGloin Fri, 15 Jan 2016 10:11:50 -0800

I will try this in Monday. Thanks for the tip.

On Fri, 15 Jan 2016, 18:58 Cheng Lian <lian.cs....@gmail.com> wrote:


> You may try DataFrame.repartition(partitionExprs: Column*) to shuffle all
> data belonging to a single (data) partition into a single (RDD) partition:
>
> df.coalesce(1).repartition("entity", "year", "month", "day", 
> "status").write.partitionBy("entity", "year", "month", "day", 
> "status").mode(SaveMode.Append).parquet(s"$location")
>
> (Unfortunately the naming here can be quite confusing.)
>
>
> Cheng
>
>
> On 1/14/16 11:48 PM, Patrick McGloin wrote:
>
> Hi,
>
> I would like to reparation / coalesce my data so that it is saved into one
> Parquet file per partition. I would also like to use the Spark SQL
> partitionBy API. So I could do that like this:
>
> df.coalesce(1).write.partitionBy("entity", "year", "month", "day", 
> "status").mode(SaveMode.Append).parquet(s"$location")
>
> I've tested this and it doesn't seem to perform well. This is because
> there is only one partition to work on in the dataset and all the
> partitioning, compression and saving of files has to be done by one CPU
> core.
>
> I could rewrite this to do the partitioning manually (using filter with
> the distinct partition values for example) before calling coalesce.
>
> But is there a better way to do this using the standard Spark SQL API?
>
> Best regards,
>
> Patrick
>
>
>
>

Re: DataFrame partitionBy to a single Parquet file (per partition)

Reply via email to