Re: Best Savemode option to write Parquet file

Anubhav Agarwal Thu, 06 Oct 2016 11:26:19 -0700

Hi,
I already had the following set:-
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")


Will add the other setting too.

But my question is I am correct in assuming Append mode shuffles all data
to one node before writing?
And do other modes do the same or all executors write to the folder in
parallel .

Thank You,
Anu

On Thu, Oct 6, 2016 at 11:36 AM, Chanh Le <giaosu...@gmail.com> wrote:

> Hi Abnubhav,
> The best way to store parquet is partition it by time or specific field
> that you are going to mark for delete after the time.
> in my case I partition my data by time so I can easy to delete the data
> after 30 days.
> Use with mode Append and disable the summary information
>
> sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
> sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs",
>  "false")
>
>
> Regards,
> Chanh
>
>
> On Oct 6, 2016, at 10:32 PM, Anubhav Agarwal <anubha...@gmail.com> wrote:
>
> Hi all,
> I have searched a bit before posting this query.
>
> Using Spark 1.6.1
> Dataframe.write().format("parquet").mode(SaveMode.Append).save("location)
>
> Note:- The data in that folder can be deleted and most of the times that
> folder doesn't even exist.
>
> Which Savemode is the best, if necessary at all?
>
> I am using Savemode.Append which seems to cause huge amounts of shuffle as
> only executioner is doing the actual write. (May be wrong)
>
> Would using Overwrite cause all the executors write to that folder at once
> or would this also send data to one single executor before writing?
>
> Or should I not use any of the modes at all and just do a write?
>
>
> Thank You,
> Anu
>
>
>

Re: Best Savemode option to write Parquet file

Reply via email to