Re: Best Savemode option to write Parquet file

Chanh Le Thu, 06 Oct 2016 08:36:43 -0700

Hi Abnubhav,
The best way to store parquet is partition it by time or specific field that 
you are going to mark for delete after the time.
in my case I partition my data by time so I can easy to delete the data after 
30 days.
Use with mode Append and disable the summary information


sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", 
"false")

Regards,
Chanh


> On Oct 6, 2016, at 10:32 PM, Anubhav Agarwal <anubha...@gmail.com> wrote:
> 
> Hi all,
> I have searched a bit before posting this query.
> 
> Using Spark 1.6.1
> Dataframe.write().format("parquet").mode(SaveMode.Append).save("location)
> 
> Note:- The data in that folder can be deleted and most of the times that 
> folder doesn't even exist.
> 
> Which Savemode is the best, if necessary at all?
> 
> I am using Savemode.Append which seems to cause huge amounts of shuffle as 
> only executioner is doing the actual write. (May be wrong)
> 
> Would using Overwrite cause all the executors write to that folder at once or 
> would this also send data to one single executor before writing?
> 
> Or should I not use any of the modes at all and just do a write?
> 
> 
> Thank You,
> Anu

Re: Best Savemode option to write Parquet file

Reply via email to