Hi Abnubhav,
The best way to store parquet is partition it by time or specific field that 
you are going to mark for delete after the time.
in my case I partition my data by time so I can easy to delete the data after 
30 days.
Use with mode Append and disable the summary information 

sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", 
"false")

Regards,
Chanh


> On Oct 6, 2016, at 10:32 PM, Anubhav Agarwal <anubha...@gmail.com> wrote:
> 
> Hi all,
> I have searched a bit before posting this query.
> 
> Using Spark 1.6.1
> Dataframe.write().format("parquet").mode(SaveMode.Append).save("location)
> 
> Note:- The data in that folder can be deleted and most of the times that 
> folder doesn't even exist.
> 
> Which Savemode is the best, if necessary at all?
> 
> I am using Savemode.Append which seems to cause huge amounts of shuffle as 
> only executioner is doing the actual write. (May be wrong)
> 
> Would using Overwrite cause all the executors write to that folder at once or 
> would this also send data to one single executor before writing?
> 
> Or should I not use any of the modes at all and just do a write?
> 
> 
> Thank You,
> Anu

Reply via email to