Hi, I already had the following set:- sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
Will add the other setting too. But my question is I am correct in assuming Append mode shuffles all data to one node before writing? And do other modes do the same or all executors write to the folder in parallel . Thank You, Anu On Thu, Oct 6, 2016 at 11:36 AM, Chanh Le <giaosu...@gmail.com> wrote: > Hi Abnubhav, > The best way to store parquet is partition it by time or specific field > that you are going to mark for delete after the time. > in my case I partition my data by time so I can easy to delete the data > after 30 days. > Use with mode Append and disable the summary information > > sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false") > sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", > "false") > > > Regards, > Chanh > > > On Oct 6, 2016, at 10:32 PM, Anubhav Agarwal <anubha...@gmail.com> wrote: > > Hi all, > I have searched a bit before posting this query. > > Using Spark 1.6.1 > Dataframe.write().format("parquet").mode(SaveMode.Append).save("location) > > Note:- The data in that folder can be deleted and most of the times that > folder doesn't even exist. > > Which Savemode is the best, if necessary at all? > > I am using Savemode.Append which seems to cause huge amounts of shuffle as > only executioner is doing the actual write. (May be wrong) > > Would using Overwrite cause all the executors write to that folder at once > or would this also send data to one single executor before writing? > > Or should I not use any of the modes at all and just do a write? > > > Thank You, > Anu > > >