Re: Best Savemode option to write Parquet file

2016-10-06 Thread Chanh Le
Hi, It depends on your case but if you do shuffle it’s expensive operation unless you want to reduce number of files and it's not parallel so it might have cost you a lot of time to write data. Regards, Chanh > On Oct 7, 2016, at 1:25 AM, Anubhav Agarwal wrote: > > Hi, > I already had the f

Re: Best Savemode option to write Parquet file

2016-10-06 Thread Anubhav Agarwal
Hi, I already had the following set:- sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false") Will add the other setting too. But my question is I am correct in assuming Append mode shuffles all data to one node before writing? And do other modes do the same or all executors write

Re: Best Savemode option to write Parquet file

2016-10-06 Thread Chanh Le
Hi Abnubhav, The best way to store parquet is partition it by time or specific field that you are going to mark for delete after the time. in my case I partition my data by time so I can easy to delete the data after 30 days. Use with mode Append and disable the summary information sc.hadoopCon

Best Savemode option to write Parquet file

2016-10-06 Thread morfious902002
ng? Or should I not use any of the modes at all and just do a write? Thank You, Anu -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Best-Savemode-option-to-write-Parquet-file-tp27852.html Sent from the Apache Spark User List mailing list a

Best Savemode option to write Parquet file

2016-10-06 Thread Anubhav Agarwal
Hi all, I have searched a bit before posting this query. Using Spark 1.6.1 Dataframe.write().format("parquet").mode(SaveMode.Append).save("location) Note:- The data in that folder can be deleted and most of the times that folder doesn't even exist. Which Savemode is the best, if necessary at all