Re: Best Savemode option to write Parquet file

2016-10-06 Thread Chanh Le
Hi, It depends on your case but if you do shuffle it’s expensive operation unless you want to reduce number of files and it's not parallel so it might have cost you a lot of time to write data. Regards, Chanh > On Oct 7, 2016, at 1:25 AM, Anubhav Agarwal wrote: > > Hi, > I already had the f

Re: Best Savemode option to write Parquet file

2016-10-06 Thread Anubhav Agarwal
Hi, I already had the following set:- sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false") Will add the other setting too. But my question is I am correct in assuming Append mode shuffles all data to one node before writing? And do other modes do the same or all executors write

Re: Best Savemode option to write Parquet file

2016-10-06 Thread Chanh Le
Hi Abnubhav, The best way to store parquet is partition it by time or specific field that you are going to mark for delete after the time. in my case I partition my data by time so I can easy to delete the data after 30 days. Use with mode Append and disable the summary information sc.hadoopCon