Hi,
It depends on your case but if you do shuffle it’s expensive operation unless
you want to reduce number of files and it's not parallel so it might have cost
you a lot of time to write data.
Regards,
Chanh
> On Oct 7, 2016, at 1:25 AM, Anubhav Agarwal wrote:
>
> Hi,
> I already had the f
Hi,
I already had the following set:-
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
Will add the other setting too.
But my question is I am correct in assuming Append mode shuffles all data
to one node before writing?
And do other modes do the same or all executors write
Hi Abnubhav,
The best way to store parquet is partition it by time or specific field that
you are going to mark for delete after the time.
in my case I partition my data by time so I can easy to delete the data after
30 days.
Use with mode Append and disable the summary information
sc.hadoopCon
ng?
Or should I not use any of the modes at all and just do a write?
Thank You,
Anu
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Best-Savemode-option-to-write-Parquet-file-tp27852.html
Sent from the Apache Spark User List mailing list a
Hi all,
I have searched a bit before posting this query.
Using Spark 1.6.1
Dataframe.write().format("parquet").mode(SaveMode.Append).save("location)
Note:- The data in that folder can be deleted and most of the times that
folder doesn't even exist.
Which Savemode is the best, if necessary at all