Re: Spark Small file issue

2020-06-29 Thread Hichki
All 800 files(in a partition folder) sizes are in bytes. It will sum up to 200 MB which is each partition folder input size. And I am using ORC format. Never used Parquet format. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: Spark Small file issue

2020-06-29 Thread Bobby Evans
So I should have done some back of the napkin math before all of this. You are writing out 800 files, each < 128 MB. If they were 128 MB then it would be 100GB of data being written, I'm not sure how much hardware you have but, but the fact that you can shuffle about 100GB to a single thread and w

Re: Spark Small file issue

2020-06-29 Thread Hichki
Hi, I am doing repartition at the end. I mean before insert overwriting the table. I see the last step (repartition) is taking more time. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e

Re: Spark Small file issue

2020-06-24 Thread Koert Kuipers
i second that. we have gotten bitten too many times by coalesce impacting upstream in an unintended way that i avoid coalesce on write altogether. i prefer to use repartition (and take the shuffle hit) before writing (especially if you are writing out partitioned), or if possible use adaptive quer

Re: Spark Small file issue

2020-06-24 Thread Bobby Evans
First, you need to be careful with coalesce. It will impact upstream processing, so if you are doing a lot of computation in the last stage before the repartition then coalesce will make the problem worse because all of that computation will happen in a single thread instead of being spread out. M

Re: Spark Small file issue

2020-06-23 Thread German SM
Hi, When reducing partitions is better to use coalesce because it doesn't need to shuffle the data. dataframe.coalesce(1) El mar., 23 jun. 2020 23:54, Hichki escribió: > Hello Team, > > > > I am new to Spark environment. I have converted Hive query to Spark Scala. > Now I am loading data and d