All 800 files(in a partition folder) sizes are in bytes. It will sum up to
200 MB which is each partition folder input size. And I am using ORC format.
Never used Parquet format.
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
So I should have done some back of the napkin math before all of this. You
are writing out 800 files, each < 128 MB. If they were 128 MB then it
would be 100GB of data being written, I'm not sure how much hardware you
have but, but the fact that you can shuffle about 100GB to a single thread
and w
Hi,
I am doing repartition at the end. I mean before insert overwriting the
table. I see the last step (repartition) is taking more time.
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe e
i second that. we have gotten bitten too many times by coalesce impacting
upstream in an unintended way that i avoid coalesce on write altogether.
i prefer to use repartition (and take the shuffle hit) before writing
(especially if you are writing out partitioned), or if possible use
adaptive quer
First, you need to be careful with coalesce. It will impact upstream
processing, so if you are doing a lot of computation in the last stage
before the repartition then coalesce will make the problem worse because
all of that computation will happen in a single thread instead of being
spread out.
M
Hi,
When reducing partitions is better to use coalesce because it doesn't need
to shuffle the data.
dataframe.coalesce(1)
El mar., 23 jun. 2020 23:54, Hichki escribió:
> Hello Team,
>
>
>
> I am new to Spark environment. I have converted Hive query to Spark Scala.
> Now I am loading data and d