So, in case of Sort Merge join in which a shuffle (exchange) will be
performed, I have the following questions (Please correct me if my
understanding is not correct):

Let's say that relation A is a JSONRelation (640 MB) on the HDFS where the
block size is 64MB. This will produce a Scan JSONRelation() of 10 partitions
( 640 / 64 ) where each of these partitions will contain |A| / 10 rows.

The second step will be a hashPartitioning(#key, 200) where #key is the
equi-join condition and 200 the default number of shuffles
(spark.sql.shuffle.partitions). Each partition will be computed in an
individual task, in which every row will be hashed on the #key and then will
be written in the corresponing chunk (of 200 resulting chunks) directly on
disk. 

Q1: What happens if a resulting hashed row in the executor A must be written
in a chunk which is stored in the executor B? Does it use the
HashShuffleManager to transfer it over the network?

Q2: After the Sort (3rd) step there will be 200, 200 resulting
partitions/chunks for relations A and B respectively which will be
concatenated into 200 SortMergeJoin tasks where each of them will contain
(|A|/200 + |B|/200) rows. For each pair (chunkOfA, chunkOfB) will chunkOfA
and chunkOfB contain rows for the same hash key ?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Map-Tasks-Disk-Spill-tp15216p15251.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to