This usually happens when one of the worker is stuck on GC Pause and it
times out. Enable the following configurations while creating sparkContext:
sc.set("spark.rdd.compress","true")
sc.set("spark.storage.memoryFraction","1")
sc.set("spark.core.connection.ack.wait.timeout","6000")
Hello all. I have been running a Spark Job that eventually needs to do a large
join.
24 million x 150 million
A broadcast join is infeasible in this instance clearly, so I am instead
attempting to do it with Hash Partitioning by defining a custom partitioner as:
class RDD2Partitioner(partition