Thanks a lot for the explanation Matei.
As a matter of fact, I was just reading up on the paper on the Narrow and
Wide Dependencies and saw that groupByKey is indeed a wide dependency which,
as you explained, is the problem.
Maybe it wouldn't be a bad thing to have a section in the docs on the
wi
The problem is that groupByKey means “bring all the points with this same key
to the same JVM”. Your input is a Seq[Point], so you have to have all the
points there. This means that a) all points will be sent across the network in
a cluster, which is slow (and Spark goes through this sending cod