Hi, I'm having trouble serializing tasks for this code:
val rddC = (rddA join rddB) .map { case (x, (y, z)) => z -> y } .reduceByKey( { (y1, y2) => Semigroup.plus(y1, y2) }, 1000) Somehow when running on a small data set the size of the serialized task is about 650KB, which is very big, and when running on a big data set it's about 1.3MB, which is huge. I can't find what's causing it. The only reference to an object outside the scope of the closure is to a static method on Semigroup, which should serialize fine. The proof that it's okay to call Semigroup like that is that I have another operation in the same job that uses it and that's serializes okay (~30KB). The part I really don't get is why is the size of the serialized task dependent on the size of the RDD? I haven't used "parallelize" to create rddA and rddB, but rather derived them from transformations of hive tables. Thanks, - Sebastien