Hi,

I'm having trouble serializing tasks for this code:

val rddC = (rddA join rddB)
  .map { case (x, (y, z)) => z -> y }
  .reduceByKey( { (y1, y2) => Semigroup.plus(y1, y2) }, 1000)

Somehow when running on a small data set the size of the serialized task is
about 650KB, which is very big, and when running on a big data set it's
about 1.3MB, which is huge. I can't find what's causing it. The only
reference to an object outside the scope of the closure is to a static
method on Semigroup, which should serialize fine. The proof that it's okay
to call Semigroup like that is that I have another operation in the same
job that uses it and that's serializes okay (~30KB).

The part I really don't get is why is the size of the serialized task
dependent on the size of the RDD? I haven't used "parallelize" to create
rddA and rddB, but rather derived them from transformations of hive tables.

Thanks,

- Sebastien

Reply via email to