Hi,

I've got a huge list of key-value pairs, where the key is an integer and
the value is a long string(around 1Kb). I want to concatenate the strings
with the same keys.

Initially I did something like: pairs.reduceByKey((a, b) => a+" "+b)

Then tried to save the result to HDFS. But it was extremely slow. I had to
kill the job at last.

I guess it's because the value part is too big and it slows down the
shuffling phase. So I tried to use sortByKey before doing reduceByKey.
sortByKey is very fast, and it's also fast when writing the result back to
HDFS. But when I did reduceByKey, it was as slow as before.

How can I make this simple operation faster?

Thanks,
Fan

Reply via email to