Hi, I've got a huge list of key-value pairs, where the key is an integer and the value is a long string(around 1Kb). I want to concatenate the strings with the same keys.
Initially I did something like: pairs.reduceByKey((a, b) => a+" "+b) Then tried to save the result to HDFS. But it was extremely slow. I had to kill the job at last. I guess it's because the value part is too big and it slows down the shuffling phase. So I tried to use sortByKey before doing reduceByKey. sortByKey is very fast, and it's also fast when writing the result back to HDFS. But when I did reduceByKey, it was as slow as before. How can I make this simple operation faster? Thanks, Fan