is your data skewed? Could it be that there are a few keys with a huge
number of records? You might consider outputting
(recordA, count)
(recordB, count)
instead of
recordA
recordA
recordA
...
you could do this with:
input = sc.textFile
pairsCounts = input.map{x => (x,1)}.reduceByKey{_ + _}
This is more of an aside, but why repartition this data instead of letting
it define partitions naturally? You will end up with a similar number.
On Mar 9, 2015 5:32 PM, "mingweili0x" wrote:
> I'm basically running a sorting using spark. The spark program will read
> from
> HDFS, sort on composit
Don't you think 1000 is too less for 160GB of data? Also you could try
using KryoSerializer, Enabling RDD Compression.
Thanks
Best Regards
On Mon, Mar 9, 2015 at 11:01 PM, mingweili0x wrote:
> I'm basically running a sorting using spark. The spark program will read
> from
> HDFS, sort on compos