Re: saveAsTextFile extremely slow near finish

2015-03-11 Thread Imran Rashid
is your data skewed? Could it be that there are a few keys with a huge number of records? You might consider outputting (recordA, count) (recordB, count) instead of recordA recordA recordA ... you could do this with: input = sc.textFile pairsCounts = input.map{x => (x,1)}.reduceByKey{_ + _}

Re: saveAsTextFile extremely slow near finish

2015-03-10 Thread Sean Owen
This is more of an aside, but why repartition this data instead of letting it define partitions naturally? You will end up with a similar number. On Mar 9, 2015 5:32 PM, "mingweili0x" wrote: > I'm basically running a sorting using spark. The spark program will read > from > HDFS, sort on composit

Re: saveAsTextFile extremely slow near finish

2015-03-09 Thread Akhil Das
Don't you think 1000 is too less for 160GB of data? Also you could try using KryoSerializer, Enabling RDD Compression. Thanks Best Regards On Mon, Mar 9, 2015 at 11:01 PM, mingweili0x wrote: > I'm basically running a sorting using spark. The spark program will read > from > HDFS, sort on compos