subject:"PySpark on Yarn \- how group by data properly"

Re: PySpark on Yarn - how group by data properly

2014-09-16 Thread Oleg Ruchovets

I am expand my data set and executing pyspark on yarn: I payed attention that only 2 processes processed the data: 14210 yarn 20 0 2463m 2.0g 9708 R 100.0 4.3 8:22.63 python2.7 32467 yarn 20 0 2519m 2.1g 9720 R 99.3 4.4 7:16.97 python2.7 *Question:* *how to configure

Re: PySpark on Yarn - how group by data properly

2014-09-09 Thread Davies Liu

On Tue, Sep 9, 2014 at 9:56 AM, Oleg Ruchovets wrote: > Hi , > >I came from map/reduce background and try to do quite trivial thing: > > I have a lot of files ( on hdfs ) - format is : > >1 , 2 , 3 >2 , 3 , 5 >1 , 3, 5 > 2, 3 , 4 > 2 , 5, 1 > > I am actually need to grou

PySpark on Yarn - how group by data properly

2014-09-09 Thread Oleg Ruchovets

Hi , I came from map/reduce background and try to do quite trivial thing: I have a lot of files ( on hdfs ) - format is : 1 , 2 , 3 2 , 3 , 5 1 , 3, 5 2, 3 , 4 2 , 5, 1 I am actually need to group by key (first column) : key values 1 --> (2,3),(3,5) 2 --> (3,5),(3