I am expand my data set and executing pyspark on yarn:
I payed attention that only 2 processes processed the data:
14210 yarn 20 0 2463m 2.0g 9708 R 100.0 4.3 8:22.63 python2.7
32467 yarn 20 0 2519m 2.1g 9720 R 99.3 4.4 7:16.97 python2.7
*Question:*
*how to configure
On Tue, Sep 9, 2014 at 9:56 AM, Oleg Ruchovets wrote:
> Hi ,
>
>I came from map/reduce background and try to do quite trivial thing:
>
> I have a lot of files ( on hdfs ) - format is :
>
>1 , 2 , 3
>2 , 3 , 5
>1 , 3, 5
> 2, 3 , 4
> 2 , 5, 1
>
> I am actually need to grou
Hi ,
I came from map/reduce background and try to do quite trivial thing:
I have a lot of files ( on hdfs ) - format is :
1 , 2 , 3
2 , 3 , 5
1 , 3, 5
2, 3 , 4
2 , 5, 1
I am actually need to group by key (first column) :
key values
1 --> (2,3),(3,5)
2 --> (3,5),(3