RE: Scalability of group by

2015-04-28 Thread Ulanov, Alexander
@spark.apache.org Subject: Re: Scalability of group by Hi, I can offer a few ideas to investigate in regards to your issue here. I've run into resource issues doing shuffle operations with a much smaller dataset than 2B. The data is going to be saved to disk by the BlockManager as part o

Re: Scalability of group by

2015-04-28 Thread Richard Marscher
gt; > > *From:* ayan guha [mailto:guha.a...@gmail.com] > *Sent:* Monday, April 27, 2015 6:58 PM > *To:* Ulanov, Alexander > *Cc:* user@spark.apache.org > *Subject:* Re: Scalability of group by > > > > Hi > > Can you test on a smaller dataset to identify if it is cluster issue

RE: Scalability of group by

2015-04-27 Thread Ulanov, Alexander
@spark.apache.org Subject: Re: Scalability of group by Hi Can you test on a smaller dataset to identify if it is cluster issue or scaling issue in spark On 28 Apr 2015 11:30, "Ulanov, Alexander" mailto:alexander.ula...@hp.com>> wrote: Hi, I am running a group by on a dataset of 2

Re: Scalability of group by

2015-04-27 Thread ayan guha
Hi Can you test on a smaller dataset to identify if it is cluster issue or scaling issue in spark On 28 Apr 2015 11:30, "Ulanov, Alexander" wrote: > Hi, > > > > I am running a group by on a dataset of 2B of RDD[Row [id, time, value]] > in Spark 1.3 as follows: > > “select id, time, first(value)

Scalability of group by

2015-04-27 Thread Ulanov, Alexander
Hi, I am running a group by on a dataset of 2B of RDD[Row [id, time, value]] in Spark 1.3 as follows: "select id, time, first(value) from data group by id, time" My cluster is 8 nodes with 16GB RAM and one worker per node. Each executor is allocated with 5GB of memory. However, all executors ar