Re: Save RDD with partition information

2015-01-13 Thread lihu
By the way, I am not sure enough wether the shuffle key can go into the similar container.

Re: Save RDD with partition information

2015-01-13 Thread lihu
there is no way to avoid shuffle if you use combine by key, no matter if your data is cached in memory, because the shuffle write must write the data into disk. And It seem that spark can not guarantee the similar key(K1) goes to the Container_X. you can use the tmpfs for your shuffle dir, this ca

Re: Save RDD with partition information

2015-01-13 Thread Raghavendra Pandey
I believe the default hash partitioner logic in spark will send all the same keys to same machine. On Wed, Jan 14, 2015, 03:03 Puneet Kapoor wrote: > Hi, > > I have a usecase where in I have hourly spark job which creates hourly > RDDs, which are partitioned by keys. > > At the end of the day I

Save RDD with partition information

2015-01-13 Thread Puneet Kapoor
Hi, I have a usecase where in I have hourly spark job which creates hourly RDDs, which are partitioned by keys. At the end of the day I need to access all of these RDDs and combine the Key/Value pairs over the day. If there is a key K1 in RDD0 (1st hour of day), RDD1 ... RDD23(last hour of the d