By the way, I am not sure enough wether the shuffle key can go into the
similar container.
there is no way to avoid shuffle if you use combine by key, no matter if
your data is cached in memory, because the shuffle write must write the
data into disk. And It seem that spark can not guarantee the similar
key(K1) goes to the Container_X.
you can use the tmpfs for your shuffle dir, this ca
I believe the default hash partitioner logic in spark will send all the
same keys to same machine.
On Wed, Jan 14, 2015, 03:03 Puneet Kapoor wrote:
> Hi,
>
> I have a usecase where in I have hourly spark job which creates hourly
> RDDs, which are partitioned by keys.
>
> At the end of the day I
Hi,
I have a usecase where in I have hourly spark job which creates hourly
RDDs, which are partitioned by keys.
At the end of the day I need to access all of these RDDs and combine the
Key/Value pairs over the day.
If there is a key K1 in RDD0 (1st hour of day), RDD1 ... RDD23(last hour of
the d