Anis,
If your random partitions are smaller than your smallest machine, and you
request executors for your spark jobs no larger than your smallest machine,
then spark/cluster manager will automatically assign many executors to your
larger machines.
As long as you request small executors, you will
Thank you very much for your reply.
I guess this approach balances the load across the cluster of machines.
However, I am looking for something for heterogeneous cluster for which the
distribution is not known in prior.
Cheers,
Anis
On Tue, 14 Feb 2017 at 20:19, Galen Marchetti
wrote:
> Anis
Anis,
I've typically seen people handle skew by seeding the keys corresponding to
high volumes with random values, then partitioning the dataset based on the
original key *and* the random value, then reducing.
Ex: ( , ) -> ( , ,
)
This transformation reduces the size of the huge partition, mak
Dear All,
I have few use cases for spark streaming where spark cluster consist of
heterogenous machines.
Additionally, there is skew present in both the input distribution (e.g.,
each tuple is drawn from a zipf distribution) and the service time (e.g.,
service time required for each tuple comes f