Syed,
Thanks for the tip. I'm not sure if coalesce is doing what I'm intending
to do, which is, in effect, to subdivide the RDD into N parts (by calling
coalesce and doing operations on the partitions.) It sounds like, however,
this won't bottleneck my processing power. If this sets off any ala
RDD.coalesce should be fine for rebalancing data across all RDD partitions.
Coalesce is pretty handy in situations where you have sparse data and want
to compact it (e.g. data after applying a strict filter) OR you know the
magic number of partitions according to your cluster which will be optimal.
For instance, I need to work with an RDD in terms of N parts. Will calling
RDD.coalesce(N) possibly cause processing bottlenecks?
On Mon, Mar 24, 2014 at 1:28 PM, Walrus theCat wrote:
> Hi,
>
> Quick question about partitions. If my RDD is partitioned into 5
> partitions, does that mean that I