Hi Matt,
is there a reason you need to call coalesce every loop iteration? Most
likely it forces spark to do lots of unnecessary shuffles. Also - for
really large number of inputs this approach can lead to due to to many
nested RDD.union calls. A safer approach is to call union from
SparkContext once, as soon as you have all RDDs ready. For python it
looks this way:
rdds = []
for i in xrange(cnt):
rdd = ...
rdds.append(rdd)
finalRDD = sparkContext.union(rdds)
HTH,
Tomasz
W dniu 18.06.2015 o 02:53, Matt Forbes pisze:
I have multiple input paths which each contain data that need to be
mapped in a slightly different way into a common data structure. My
approach boils down to:
RDD<T> rdd = null;
for (Configuration conf : configurations) {
RDD<T> nextRdd = loadFromConfiguration(conf);
rdd = (rdd == null) ? nextRdd : rdd.union(nextRdd);
rdd = rdd.coalesce(nextRdd.partitions().size());
}
Now, for a small number of inputs there doesn't seem to be a problem,
but for the full set which is about 60 sub-RDDs coming in at around
500MM total records takes a very long time to construct. Just for a
simple load-then-count example job, it takes 13 minutes total, where the
count() task only accounts for 2 minutes of that.
Is there something I should be doing differently here? If you can't
tell, this is in java so my RDD is probably some mess of nested wrapped
RDDs but I'm not sure if that would be the real issue.
Thanks,
Matt
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]