Hello,
I have a task that runs on a week's worth of data (let's say) and
produces a Set of tuples such as Set[(String,Long)] (essentially output
of countByValue.toMap)
I want to produce 4 sets, one each for a different week and run an
intersection of the 4 sets.
I have the sequential approach going but obviously, the 4 weeks are
independent of each other in how they produce the sets (they all work on
their own data) so the same job that produces a Set for one week can
just be run as 4 jobs in parallel all with different week start dates.
How is this done in Spark? Is it the runJob() method on SparkContext?
Any example code anywhere?
Thanks!
Ognen