Hi all, I’m busy tuning up a workflow (defined w/Cascading, planned with Flink) that runs on a 5 slave EMR cluster.
The default parallelism (from the Flink planner) is set to 40, since I’ve got 5 task managers (one per node) and 8 slots/TM. But this seems to jam things up, as I see simultaneous GroupReduce subtasks competing for resources (or so it seems). Any insight into how to tune this? And what’s the right way to set it on a sub-task basis? With Cascading Flows planned for M-R I can set the number of reducers via a Hadoop JobConf configuration setting, on a per-step (to use Cascading lingo) basis. But with a Flow planned for Flink, there’s only a single “step”. Thanks, — Ken