Hi Qin, Thanks for bringing up this issue. AFAIK, there is no such mechanism in Flink for dynamic task re-assignment at runtime, as states need to be correctly re-distributed across the nodes, which is highly error-prone and not well-suited for the current computation model.
However, if the data-skewness pattern for those jobs could be predicted, maybe FLIP-56: Dynamic Slot Allocation <https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation> & FLIP-53: Fine Grained Operator Resource Management <https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management> could help alleviate this issue. Also, you can try some tuning tips like *Local-Global Aggregation* to reduce data skewness in some scenarios. Best, Weike On Mon, Jan 10, 2022 at 2:44 PM Chen Qin <qinnc...@gmail.com> wrote: > Hi there, > > We ran multiple large scale applications YARN clusters, one observation > were those jobs often CPU skewed due to topology or data skew on subtasks. > And for better or worse, the skew leads to a few task managers consuming > large vcores while majority task managers consume much less. Our goal is to > save the total infra budget while keeping the job running smoothly. > > Any ongoing discussions in this area? Naively, if we know for sure a few > tasks (uuids) use higher vcore from previous runs, could we request one > last batch of containers with high vcore resource profile and reassign > those tasks? > > Thanks, > Chen >