Hi Qin,

Thanks for bringing up this issue. AFAIK, there is no such mechanism in
Flink for dynamic task re-assignment at runtime, as states need to be
correctly re-distributed across the nodes, which is highly error-prone and
not well-suited for the current computation model.

However, if the data-skewness pattern for those jobs could be predicted,
maybe FLIP-56: Dynamic Slot Allocation
<https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation>
& FLIP-53: Fine Grained Operator Resource Management
<https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management>
could help alleviate this issue. Also, you can try some tuning tips
like *Local-Global
Aggregation* to reduce data skewness in some scenarios.

Best,
Weike

On Mon, Jan 10, 2022 at 2:44 PM Chen Qin <qinnc...@gmail.com> wrote:

> Hi there,
>
> We ran multiple large scale applications YARN clusters, one observation
> were those jobs often CPU skewed due to topology or data skew on subtasks.
> And for better or worse, the skew leads to a few task managers consuming
> large vcores while majority task managers consume much less. Our goal is to
> save the total infra budget while keeping the job running smoothly.
>
> Any ongoing discussions in this area? Naively, if we know for sure a few
> tasks (uuids) use higher vcore from previous runs, could we request one
> last batch of containers with high vcore resource profile and reassign
> those tasks?
>
> Thanks,
> Chen
>

Reply via email to