Zhanghao Chen created FLINK-31769: ------------------------------------- Summary: Add percentiles to aggregated metrics Key: FLINK-31769 URL: https://issues.apache.org/jira/browse/FLINK-31769 Project: Flink Issue Type: Improvement Components: Autoscaler, Runtime / Metrics Reporter: Zhanghao Chen Attachments: image-2023-04-11-15-11-51-471.png
*Background* Currently only min/avg/max of metrics are exposed via REST API. Flink Autoscaler relies on these aggregated metrics to make predictions, and the type of aggregation plays an import role. [FLINK-30652] Use max busytime instead of average to compute true processing rate - ASF JIRA (apache.org) suggests that using max aggregator instead of avg of busy time can handle data skew more robustly. However, we found that for large-scale jobs, using max aggregation may be too sensitive. As a result, the true processing rate is underestimated with severe turbulence. The graph below is the true processing rate estimated with different aggregators of a real production data transmission job with a parallelism of 750. !image-2023-04-11-15-11-51-471.png! *Proposal* Add percentiles (p50, p90, p99) to aggregated metrics. Apache common maths can be used for computing that. A follow up would be making Flink autoscaler make use of the new aggregators. -- This message was sent by Atlassian Jira (v8.20.10#820010)