Anurag Kyal created FLINK-36983: ----------------------------------- Summary: Observing unreliable IO metrics Key: FLINK-36983 URL: https://issues.apache.org/jira/browse/FLINK-36983 Project: Flink Issue Type: Bug Components: Autoscaler Affects Versions: 1.18.1 Reporter: Anurag Kyal Attachments: Screenshot 2024-12-31 at 2.01.53 PM.png
<Not sure yet if it's a bug or just an issue with my setup> Have been trying to enabling the autoscaler for our Flink jobs and it hasn't been working as expected. So I started diving into the source code and found out that the algorithm heavily relies on the IO metrics for the job's vertices in the DAG. However, the IO metrics seem pretty inconsistent for my job at which point the autoscaling algo will def not work. I have seen the IO metrics on the UI to be pretty inconsistent earlier too but never got bothered about it until I found out that it's actually being used as inputs to the autoscaling algorithm. This screenshot below demonstrates some of the discrepancies for a sample job. !Screenshot 2024-12-31 at 2.01.53 PM.png|width=643,height=257! Also want to add that I have verified that the job is healthy and not doing anything unexpected from business metrics. There is consistently healthy amount of data flowing in and out to the sink. Since so many people are using the autoscaling successfully thus makes me wonder if it's an issue with my setup? Would love to hear if anyone else is seeing this issue or any other insights how to resolve this. -- This message was sent by Atlassian Jira (v8.20.10#820010)