Anurag Kyal created FLINK-36983:
-----------------------------------

             Summary: Observing unreliable IO metrics
                 Key: FLINK-36983
                 URL: https://issues.apache.org/jira/browse/FLINK-36983
             Project: Flink
          Issue Type: Bug
          Components: Autoscaler
    Affects Versions: 1.18.1
            Reporter: Anurag Kyal
         Attachments: Screenshot 2024-12-31 at 2.01.53 PM.png

<Not sure yet if it's a bug or just an issue with my setup>

Have been trying to enabling the autoscaler for our Flink jobs and it hasn't 
been working as expected. So I started diving into the source code and found 
out that the algorithm heavily relies on the IO metrics for the job's vertices 
in the DAG. However, the IO metrics seem pretty inconsistent for my job at 
which point the autoscaling algo will def not work.

I have seen the IO metrics on the UI to be pretty inconsistent earlier too but 
never got bothered about it until I found out that it's actually being used as 
inputs to the autoscaling algorithm.

This screenshot below demonstrates some of the discrepancies for a sample job.

!Screenshot 2024-12-31 at 2.01.53 PM.png|width=643,height=257!


Also want to add that I have verified that the job is healthy and not doing 
anything unexpected from business metrics. There is consistently healthy amount 
of data flowing in and out to the sink.

Since so many people are using the autoscaling successfully thus makes me 
wonder if it's an issue with my setup? Would love to hear if anyone else is 
seeing this issue or any other insights how to resolve this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to