Hello, Set up
I am running my Flink streaming jobs (upgradeMode = stateless) on an AWS EKS cluster. The node-type for the pods of the streaming jobs belongs to a node-group that has an AWS ASG (auto scaling group). The streaming jobs are the FlinkDeployments managed by the flink-k8s-operator (1.8) and I have enabled the job autoscaler. Scenario When the flink auto-scaler scales up a flink streaming job, new flink TMs are first added onto any existing nodes with available resources. If resources are not enough to schedule all the TM pods, ASG adds new nodes to the EKS cluster and the rest of the TM pods are scheduled on these new nodes. Issue After the scale-up, the TM pods scheduled on the existing nodes with available resources successfully read the checkpoint from S3 however the TM pods scheduled on the new nodes added by ASG run into 403 (access denied) while reading the same checkpoint file from the checkpoint location in S3. Just FYI: I have disabled memory auto-tuning so the auto-scaling events are in place. 1. The IAM role associated with the service account being used by the FlinkDeployment is as expected for the new pods. 2. I am able to reproduce this issue every single time there is a scale-up that requires ASG to add new nodes to the cluster. 3. If I delete the FlinkDeployment and allow the operator to restart it, it starts and stops throwing 403. 4. I am also observing some 404 (not found) being reported by certain newly added TM pods. They are looking for an older checkpoint (for example looking for chk10 while a chk11 has already been created in S3 and chk10 would have gotten subsumed by chk11) I would appreciate it if there are any pointers on how to debug this further. Let me know if you need more information. Thank you Chetas