andrew-stein-sp opened a new issue, #42923: URL: https://github.com/apache/airflow/issues/42923
### Apache Airflow version Other Airflow 2 version (please specify below) ### If "Other Airflow 2 version" selected, which one? 2.10.1 ### What happened? We have a particular task that spawns a pod that is scheduled by itself to a GPU node. What I've observed is that occasionally, the k8s pod reaches a status of "Completed" before `pod_manager.py` finishes streaming logs from said pod to Airflow's logging system. Yes, this task produces A LOT of logs. Sometimes it can take 2 minutes or more for `pod_manager.py` to actually catch up. The problem happens when AWS karpenter reclaims the GPU node before it finishes streaming the pod logs to airflow, resulting in a 404 error from the kubernetes API like the one below, and the airflow task being failed. ``` kubernetes.client.exceptions.ApiException: (404) We're already using the `karpenter.sh/do-not-disrupt:true` annotation, but that isn't effective at preventing node reclamation once the pod reaches a state of "Completed". We've been able to get around this for now by setting `get_logs=false` for that particular task, however we shouldn't have to do that. The other possibility would be to increase the amount of time before karpenter can reclaim a node, but again, these are gpu nodes, so they're expensive and we run 300k tasks a day in just 1 of 8 regions we have airflow deployed to. ### What you think should happen instead? pod_manager.py should be able to set the airflow task to "success" once the pod reaches a state of "Completed" and then it can continue to stream logs to Airflow under a "Best Effort" basis. In other words, if there is a kube api error received while getting pod logs AFTER the pod has reached a "Completed" state, then those errors should be ignored. ### How to reproduce create a pod that produces a lot of logs and then kill the eks node within a minute or 2 of pod reaching a state of "completed" ### Operating System Official Airflow Image on python3.10 (debian) ### Versions of Apache Airflow Providers we don't have any providers pinned, so whatever versions ship with 2.10.1 ### Deployment Official Apache Airflow Helm Chart ### Deployment details Deployed via ArgoCD. ### Anything else? _No response_ ### Are you willing to submit PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org