andrew-stein-sp opened a new issue, #42923:
URL: https://github.com/apache/airflow/issues/42923

   ### Apache Airflow version
   
   Other Airflow 2 version (please specify below)
   
   ### If "Other Airflow 2 version" selected, which one?
   
   2.10.1
   
   ### What happened?
   
   We have a particular task that spawns a pod that is scheduled by itself to a 
GPU node. What I've observed is that occasionally, the k8s pod reaches a status 
of "Completed" before `pod_manager.py` finishes streaming logs from said pod to 
Airflow's logging system. Yes, this task produces A LOT of logs. 
   
   Sometimes it can take 2 minutes or more for `pod_manager.py` to actually 
catch up. The problem happens when AWS karpenter reclaims the GPU node before 
it finishes streaming the pod logs to airflow, resulting in a 404 error from 
the kubernetes API like the one below, and the airflow task being failed.
   
   ```
   kubernetes.client.exceptions.ApiException: (404)
   
   We're already using the `karpenter.sh/do-not-disrupt:true` annotation, but 
that isn't effective at preventing node reclamation once the pod reaches a 
state of "Completed".
   
   We've been able to get around this for now by setting `get_logs=false` for 
that particular task, however we shouldn't have to do that. 
   
   The other possibility would be to increase the amount of time before 
karpenter can reclaim a node, but again, these are gpu nodes, so they're 
expensive and we run 300k tasks a day in just 1 of 8 regions we have airflow 
deployed to.
   
   ### What you think should happen instead?
   
   pod_manager.py should be able to set the airflow task to "success" once the 
pod reaches a state of "Completed" and then it can continue to stream logs to 
Airflow under a "Best Effort" basis. In other words, if there is a kube api 
error received while getting pod logs AFTER the pod has reached a "Completed" 
state, then those errors should be ignored.
   
   ### How to reproduce
   
   create a pod that produces a lot of logs and then kill the eks node within a 
minute or 2 of pod reaching a state of "completed"
   
   ### Operating System
   
   Official Airflow Image on python3.10 (debian)
   
   ### Versions of Apache Airflow Providers
   
   we don't have any providers pinned, so whatever versions ship with 2.10.1
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   Deployed via ArgoCD.
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to