mhaure-touze opened a new issue, #55368:
URL: https://github.com/apache/airflow/issues/55368

   ### Apache Airflow Provider(s)
   
   cncf-kubernetes, amazon
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-amazon==9.12.0
   apache-airflow-providers-cncf-kubernetes==10.7.0
   
   ### Apache Airflow version
   
   2.10.3
   
   ### Operating System
   
   amazon linux
   
   ### Deployment
   
   Amazon (AWS) MWAA
   
   ### Deployment details
   
   EksPodOperator which launch a pod on a EKS cluster v1.32
   
   ### What happened
   
   1. the operator launch the pod
   2. the triggerer pause the task as "DEFERRED"
   3. the triggerer send a "running" event
   4. the pod launch trigger_reentry method
   5. some how the task wait for pod completion
   6. task stay alive until a hearbeat timeout kill it
   
   ```
   ip-172-29-129-41.eu-west-1.compute.internal
   *** Reading remote log from Cloudwatch log_group: 
airflow-data-eng-mwaa-env-Task log_stream: 
dag_id=waititng/run_id=manual__2025-09-05T09_23_19.615897+00_00/task_id=waititng/attempt=10.log
   2025-09-05T16:22:04.599378194Z 
   2025-09-05T16:22:04.653984267Z 
   2025-09-05T16:22:04.654174133Z 
   ...
   2025-09-05T23:50:51.065801031Z   
   2025-09-05T23:50:51.078800975Z 
   2025-09-05T23:50:51.125942120Z 
   [Invalid date] {local_task_job_runner.py:123} ▶ Pre task execution logs
   [Invalid date] {base.py:84} INFO - Retrieving connection 'aws_eks_role'
   [Invalid date] {baseoperator.py:416} WARNING - EksPodOperator.execute cannot 
be called outside TaskInstance!
   [Invalid date] {pod.py:1280} INFO - Building pod 
waititng-9c5b6348-8893-405e-b769-5f0ffe3ee776-n48xxaw6 with labels: {'dag_id': 
'waititng', 'task_id': 'waititng', 'run_id': 
'manual__2025-09-05T092319.6158970000-3503ec696', 'kubernetes_pod_operator': 
'True', 'try_number': '10'}
   [Invalid date] {pod.py:572} INFO - Found matching pod 
waititng-9c5b6348-8893-405e-b769-5f0ffe3ee776-e0pnddo5 with labels 
{'airflow_kpo_in_cluster': 'False', 'airflow_version': '2.10.3', 'component': 
'singleuser-server', 'dag_id': 'waititng', 'kubernetes_pod_operator': 'True', 
'run_id': 'manual__2025-09-05T092319.6158970000-3503ec696', 'task_id': 
'waititng', 'try_number': '7'}
   [Invalid date] {pod.py:573} INFO - `try_number` of task_instance: 10
   [Invalid date] {pod.py:574} INFO - `try_number` of pod: 7
   [Invalid date] {pod.py:584} INFO - Reusing existing pod 
'waititng-9c5b6348-8893-405e-b769-5f0ffe3ee776-e0pnddo5' (phase=Running, 
reason=) since it is not terminated or evicted.
   [Invalid date] {taskinstance.py:288} INFO - Pausing task as DEFERRED. 
dag_id=waititng, task_id=waititng, 
run_id=manual__2025-09-05T09:23:19.615897+00:00, 
execution_date=20250905T092319, start_date=20250906T001827
   [Invalid date] {taskinstance.py:340} ▶ Post task execution logs
   [Invalid date] {pod.py:146} INFO - Checking pod 
'waititng-9c5b6348-8893-405e-b769-5f0ffe3ee776-e0pnddo5' in namespace 
'namespace'.
   [Invalid date] {triggerer_job_runner.py:631} INFO - Trigger 
waititng/manual__2025-09-05T09:23:19.615897+00:00/waititng/-1/10 (ID 20) fired: 
TriggerEvent<{'status': 'running', 'last_log_time': None, 'namespace': 
'namespace', 'name': 'waititng-9c5b6348-8893-405e-b769-5f0ffe3ee776-e0pnddo5', 
'eks_cluster_name': 'cluster'}>
   [Invalid date] {local_task_job_runner.py:123} ▶ Pre task execution logs
   [Invalid date] {base.py:84} INFO - Retrieving connection 'aws_eks_role'
   [Invalid date] {pod_manager.py:713} INFO - Pod 
waiting-9c5b6348-8893-405e-b769-5f0ffe3ee776-e0pnddo5 has phase Running
   [Invalid date] {pod_manager.py:713} INFO - Pod 
waiting-9c5b6348-8893-405e-b769-5f0ffe3ee776-e0pnddo5 has phase Running
   [Invalid date] {job.py:229} INFO - Heartbeat recovered after 71.80 seconds
   [Invalid date] {local_task_job_runner.py:266} INFO - Task exited with return 
code -9. For more information, see 
https://airflow.apache.org/docs/apache-airflow/stable/troubleshooting.html#LocalTaskJob-killed
   [Invalid date] {local_task_job_runner.py:245} ▲▲▲ Log group end
   ```
   
   ### What you think should happen instead
   
   I am expecting the task to alternate between a running and deferred state 
until pod completion/failure
   - Operator mode is deferrable=true
   - logging_interval is set to 600 seconds
   
   ### How to reproduce
   
   ```
   import datetime
   
   from airflow.decorators import dag
   from airflow.providers.amazon.aws.operators.eks import EksPodOperator
   
   @dag(
       dag_id="wait",
       start_date=datetime.datetime(2025, 8, 4),
       schedule=None,
       catchup=False,
   )
   def wait() -> None:
       EksPodOperator(
               task_id="wait",
               aws_conn_id="aws_eks_role",
               cluster_name="cluster,
               deferrable=True,
               namespace="namespace",
               region="eu-west-1",
               pod_name=f"chromium-{pipeline_config.pipeline_id}",
               cmds=["/bin/sh", "-c"],
               arguments=["while true; do echo 'sleeping...'; sleep 2; done"],
               image="alpine:3.22.1",
               on_finish_action="delete_pod",
               poll_interval=60,
               logging_interval=600,
           )
   wait()
   ```
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to