[GitHub] [airflow] sweco commented on issue #19329: Kubernetes Executor: non-existent upstream failed

GitBox Fri, 12 Nov 2021 07:23:06 -0800


sweco commented on issue #19329:
URL: https://github.com/apache/airflow/issues/19329#issuecomment-967199405



   # Steps to reproduce
   
   1. Prepare a relatively small cluster, e.g. 1 node with 8 GB of RAM.
   2. Deploy Airflow using Kubernetes executor there.
   3. Prepare a DAG according to the following specification
       - It tries to run 5 tasks concurrently
       - Each task specifies 2 GB of RAM in their pod requests - i.e. the sum 
of requested RAM is greater than the total available RAM (5 * 2 GB = 10 GB > 8 
GB)
       - Each task is taking more than 300 seconds (the scheduler does not 
crash immediately after 300 seconds, so be sure to put some reserve, e.g. 600 
seconds)
   
       ```python
       t1 = []
       for i in range(5):
           t = BashOperator(
               task_id=f't1_{i}',
               bash_command='sleep 600 && echo Hello World!',
               executor_config={
                   'pod_override': k8s.V1Pod(
                       spec=k8s.V1PodSpec(
                           containers=[
                               k8s.V1Container(
                                   name="base",
                                   resources=k8s.V1ResourceRequirements(
                                       requests={
                                           'memory': '2000Mi',
                                           'cpu': '1000m'
                                       }
                                   )
                               )
                           ]
                       )
                   )
               }
           )
           t1.append(t)
       
       t2 = BashOperator(
           task_id='t2',
           bash_command='sleep 20 && date',
       )
   
       t1 >> t2
       ```
   
   4. Run the DAG, you should see that at least one of the tasks' pods will not 
be able to be run and will thus be in `Pending` state.
   
       
![image](https://user-images.githubusercontent.com/11132999/141490482-c2f5e424-1c0c-48f6-861e-8c95c3854572.png)
   
   5. After ~6 minutes the pods will be deleted by the scheduler, marked as 
`up_for_retry` and at the same time `t2` will be marked as `upstream_failed`. 
The scheduler reports this error log for each task:
   
       ```
       ERROR - Pod "internaltestt1..." has been pending for longer than 300 
seconds.It will be deleted and set to failed.
       ```
   
       In the next screenshot we can see the upstream task, whose pod was 
deleted, marked as `up_for_retry` and `t2` marked as `upstream_failed` even 
though the upstream task can still succeed running on the second try.
   
       
![image](https://user-images.githubusercontent.com/11132999/141489993-752fe86b-2a3b-49ee-b14b-4066d9ea0c23.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] sweco commented on issue #19329: Kubernetes Executor: non-existent upstream failed

Reply via email to