woodywuuu commented on issue #10790:
URL: https://github.com/apache/airflow/issues/10790#issuecomment-1095231809

   airflow: 2.2.2 with mysql8、 HA scheduler、celery executor(redis backend)
   
   From logs, it show that those ti reported this error `killed externally 
(status: success)` , were rescheduled! 
   1. scheduler found a ti to scheduled (ti from None to scheduled)
   2. scheduler queued ti(ti from scheduled to queued)
   3. scheduler send ti to celery
   4. worker get ti
   5. worker found ti‘s state in mysql  is scheduled 
https://github.com/apache/airflow/blob/2.2.2/airflow/models/taskinstance.py#L1224
   6. worker set this ti to None
   7. scheduler reschedule this ti
   8. scheduler could not queue this ti again, and found this ti success(in 
celery), so set it to failed
   
   From mysql we get that: all failed task has no external_executor_id!
   
   We use 5000 dags, each with 50 dummy task, found that, if the following two 
conditions are met,the probability of triggering this problem will highly 
increase:
   
   1. no external_executor_id was set to queued ti in celery 
https://github.com/apache/airflow/blob/2.2.2/airflow/jobs/scheduler_job.py#L537
      * This sql above has skip_locked, and some queued ti in celery may miss 
this external_executor_id. 
   10. a scheduler loop cost very long(more than 60s), 
`adopt_or_reset_orphaned_tasks` judge that schedulerJob failed, and try adopt 
orphaned ti 
https://github.com/apache/airflow/blob/9ac742885ffb83c15f7e3dc910b0cf9df073407a/airflow/executors/celery_executor.py#L442
   
   We do these tests:
   1. patch `SchedulerJob. _process_executor_events `, not to set 
external_executor_id to those queued ti
      * 300+ dag failed with `killed externally (status: success)` normally 
less than 10
   2. patch `adopt_or_reset_orphaned_tasks`, not to adopt orphaned ti 
      * no dag failed !
   
   I read the notes 
[below](https://github.com/apache/airflow/blob/9ac742885ffb83c15f7e3dc910b0cf9df073407a/airflow/executors/celery_executor.py#L442)
 , but still don't understand this problems:
   1. why should we handle queued ti in celery and set this external id ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to