dstandish commented on code in PR #42082: URL: https://github.com/apache/airflow/pull/42082#discussion_r1801177274
########## airflow/models/taskinstance.py: ########## @@ -1760,16 +1758,16 @@ def _handle_reschedule( ti.end_date = timezone.utcnow() ti.set_duration() - # Lock DAG run to be sure not to get into a deadlock situation when trying to insert - # TaskReschedule which apparently also creates lock on corresponding DagRun entity - with_row_locks( - session.query(DagRun).filter_by( - dag_id=ti.dag_id, - run_id=ti.run_id, - ), - session=session, - ).one() - # Log reschedule request + # set state + ti.state = TaskInstanceState.UP_FOR_RESCHEDULE + + ti.clear_next_method_args() + + session.merge(ti) + session.commit() Review Comment: Yeah I thought about that. To me it seems a good tradeoff. Let's discuss. First, the probability would seem to be quite low. Because we're talking about a very small time window. ``` 1. session.commit 2. session.add 3. session.commit ``` the time is between 1 and 3. If 2 is stuck in lock wait that could become less insignifigant (i.e. more frequent). but i think either way it seems we arrive at a better outcome. why? because let's imagine the worker blows up between 1 and 3. What is the consequence? I believe the consequence would be that no TR record is inserted although the state is set properly. That scenario looks to be handled here in the relevant dep: https://github.com/apache/airflow/blob/85fd0e1102d664337d6bb08e590979867372f61d/airflow/ti_deps/deps/ready_to_reschedule.py#L85 So the sensor would be resumed probably more quickly than it thought it would. But with the before code, it's actually worse what happens. With the old code if we imagine worker blows up at that same time, then the task does not report state correctly, it would just be an OOM task death which would be a zombie that would be detected by scheduler and the sensor would not resume poking until after retries if configured. So commiting there that seems better to me anyway, if you assume that it blows up immediately after. And, commiting there means no deadlock concern, and it means we don't need to hold the table with a select for update, which holds the table for significantly longer, and if you have lots of other sensors in same dag, it can cause trouble. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org