dstandish commented on code in PR #42082:
URL: https://github.com/apache/airflow/pull/42082#discussion_r1801177274


##########
airflow/models/taskinstance.py:
##########
@@ -1760,16 +1758,16 @@ def _handle_reschedule(
     ti.end_date = timezone.utcnow()
     ti.set_duration()
 
-    # Lock DAG run to be sure not to get into a deadlock situation when trying 
to insert
-    # TaskReschedule which apparently also creates lock on corresponding 
DagRun entity
-    with_row_locks(
-        session.query(DagRun).filter_by(
-            dag_id=ti.dag_id,
-            run_id=ti.run_id,
-        ),
-        session=session,
-    ).one()
-    # Log reschedule request
+    # set state
+    ti.state = TaskInstanceState.UP_FOR_RESCHEDULE
+
+    ti.clear_next_method_args()
+
+    session.merge(ti)
+    session.commit()

Review Comment:
   Yeah I thought about that.  To me it seems a good tradeoff.  Let's discuss.
   
   First, the probability would seem to be quite low. Because we're talking 
about a very small time window. 
   
   ```
   1. session.commit
   2. session.add
   3. session.commit
   ```
   the time is between 1 and 3. 
   
   If 2 is stuck in lock wait that could become less insignifigant (i.e. more 
frequent).
   
   but i think either way it seems we arrive at a better outcome.
   
   why?
   
   because let's imagine the worker blows up between 1 and 3.   What is the 
consequence?
   
   I believe the consequence would be that no TR record is inserted although 
the state is set properly.  
   
   That scenario looks to be handled here in the relevant dep: 
https://github.com/apache/airflow/blob/85fd0e1102d664337d6bb08e590979867372f61d/airflow/ti_deps/deps/ready_to_reschedule.py#L85
   
   So the sensor would be resumed probably more quickly than it thought it 
would.
   
   But with the before code, it's actually worse what happens.  With the old 
code if we imagine worker blows up at that same time, then the task does not 
report state correctly, it would just be an OOM task death which would be a 
zombie that would be detected by scheduler and the sensor would not resume 
poking until after retries if configured.
   
   So commiting there that seems better to me anyway, if you assume that it 
blows up immediately after.
   
   And, commiting there means no deadlock concern, and it means we don't need 
to hold the table with a select for update, which holds the table for 
significantly longer, and if you have lots of other sensors in same dag, it can 
cause trouble.
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to