Well, my email is not about the single PR or a follow up on that PR. It is 
referring to an issue in the core logic that results into DAG hash change. 

Jigar

> On Aug 6, 2025, at 12:20 AM, Jarek Potiuk <ja...@potiuk.com> wrote:
> 
> Yep. Thanks for the heads up.
> 
> We saw both PR and the issue and it is scheduled for 3.0.5 - it did not
> make it in 3.0.4. I think it would be good if you confirm in your PR that
> you applied the patch and show some evidences of what happened - before and
> after, not only "word" explanation - words and textual description is often
> prone to interpretation - but if you show what happens before you applied
> the patch and after, that would make it way easier to confirm that it
> works as expected.
> 
> And then just patiently remind in your PR if things are not reviewed - like
> in other PRs and fixes.
> 
> Also just for your information - (and anyone looking here as an
> educational message). We should avoid sending such single-PR,
> relatively obscure issue-related messages to devlist.
> 
> We try to reserve the devlist communication for important information that
> affects airflow decisions, all contributors, the way we do development,
> discussions about the future of our development and important features
> discussions.
> We rarely (if at all) use it to discuss individual bug fixes and PRs
> (unless those are absolutely critical fixes that need to be fixed
> immediately) - because it adds a lot of noise to our inboxes. Devlist
> discussions are the ones that we should really focus on - most people in
> the community should read and at least think about the things we post at
> the devlist, so posting about single bug and PR is adding a lot of
> cognitive overload for everyone, It's better to keep such messages to the
> PRs and issues in GitHub,
> 
> Empathy towards all the people in the community is important part of
> playing "well" in the community, so I hope we all understand that and
> follow those.
> 
> J.
> 
> 
> 
> 
>> On Wed, Aug 6, 2025 at 8:53 AM Jigar Parekh <ji...@vizeit.com> wrote:
>> 
>> I have been looking into Airflow metadata database level bottlenecks. In
>> my analysis so far, I observed that change of dag hash at run time for any
>> reason has a significant negative impact on the database because it blocks
>> dag run updates for last scheduling resulting into higher lock waits and in
>> many instances lock wait timeouts. I recently opened an issue #53957
>> showing one instance where dag hash changes just because template field
>> order is different and I also suggested a fix with a PR
>> #54041.Troubleshooting the lock waits further, I have come across a
>> scenario that is rare but it is resulting into unnecessary dag hash change.
>> This, in my opinion, needs community experts’ attention and review. The
>> details are below.
>> Airflow version: 2.x or (also 3.x based on the code)
>> Airflow config:
>> Executor: k8
>> AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION: False
>> AIRFLOW__CORE__MAX_NUM_RENDERED_TI_FIELDS_PER_TASK: 0
>> AIRFLOW__CORE__PARALLELISM: 250
>> DAG: any DAG with dag Params and multiple retries for tasks with a retry
>> callback
>> 
>> Steps:
>> 1. Trigger DAG with overriding the param default value
>> 2. Create a zombie task in the run e.g. remove the executor pod while the
>> task is running
>> 3. Observe the scheduler log (enable debug if possible) and serialized dag
>> table, dag hash is updated with the new value. If you compare with the old
>> serialized value in the data column, you will see that the difference is
>> the new serialized value now has param values from the run that had zombie
>> task failure
>> 4. This results into an additional dag run update statement with last
>> scheduling update statement and takes longer to execute when you have
>> multiple tasks executing simultaneously. This multiplies further if a DAG
>> run has multiple zombie task failures at the same time from different runs
>> with different Param valuesCode analysis: (I have looked at the code for
>> tag 2.10.5 because I am using that version in production but latest code
>> appears to be similar in logic)Based on the code analysis, I see that DAG
>> processor in the scheduler executes callbacks before serialization of the
>> DAG in processor.py -> process_file function which calls taskinstance.py ->
>> handle_failure function that ends up calling get_template_context having
>> process_params function call updating params value to the values from DAG
>> run conf. This causes param default value to change in the serialized DAG
>> and change in the DAG hash value
>> It appears that handle_failure is being called in other scenarios where
>> updating params values to the ones from DAG run conf may be required but in
>> this scenario it does not seem to be required. So far I am unable to find
>> any ways to resolve this problem
>> I hope this information helps to understand the problem.
>> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
For additional commands, e-mail: dev-h...@airflow.apache.org

Reply via email to