Well, my email is not about the single PR or a follow up on that PR. It is referring to an issue in the core logic that results into DAG hash change.
Jigar > On Aug 6, 2025, at 12:20 AM, Jarek Potiuk <ja...@potiuk.com> wrote: > > Yep. Thanks for the heads up. > > We saw both PR and the issue and it is scheduled for 3.0.5 - it did not > make it in 3.0.4. I think it would be good if you confirm in your PR that > you applied the patch and show some evidences of what happened - before and > after, not only "word" explanation - words and textual description is often > prone to interpretation - but if you show what happens before you applied > the patch and after, that would make it way easier to confirm that it > works as expected. > > And then just patiently remind in your PR if things are not reviewed - like > in other PRs and fixes. > > Also just for your information - (and anyone looking here as an > educational message). We should avoid sending such single-PR, > relatively obscure issue-related messages to devlist. > > We try to reserve the devlist communication for important information that > affects airflow decisions, all contributors, the way we do development, > discussions about the future of our development and important features > discussions. > We rarely (if at all) use it to discuss individual bug fixes and PRs > (unless those are absolutely critical fixes that need to be fixed > immediately) - because it adds a lot of noise to our inboxes. Devlist > discussions are the ones that we should really focus on - most people in > the community should read and at least think about the things we post at > the devlist, so posting about single bug and PR is adding a lot of > cognitive overload for everyone, It's better to keep such messages to the > PRs and issues in GitHub, > > Empathy towards all the people in the community is important part of > playing "well" in the community, so I hope we all understand that and > follow those. > > J. > > > > >> On Wed, Aug 6, 2025 at 8:53 AM Jigar Parekh <ji...@vizeit.com> wrote: >> >> I have been looking into Airflow metadata database level bottlenecks. In >> my analysis so far, I observed that change of dag hash at run time for any >> reason has a significant negative impact on the database because it blocks >> dag run updates for last scheduling resulting into higher lock waits and in >> many instances lock wait timeouts. I recently opened an issue #53957 >> showing one instance where dag hash changes just because template field >> order is different and I also suggested a fix with a PR >> #54041.Troubleshooting the lock waits further, I have come across a >> scenario that is rare but it is resulting into unnecessary dag hash change. >> This, in my opinion, needs community experts’ attention and review. The >> details are below. >> Airflow version: 2.x or (also 3.x based on the code) >> Airflow config: >> Executor: k8 >> AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION: False >> AIRFLOW__CORE__MAX_NUM_RENDERED_TI_FIELDS_PER_TASK: 0 >> AIRFLOW__CORE__PARALLELISM: 250 >> DAG: any DAG with dag Params and multiple retries for tasks with a retry >> callback >> >> Steps: >> 1. Trigger DAG with overriding the param default value >> 2. Create a zombie task in the run e.g. remove the executor pod while the >> task is running >> 3. Observe the scheduler log (enable debug if possible) and serialized dag >> table, dag hash is updated with the new value. If you compare with the old >> serialized value in the data column, you will see that the difference is >> the new serialized value now has param values from the run that had zombie >> task failure >> 4. This results into an additional dag run update statement with last >> scheduling update statement and takes longer to execute when you have >> multiple tasks executing simultaneously. This multiplies further if a DAG >> run has multiple zombie task failures at the same time from different runs >> with different Param valuesCode analysis: (I have looked at the code for >> tag 2.10.5 because I am using that version in production but latest code >> appears to be similar in logic)Based on the code analysis, I see that DAG >> processor in the scheduler executes callbacks before serialization of the >> DAG in processor.py -> process_file function which calls taskinstance.py -> >> handle_failure function that ends up calling get_template_context having >> process_params function call updating params value to the values from DAG >> run conf. This causes param default value to change in the serialized DAG >> and change in the DAG hash value >> It appears that handle_failure is being called in other scenarios where >> updating params values to the ones from DAG run conf may be required but in >> this scenario it does not seem to be required. So far I am unable to find >> any ways to resolve this problem >> I hope this information helps to understand the problem. >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org For additional commands, e-mail: dev-h...@airflow.apache.org