Hey there, long time reader, first time poster here :)

*tl;dr:*

*As part of the 3.0 release, I would like to propose changing the default
for `catchup_by_default` from True to False. *

*This discussion asks for input and whether this can be a lazy consensus or
should be a vote.*

Timethings are hard. Especially for new Airflow users. When I first started
using Airflow, it took me a while (and one or two napkin sketches) to
understand how to set the start_date and trigger the DAG runs I wanted. To
this day, I still often just pick a date a couple days in the past and just
set catchup to False to not have to do the math on schedules that aren’t
straightforward.

As part of the Astronomer DevRel team, I teach users about Airflow. This
“gotcha” is especially common for new users to run into. Imagine that
you’re a new person writing a DAG with a start date of Jan 1st. You unpause
your DAG, and you unexpectedly see a large amount of DAG runs kicking off.
When we talk to practitioners in Airflow 101 webinars, many share that have
accidentally overflooded their Airflow deployment because they didn’t
understand the relationship between the start_date and DAG runs, by not
knowing about catchup, or by forgetting to add the line when writing new
dags.

This is why I propose changing the config catchup_by_default from True to
False.

Pro:

   -

   Less accidental DAG runs by beginners and people accidentally forgetting
   catchup=False. Especially for beginners this is confusing.
   -

   One parameter less for beginners to learn when they write their first
   DAG, one line less to write for most DAGs in the future.


Con:

   -

   Breaking change, but since it is a config value a minor one that users
   who want the old behavior can easily adjust. We can add something to the
   config linter to highlight this change, and prompt users to set the value
   back to True if they prefer the current behaviour.



Elad pointed out that there has been previous discussion on this including:


   -

   The suggestion to move away from a binary option to an enum to have more
   fine grained control on when to catch up (only when the DAG is first turned
   on, only when the DAG is not first turned on, always, never…) #35392
   <https://github.com/apache/airflow/pull/35392#issuecomment-1792254428>


This is a good idea, but there is more to figure out. As others have
pointed out in the PR, if we go this route this means more configurations.
I don’t think changing the default blocks from going this route in the
future.

When the time comes, we could turn this into an enum. For migration
purposes and to avoid DAG code changes, we could add more options including
“always” and “never”, and map True to “always” and False to “never”. For
this feature, what we do at the global level should match what’s available
at the DAG level, meaning the DAG parameter will also need to be adjusted
accordingly. Even in this new model, defaulting to "False"/“never” is the
right way forward.


   -

   #38168 <https://github.com/apache/airflow/pull/38168> discussed/proposed
   the possibility for an option to disable the “catch up of the latest DAG
   run” behavior when unpausing a DAG with catchup=False.

   While it is closely related I think this is a separate issue that merits
   its own discussion. I.e. we’re not talking about changing a default value,
   we’re talking about fundamentally changing what catchup=False means. It’s a
   lot less alarming for users to accidentally trigger one DAG run because
   they didn’t understand catchup behaviour, versus a large number of DAG
   runs. That is the confusing behaviour, and what I’m hoping to prevent with
   the default change.



I started a PR here for the most basic option for this change, just
changing the config variable from True to False:
https://github.com/apache/airflow/pull/47354

If there is general alignment I’d try for a lazy consensus, otherwise a
vote 🙂

Reply via email to