I agree with Collin; said a lot better than what I was in the process of writing. Execution Date -> Data Intervals was an improvement, but even with this change, it's still difficult to understand. I think doing different things depending on whether the start_date exists or not will add to that complexity. I do like the idea of it being optional, and I think if you were to go that route, the plan you proposed is the right one, but I prefer the explicitness of the args in this case.
What is making me uncomfortable is that DAG authors do not necessarily have a software engineering background, nor are all of them Airflow experts. Also, lots of people who are not DAG authors interact with Airflow directly and indirectly, and they also need to be able to easily reason about when DAGs are supposed to run, and what period that run encompasses. I'm picturing a support engineer pagerduty for a missed SLA for a new report, and the support engineer trying to reason this behaviour out via the DAG code. On Tue, Mar 22, 2022 at 3:11 PM Collin McNulty <col...@astronomer.io.invalid> wrote: > I like the idea of supporting start_date=None, but that absolutely should > not mean that we interpret start_date as “now”. start_date=now is one of > the most common ways to shoot yourself in the foot writing DAGs. I think > interpreting start_date=None as “don’t do any sort of catchup and run the > next time you’re able” makes some amount of sense, but I like Philippe’s > idea a little more. Specifically, it seems like bool is simply not a > correct type for catchup, as we can describe at least 3 behaviors that make > sense. What if we change the default type to string, and support bool as a > legacy at least until 3.0? > > Catchup="all" (or True): run all intervals. Make "all" the default. > Catchup="none" : do not run any past interval > Catchup="last" (or False) run only the most recent interval > > On Tue, Mar 22, 2022 at 1:15 PM Daniel Standish > <daniel.stand...@astronomer.io.invalid> wrote: > >> There's some wiggliness here because of Airflow's behavior of actually >> *running* the dag at the end of the interval rather than the start. So >> if we have start_date=None, then we default the start date to *now,* then >> maybe to be consistent, the first run needs to be not 00:00 tomorrow but >> 00:00 the next day. The oddness is amplified when you consider a monthly >> dag, where if you deploy now, start date is now, first schedulable run is >> next month, therefore first run _more_ than a month away. To fix this I >> think we need to add support in our timetables for running at the start of >> the interval instead of the end -- and I think this is something that >> timetables were introduced to support anyway. >> >> >> -- Constance Martineau Product Manager Email: consta...@astronomer.io Time zone: US Eastern (EST UTC-5 / EDT UTC-4) <https://www.astronomer.io/>