Re: Make first dag run optional when catchup is False

Philippe Lanoe Tue, 22 Mar 2022 08:54:03 -0700

I agree with Jarek in the sense that the DAG developer **should** know when
the DAG should start, however in practice for time-based scheduling it can
be cumbersome to maintain especially:
- everytime the jobs evolves and gets updated / new version
- when developers have to maintain hundreds/thousands of independent jobs,
keeping track of start_date for each of them can be difficult


Not to mention that many companies do not have state-of-the-art CI/CD
processes which could allow them to dynamically change the start date. In
many cases when a change is made to a job, the developers simply want to
update the job and the next run will take it into account.

I also agree with Collin and Constance that the "run last interval" is a
valid use case and therefore this parameter could accept three values to
handle all of these cases.
However I would suggest:

Catchup=True : run all intervals
Catchup=False: do not run any past interval
Catchup="Last Interval" (or any better name :))

I know that the DAG authors who relied on Catchup=False to run the last
interval will need to adjust their DAG but if added a third option not to
trigger any run then the DAG authors who relied on catchup=False + set
start date will also need to update their DAG to have the proper value. And
in my opinion when I read Catchup=False the natural way of reading it is
"no catchup", therefore it would be better to fix it in the right direction.

What is the next step here? Who can decide / approve such a new feature
request?

Thanks,
Philippe

On Mon, Mar 21, 2022 at 4:05 PM Constance Martineau
<consta...@astronomer.io.invalid> wrote:

> I've had a variation of this debate a few times, and the behaviour you
> find intuitive in my opinion comes down to your background (software
> engineer vs data engineer vs BI developer vs data analyst), industry
> standards, and the scope of responsibility DAG authors have at your
> organization. My vote is to extend the catchup setting to either run all
> intervals (catchup=True today), run the most recent interval (catchup=False
> today) or schedule the next interval. I have seen organizations where both
> would be beneficial depending on the data pipeline in question.
>
> All of Alex's points are why I think we at least need the option.
>
> I came from an institutional investor, and we had plenty of DAGs that ran
> daily, weekly, monthly, quarterly and yearly.
>
> Many financial analysts - who were not DAG authors themselves - would have
> access to the Airflow Webserver in order to rerun tasks. They do not have
> the ability to adjust the start_date. During Audit season, it was common to
> see yearly dags being run for earlier years. To support this, means we
> needed to implement a start date for an earlier year. Saw DAG authors deal
> with this in two ways: Set the start_date to first day of prior year to get
> the DAG out and let it run, then modify the start_date to something earlier
> or set the start_date to something earlier, watch the DAG and quickly
> update the state of the dag to success (or fail). One is better than the
> other (no fun explaining to an executive why reports were accidentally sent
> externally), but neither are great. Option 3 - setting the start_date
> between the data interval period and leaving it - always caused confusion
> with other stakeholders.
>
> A global default, and DAG-level option would have been amazing.
>
>
>
> On Sun, Mar 20, 2022 at 5:15 PM Collin McNulty
> <col...@astronomer.io.invalid> wrote:
>
>> While that’s true, I think there are often stakeholders that expect a DAG
>> to run only on the day for which it is scheduled. It’s pretty
>> straightforward for me to explain to non-technical stakeholders that “aw
>> shucks we deployed just a little too late for this week’s run, we’ll run it
>> manually to fix it”. On the contrary, explaining why a DAG that I said
>> would run on Tuesdays sent out an alert on a Friday to a VP of Finance is …
>> rough. I understand that Airflow does not make guarantees about when tasks
>> will execute, but I try to scale such that when a task can start and when
>> it does start are close enough to not have to explain the difference to
>> other stakeholders.
>>
>> Editing start_date can also be tough in some conditions. If I’m baking a
>> DAG into an image, using build-once-deploy-to-many CI/CD, and testing in a
>> lower environment for longer than the interval between runs, I’m toast on
>> setting the start_date to avoid what I consider a spurious run. That’s a
>> lot of “ands” but I think it’s a fairly common set of circumstances we
>> should support.
>>
>> Collin McNulty
>>
>>
>>
>> On Sun, Mar 20, 2022 at 3:12 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>>
>>> Good. Love some mental stretching :).
>>>
>>> I believe you should **not** base the time of your run on the time it is
>>> released. Should not the DAG author know when there is a "start date"
>>> planned for the DAG? Should the decision on when the DAG interval start be
>>> made on combination of both start date in the dag **and** the time of not
>>> only when it's merged, but actually when airflow first **parses** the DAG.
>>> Not even mentioning the time zone issues.
>>>
>>> Imagine you case when DAG is merged 5 minutes between the midnight
>>> Mon/Tue and you have many DAGs. So many that parsing all the DAGs can take
>>> 20 minutes. Then the fact if your DAG runs this interval or that depends
>>> not even on the decision of when it is merged but also how long it takes
>>> Airflow to get to parse your DAG for the first time.
>>>
>>> Sounds pretty crazy :).
>>>
>>> J.
>>>
>>>
>>> On Sun, Mar 20, 2022 at 9:02 PM Collin McNulty
>>> <col...@astronomer.io.invalid> wrote:
>>>
>>>> Jarek,
>>>>
>>>> I tend to agree with you on this, but let me play devil’s advocate. If
>>>> I have a DAG that runs a report every Tuesday, I might want it to run every
>>>> Tuesday starting whenever I am able to release the DAG. But if I release on
>>>> a Friday, I don’t want it to try to run “for” last Tuesday. In this case,
>>>> the correct start_date for the dag is the day I release the DAG, but I
>>>> don’t know this date ahead of time and it differs per environment. Doing
>>>> this properly seems doable with a CD process that edits the DAG to insert
>>>> the start_date, but that’s fairly sophisticated tooling for a scenario that
>>>> I imagine is quite common.
>>>>
>>>> Collin McNulty
>>>>
>>>> On Sun, Mar 20, 2022 at 1:55 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>>>>
>>>>> Once again - why is it bad to set a start_date in the future, when -
>>>>> well - you **actually** want to run the first interval in the future ?
>>>>> What prevents you from setting the start-date to be a fixed time in
>>>>> the future, where the start date is within the interval you want to
>>>>> start first? Is it just "I do not want to specify conveniently
>>>>> whatever past date will be easy to type?"
>>>>> If this is the only reason,  then it has a big drawback - because
>>>>> "start_date" is **actually** supposed to be the piece of metadata for
>>>>> the DAG that will tell you what was the intention of the DAG writer on
>>>>> when to start it. And precisely one that allows you to start things in
>>>>> the future.
>>>>>
>>>>> Am I missing something?
>>>>>
>>>>> On Sun, Mar 20, 2022 at 7:42 PM Larry Komenda
>>>>> <avoicelikerunningwa...@gmail.com> wrote:
>>>>> >
>>>>> > Alex, that's a good point regarding the need to run a DAG for the
>>>>> most recent schedule interval right away. I hadn't thought of that 
>>>>> scenario
>>>>> as I haven't needed to build a DAG with that large of a scheduling gap. In
>>>>> that case I agree with you - it seems like it would make more sense to 
>>>>> make
>>>>> this configurable.
>>>>> >
>>>>> > Perhaps there could be an additional DAG-level parameter that could
>>>>> be set alongside "catchup" to control this behavior. Or there could be a
>>>>> new parameter that could eventually replace "catchup" that supported 3
>>>>> options - "catchup", "run most recent interval only", and "run next
>>>>> interval only".
>>>>> >
>>>>> > On Sat, Mar 19, 2022 at 1:02 PM Alex Begg <alex.b...@gmail.com>
>>>>> wrote:
>>>>> >>
>>>>> >> I would not consider it a bug to have the latest data interval run
>>>>> when you enable a DAG that is set to catchup=False.
>>>>> >>
>>>>> >> I have legitimate use for that feature by having my production
>>>>> environment have catchup_by_default=True but my lower environments are
>>>>> using catchup_by_default=False, meaning if I want to test the DAG behavior
>>>>> as scheduled in a lower environment I can just enable the DAG.
>>>>> >>
>>>>> >> For example, in a staging environment if I need to test out the
>>>>> functionality of a DAG that was scheduled for @monthly and there was no 
>>>>> way
>>>>> to test the most recent data interval, than to test a true data interval 
>>>>> of
>>>>> the DAG it could be many days, even weeks until they will occur.
>>>>> >>
>>>>> >> Triggering a DAG won’t run the latest data interval, it will use
>>>>> the current time as the logical_date, right? So that will won’t let me 
>>>>> test
>>>>> a single as scheduled data interval. So in that @monthly senecio it will 
>>>>> be
>>>>> impossible for me to test the functionality of a single data interval
>>>>> unless I wait multiple weeks.
>>>>> >>
>>>>> >> I see there could be a desire to not run the latest data interval
>>>>> and just start with whatever full interval follows the DAG being turned 
>>>>> on.
>>>>> However I think that should be configurable, not fixed permanently.
>>>>> >>
>>>>> >> Alternatively it could be ideal to have a way to trigger a specific
>>>>> run for a catchup=False DAG that just got enabled by adding a 3d option to
>>>>> the trigger button drop down to trigger a past scheduled run. Then in that
>>>>> dialog the form can default to the most recent full data interval but then
>>>>> let you also specify a specific past interval based on the DAG's schedule.
>>>>> I often had to debug a DAG in production and I wanted to trigger a 
>>>>> specific
>>>>> past data interval, not just the most recent.
>>>>> >>
>>>>> >> Alex Begg
>>>>> >>
>>>>> >> On Thu, Mar 17, 2022 at 4:58 PM Larry Komenda <
>>>>> avoicelikerunningwa...@gmail.com> wrote:
>>>>> >>>
>>>>> >>> I agree with this. I'd much rather have to trigger a single manual
>>>>> run the first time I enable a DAG than to either wait to enable until 
>>>>> after
>>>>> I want it to run or by editing the start_date of the DAG itself.
>>>>> >>>
>>>>> >>> I'd be in favor of adjusting this behavior either permanently or
>>>>> by a configuration.
>>>>> >>>
>>>>> >>> On Fri, Mar 4, 2022 at 3:00 PM Philippe Lanoe
>>>>> <pla...@cloudera.com.invalid> wrote:
>>>>> >>>>
>>>>> >>>> Hello Daniel,
>>>>> >>>>
>>>>> >>>> Thank you for your answer. In your example, as I experienced, the
>>>>> first run would not be 2010-01-01 but 2022-03-03, 00:00:00 (it is 
>>>>> currently
>>>>> March 4 - 21:00 here), which is the execution date corresponding to the
>>>>> start of the previous data interval, but the result is the same: an
>>>>> undesired dag run. (For instance, in case of cron schedule '00 22 * * *',
>>>>> one dagrun would be started immediately with execution date of 2022-03-02,
>>>>> 22:00:00)
>>>>> >>>>
>>>>> >>>> I also agree with you that it could be categorized as a bug and I
>>>>> would also vote for a fix.
>>>>> >>>>
>>>>> >>>> Would be great to have the feedback of others on this.
>>>>> >>>>
>>>>> >>>> On Fri, Mar 4, 2022 at 6:17 PM Daniel Standish
>>>>> <daniel.stand...@astronomer.io.invalid> wrote:
>>>>> >>>>>
>>>>> >>>>> You are saying, when you turn on for the first time a dag with
>>>>> e.g. @daily schedule, and catchup = False, if start date is 2010-01-01,
>>>>> then it would run first the 2010-01-01 run, then the current run (whatever
>>>>> yesterday is)?  That sounds familiar.
>>>>> >>>>>
>>>>> >>>>> Yeah I don't like that behavior.  I agree that, as you say, it's
>>>>> not the intuitive behavior.  Seems it could reasonably be categorized as a
>>>>> bug.  I'd prefer we just "fix" it rather than making it configurable.  But
>>>>> some might have concerns re backcompat.
>>>>> >>>>>
>>>>> >>>>> What do others think?
>>>>> >>>>>
>>>>> >>>>>
>>>>>
>>>> --
>>>>
>>>> Collin McNulty
>>>> Lead Airflow Engineer
>>>>
>>>> Email: col...@astronomer.io <john....@astronomer.io>
>>>> Time zone: US Central (CST UTC-6 / CDT UTC-5)
>>>>
>>>>
>>>> <https://www.astronomer.io/>
>>>>
>>> --
>>
>> Collin McNulty
>> Lead Airflow Engineer
>>
>> Email: col...@astronomer.io <john....@astronomer.io>
>> Time zone: US Central (CST UTC-6 / CDT UTC-5)
>>
>>
>> <https://www.astronomer.io/>
>>
>
>
> --
>
> Constance Martineau
> Product Manager
>
> Email: consta...@astronomer.io
> Time zone: US Eastern (EST UTC-5 / EDT UTC-4)
>
>
> <https://www.astronomer.io/>
>
>

Re: Make first dag run optional when catchup is False

Reply via email to