Re: Make first dag run optional when catchup is False

Collin McNulty Sun, 20 Mar 2022 14:15:14 -0700

While that’s true, I think there are often stakeholders that expect a DAG
to run only on the day for which it is scheduled. It’s pretty
straightforward for me to explain to non-technical stakeholders that “aw
shucks we deployed just a little too late for this week’s run, we’ll run it
manually to fix it”. On the contrary, explaining why a DAG that I said
would run on Tuesdays sent out an alert on a Friday to a VP of Finance is …
rough. I understand that Airflow does not make guarantees about when tasks
will execute, but I try to scale such that when a task can start and when
it does start are close enough to not have to explain the difference to
other stakeholders.


Editing start_date can also be tough in some conditions. If I’m baking a
DAG into an image, using build-once-deploy-to-many CI/CD, and testing in a
lower environment for longer than the interval between runs, I’m toast on
setting the start_date to avoid what I consider a spurious run. That’s a
lot of “ands” but I think it’s a fairly common set of circumstances we
should support.

Collin McNulty



On Sun, Mar 20, 2022 at 3:12 PM Jarek Potiuk <ja...@potiuk.com> wrote:

> Good. Love some mental stretching :).
>
> I believe you should **not** base the time of your run on the time it is
> released. Should not the DAG author know when there is a "start date"
> planned for the DAG? Should the decision on when the DAG interval start be
> made on combination of both start date in the dag **and** the time of not
> only when it's merged, but actually when airflow first **parses** the DAG.
> Not even mentioning the time zone issues.
>
> Imagine you case when DAG is merged 5 minutes between the midnight Mon/Tue
> and you have many DAGs. So many that parsing all the DAGs can take 20
> minutes. Then the fact if your DAG runs this interval or that depends not
> even on the decision of when it is merged but also how long it takes
> Airflow to get to parse your DAG for the first time.
>
> Sounds pretty crazy :).
>
> J.
>
>
> On Sun, Mar 20, 2022 at 9:02 PM Collin McNulty
> <col...@astronomer.io.invalid> wrote:
>
>> Jarek,
>>
>> I tend to agree with you on this, but let me play devil’s advocate. If I
>> have a DAG that runs a report every Tuesday, I might want it to run every
>> Tuesday starting whenever I am able to release the DAG. But if I release on
>> a Friday, I don’t want it to try to run “for” last Tuesday. In this case,
>> the correct start_date for the dag is the day I release the DAG, but I
>> don’t know this date ahead of time and it differs per environment. Doing
>> this properly seems doable with a CD process that edits the DAG to insert
>> the start_date, but that’s fairly sophisticated tooling for a scenario that
>> I imagine is quite common.
>>
>> Collin McNulty
>>
>> On Sun, Mar 20, 2022 at 1:55 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>>
>>> Once again - why is it bad to set a start_date in the future, when -
>>> well - you **actually** want to run the first interval in the future ?
>>> What prevents you from setting the start-date to be a fixed time in
>>> the future, where the start date is within the interval you want to
>>> start first? Is it just "I do not want to specify conveniently
>>> whatever past date will be easy to type?"
>>> If this is the only reason,  then it has a big drawback - because
>>> "start_date" is **actually** supposed to be the piece of metadata for
>>> the DAG that will tell you what was the intention of the DAG writer on
>>> when to start it. And precisely one that allows you to start things in
>>> the future.
>>>
>>> Am I missing something?
>>>
>>> On Sun, Mar 20, 2022 at 7:42 PM Larry Komenda
>>> <avoicelikerunningwa...@gmail.com> wrote:
>>> >
>>> > Alex, that's a good point regarding the need to run a DAG for the most
>>> recent schedule interval right away. I hadn't thought of that scenario as I
>>> haven't needed to build a DAG with that large of a scheduling gap. In that
>>> case I agree with you - it seems like it would make more sense to make this
>>> configurable.
>>> >
>>> > Perhaps there could be an additional DAG-level parameter that could be
>>> set alongside "catchup" to control this behavior. Or there could be a new
>>> parameter that could eventually replace "catchup" that supported 3 options
>>> - "catchup", "run most recent interval only", and "run next interval only".
>>> >
>>> > On Sat, Mar 19, 2022 at 1:02 PM Alex Begg <alex.b...@gmail.com> wrote:
>>> >>
>>> >> I would not consider it a bug to have the latest data interval run
>>> when you enable a DAG that is set to catchup=False.
>>> >>
>>> >> I have legitimate use for that feature by having my production
>>> environment have catchup_by_default=True but my lower environments are
>>> using catchup_by_default=False, meaning if I want to test the DAG behavior
>>> as scheduled in a lower environment I can just enable the DAG.
>>> >>
>>> >> For example, in a staging environment if I need to test out the
>>> functionality of a DAG that was scheduled for @monthly and there was no way
>>> to test the most recent data interval, than to test a true data interval of
>>> the DAG it could be many days, even weeks until they will occur.
>>> >>
>>> >> Triggering a DAG won’t run the latest data interval, it will use the
>>> current time as the logical_date, right? So that will won’t let me test a
>>> single as scheduled data interval. So in that @monthly senecio it will be
>>> impossible for me to test the functionality of a single data interval
>>> unless I wait multiple weeks.
>>> >>
>>> >> I see there could be a desire to not run the latest data interval and
>>> just start with whatever full interval follows the DAG being turned on.
>>> However I think that should be configurable, not fixed permanently.
>>> >>
>>> >> Alternatively it could be ideal to have a way to trigger a specific
>>> run for a catchup=False DAG that just got enabled by adding a 3d option to
>>> the trigger button drop down to trigger a past scheduled run. Then in that
>>> dialog the form can default to the most recent full data interval but then
>>> let you also specify a specific past interval based on the DAG's schedule.
>>> I often had to debug a DAG in production and I wanted to trigger a specific
>>> past data interval, not just the most recent.
>>> >>
>>> >> Alex Begg
>>> >>
>>> >> On Thu, Mar 17, 2022 at 4:58 PM Larry Komenda <
>>> avoicelikerunningwa...@gmail.com> wrote:
>>> >>>
>>> >>> I agree with this. I'd much rather have to trigger a single manual
>>> run the first time I enable a DAG than to either wait to enable until after
>>> I want it to run or by editing the start_date of the DAG itself.
>>> >>>
>>> >>> I'd be in favor of adjusting this behavior either permanently or by
>>> a configuration.
>>> >>>
>>> >>> On Fri, Mar 4, 2022 at 3:00 PM Philippe Lanoe
>>> <pla...@cloudera.com.invalid> wrote:
>>> >>>>
>>> >>>> Hello Daniel,
>>> >>>>
>>> >>>> Thank you for your answer. In your example, as I experienced, the
>>> first run would not be 2010-01-01 but 2022-03-03, 00:00:00 (it is currently
>>> March 4 - 21:00 here), which is the execution date corresponding to the
>>> start of the previous data interval, but the result is the same: an
>>> undesired dag run. (For instance, in case of cron schedule '00 22 * * *',
>>> one dagrun would be started immediately with execution date of 2022-03-02,
>>> 22:00:00)
>>> >>>>
>>> >>>> I also agree with you that it could be categorized as a bug and I
>>> would also vote for a fix.
>>> >>>>
>>> >>>> Would be great to have the feedback of others on this.
>>> >>>>
>>> >>>> On Fri, Mar 4, 2022 at 6:17 PM Daniel Standish
>>> <daniel.stand...@astronomer.io.invalid> wrote:
>>> >>>>>
>>> >>>>> You are saying, when you turn on for the first time a dag with
>>> e.g. @daily schedule, and catchup = False, if start date is 2010-01-01,
>>> then it would run first the 2010-01-01 run, then the current run (whatever
>>> yesterday is)?  That sounds familiar.
>>> >>>>>
>>> >>>>> Yeah I don't like that behavior.  I agree that, as you say, it's
>>> not the intuitive behavior.  Seems it could reasonably be categorized as a
>>> bug.  I'd prefer we just "fix" it rather than making it configurable.  But
>>> some might have concerns re backcompat.
>>> >>>>>
>>> >>>>> What do others think?
>>> >>>>>
>>> >>>>>
>>>
>> --
>>
>> Collin McNulty
>> Lead Airflow Engineer
>>
>> Email: col...@astronomer.io <john....@astronomer.io>
>> Time zone: US Central (CST UTC-6 / CDT UTC-5)
>>
>>
>> <https://www.astronomer.io/>
>>
> --

Collin McNulty
Lead Airflow Engineer

Email: col...@astronomer.io <john....@astronomer.io>
Time zone: US Central (CST UTC-6 / CDT UTC-5)


<https://www.astronomer.io/>

Re: Make first dag run optional when catchup is False

Reply via email to