When there are differing opinions but seems that there is a favourable option someone (actually anyone) might call for a vote https://www.apache.org/foundation/voting.html#votes-on-code-modification
For such votes, committers have binding votes. -1 is a veto (usually needs to be justified) and kills the proposal unless the person who vetoed will change their mind. J. On Tue, Mar 22, 2022 at 4:53 PM Philippe Lanoe <pla...@cloudera.com.invalid> wrote: > I agree with Jarek in the sense that the DAG developer **should** know > when the DAG should start, however in practice for time-based scheduling it > can be cumbersome to maintain especially: > - everytime the jobs evolves and gets updated / new version > - when developers have to maintain hundreds/thousands of independent jobs, > keeping track of start_date for each of them can be difficult > > Not to mention that many companies do not have state-of-the-art CI/CD > processes which could allow them to dynamically change the start date. In > many cases when a change is made to a job, the developers simply want to > update the job and the next run will take it into account. > > I also agree with Collin and Constance that the "run last interval" is a > valid use case and therefore this parameter could accept three values to > handle all of these cases. > However I would suggest: > > Catchup=True : run all intervals > Catchup=False: do not run any past interval > Catchup="Last Interval" (or any better name :)) > > I know that the DAG authors who relied on Catchup=False to run the last > interval will need to adjust their DAG but if added a third option not to > trigger any run then the DAG authors who relied on catchup=False + set > start date will also need to update their DAG to have the proper value. And > in my opinion when I read Catchup=False the natural way of reading it is > "no catchup", therefore it would be better to fix it in the right direction. > > What is the next step here? Who can decide / approve such a new feature > request? > > Thanks, > Philippe > > On Mon, Mar 21, 2022 at 4:05 PM Constance Martineau > <consta...@astronomer.io.invalid> wrote: > >> I've had a variation of this debate a few times, and the behaviour you >> find intuitive in my opinion comes down to your background (software >> engineer vs data engineer vs BI developer vs data analyst), industry >> standards, and the scope of responsibility DAG authors have at your >> organization. My vote is to extend the catchup setting to either run all >> intervals (catchup=True today), run the most recent interval (catchup=False >> today) or schedule the next interval. I have seen organizations where both >> would be beneficial depending on the data pipeline in question. >> >> All of Alex's points are why I think we at least need the option. >> >> I came from an institutional investor, and we had plenty of DAGs that ran >> daily, weekly, monthly, quarterly and yearly. >> >> Many financial analysts - who were not DAG authors themselves - would >> have access to the Airflow Webserver in order to rerun tasks. They do not >> have the ability to adjust the start_date. During Audit season, it was >> common to see yearly dags being run for earlier years. To support this, >> means we needed to implement a start date for an earlier year. Saw DAG >> authors deal with this in two ways: Set the start_date to first day of >> prior year to get the DAG out and let it run, then modify the start_date to >> something earlier or set the start_date to something earlier, watch the DAG >> and quickly update the state of the dag to success (or fail). One is better >> than the other (no fun explaining to an executive why reports were >> accidentally sent externally), but neither are great. Option 3 - setting >> the start_date between the data interval period and leaving it - always >> caused confusion with other stakeholders. >> >> A global default, and DAG-level option would have been amazing. >> >> >> >> On Sun, Mar 20, 2022 at 5:15 PM Collin McNulty >> <col...@astronomer.io.invalid> wrote: >> >>> While that’s true, I think there are often stakeholders that expect a >>> DAG to run only on the day for which it is scheduled. It’s pretty >>> straightforward for me to explain to non-technical stakeholders that “aw >>> shucks we deployed just a little too late for this week’s run, we’ll run it >>> manually to fix it”. On the contrary, explaining why a DAG that I said >>> would run on Tuesdays sent out an alert on a Friday to a VP of Finance is … >>> rough. I understand that Airflow does not make guarantees about when tasks >>> will execute, but I try to scale such that when a task can start and when >>> it does start are close enough to not have to explain the difference to >>> other stakeholders. >>> >>> Editing start_date can also be tough in some conditions. If I’m baking a >>> DAG into an image, using build-once-deploy-to-many CI/CD, and testing in a >>> lower environment for longer than the interval between runs, I’m toast on >>> setting the start_date to avoid what I consider a spurious run. That’s a >>> lot of “ands” but I think it’s a fairly common set of circumstances we >>> should support. >>> >>> Collin McNulty >>> >>> >>> >>> On Sun, Mar 20, 2022 at 3:12 PM Jarek Potiuk <ja...@potiuk.com> wrote: >>> >>>> Good. Love some mental stretching :). >>>> >>>> I believe you should **not** base the time of your run on the time it >>>> is released. Should not the DAG author know when there is a "start date" >>>> planned for the DAG? Should the decision on when the DAG interval start be >>>> made on combination of both start date in the dag **and** the time of not >>>> only when it's merged, but actually when airflow first **parses** the DAG. >>>> Not even mentioning the time zone issues. >>>> >>>> Imagine you case when DAG is merged 5 minutes between the midnight >>>> Mon/Tue and you have many DAGs. So many that parsing all the DAGs can take >>>> 20 minutes. Then the fact if your DAG runs this interval or that depends >>>> not even on the decision of when it is merged but also how long it takes >>>> Airflow to get to parse your DAG for the first time. >>>> >>>> Sounds pretty crazy :). >>>> >>>> J. >>>> >>>> >>>> On Sun, Mar 20, 2022 at 9:02 PM Collin McNulty >>>> <col...@astronomer.io.invalid> wrote: >>>> >>>>> Jarek, >>>>> >>>>> I tend to agree with you on this, but let me play devil’s advocate. If >>>>> I have a DAG that runs a report every Tuesday, I might want it to run >>>>> every >>>>> Tuesday starting whenever I am able to release the DAG. But if I release >>>>> on >>>>> a Friday, I don’t want it to try to run “for” last Tuesday. In this case, >>>>> the correct start_date for the dag is the day I release the DAG, but I >>>>> don’t know this date ahead of time and it differs per environment. Doing >>>>> this properly seems doable with a CD process that edits the DAG to insert >>>>> the start_date, but that’s fairly sophisticated tooling for a scenario >>>>> that >>>>> I imagine is quite common. >>>>> >>>>> Collin McNulty >>>>> >>>>> On Sun, Mar 20, 2022 at 1:55 PM Jarek Potiuk <ja...@potiuk.com> wrote: >>>>> >>>>>> Once again - why is it bad to set a start_date in the future, when - >>>>>> well - you **actually** want to run the first interval in the future ? >>>>>> What prevents you from setting the start-date to be a fixed time in >>>>>> the future, where the start date is within the interval you want to >>>>>> start first? Is it just "I do not want to specify conveniently >>>>>> whatever past date will be easy to type?" >>>>>> If this is the only reason, then it has a big drawback - because >>>>>> "start_date" is **actually** supposed to be the piece of metadata for >>>>>> the DAG that will tell you what was the intention of the DAG writer on >>>>>> when to start it. And precisely one that allows you to start things in >>>>>> the future. >>>>>> >>>>>> Am I missing something? >>>>>> >>>>>> On Sun, Mar 20, 2022 at 7:42 PM Larry Komenda >>>>>> <avoicelikerunningwa...@gmail.com> wrote: >>>>>> > >>>>>> > Alex, that's a good point regarding the need to run a DAG for the >>>>>> most recent schedule interval right away. I hadn't thought of that >>>>>> scenario >>>>>> as I haven't needed to build a DAG with that large of a scheduling gap. >>>>>> In >>>>>> that case I agree with you - it seems like it would make more sense to >>>>>> make >>>>>> this configurable. >>>>>> > >>>>>> > Perhaps there could be an additional DAG-level parameter that could >>>>>> be set alongside "catchup" to control this behavior. Or there could be a >>>>>> new parameter that could eventually replace "catchup" that supported 3 >>>>>> options - "catchup", "run most recent interval only", and "run next >>>>>> interval only". >>>>>> > >>>>>> > On Sat, Mar 19, 2022 at 1:02 PM Alex Begg <alex.b...@gmail.com> >>>>>> wrote: >>>>>> >> >>>>>> >> I would not consider it a bug to have the latest data interval run >>>>>> when you enable a DAG that is set to catchup=False. >>>>>> >> >>>>>> >> I have legitimate use for that feature by having my production >>>>>> environment have catchup_by_default=True but my lower environments are >>>>>> using catchup_by_default=False, meaning if I want to test the DAG >>>>>> behavior >>>>>> as scheduled in a lower environment I can just enable the DAG. >>>>>> >> >>>>>> >> For example, in a staging environment if I need to test out the >>>>>> functionality of a DAG that was scheduled for @monthly and there was no >>>>>> way >>>>>> to test the most recent data interval, than to test a true data interval >>>>>> of >>>>>> the DAG it could be many days, even weeks until they will occur. >>>>>> >> >>>>>> >> Triggering a DAG won’t run the latest data interval, it will use >>>>>> the current time as the logical_date, right? So that will won’t let me >>>>>> test >>>>>> a single as scheduled data interval. So in that @monthly senecio it will >>>>>> be >>>>>> impossible for me to test the functionality of a single data interval >>>>>> unless I wait multiple weeks. >>>>>> >> >>>>>> >> I see there could be a desire to not run the latest data interval >>>>>> and just start with whatever full interval follows the DAG being turned >>>>>> on. >>>>>> However I think that should be configurable, not fixed permanently. >>>>>> >> >>>>>> >> Alternatively it could be ideal to have a way to trigger a >>>>>> specific run for a catchup=False DAG that just got enabled by adding a 3d >>>>>> option to the trigger button drop down to trigger a past scheduled run. >>>>>> Then in that dialog the form can default to the most recent full data >>>>>> interval but then let you also specify a specific past interval based on >>>>>> the DAG's schedule. I often had to debug a DAG in production and I wanted >>>>>> to trigger a specific past data interval, not just the most recent. >>>>>> >> >>>>>> >> Alex Begg >>>>>> >> >>>>>> >> On Thu, Mar 17, 2022 at 4:58 PM Larry Komenda < >>>>>> avoicelikerunningwa...@gmail.com> wrote: >>>>>> >>> >>>>>> >>> I agree with this. I'd much rather have to trigger a single >>>>>> manual run the first time I enable a DAG than to either wait to enable >>>>>> until after I want it to run or by editing the start_date of the DAG >>>>>> itself. >>>>>> >>> >>>>>> >>> I'd be in favor of adjusting this behavior either permanently or >>>>>> by a configuration. >>>>>> >>> >>>>>> >>> On Fri, Mar 4, 2022 at 3:00 PM Philippe Lanoe >>>>>> <pla...@cloudera.com.invalid> wrote: >>>>>> >>>> >>>>>> >>>> Hello Daniel, >>>>>> >>>> >>>>>> >>>> Thank you for your answer. In your example, as I experienced, >>>>>> the first run would not be 2010-01-01 but 2022-03-03, 00:00:00 (it is >>>>>> currently March 4 - 21:00 here), which is the execution date >>>>>> corresponding >>>>>> to the start of the previous data interval, but the result is the same: >>>>>> an >>>>>> undesired dag run. (For instance, in case of cron schedule '00 22 * * *', >>>>>> one dagrun would be started immediately with execution date of >>>>>> 2022-03-02, >>>>>> 22:00:00) >>>>>> >>>> >>>>>> >>>> I also agree with you that it could be categorized as a bug and >>>>>> I would also vote for a fix. >>>>>> >>>> >>>>>> >>>> Would be great to have the feedback of others on this. >>>>>> >>>> >>>>>> >>>> On Fri, Mar 4, 2022 at 6:17 PM Daniel Standish >>>>>> <daniel.stand...@astronomer.io.invalid> wrote: >>>>>> >>>>> >>>>>> >>>>> You are saying, when you turn on for the first time a dag with >>>>>> e.g. @daily schedule, and catchup = False, if start date is 2010-01-01, >>>>>> then it would run first the 2010-01-01 run, then the current run >>>>>> (whatever >>>>>> yesterday is)? That sounds familiar. >>>>>> >>>>> >>>>>> >>>>> Yeah I don't like that behavior. I agree that, as you say, >>>>>> it's not the intuitive behavior. Seems it could reasonably be >>>>>> categorized >>>>>> as a bug. I'd prefer we just "fix" it rather than making it >>>>>> configurable. >>>>>> But some might have concerns re backcompat. >>>>>> >>>>> >>>>>> >>>>> What do others think? >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> -- >>>>> >>>>> Collin McNulty >>>>> Lead Airflow Engineer >>>>> >>>>> Email: col...@astronomer.io <john....@astronomer.io> >>>>> Time zone: US Central (CST UTC-6 / CDT UTC-5) >>>>> >>>>> >>>>> <https://www.astronomer.io/> >>>>> >>>> -- >>> >>> Collin McNulty >>> Lead Airflow Engineer >>> >>> Email: col...@astronomer.io <john....@astronomer.io> >>> Time zone: US Central (CST UTC-6 / CDT UTC-5) >>> >>> >>> <https://www.astronomer.io/> >>> >> >> >> -- >> >> Constance Martineau >> Product Manager >> >> Email: consta...@astronomer.io >> Time zone: US Eastern (EST UTC-5 / EDT UTC-4) >> >> >> <https://www.astronomer.io/> >> >>