I wanted to add to this discussion that I had my own confusion about logical_date and data_interval_end (I talked about this on Slack with Jed Cunningham a month ago: https://apache-airflow.slack.com/archives/CSS36QQS1/p1641410896219600 ).
I wanted to thank Jarek for clearing this up a bit in this thread. Your explanation of the following makes the most sense to me and would be how I think about which to use in my DAGs: * If your task is about "data_interval" - by all means use the data_interval_start and end. * if your task is not about "interval" - use the "logical_date". It will help immensely to have this clarified in the documentation because I am sure a lot others have a bit of confusion regarding this but are just staying quiet. Thanks, Alex On Mon, Feb 7, 2022 at 8:02 AM Jarek Potiuk <ja...@potiuk.com> wrote: > You have my axe :) > > On Mon, Feb 7, 2022 at 5:00 PM Howard Yoo <howard...@gmail.com> wrote: > >> Sure, I could try! But I definitely need Jarek's help (and the others) on >> it - so would like to work with Jarek for him to review any changes that I >> make (and make sure the wordings, definitions, are correct to the intended >> design). >> >> - Howard >> >> On Mon, Feb 7, 2022 at 9:38 AM Ash Berlin-Taylor <a...@apache.org> wrote: >> >>> Agreed! >>> >>> Howard: do you fancy trying to create a PR to capture this discusion/the >>> reasoning in our docs? >>> >>> It probably belongs on one of these three pages >>> >>> >>> https://github.com/apache/airflow/blob/main/docs/apache-airflow/concepts/scheduler.rst >>> >>> https://github.com/apache/airflow/blob/main/docs/apache-airflow/concepts/dags.rst >>> >>> https://github.com/apache/airflow/blob/main/docs/apache-airflow/concepts/timetable.rst >>> >>> Cheers, >>> >>> Ash >>> >>> On Mon, Feb 7 2022 at 09:13:28 +0100, Jarek Potiuk <ja...@potiuk.com> >>> wrote: >>> >>> Yeah. That discussion actually made me think that probably we need to >>> explain it better :) >>> >>> On Sun, Feb 6, 2022 at 11:10 PM Howard Yoo <howard...@gmail.com> wrote: >>> >>>> As we discuss this topic, the more and more I get to understand the >>>> reasons behind all those philosophies behind, so I appreciate the knowledge >>>> that I gained. >>>> >>>> As long as those terms and principles are well described and explained >>>> without confusion, I believe we are moving to the right direction and >>>> that’s what matters. >>>> >>>> - Howard >>>> >>>> Sent from my iPhone >>>> >>>> On Feb 6, 2022, at 3:24 PM, Jarek Potiuk <ja...@potiuk.com> wrote: >>>> >>>> >>>> IMHO It does not really matter if they are the same or not and which >>>> one is the same. This is actually the beauty of the "abstract" and "vague" >>>> logical_date. Those are different "concepts" that you use in different >>>> cases. >>>> >>>> The logical date **might** be the same as one of the interval_dates. >>>> It's just an "abstract" representation of the particular "run_id" - and you >>>> should not care, because "logical_date" makes sense for some cases, but >>>> "data_interval_start/end" for other cases. >>>> >>>> * If your task is about "data_interval" - by all means use the >>>> data_interval_start and end. >>>> * if your task is not about "interval" - use the "logical_date". >>>> >>>> That is how I see it at least. By using a different approach when you >>>> use different cases the users might free their "mental-mapping" - they do >>>> not have to map the "logical_date" to either "start" or "end". It does not >>>> matter. but if they process a data interval, they have very clear >>>> boundaries of ("start" <-> "end") range that they can use without even >>>> thinking on. how "logical_date" maps to it. >>>> >>>> For me - those are completely different cases and they are orthogonal >>>> to each other (even if some of those values are the same). >>>> >>>> J. >>>> >>>> On Sun, Feb 6, 2022 at 7:00 PM Howard Yoo <howard...@gmail.com> wrote: >>>> >>>>> I see, thank you for the info. >>>>> I didn’t know about the existence of the data_interval_start and end >>>>> dates. I briefly looked at those definitions, and was wondering… wouldn’t >>>>> they be equal to the logical dates? I do see those variables mentioned in >>>>> https://airflow.apache.org/docs/apache-airflow/stable/templates-ref.html, >>>>> and also see the ds and ts meaning logical dates. In practice, are those >>>>> dates and timestamps supposed to be the same? >>>>> >>>>> Wonder also, if the ‘data_’ prefix would be necessary if airfow would >>>>> be used to orchestrate far more things in the future (perhaps this may be >>>>> another thread), but in general, we should have a continuous discussions >>>>> to >>>>> further clearly define all those dates for the improved usage of airflow. >>>>> >>>>> Howard >>>>> >>>>> Sent from my iPhone >>>>> >>>>> On Feb 6, 2022, at 11:15 AM, Jarek Potiuk <ja...@potiuk.com> wrote: >>>>> >>>>> >>>>> We already have `data_interval_start` and `data_interval_end' as >>>>> fields, and we need something else that can have more "abstract" meaning >>>>> to >>>>> apply to the whole run as "single thing". Using interval_date would be a >>>>> bit ambiguous. >>>>> >>>>> "Did you mean start or end actually when you mentioned interval date?" >>>>> - is the question that I anticipate happening a lot if we mix those. >>>>> >>>>> J. >>>>> >>>>> >>>>> >>>>> On Sun, Feb 6, 2022 at 6:04 PM Howard Yoo <howard...@gmail.com> wrote: >>>>> >>>>>> Now I can understand why the data_date may not be a perfect fit to >>>>>> describe the term. >>>>>> >>>>>> This is not to be against the logical_date, but what about >>>>>> ‘interval_date?’ We have the schedule interval, which defines the >>>>>> duration >>>>>> of the interval (e.g. 1day), so wouldn’t interval start and end date be a >>>>>> better representation of it rather than the logical date? >>>>>> >>>>>> Just want to hear whether that has been brought up already or not. >>>>>> >>>>>> Howard >>>>>> >>>>>> Sent from my iPhone >>>>>> >>>>>> On Feb 6, 2022, at 10:25 AM, Jarek Potiuk <ja...@potiuk.com> wrote: >>>>>> >>>>>> >>>>>> I wholeheartedly agree with TP on that one. I think while some time >>>>>> ago "data date" could make sense, Airflow's future is much more than just >>>>>> processing data intervals. >>>>>> This is the primary use case and this is where Airflow shines od >>>>>> course, but one of the good examples of how Airflow is used out there, >>>>>> and >>>>>> while we are not really encouraging it, there are not only legitimate, >>>>>> but >>>>>> also something that I hope Airflow will treat as first-time citizens soon >>>>>> (and it kind of already is with custom timetables). >>>>>> >>>>>> Just an example here - for me one of the most eye-opening talks in >>>>>> last year's Airflow Summit >>>>>> https://airflowsummit.org/sessions/2021/provision-as-a-service/ >>>>>> In this talk Cloudflare engineers explain how they manage the >>>>>> CloudFlare infrastructure using Airflow. >>>>>> >>>>>> The "Data date" has no meaning in this case. But the "logical Date" >>>>>> (which is the vaguest-possible one as TP explained) continues to have >>>>>> one. >>>>>> This is the "logical date of the infrastructure provisioning". Thanks >>>>>> to Airflow (as I understand it) Cloudflare is able to re-provision their >>>>>> services to "yesterday's logical date infrastructure" today - for >>>>>> example. >>>>>> >>>>>> That would not fly with "data date". >>>>>> >>>>>> J, >>>>>> >>>>>>