I wanted to add to this discussion that I had my own
confusion about logical_date and data_interval_end (I talked about this on
Slack with Jed Cunningham a month ago:
https://apache-airflow.slack.com/archives/CSS36QQS1/p1641410896219600 ).

I wanted to thank Jarek for clearing this up a bit in this thread. Your
explanation of the following makes the most sense to me and would be how I
think about which to use in my DAGs:

* If your task is about "data_interval" - by all means use the
data_interval_start and end.
* if your task is not about "interval" - use the "logical_date".

It will help immensely to have this clarified in the documentation because
I am sure a lot others have a bit of confusion regarding this but are just
staying quiet.

Thanks,

Alex


On Mon, Feb 7, 2022 at 8:02 AM Jarek Potiuk <ja...@potiuk.com> wrote:

> You have my axe :)
>
> On Mon, Feb 7, 2022 at 5:00 PM Howard Yoo <howard...@gmail.com> wrote:
>
>> Sure, I could try! But I definitely need Jarek's help (and the others) on
>> it - so would like to work with Jarek for him to review any changes that I
>> make (and make sure the wordings, definitions, are correct to the intended
>> design).
>>
>> - Howard
>>
>> On Mon, Feb 7, 2022 at 9:38 AM Ash Berlin-Taylor <a...@apache.org> wrote:
>>
>>> Agreed!
>>>
>>> Howard: do you fancy trying to create a PR to capture this discusion/the
>>> reasoning in our docs?
>>>
>>> It probably belongs on one of these three pages
>>>
>>>
>>> https://github.com/apache/airflow/blob/main/docs/apache-airflow/concepts/scheduler.rst
>>>
>>> https://github.com/apache/airflow/blob/main/docs/apache-airflow/concepts/dags.rst
>>>
>>> https://github.com/apache/airflow/blob/main/docs/apache-airflow/concepts/timetable.rst
>>>
>>> Cheers,
>>>
>>> Ash
>>>
>>> On Mon, Feb 7 2022 at 09:13:28 +0100, Jarek Potiuk <ja...@potiuk.com>
>>> wrote:
>>>
>>> Yeah. That discussion actually made me think that probably we need to
>>> explain it better :)
>>>
>>> On Sun, Feb 6, 2022 at 11:10 PM Howard Yoo <howard...@gmail.com> wrote:
>>>
>>>> As we discuss this topic, the more and more I get to understand the
>>>> reasons behind all those philosophies behind, so I appreciate the knowledge
>>>> that I gained.
>>>>
>>>> As long as those terms and principles are well described and explained
>>>> without confusion, I believe we are moving to the right direction and
>>>> that’s what matters.
>>>>
>>>> - Howard
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Feb 6, 2022, at 3:24 PM, Jarek Potiuk <ja...@potiuk.com> wrote:
>>>>
>>>> 
>>>> IMHO It does not really matter if they are the same or not and which
>>>> one is the same. This is actually the beauty of the "abstract" and "vague"
>>>> logical_date. Those are different "concepts" that you use in different
>>>> cases.
>>>>
>>>> The logical date **might** be the same as one of the interval_dates.
>>>> It's just an "abstract" representation of the particular "run_id" - and you
>>>> should not care, because "logical_date" makes sense for some cases, but
>>>> "data_interval_start/end" for other cases.
>>>>
>>>> * If your task is about "data_interval" - by all means use the
>>>> data_interval_start and end.
>>>> * if your task is not about "interval" - use the "logical_date".
>>>>
>>>> That is how I see it at least. By using a different approach when you
>>>> use different cases the users might free their "mental-mapping" - they do
>>>> not have to map the "logical_date" to either "start" or "end". It does not
>>>> matter. but if they process a data interval, they have very clear
>>>> boundaries of ("start" <-> "end") range that they can use without even
>>>> thinking on. how "logical_date" maps to it.
>>>>
>>>> For me - those are completely different cases and they are orthogonal
>>>> to each other (even if some of those values are the same).
>>>>
>>>> J.
>>>>
>>>> On Sun, Feb 6, 2022 at 7:00 PM Howard Yoo <howard...@gmail.com> wrote:
>>>>
>>>>> I see, thank you for the info.
>>>>> I didn’t know about the existence of the data_interval_start and end
>>>>> dates. I briefly looked at those definitions, and was wondering… wouldn’t
>>>>> they be equal to the logical dates? I do see those variables mentioned in
>>>>> https://airflow.apache.org/docs/apache-airflow/stable/templates-ref.html,
>>>>> and also see the ds and ts meaning logical dates. In practice, are those
>>>>> dates and timestamps supposed to be the same?
>>>>>
>>>>> Wonder also, if the ‘data_’ prefix would be necessary if airfow would
>>>>> be used to orchestrate far more things in the future (perhaps this may be
>>>>> another thread), but in general, we should have a continuous discussions 
>>>>> to
>>>>> further clearly define all those dates for the improved usage of airflow.
>>>>>
>>>>> Howard
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On Feb 6, 2022, at 11:15 AM, Jarek Potiuk <ja...@potiuk.com> wrote:
>>>>>
>>>>> 
>>>>> We already have `data_interval_start` and `data_interval_end' as
>>>>> fields, and we need something else that can have more "abstract" meaning 
>>>>> to
>>>>> apply to the whole run as "single thing". Using interval_date would be a
>>>>> bit ambiguous.
>>>>>
>>>>> "Did you mean start or end actually when you mentioned interval date?"
>>>>> - is the question that I anticipate happening a lot if we mix those.
>>>>>
>>>>> J.
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Feb 6, 2022 at 6:04 PM Howard Yoo <howard...@gmail.com> wrote:
>>>>>
>>>>>> Now I can understand why the data_date may not be a perfect fit to
>>>>>> describe the term.
>>>>>>
>>>>>> This is not to be against the logical_date, but what about
>>>>>> ‘interval_date?’ We have the schedule interval, which defines the 
>>>>>> duration
>>>>>> of the interval (e.g. 1day), so wouldn’t interval start and end date be a
>>>>>> better representation of it rather than the logical date?
>>>>>>
>>>>>> Just want to hear whether that has been brought up already or not.
>>>>>>
>>>>>> Howard
>>>>>>
>>>>>> Sent from my iPhone
>>>>>>
>>>>>> On Feb 6, 2022, at 10:25 AM, Jarek Potiuk <ja...@potiuk.com> wrote:
>>>>>>
>>>>>> 
>>>>>> I wholeheartedly agree with TP on that one.  I think while some time
>>>>>> ago "data date" could make sense, Airflow's future is much more than just
>>>>>> processing data intervals.
>>>>>> This is the primary use case and this is where Airflow shines od
>>>>>> course, but one of the good examples of how Airflow is used out there, 
>>>>>> and
>>>>>> while we are not really encouraging it, there are not only legitimate, 
>>>>>> but
>>>>>> also something that I hope Airflow will treat as first-time citizens soon
>>>>>> (and it kind of already is with custom timetables).
>>>>>>
>>>>>> Just an example here - for me one of the most eye-opening talks in
>>>>>> last year's Airflow Summit
>>>>>> https://airflowsummit.org/sessions/2021/provision-as-a-service/
>>>>>> In this talk Cloudflare engineers explain how they manage the
>>>>>> CloudFlare infrastructure using Airflow.
>>>>>>
>>>>>> The "Data date" has no meaning in this case. But the "logical Date"
>>>>>> (which is the vaguest-possible one as TP explained) continues to have 
>>>>>> one.
>>>>>> This is the "logical date of the infrastructure provisioning". Thanks
>>>>>> to Airflow (as I understand it) Cloudflare is able to re-provision their
>>>>>> services to "yesterday's logical date infrastructure"  today - for 
>>>>>> example.
>>>>>>
>>>>>> That would not fly with "data date".
>>>>>>
>>>>>> J,
>>>>>>
>>>>>>

Reply via email to