Agreed!

Howard: do you fancy trying to create a PR to capture this discusion/the reasoning in our docs?

It probably belongs on one of these three pages

<https://github.com/apache/airflow/blob/main/docs/apache-airflow/concepts/scheduler.rst>
<https://github.com/apache/airflow/blob/main/docs/apache-airflow/concepts/dags.rst>
<https://github.com/apache/airflow/blob/main/docs/apache-airflow/concepts/timetable.rst>

Cheers,

Ash

On Mon, Feb 7 2022 at 09:13:28 +0100, Jarek Potiuk <ja...@potiuk.com> wrote:
Yeah. That discussion actually made me think that probably we need to explain it better :)

On Sun, Feb 6, 2022 at 11:10 PM Howard Yoo <howard...@gmail.com <mailto:howard...@gmail.com>> wrote:
As we discuss this topic, the more and more I get to understand the reasons behind all those philosophies behind, so I appreciate the knowledge that I gained.

As long as those terms and principles are well described and explained without confusion, I believe we are moving to the right direction and that’s what matters.

- Howard

Sent from my iPhone

On Feb 6, 2022, at 3:24 PM, Jarek Potiuk <ja...@potiuk.com <mailto:ja...@potiuk.com>> wrote:


IMHO It does not really matter if they are the same or not and which one is the same. This is actually the beauty of the "abstract" and "vague" logical_date. Those are different "concepts" that you use in different cases.

The logical date **might** be the same as one of the interval_dates. It's just an "abstract" representation of the particular "run_id" - and you should not care, because "logical_date" makes sense for some cases, but "data_interval_start/end" for other cases.

* If your task is about "data_interval" - by all means use the data_interval_start and end.
* if your task is not about "interval" - use the "logical_date".

That is how I see it at least. By using a different approach when you use different cases the users might free their "mental-mapping" - they do not have to map the "logical_date" to either "start" or "end". It does not matter. but if they process a data interval, they have very clear boundaries of ("start" <-> "end") range that they can use without even thinking on. how "logical_date" maps to it.

For me - those are completely different cases and they are orthogonal to each other (even if some of those values are the same).

J.

On Sun, Feb 6, 2022 at 7:00 PM Howard Yoo <howard...@gmail.com <mailto:howard...@gmail.com>> wrote:
I see, thank you for the info.
I didn’t know about the existence of the data_interval_start and end dates. I briefly looked at those definitions, and was wondering… wouldn’t they be equal to the logical dates? I do see those variables mentioned in <https://airflow.apache.org/docs/apache-airflow/stable/templates-ref.html>, and also see the ds and ts meaning logical dates. In practice, are those dates and timestamps supposed to be the same?

Wonder also, if the ‘data_’ prefix would be necessary if airfow would be used to orchestrate far more things in the future (perhaps this may be another thread), but in general, we should have a continuous discussions to further clearly define all those dates for the improved usage of airflow.

Howard

Sent from my iPhone

On Feb 6, 2022, at 11:15 AM, Jarek Potiuk <ja...@potiuk.com <mailto:ja...@potiuk.com>> wrote:


We already have `data_interval_start` and `data_interval_end' as fields, and we need something else that can have more "abstract" meaning to apply to the whole run as "single thing". Using interval_date would be a bit ambiguous.

"Did you mean start or end actually when you mentioned interval date?" - is the question that I anticipate happening a lot if we mix those.

J.



On Sun, Feb 6, 2022 at 6:04 PM Howard Yoo <howard...@gmail.com <mailto:howard...@gmail.com>> wrote:
Now I can understand why the data_date may not be a perfect fit to describe the term.

This is not to be against the logical_date, but what about ‘interval_date?’ We have the schedule interval, which defines the duration of the interval (e.g. 1day), so wouldn’t interval start and end date be a better representation of it rather than the logical date?

Just want to hear whether that has been brought up already or not.

Howard

Sent from my iPhone

On Feb 6, 2022, at 10:25 AM, Jarek Potiuk <ja...@potiuk.com <mailto:ja...@potiuk.com>> wrote:


I wholeheartedly agree with TP on that one. I think while some time ago "data date" could make sense, Airflow's future is much more than just processing data intervals. This is the primary use case and this is where Airflow shines od course, but one of the good examples of how Airflow is used out there, and while we are not really encouraging it, there are not only legitimate, but also something that I hope Airflow will treat as first-time citizens soon (and it kind of already is with custom timetables).

Just an example here - for me one of the most eye-opening talks in last year's Airflow Summit <https://airflowsummit.org/sessions/2021/provision-as-a-service/> In this talk Cloudflare engineers explain how they manage the CloudFlare infrastructure using Airflow.

The "Data date" has no meaning in this case. But the "logical Date" (which is the vaguest-possible one as TP explained) continues to have one. This is the "logical date of the infrastructure provisioning". Thanks to Airflow (as I understand it) Cloudflare is able to re-provision their services to "yesterday's logical date infrastructure" today - for example.

That would not fly with "data date".

J,


Reply via email to