Agreed!
Howard: do you fancy trying to create a PR to capture this
discusion/the reasoning in our docs?
It probably belongs on one of these three pages
<https://github.com/apache/airflow/blob/main/docs/apache-airflow/concepts/scheduler.rst>
<https://github.com/apache/airflow/blob/main/docs/apache-airflow/concepts/dags.rst>
<https://github.com/apache/airflow/blob/main/docs/apache-airflow/concepts/timetable.rst>
Cheers,
Ash
On Mon, Feb 7 2022 at 09:13:28 +0100, Jarek Potiuk <ja...@potiuk.com>
wrote:
Yeah. That discussion actually made me think that probably we need to
explain it better :)
On Sun, Feb 6, 2022 at 11:10 PM Howard Yoo <howard...@gmail.com
<mailto:howard...@gmail.com>> wrote:
As we discuss this topic, the more and more I get to understand the
reasons behind all those philosophies behind, so I appreciate the
knowledge that I gained.
As long as those terms and principles are well described and
explained without confusion, I believe we are moving to the right
direction and that’s what matters.
- Howard
Sent from my iPhone
On Feb 6, 2022, at 3:24 PM, Jarek Potiuk <ja...@potiuk.com
<mailto:ja...@potiuk.com>> wrote:
IMHO It does not really matter if they are the same or not and
which one is the same. This is actually the beauty of the
"abstract" and "vague" logical_date. Those are different "concepts"
that you use in different cases.
The logical date **might** be the same as one of the
interval_dates. It's just an "abstract" representation of the
particular "run_id" - and you should not care, because
"logical_date" makes sense for some cases, but
"data_interval_start/end" for other cases.
* If your task is about "data_interval" - by all means use the
data_interval_start and end.
* if your task is not about "interval" - use the "logical_date".
That is how I see it at least. By using a different approach when
you use different cases the users might free their "mental-mapping"
- they do not have to map the "logical_date" to either "start" or
"end". It does not matter. but if they process a data interval,
they have very clear boundaries of ("start" <-> "end") range that
they can use without even thinking on. how "logical_date" maps to
it.
For me - those are completely different cases and they are
orthogonal to each other (even if some of those values are the
same).
J.
On Sun, Feb 6, 2022 at 7:00 PM Howard Yoo <howard...@gmail.com
<mailto:howard...@gmail.com>> wrote:
I see, thank you for the info.
I didn’t know about the existence of the data_interval_start and
end dates. I briefly looked at those definitions, and was
wondering… wouldn’t they be equal to the logical dates? I do
see those variables mentioned in
<https://airflow.apache.org/docs/apache-airflow/stable/templates-ref.html>,
and also see the ds and ts meaning logical dates. In practice, are
those dates and timestamps supposed to be the same?
Wonder also, if the ‘data_’ prefix would be necessary if
airfow would be used to orchestrate far more things in the future
(perhaps this may be another thread), but in general, we should
have a continuous discussions to further clearly define all those
dates for the improved usage of airflow.
Howard
Sent from my iPhone
On Feb 6, 2022, at 11:15 AM, Jarek Potiuk <ja...@potiuk.com
<mailto:ja...@potiuk.com>> wrote:
We already have `data_interval_start` and `data_interval_end' as
fields, and we need something else that can have more "abstract"
meaning to apply to the whole run as "single thing". Using
interval_date would be a bit ambiguous.
"Did you mean start or end actually when you mentioned interval
date?" - is the question that I anticipate happening a lot if we
mix those.
J.
On Sun, Feb 6, 2022 at 6:04 PM Howard Yoo <howard...@gmail.com
<mailto:howard...@gmail.com>> wrote:
Now I can understand why the data_date may not be a perfect fit
to describe the term.
This is not to be against the logical_date, but what about
‘interval_date?’ We have the schedule interval, which
defines the duration of the interval (e.g. 1day), so wouldn’t
interval start and end date be a better representation of it
rather than the logical date?
Just want to hear whether that has been brought up already or
not.
Howard
Sent from my iPhone
On Feb 6, 2022, at 10:25 AM, Jarek Potiuk <ja...@potiuk.com
<mailto:ja...@potiuk.com>> wrote:
I wholeheartedly agree with TP on that one. I think while some
time ago "data date" could make sense, Airflow's future is much
more than just processing data intervals.
This is the primary use case and this is where Airflow shines
od course, but one of the good examples of how Airflow is used
out there, and while we are not really encouraging it, there
are not only legitimate, but also something that I hope Airflow
will treat as first-time citizens soon (and it kind of already
is with custom timetables).
Just an example here - for me one of the most eye-opening talks
in last year's Airflow Summit
<https://airflowsummit.org/sessions/2021/provision-as-a-service/>
In this talk Cloudflare engineers explain how they manage the
CloudFlare infrastructure using Airflow.
The "Data date" has no meaning in this case. But the "logical
Date" (which is the vaguest-possible one as TP explained)
continues to have one. This is the "logical date of the
infrastructure provisioning". Thanks to Airflow (as I
understand it) Cloudflare is able to re-provision their
services to "yesterday's logical date infrastructure" today -
for example.
That would not fly with "data date".
J,