Daniel:

On Sat, Jan 11, 2020 at 11:10 PM Daniel Standish <dpstand...@gmail.com>
wrote:

> To banish anything stateful seems arbitrary and unnecessary.  Airflow is
> more than its canonical task structure: hook / operator framework and
> ecosystem, scheduling, retry, alerting, distributed computing, etc etc etc
> etc.
>

I think we should be really conscious and deliberately decide what Airflow
does and what it does not.
It's a glorified CRON with fixed intervals to process the data. That's
about it.
I think we should not turn it into a generic DAG executor to handle more
cases. There are plenty of
more or less generic DAG workflow execution engines and our goal is not to
do a generic DAG
workflow engine and replace them. For me it is really basic assumption -
and to change it, it would
require to change completely the direction of the project. This assumption
is pretty much foundational
for Airflow. It's the kind of base that we should look back at and ask
"does the change fit
that basic assumption?" whenever we make any serious decision. I really
like the idea of doing
one thing very well and I think Airflow is the kind of tool. IMHO - we
should not make it easier to use it for
cases it was not designed for even if we can.

As long as support for the canonical task is preserved, what's the harm in
> supporting stateful usage where it makes sense?
>

The harm is that we will have to implement it, answer questions and support
forever all the use cases people might
come up with for such a "state". People are creative and once they have
such a generic feature in their hands
they will use it for various things. By being opinionated, we won't handle
all such cases - and we can simply answer
people who want to (ab)use it - "it's not the intention of Airflow". Of
course we risk that Airflow will not be used for
those people in those cases ... But I think this is exactly what we want in
fact. I'd love people use Apache Beam
for streaming and incremental processing/streaming. It's a fantastic tool
for that.


> Airflow may not have been designed initially to support incremental
> processes.  But it is a living thing, and as it happens, it can work well
> for them.
>

I think it's the case about "if you have a hammer everything looks like a
nail". The fact that it can, it does not mean it's
the best tool for that or that you use it properly.


> I think the two approaches can coexist harmoniously.
>

I don't think so. By adding state you lose the idempotency property - which
is again - foundational assumption for
all operators. We wrote 100s of operators so far and Idempotency was often
the difficult part. This means that you
had to work a bit harder to have a good, idempotent operator. But by doing
so, your users can simply rely on the DAG.
At any point in time they can backfill the DAG from a month ago for a given
day and they do not have to worry about it.
This is THE most important feature of Airflow I think. You can have 100s or
1000s of DAGs in your company
and have one person operate all of them. DAGs written by 10s of other
people. As an operator - you do not have
to know any details about how each operator and DAG works - what you know
that you can re-run/backfill any
portion of DAG from the past and that it will work. When you know you have
to fix some portion of data and you
fixed the algorithms or reference data or cleanup process - you do not have
to understand how it all works.
You simply back-fill. By adding "maybes" to the whole picture (this is what
stateful tasks are about in
this context - it "may" work when back-fill but not necessarily) we are
undermining
the basic trust the operator might have with backfilling tasks. Of course
it's a bit of an oversimplification,
but It reflects the most important (for me) usage and reason why people
would like to use Airflow.
I think it is really important to not have "maybes" here and be opinionated
- this leads to trust in Airflow.

J.



-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Reply via email to