Alright since I was summoned...

When I was an airflow user, I did a lot of incremental processes.  Pretty
much everything was incremental.  Data warehousing / analytics shop /
e-commerce reporting / integrations this kind of thing.

One common use case is implementing something like a fivetran, which I did
a few times.

For me, execution date was almost entirely useless.  Execution date is
there for partition-driven workloads.

For incremental, you need to track your state somehow.

That's why I experimented with various state storage interfaces, and
developed a watermark operator, which we used a lot.  And I demoed a
version of them here <https://github.com/apache/airflow/pull/19051>, and
authored AIP-30
<https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-30%3A+State+persistence>
.

I wrote AIP-30 when I was still contributing to Airflow for funsies, and
didn't get a ton of engagement on it so it sort of languished, then when I
became full time airflow dev, there were other priorities.

But to me the use case is still pretty obvious.  Nothing we have added
since then really explicitly supports incremental workflows.

To me the question is (as it was then, and I think I mentioned this in the
AIP), do you provide a generic interface where user controls namespace and
name of the state you are trying to persist?  Or instead do you provide
mechanisms to store state on existing objects.  So e.g. on trigger, on
task, on whatever, you can do `self.save_state(key...)` etc.  In my
proposal I think I leaned towards generic, and it seems Jake leans the same
way.  There are pros and cons.

In terms of the underlying storage mechanism, it seems pretty reasonable to
allow this to be pluggable like everything else.  I used different
"backends" at different times -- s3, or database.  Typically you don't need
mega low latency with the type of tasks Airflow is used for.

Reply via email to