Daniel: On Sat, Jan 11, 2020 at 11:10 PM Daniel Standish <dpstand...@gmail.com> wrote:
> To banish anything stateful seems arbitrary and unnecessary. Airflow is > more than its canonical task structure: hook / operator framework and > ecosystem, scheduling, retry, alerting, distributed computing, etc etc etc > etc. > I think we should be really conscious and deliberately decide what Airflow does and what it does not. It's a glorified CRON with fixed intervals to process the data. That's about it. I think we should not turn it into a generic DAG executor to handle more cases. There are plenty of more or less generic DAG workflow execution engines and our goal is not to do a generic DAG workflow engine and replace them. For me it is really basic assumption - and to change it, it would require to change completely the direction of the project. This assumption is pretty much foundational for Airflow. It's the kind of base that we should look back at and ask "does the change fit that basic assumption?" whenever we make any serious decision. I really like the idea of doing one thing very well and I think Airflow is the kind of tool. IMHO - we should not make it easier to use it for cases it was not designed for even if we can. As long as support for the canonical task is preserved, what's the harm in > supporting stateful usage where it makes sense? > The harm is that we will have to implement it, answer questions and support forever all the use cases people might come up with for such a "state". People are creative and once they have such a generic feature in their hands they will use it for various things. By being opinionated, we won't handle all such cases - and we can simply answer people who want to (ab)use it - "it's not the intention of Airflow". Of course we risk that Airflow will not be used for those people in those cases ... But I think this is exactly what we want in fact. I'd love people use Apache Beam for streaming and incremental processing/streaming. It's a fantastic tool for that. > Airflow may not have been designed initially to support incremental > processes. But it is a living thing, and as it happens, it can work well > for them. > I think it's the case about "if you have a hammer everything looks like a nail". The fact that it can, it does not mean it's the best tool for that or that you use it properly. > I think the two approaches can coexist harmoniously. > I don't think so. By adding state you lose the idempotency property - which is again - foundational assumption for all operators. We wrote 100s of operators so far and Idempotency was often the difficult part. This means that you had to work a bit harder to have a good, idempotent operator. But by doing so, your users can simply rely on the DAG. At any point in time they can backfill the DAG from a month ago for a given day and they do not have to worry about it. This is THE most important feature of Airflow I think. You can have 100s or 1000s of DAGs in your company and have one person operate all of them. DAGs written by 10s of other people. As an operator - you do not have to know any details about how each operator and DAG works - what you know that you can re-run/backfill any portion of DAG from the past and that it will work. When you know you have to fix some portion of data and you fixed the algorithms or reference data or cleanup process - you do not have to understand how it all works. You simply back-fill. By adding "maybes" to the whole picture (this is what stateful tasks are about in this context - it "may" work when back-fill but not necessarily) we are undermining the basic trust the operator might have with backfilling tasks. Of course it's a bit of an oversimplification, but It reflects the most important (for me) usage and reason why people would like to use Airflow. I think it is really important to not have "maybes" here and be opinionated - this leads to trust in Airflow. J. -- Jarek Potiuk Polidea <https://www.polidea.com/> | Principal Software Engineer M: +48 660 796 129 <+48660796129> [image: Polidea] <https://www.polidea.com/>