> > I'm curious Jarek, does Airflow take any dependencies on popular libraries > like pandas, numpy, pyarrow, scipy, etc... which users are likely to have > their own dependency on? I think these dependencies are challenging in a > different way than the client libraries - ideally we would support a wide > version range so as not to require users to upgrade those libraries in > lockstep with Beam. However in some cases our dependency is pretty tight > (e.g. the DataFrame API's dependency on pandas), so we need to make sure to > explicitly test with multiple different versions. Does Airflow have any > similar issues? >
Yes we do (all of those I think :) ). Complete set of all our deps can be found here https://github.com/apache/airflow/blob/constraints-main/constraints-3.9.txt (continuously updated and we have different sets for different python versions). We took a rather interesting and unusual approach (more details in my talk) - mainly because Airflow is both an application to install (for users) and library to use (for DAG authors) and both have contradicting expectations (installation stability versus flexibility in upgrading/downgrading dependencies). Our approach is really smart in making sure water and fire play well with each other. Most of those dependencies are coming from optional extras (list of all extras here: https://airflow.apache.org/docs/apache-airflow/stable/extra-packages-ref.html). More often than not the "problematic" dependencies you mention are transitive dependencies through some client libraries we use (for example Apache Beam SDK is a big contributor to those :). Airflow "core" itself has far less dependencies https://github.com/apache/airflow/blob/constraints-main/constraints-no-providers-3.9.txt (175 currently) and we actively made sure that all "pandas" of this world are only optional extra deps. Now - the interesting thing is that we use "constraints'' (the links you with dependencies that I posted are those constraints) to pin versions of the dependencies that are "golden" - i.e. we test those continuously in our CI and we automatically upgrade the constraints when all the unit and integration tests pass. There is a little bit of complexity and sometimes conflicts to handle (as `pip` has to find the right set of deps that will work for all our optional extras), but eventually we have really one "golden" set of constraints at any moment in time main (or v2-x branch - we have a separate set for each branch) that we are dealing with. And this is the only "set" of dependency versions that Airflow gets tested with. Note - these are *constraints *not *requirements *- that makes a whole world of difference. Then when we release airflow, we "freeze" the constraints with the version tag. We know they work because all our tests pass with them in CI. Then we communicate to our users (and we use it in our Docker image) that the only "supported" way of installing airflow is with using `pip` and constraints https://airflow.apache.org/docs/apache-airflow/stable/installation/installing-from-pypi.html. And we do not support poetry, pipenv - we leave it up to users to handle them (until poetry/pipenv will support constraints - which we are waiting for and there is an issue where I explained why it is useful). It looks like that `pip install "apache-airflow==2.3.4" --constraint " https://raw.githubusercontent.com/apache/airflow/constraints-2.3.4/constraints-3.9.txt"` (different constraints for different airflow version and Python version you have) Constraints have this nice feature that they are only used during the "pip install" phase and thrown out immediately after the install is complete. They do not create "hard" requirements for airflow. Airflow still has a number of "lower-bound" limits for a number of constraints but we try to avoid putting upper-bounds at all (only in specific cases and documenting them) and our bounds are rather relaxed. This way we achieve two things: 1) when someone does not use constraints and has a problem with broken dependency - we tell them to use constraints - this is what we as a community commit to and support 2) but by using constraints mechanism we do not limit our users if they want to upgrade or downgrade any dependencies. They are free to do it (as long as it fits the - rather relaxed lower/upper bounds of Airflow). But "with great powers come great responsibilities" - if they want to do that., THEY have to make sure that airflow will work. We make no guarantees there. 3) we are not limited by the 3rd-party libraries that come as extras - if you do not use those, the limits do not apply I think this works really well - but it is rather complex to setup and maintain - I built a whole complex set of scripts and I have the whole `breeze` ("It's a breeze to develop airflow" is the theme) development/CI environment based on docker and docker-compose that allows us to automate all of that. J.