I understand the security concerns, and generally agree, but as a regular user I always wished we could upload DAG files via an API. It opens the door to have an upload button, which would be nice. It would make Airflow a lot more accessible to non-engineering types.
I love the idea of implementing a manual review option in conjunction with some sort of hook (similar to Airflow cluster policies) would be a good middle ground. An administrator could use that hook to do checks against DAGs or run security scanners, and decide whether or not to implement a review requirement. On Thu, Aug 11, 2022 at 1:54 PM Tomasz Urbaszek <turbas...@apache.org> wrote: > In general I second what XD said. CI/CD feels better than sending DAG > files over API and the security issues arising from accepting "any python > file" are probably quite big. > > However, I think this proposal can be tightly related to "declarative > DAGs". Instead of sending a DAG file, the user would send the DAG > definition (operators, inputs, relations) in a predefined format that is > not a code. This of course has some limitations like inability to define > custom macros, callbacks on the fly but it may be a good compromise. > > Other thought - if we implement something like "DAG via API" then we > should consider adding an option to review DAGs (approval queue etc) to > reduce security issues that are mitigated by for example deploying DAGs > from git (where we have code review, security scanners etc). > > Cheers, > Tomek > > On Thu, 11 Aug 2022 at 17:50, Xiaodong Deng <xdd...@apache.org> wrote: > >> Hi Mocheng, >> >> Please allow me to share a question first: so in your proposal, the API >> in your plan is still accepting an Airflow DAG as the payload (just >> binarized or compressed), right? >> >> If that's the case, I may not be fully convinced: the objectives in your >> proposal is about automation & programmatically submitting DAGs. These can >> already be achieved in an efficient way through CI/CD practice + a >> centralized place to manage your DAGs (e.g. a Git Repo to host the DAG >> files). >> >> As you are already aware, allowing this via API adds additional security >> concern, and I would doubt if that "breaks even". >> >> Kindly let me know if I have missed anything or misunderstood your >> proposal. Thanks. >> >> >> Regards, >> XD >> ---------------------------------------------------------------- >> (This is not a contribution) >> >> On Wed, Aug 10, 2022 at 1:46 AM Mocheng Guo <gmca...@gmail.com> wrote: >> >>> Hi Everyone, >>> >>> I have an enhancement proposal for the REST API service. This is based >>> on the observations that Airflow users want to be able to access Airflow >>> more easily as a platform service. >>> >>> The motivation comes from the following use cases: >>> 1. Users like data scientists want to iterate over data quickly with >>> interactive feedback in minutes, e.g. managing data pipelines inside >>> Jupyter Notebook while executing them in a remote airflow cluster. >>> 2. Services targeting specific audiences can generate DAGs based on >>> inputs like user command or external triggers, and they want to be able to >>> submit DAGs programmatically without manual intervention. >>> >>> I believe such use cases would help promote Airflow usability and gain >>> more customer popularity. The existing DAG repo brings considerable >>> overhead for such scenarios, a shared repo requires offline processes and >>> can be slow to rollout. >>> >>> The proposal aims to provide an alternative where a DAG can be >>> transmitted online and here are some key points: >>> 1. A DAG is packaged individually so that it can be distributable over >>> the network. For example, a DAG may be a serialized binary or a zip file. >>> 2. The Airflow REST API is the ideal place to talk with the external >>> world. The API would provide a generic interface to accept DAG artifacts >>> and should be extensible to support different artifact formats if needed. >>> 3. DAG persistence needs to be implemented since they are not part of >>> the DAG repository. >>> 4. Same behavior for DAGs supported in API vs those defined in the repo, >>> i.e. users write DAGs in the same syntax, and its scheduling, execution, >>> and web server UI should behave the same way. >>> >>> Since DAGs are written as code, running arbitrary code inside Airflow >>> may pose high security risks. Here are a few proposals to stop the security >>> breach: >>> 1. Accept DAGs only from trusted parties. Airflow already supports >>> pluggable authentication modules where strong authentication such as >>> Kerberos can be used. >>> 2. Execute DAG code as the API identity, i.e. A DAG created through the >>> API service will have run_as_user set to be the API identity. >>> 3. To enforce data access control on DAGs, the API identity should also >>> be used to access the data warehouse. >>> >>> We shared a demo based on a prototype implementation in the summit and >>> some details are described in this ppt >>> <https://drive.google.com/file/d/1luDGvWRA-hwn2NjPoobis2SL4_UNYfcM/view>, >>> and would love to get feedback and comments from the community about this >>> initiative. >>> >>> thanks >>> Mocheng >>> >> -- Constance Martineau Product Manager Email: consta...@astronomer.io Time zone: US Eastern (EST UTC-5 / EDT UTC-4) <https://www.astronomer.io/>