I understand the security concerns, and generally agree, but as a regular
user I always wished we could upload DAG files via an API. It opens the
door to have an upload button, which would be nice. It would make Airflow a
lot more accessible to non-engineering types.

I love the idea of implementing a manual review option in conjunction with
some sort of hook (similar to Airflow cluster policies) would be a good
middle ground. An administrator could use that hook to do checks against
DAGs or run security scanners, and decide whether or not to implement a
review requirement.

On Thu, Aug 11, 2022 at 1:54 PM Tomasz Urbaszek <turbas...@apache.org>
wrote:

> In general I second what XD said. CI/CD feels better than sending DAG
> files over API and the security issues arising from accepting "any python
> file" are probably quite big.
>
> However, I think this proposal can be tightly related to "declarative
> DAGs". Instead of sending a DAG file, the user would send the DAG
> definition (operators, inputs, relations) in a predefined format that is
> not a code. This of course has some limitations like inability to define
> custom macros, callbacks on the fly but it may be a good compromise.
>
> Other thought - if we implement something like "DAG via API" then we
> should consider adding an option to review DAGs (approval queue etc) to
> reduce security issues that are mitigated by for example deploying DAGs
> from git (where we have code review, security scanners etc).
>
> Cheers,
> Tomek
>
> On Thu, 11 Aug 2022 at 17:50, Xiaodong Deng <xdd...@apache.org> wrote:
>
>> Hi Mocheng,
>>
>> Please allow me to share a question first: so in your proposal, the API
>> in your plan is still accepting an Airflow DAG as the payload (just
>> binarized or compressed), right?
>>
>> If that's the case, I may not be fully convinced: the objectives in your
>> proposal is about automation & programmatically submitting DAGs. These can
>> already be achieved in an efficient way through CI/CD practice + a
>> centralized place to manage your DAGs (e.g. a Git Repo to host the DAG
>> files).
>>
>> As you are already aware, allowing this via API adds additional security
>> concern, and I would doubt if that "breaks even".
>>
>> Kindly let me know if I have missed anything or misunderstood your
>> proposal. Thanks.
>>
>>
>> Regards,
>> XD
>> ----------------------------------------------------------------
>> (This is not a contribution)
>>
>> On Wed, Aug 10, 2022 at 1:46 AM Mocheng Guo <gmca...@gmail.com> wrote:
>>
>>> Hi Everyone,
>>>
>>> I have an enhancement proposal for the REST API service. This is based
>>> on the observations that Airflow users want to be able to access Airflow
>>> more easily as a platform service.
>>>
>>> The motivation comes from the following use cases:
>>> 1. Users like data scientists want to iterate over data quickly with
>>> interactive feedback in minutes, e.g. managing data pipelines inside
>>> Jupyter Notebook while executing them in a remote airflow cluster.
>>> 2. Services targeting specific audiences can generate DAGs based on
>>> inputs like user command or external triggers, and they want to be able to
>>> submit DAGs programmatically without manual intervention.
>>>
>>> I believe such use cases would help promote Airflow usability and gain
>>> more customer popularity. The existing DAG repo brings considerable
>>> overhead for such scenarios, a shared repo requires offline processes and
>>> can be slow to rollout.
>>>
>>> The proposal aims to provide an alternative where a DAG can be
>>> transmitted online and here are some key points:
>>> 1. A DAG is packaged individually so that it can be distributable over
>>> the network. For example, a DAG may be a serialized binary or a zip file.
>>> 2. The Airflow REST API is the ideal place to talk with the external
>>> world. The API would provide a generic interface to accept DAG artifacts
>>> and should be extensible to support different artifact formats if needed.
>>> 3. DAG persistence needs to be implemented since they are not part of
>>> the DAG repository.
>>> 4. Same behavior for DAGs supported in API vs those defined in the repo,
>>> i.e. users write DAGs in the same syntax, and its scheduling, execution,
>>> and web server UI should behave the same way.
>>>
>>> Since DAGs are written as code, running arbitrary code inside Airflow
>>> may pose high security risks. Here are a few proposals to stop the security
>>> breach:
>>> 1. Accept DAGs only from trusted parties. Airflow already supports
>>> pluggable authentication modules where strong authentication such as
>>> Kerberos can be used.
>>> 2. Execute DAG code as the API identity, i.e. A DAG created through the
>>> API service will have run_as_user set to be the API identity.
>>> 3. To enforce data access control on DAGs, the API identity should also
>>> be used to access the data warehouse.
>>>
>>> We shared a demo based on a prototype implementation in the summit and
>>> some details are described in this ppt
>>> <https://drive.google.com/file/d/1luDGvWRA-hwn2NjPoobis2SL4_UNYfcM/view>,
>>> and would love to get feedback and comments from the community about this
>>> initiative.
>>>
>>> thanks
>>> Mocheng
>>>
>>

-- 

Constance Martineau
Product Manager

Email: consta...@astronomer.io
Time zone: US Eastern (EST UTC-5 / EDT UTC-4)


<https://www.astronomer.io/>

Reply via email to