Hi Kamil,

Direct database access for users is a no go. Direct queries that in anyway
manipulate the database or any state in Airfl ow (trigger dags, connections
etc) from the CLI should be prevented. Not even root should be able to do
this as root does not have any meaning over the wire to Airflow. We cannot
trust the user. Period. I agree with you that the current state of the API
is abysmal. Hence I mentioned using Flask Application Builder framework for
this which does do fine-grained permissions and simplifies what we try to
accomplish.

I fundamentally disagree with your position that remote working with
Airflow should be done through the Web UI. Unix first and foremost
interface is the cli. I can use pipes, automation with other tools etc. I
use git from the command line. I use curl from the command line. If
triggering a dag becomes impossible from the command line we loose one of
our main selling points.

On your points:

1-3: Solved by using FAB
4. You main points are solved by using FAB. On B) I don’t think it is a
good idea to have an extra layer between json_client and local_client. Both
are pretty thin wrappers around `airflow.api.common.*`.
5. Non existent ;-)
6. I do agree that there is some code duplication going on, but its
intention was there in order to have it in place until `local_client`
wasn’t required anymore and the Rest API is mature. So mature the API, get
rid of local_client.
7. Having the CLI manipulate `airflow.cfg` is fine as long as it checks if
it has the right permissions to do so, the OS will enforce that anyway.
8. They are not in the same namespace. `airflow.api.client.*` vs
`airflow.www.api`. If you mean by ‘package’ ‘Airflow’, then I suggest maybe
packaging the API/Client separately of break up the Airflow package in
“airflow-common”, “airflow-client”, “airflow-server”.
9. That’s not unexpected: the interface is documented whilst not perfectly.
The CLI *should* use the public API. As in eat you own dog food. The (json)
client api resides in its own namespace which does *not* import Airflow at
all. So interfacing clients should not use the local_client.

In my opinion you are mixing several issues. The CLI has its issues, the
API has its issues and maybe our packaging has its issues. We should
address them separately.

Cheers
Bolke


On 18 January 2020 at 16:41:21, Kamil Breguła (kamil.breg...@polidea.com)
wrote:

In my opinion, we should go completely in the opposite direction. We
should limit remote mode from the current CLI and add support only
commands that use direct queries to the database as long term
solution. If users need remote mode, we can add these options, but
this should be marked with stability and safe warnings.

I don't like the current remote mode in CLI for several reasons:
1) The default API security configuration makes all API data public
2) We do not provide sufficient security mechanisms to secure the API.
Supported mechanisms are:
a) deny_all
b) accept_all
b) kerberos
This prevents many users from using it in production systems if
Kerberos is not used. If someone even unknowingly tries to use it will
be unsafe in too large case.
3) The current API does not allow you to limit permissions so that the
user can only use the selected feature. If the user has access to the
API, he can do anything.
4) An abstraction leak between json_client and local_client.
a) The API Clients have no defined exceptions.
This seriously hinders the development of CLI. In the past I wanted to
fix CLI so that unhandled exceptions would not appear, but too
much-complicated
b) The API Clients have no defined return types
This should be resolved by introducing an additional layer between
json_client and local_client. It should serialize and deserialize
response to common type. It should not be possible to return a
partially correct value e.g. without the key.
6) This forces duplicate code. This is dangerous because one method
may not be sufficiently tested and thus behave unexpectedly.
7) Not everything should be possible via the API. e.g. Tomek Urbaszek
is now working to add new commands that allow editing of the
airflow.cfg file.
PR: https://github.com/apache/airflow/pull/7130
8) We do not have sufficient separation of the API client and API
server. The API client and API server are in the same python package.
This seriously threatens the interoperability of this API. It's now
very easy to build an API that will return data in an unexpected
format.
9) This means that people who want to use the API will only use the
specified client API. The same API client that is used by the CLI,
which is very unexpected because it has severe limitations. The
biggest problem is that it's in the airflow package. Importing this
package has a lot of side effects. I don't know about everything that
happens, but I have two main issues:
a) a lot of new modules are being loaded into memory.
>>> len(sys.modules)
85
>>> import airflow
[2020-01-18 14:22:07,030] {settings.py:179} INFO -
settings.configure_orm(): Using pool settings. pool_size=5,
max_overflow=10, pool_recycle=1800, pid=136
>>> len(sys.modules)
1635
This problem has been reduced, but completely resolved by this PR.
https://github.com/apache/airflow/pull/6594

b) Changes logger configurations:
>>> import logging
>>> logging._handlers.data
{}
>>> import airflow
[2020-01-18 14:29:36,809] {settings.py:179} INFO -
settings.configure_orm(): Using pool settings. pool_size=5,
max_overflow=10, pool_recycle=1800, pid=150
>>> logging._handlers.data
{'console': <weakref at 0x7fd26e20aa40; to 'RedirectStdHandler' at
0x7fd26e1f1b38>, 'processor': <weakref at 0x7fd26e20ac50; to
'FileProcessorHandler' at 0x7fd26e435cf8>, 'task': <weakref at
0x7fd26e20ad00; to 'FileTaskHandler' at 0x7fd26e435f60>}
Changing the configuration of the logger in an unexpected way may
cause that some event will not be logged, which is harmful to the
audibility of system. This API Client cannot be used on financial
systems.

We can solve these problems at the code level, but I think we should
solve them at the design level. This is an open source project and we
cannot require users to be aware of how their change fully affects
applications. Now if we added the code in one place it affects the
client, server and CLI. The design should prevent such a situation.
This is possible through code isolation and appropriate integration
tests.

In my opinion, using this CLI in remote mode is too dangerous to be
used by end-users. The only secure way to remotely manage Airflow is
to use Web UI. Investing time in a solution whose design is so
dangerous can bring too many unexpected problems.

I suggest that we do not develop remote mode in current CLI, and even
delete it in fear future. Then we began to build the necessary
components step by step.
a) Build an API that meets the basic requirements:
- Secure: It should provide security mechanisms such as permission
levels, secure authentication.
- Interoperable: Using the API should be possible for any client. The
API should not be technology-dependent.
- Stable: The API should not undergo a big change between versions.
b) Build an API client. It is best if the client is generated
automatically.
Example: https://github.com/kubernetes-client/gen
c) We should build a CLI that provides access to the API in a secure
manner.

Of course, we can try to build temporary solutions if the users need
it, but this should be very limited and should have a warning about
the level of stability and security. So if the user needs it can add
commands that support remote mode, but the user should be aware that
this is not a stable and secure solution.

I have plans for Q1 to build a stable API that does not have the above
issues. If God allows me, I will start working on the client and CLI
in Q2, but this is not confirmed. Only work on a stable API is planned
for now. I will provide more information (including design docs/AIP)
when I finish working on AIP-21. AIP-21 is very much expected by my
clients and I want to focus on it first.

Thank you Bolke very much for raising the discussion about it. It is
definitely a component that requires a lot of attention and I hope
that we will be able to work out a solution that will satisfy
everyone. Thank you also for adding this mode to the project, because
I know that it solves the problem of many users, but unfortunately my
clients cannot use it and want to develop a solution that can be used
by current and future users.


On Sat, Jan 18, 2020 at 1:46 PM Bolke de Bruin <bdbr...@gmail.com> wrote:
>
> My suggestion is to have new funtionality (or even updates maybe) to
require implementing an API in local and json. That's an intermediate step
to stop allowing direct DB access.
>
> Basically I'm saying no new and updated CLI functionality is allowed to
use "airflow.utils.session" directly and needs to use "airflow.api.common"
(or alike).
>
> This can be done now.
>
> Recent FAB has some nice stuff for Rest APIs including authentication.
This integrates well with the rest of the wee stuff we have. Although i
would love to see a grpc implementation.
>
> B.
>
> Sent from my iPhone
>
> > On 18 Jan 2020, at 13:19, Jarek Potiuk <jarek.pot...@polidea.com>
wrote:
> >
> > Absolutely yes. I think this is really what is in the current
plans/roadmap
> > :).
> >
> > It's just a matter of when and how to enforce it. The current
experimental
> > API is well .... still experimental.
> >
> > What we really need to do is implement much more complete API
> > support/approach - which is on the Airllow 2.0
> > roadmap. AFAIK - Kamil is going to start discussions and make some
> > proposals for that this week.
> > And decoupling CLI from DB is rather high on the list for the API I
believe.
> >
> > I think it's really important to solve it "well" - i.e. introduce
flexible
> > authorisation/authentication mechanism to the api,
> > have a way to decouple both client and web server from database
operations
> > and eventually decouple
> > workers from DB access. This is our ultimate goal and I think we should
> > define a broader picture/target now
> > (i.e. how to design the API so that it serves all that options in the
> > future) and have a plan on how to gradually
> > introduce it so that several contributors/commiters can take part in
this
> > process. One of the sequences
> > of introducing the API might be for example:
> >
> > 1) either graduate existing experimental API to be "official" or
introduce
> > a new "official"
> > API solution if we find it better
> > 2) reimplement all CLI commands to use the new API
> > 3) Reimplement web server to use the new API.
> > 4) (very long term) decouple workers from the database.
> >
> > I think forbidding CLI to access database should happen between 1) and
2) -
> > when we have an "official" API solution
> > in place (and it should be automatically verified - we can easily add
> > pylint plugin that can check CLI package for
> > db usage). In my opinion we cannot expect people to use API until it
goes
> > out of experimental/ we have a viable
> > stable long term alternative agreed.
> >
> > Once this is in place - no new CLI command should be allowed with the
> > direct DB access.
> >
> > J.
> >
> >
> >> On Sat, Jan 18, 2020 at 12:59 PM Bolke de Bruin <bdbr...@gmail.com>
wrote:
> >>
> >> Hi All,
> >>
> >> I’ve noticed that we are still implementing new features or are doing
> >> refactoring of CLI commands that directly interface with the database
> >> instead of using the abstractions that should be made available from
the
> >> API specification. Why is this an issue? The CLI is used by arbitrary
user
> >> to interface with Airflow operations. Airflow relies on the database
to be
> >> its single source of truth. A user that is able to read the
configuration
> >> of Airflow is currently able to manipulate the database. The CLI
requires
> >> database access hence the information to deal with the database is in
the
> >> configuration file. To improve security the CLI should use a rest API
which
> >> allows for proper authn/authz and segregation of duties.
> >>
> >> In the past I have introduced the experimental API with a
“local_client”
> >> and “json_client” implementation. The local_client still allows for
direct
> >> database access and its only function is to be there during the
transition
> >> period to have the full rest api available. After that it should be
> >> deprecated and removed.
> >>
> >> My suggestion is to disallow any new functionality in the CLI that
directly
> >> relies on “airflow.utils.session” and only allow new functionality to
go
> >> through the API client. For now that would mean 2 implementations:
local
> >> and json. Of course refactoring the current state should be on the
list in
> >> order to remove the “local_client”.
> >>
> >> The API client should be available to other packages as well. So maybe
we
> >> should package the cli and client api implementations into
> >> “airflow-client”.
> >>
> >> What are your thoughts?
> >>
> >> Thanks
> >> Bolke
> >>
> >
> >
> > --
> >
> > Jarek Potiuk
> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >
> > M: +48 660 796 129 <+48660796129>
> > [image: Polidea] <https://www.polidea.com/>

Reply via email to