Re: AIP-8 Split Hooks/Operators into Separate Packages

Maxime Beauchemin Thu, 10 Jan 2019 08:45:13 -0800

That's not what I meant. If I apply what I meant to your example we'd have
a single package for each hook `airflow-hook-s3` and `airflow-hook-gcs`,
and a package for `airflow-operator-s3-to-gcs`. The operator package would
depend on both hook packages.


There's no code or test duplication there. If fancy mocking solutions are
defined for tests, they should be exposed in the hook packages and can be
reused in the operator packages.

Max

On Wed, Jan 9, 2019 at 11:00 PM airflowuser
<airflowu...@protonmail.com.invalid> wrote:

> @Max  I don't see how this is doable.
> Consider S3ToGoogleCloudStorageOperator
> It users both S3Hook and GoogleCloudStorageHook.
>
> With your suggestion we have to maintain S3Hook in each separated package
> per operator/sensor. Which means for example  if new parameter is added to
> any of the hooks you have to add it in dozes of places (+tests).
> This is very inconvenient.
>
>
> Sent with ProtonMail Secure Email.
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Wednesday, January 9, 2019 9:29 PM, Maxime Beauchemin <
> maximebeauche...@gmail.com> wrote:
>
> > If there's a strict policy of having a single hook and a single operator
> > per package, then the hook package would be the only place where the
> > external dependency is defined, and the operator packages would depend on
> > hook package(s). That would follow the "micro package" philosophy and
> could
> > work pretty well. Every hook and operator can have its own set of
> > maintainers, test/CI and release cadence.
> >
> > There can even be packages composing common operators as in
> > "airflow-hadoop-operators", or "airflow-common-databases-operators", and
> > for backward compatibility, During a transition phase, Airflow could
> depend
> > on a new "airflow-backward-compatibility-operators", though ultimately we
> > should encourage people to come up with the right packages/operators for
> > their environment instead.
> >
> > Max
> >
> > On Wed, Jan 9, 2019 at 11:12 AM Felix Uellendall felix.uellend...@gmx.de
> > wrote:
> >
> > > Regardless of how complex this implementation would be I am +1 on this.
> > > From the developer's point of view that the CI would run so much faster
> > > is the biggest plus for me. I think It will only become worse the more
> > > dependencies we add.
> > > From the user's point of view that I am able to choose from multiple
> > > packages/repositories only the ones I really want to use and I know
> that
> > > any contributor of this repo probably can answer my questions related
> to
> > > this hooks/operators is the biggest plus for me.
> > > I know there would be a lot to do and this gives me headaches even now
> > > but in the end I think it would be a great change that is necessary in
> > > the long run.
> > >
> > > -   feluelle
> > >
> > > Am 08/01/2019 um 17:42 schrieb Jarek Potiuk:
> > >
> > > > While splitting the monolithical Airfllow architecture to pieces
> sounds
> > > > good, there is one problem that might be difficult to tackle (or
> rather
> > > > impossible unless we change architecture of Airflow significantly) -
> > > > namely
> > > > dependencies/requirements.
> > > > The way Airflow uses operators is that its operators are already
> closely
> > > > coupled with Airflow core. Airflow has to parse all the operators
> within
> > > > the same python interpreter/virtual machine as the core Airflow. This
> > > > means
> > > > potentially big problem with dependency/requirement handling if we
> have
> > > > multiple packages. There are enough common/shared dependencies that
> > > > various
> > > > operators use even now to cause occasional headaches even now. We
> already
> > > > have quite a challenge with handling dependencies of Airflow and its
> > > > operators/hooks when they are part of Airflow repo.
> > > > Currently the problem is that Ariflow sometimes uses outdated
> > > > dependencies
> > > > or that some random transient dependencies break Airflow
> installation.
> > > > But
> > > > at we at least have a common dependency list that we work against
> for all
> > > > operators. Unfortunately if we split, then the problem will be worse
> -
> > > > very
> > > > quickly some contrib operators will require different dependencies
> and
> > > > will
> > > > not be compatible with Airflow or break Airflow's behaviour.
> > > > Not mentioning the problem when you want to use hooks from some other
> > > > "area" in your operator. Currently Hooks are the way how you can
> speed up
> > > > development of cross-are behaviour. You implement hooks in some
> "area"
> > > > and
> > > > other "areas" are free or even encouraged to use them. For example
> > > > exporting from BigQuery to all cloud storages in principle should
> depend
> > > > on
> > > > Hooks for every single Cloud Storage package out there (Google,
> Azure,
> > > > AWS). This is even worse than the MySqLToHive case described earlier
> -
> > > > very
> > > > quickly we would end up with totally unmanageable mesh of
> > > > cross-dependencies.
> > > > I think to really make Operators independent from Airflow core, we
> would
> > > > need to allow the dependencies to be fully isolated - i.e. to allow
> > > > operators to have different set of dependencies than the core. That's
> > > > quite
> > > > impossible with the current Airflow approach where the same operator
> code
> > > > is parsed in the core. And the same code is used during execute in
> > > > worker.
> > > > And the same code might be used by another operator in form of hook.
> > > > Unfortunately we are not in the npm
> > > > https://npm.github.io/how-npm-works-docs/index.html world (as Kamil
> > > > Breguła pointed to me today) where the module loaded handles multiple
> > > > versions of the same library in the same process.
> > > > One other questions that bothers me - I believe (please correct me
> If I
> > > > am
> > > > wrong) some of the operators are using some core features of Airflow
> and
> > > > are even more tied with the Core. For example it is perfectly fine
> for
> > > > the
> > > > operator to use SQL Alchemy ORM classes of Airflow and run
> > > > queries/perform
> > > > updates in the metadata database of Airflow, I believe - as far as I
> know
> > > > there is a requirement (I saw this somewhere at least) that Celery or
> > > > Kubernetes workers need to be able to open a direct database
> connection
> > > > to
> > > > the metadata database of Airflow and there is nothing to prevent the
> > > > operators to do it. This in essence means that the operator has to
> depend
> > > > on many core dependencies/requirements including sqlalchemy,
> > > > postgres/mysql
> > > > ....). This can be changed and "forbidden" to use Airflow's core
> features
> > > > but it might break compatibility (If I am right about it).
> > > > We could imagine a different approach - where operator is split to a
> > > > "Proxy" and "Execute" classes. "Proxy" within Core's interpreter with
> > > > Core's dependencies, and "Executor" within the worker case. Then each
> > > > task
> > > > could run in its own Docker image/Pod on Kubernetes with its own
> > > > dependencies. But that looks like big, backwards-incompatible change
> and
> > > > it
> > > > still does not solve cross-dependencies between different "areas".
> For
> > > > handling cross-area operations we would somehow implement
> communication
> > > > between different containers - each having own dependencies. That
> would
> > > > be
> > > > possible in Kubernetes by having single POD with several containers
> > > > sharing
> > > > the common data and communicating. Seems possible.
> > > > It's quite an entertaining idea, but it sounds like Airflow 3.0
> already
> > > > and
> > > > one that is not really backwards compatible ;).
> > > > J.
> > > > On Tue, Jan 8, 2019 at 5:37 PM Tim Swast sw...@google.com.invalid
> > > > wrote:
> > > >
> > > > > > I don't see it solving any problem than test speed (which is a
> big one,
> > > > > > yes) but doesn't reduce the amount of workload on the committers.
> > > > >
> > > > > It's about distributed ownership. For example, I'm not a committer
> in
> > > > > pandas, but I am the primary maintainer of pandas-gbq. You're right
> > > > > that if
> > > >
> > > > > the set of committers is the same for all 24 repos, there isn't
> all that
> > > > > much benefit beyond testing speed.
> > > > >
> > > > > > Each sub-project would still have to follow the normal Apache
> voting
> > > > > > process.
> > > > >
> > > > > Presumably the set of people that care about the sub-packages will
> be
> > > > > smaller. I don't know enough about the Apache voting process to
> know how
> > > > > that might affect it.
> > > > > Maybe many of the sub-packages can live outside the Apache org?
> Pandas
> > > > > keeps the I/O sub-packages in a different org, for example.
> > > > >
> > > > > > Google could choose to release a airflow-gcp-operators package
> now and
> > > > > > tell people to |from gcp.airflow.operators import
> SomeNewOperator|.
> > > > >
> > > > > That's actually part of my motivation for this proposal. I've got
> some
> > > > > red
> > > >
> > > > > tape to get through, but ideally the proposed airflow-google
> repository
> > > > > in
> > > >
> > > > > AIP-8 would actually live in the GoogleCloudPlatform org.
> > > > > Maybe I should decrease the scope of AIP-8 to Google
> hooks/operators?
> > > > >
> > > > > > There is nothing stopping someone /currently/ creating their own
> > > > > > operators package.
> > > > >
> > > > > Hooks still need some support in core, so that connections can be
> > > > > configured. Also, the fact that so many operators live in the
> Airflow
> > > > > makes
> > > >
> > > > > it seem like an operator is less supported / a hack if it doesn't
> live
> > > > > there.
> > > > >
> > > > > > How will we ensure that core changes don't break any
> hooks/operators?
> > > > > > Pandas does this by running tests in the I/O repos against
> pandas master
> > > > > > branch in addition to against supported releases.
> > > > >
> > > > > > How do we support the logging backends for s3/azure/gcp?
> > > > > > I don't see any reason we can't keep doing what we're already
> doing.
> > >
> > >
> https://github.com/apache/airflow/blob/5d75028d2846ed27c90cc4009b6fe81046752b1e/airflow/utils/log/gcs_task_handler.py#L45
> > >
> > > > > We'd need to adjust the import path for the hook, but so long as
> the
> > > > > upload
> > > >
> > > > > / download method remains stable, it'll work the same. The
> sub-package
> > > > > will
> > > >
> > > > > need to ensure it tests the logging code path in addition to
> testing
> > > > > DAGs
> > > >
> > > > > that use the relevant operators.
> > > > >
> > > > > -   • *Tim Swast
> > > > > -   • *Software Friendliness Engineer
> > > > > -   • *Google Cloud Developer Relations
> > > > > -   • *Seattle, WA, USA
> > > > >
> > > > > On Tue, Jan 8, 2019 at 7:55 AM Ash Berlin-Taylor a...@apache.org
> > > > > wrote:
> > > >
> > > > > > Can someone explain to me how having multiple packages will work
> in
> > > > > > practice?
> > > > > > How will we ensure that core changes don't break any
> hooks/operators?
> > > > > > How do we support the logging backends for s3/azure/gcp?
> > > > > > What would the release process be for the "sub"-packages?
> > > > > > There is nothing stopping someone /currently/ creating their own
> > > > > > operators package. There is nothing what-so-ever special about
> the
> > > > > > |airflow.operators| package namespace, and for example Google
> could
> > > > > > choose to release a airflow-gcp-operators package now and tell
> people
> > > > > > to
> > > >
> > > > > > |from gcp.airflow.operators import SomeNewOperator|.
> > > > > > My view on this currently is -1 as I don't see it solving any
> problem
> > > > > > than test speed (which is a big one, yes) but doesn't reduce the
> amount
> > > > > > of workload on the committers - rather it increases it by having
> a more
> > > > > > complex release process (each sub-project would still have to
> follow
> > > > > > the
> > > >
> > > > > > normal Apache voting process) and having 24 repos to check for
> PRs
> > > > > > rather than just 1.
> > > > > > Am I missing something?
> > > > > > ("Core" vs "contrib" made sense when Airflow was still under
> Airbnb, we
> > > > > > should probably just move everything from contrib out to core pre
> > > > > > 2.0.0)
> > > >
> > > > > > -ash
> > > > > > airflowuser wrote on 08/01/2019 15:44:
> > > > > >
> > > > > > > I think the operator should be placed by the source.
> > > > > > > If it's MySQLToHiveOperator then it would be placed in MySQL
> package.
> > > > > > > The BIG question here is if this serve actual improvement like
> faster
> > > > > > > deployment of hook/operators bug-fix to Airflow users (faster
> than
> > > > > > > actual
> > > >
> > > > > > Airflow release) or this is mere cosmetic issue.
> > > > > >
> > > > > > > I assume that this also covers the unnecessary separation of
> core and
> > > > > > > contrib.
> > > > > > > Sent with ProtonMail Secure Email.
> > > > > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > > > > > > On Monday, January 7, 2019 10:16 PM, Maxime Beauchemin <
> > > > > > > maximebeauche...@gmail.com> wrote:
> > > > > > >
> > > > > > > > Something to think about is how data transfer operators like
> the
> > > > > > > > MysqlToHiveOperator usually rely on 2 hooks. With a
> package-specific
> > > > > > > > approach that may mean something like an `airflow-hive`,
> > > > > > > > `airflow-mysql`
> > > > > >
> > > > > > > > and `airflow-mysql-hive` packages, where the
> `airflow-mysql-hive`
> > > > > > > > package
> > > > > > >
> > > > > > > > depends on the two other packages.
> > > > > > > > It's just a matter of having a clear strategy, good naming
> > > > > > > > conventions
> > > >
> > > > > > and
> > > > > >
> > > > > > > > a nice central place in the docs that centralize a list of
> approved
> > > > > > > > packages.
> > > > > > > > Max
> > > > > > > > On Mon, Jan 7, 2019 at 9:05 AM Tim Swast
> sw...@google.com.invalid
> > > > > > > > wrote:
> > > > > > >
> > > > > > > > > I've created AIP-8:
> > >
> > >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=100827303
> > >
> > > > > > > > > To follow-up from the discussion about splitting
> hooks/operators out
> > > > > > > > > of the
> > > > > > >
> > > > > > > > > core Airflow package at
> > > > > > > > >
> http://mail-archives.apache.org/mod_mbox/airflow-dev/201809.mbox/<
> > > > > > > > > 308670db-bd2a-4738-81b1-3f6fb312c...@apache.org>
> > > > > > >
> > > > > > > > > I propose packaging based on the target system, informed
> by the
> > > > > > > > > existing
> > > > > > >
> > > > > > > > > hooks in both core and contrib. This will allow those with
> the
> > > > > > > > > relevant
> > > > > >
> > > > > > > > > expertise in each target system to respond to
> contributions / issues
> > > > > > > > > without having to follow the flood of everything
> Airflow-related. It
> > > > > > > > > will
> > > > > > >
> > > > > > > > > also decrease the surface area of the core package,
> helping with
> > > > > > > > > testability and long-term maintenance.
> > > > > > > > >
> > > > > > > > > -   • *Tim Swast
> > > > > > > > > -   • *Software Friendliness Engineer
> > > > > > > > > -   • *Google Cloud Developer Relations
> > > > > > > > > -   • *Seattle, WA, USA
>
>
>

Re: AIP-8 Split Hooks/Operators into Separate Packages

Reply via email to