To add another DAG Author perspective, I'd vote for:
#1 (from airflow ...) but without side-effects
or #2 (from airflow.sdk ...).

To compare with other X-as-code tools:
- Luigi has top-level *Luigi.task* (is class-based)
- Prefect has top-level *from prefect import flow, task* (and seems to
refer to it as an sdk <https://docs-3.prefect.io/3.0/api-ref/index>)
- dagster has top-level *from dagster import asset*
- pyspark doesn't have top-level, but uses specific names (e.g. *from
pyspark.sql import SparkSession*)
- pulumi seems to be top-level, kinda *from pulumi_<provider> import
<resource>*

On Tue, Sep 3, 2024 at 9:15 AM Julian LaNeve <jul...@astronomer.io.invalid>
wrote:

> Chiming in here mostly from the DAG author perspective!
>
> I like `airflow.sdk` best. It makes it super clear what the user is
> supposed to interact with and what Airflow’s “public” interface is.
> Importing from `airflow.models` has always felt weird because it feels like
> you’re going into Airflow’s internals, and importing from things like
> `airflow.utils` just added to the confusion because it was always super
> unclear what a normal user is supposed to interact with vs what’s internal
> and subject to change.
>
> The only slight downside (imo) to `airflow.sdk` is that an SDK is
> traditionally used to manage/interact with APIs (e.g. the Stripe SDK), so
> you could make the case that an “Airflow SDK” should be a library to
> interact with Airflow’s API. We’ve run into this before with Astro, where
> we published the Astro SDK as an Airflow provider for doing ETL. Then we
> were considering releasing a separate tool for interacting with Astro’s API
> (creating deployments, etc), which we would’ve called an “Astro SDK” but
> that name was already taken. I don’t think we’ll run into that here because
> we already have the `clients` concept to interact with the API.
>
> The `airflow.definitions` pattern feels odd because it’s not something
> I’ve seen elsewhere, so a user would have to learn/remember the pattern
> just for Airflow. The top level option also feels nice but the “user” of
> Airflow is more than just a DAG author, so I wouldn’t want to restrict
> top-level imports just to one audience.
>
> --
> Julian LaNeve
> CTO
>
> Email: jul...@astronomer.io
>  <mailto:jul...@astronomer.io>Mobile: 330 509 5792
>
> > On Sep 2, 2024, at 6:46 AM, Jarek Potiuk <ja...@potiuk.com> wrote:
> >
> > Yep so. If we do not have side-effects from import airflow -> my vote
> would
> > be "airflow.sdk" :)
> >
> > On Mon, Sep 2, 2024 at 10:29 AM Ash Berlin-Taylor <a...@apache.org>
> wrote:
> >
> >> Yes, strongly agreed on the “no side-effects form `import airflow`”.
> >>
> >> To summarise the options so far:
> >>
> >> 1. `from airflow import DAG, TaskGroup` — have the imports be from the
> top
> >> level airflow module
> >> 2. `from airflow.definitions import DAG, TaskGroup`
> >> 3. `from airflow.sdk import DAG, TaskGroup`
> >>
> >>> On 31 Aug 2024, at 23:07, Jarek Potiuk <ja...@potiuk.com> wrote:
> >>>
> >>> Should be:
> >>>
> >>> ```
> >>> @configure_settings
> >>> @configure_worker_plugins
> >>> def cli_worker():
> >>>   pass
> >>> ```
> >>>
> >>> On Sun, Sep 1, 2024 at 12:05 AM Jarek Potiuk <ja...@potiuk.com> wrote:
> >>>
> >>>> Personally for me "airflow.sdk" is best and very straightforward. And
> we
> >>>> have not yet used that for other things before, so it's free to use.
> >>>>
> >>>> "Models" and similar carried more (often misleading) information -
> they
> >>>> were sometimes database models, sometimes they were not. This caused a
> >> lot
> >>>> of confusion.
> >>>>
> >>>> IMHO explicitly calling something "sdk" is a clear indication "this is
> >>>> what you are expected to use". And makes it very clear what is and
> what
> >> is
> >>>> not a public interface. We should aim to make everything in
> >> "airflow.<sdk>"
> >>>> (or whatever we choose) "public" and everything else "private". That
> >> should
> >>>> also reduce the need of having to have a separate description of "what
> >> is
> >>>> public and what is not".
> >>>>
> >>>> Actually - if we continue doing import initialization as we do today
> - I
> >>>> would even go as far as the "airflow_sdk" package - unless we do
> >> something
> >>>> else that we have had a problem with for a long time - getting rid of
> >> side
> >>>> effects of "airflow" import.
> >>>>
> >>>> It's a bit tangential but actually related - as part of this work we
> >>>> should IMHO get rid of all side-effects of "import airflow" that we
> >>>> currently have. If we stick to sub-package of airflow  - it is almost
> a
> >>>> given thing since "airflow.sdk"  (or whatever we choose) will be
> >>>> available to "worker", "dag file processor" and "triggerer" but the
> >> rest of
> >>>> the "airlfow","whatever" will not be, and they won't be able to use
> DB,
> >>>> where scheduler, api_server will.
> >>>>
> >>>> So having side effects - such as connecting to the DB, configuring
> >>>> settings, plugin manager initialization when you do "import" caused a
> >> lot
> >>>> of pain, cyclic imports and a number of other problems.
> >>>>
> >>>> I think we should aim to  make "initialization" code explicit rather
> >> than
> >>>> implicit (Python zen) - and (possibly via decorators) simply
> initialize
> >>>> what is needed and in the right sequence explicitly for each command.
> >> If we
> >>>> will be able to do it "airflow.sdk" is ok, if we will still have
> "import
> >>>> airflow" side-effects, The "airflow_sdk" (or similar) is in this case
> >>>> better, because otherwise we will have to have some ugly conditional
> >> code -
> >>>> when you have and when you do not have database access.
> >>>>
> >>>> As an example - If we go for "airflow.sdk" I'd love to see something
> >> like
> >>>> that:
> >>>>
> >>>> ```
> >>>> @configure_db
> >>>> @configure_settings
> >>>> def cli_db():
> >>>>   pass
> >>>>
> >>>> @configure_db
> >>>> @configure_settings
> >>>> @configure_ui_plugins
> >>>> def cli_webserver():
> >>>>   pass
> >>>>
> >>>> @configure_settings
> >>>> @configure_ui_plugins
> >>>> def cli_worker():
> >>>>   pass
> >>>> ```
> >>>>
> >>>> Rather than that:
> >>>>
> >>>> ```
> >>>> import airflow <-- here everything gets initialized
> >>>> ```
> >>>>
> >>>> J
> >>>>
> >>>>
> >>>> On Sat, Aug 31, 2024 at 10:17 PM Jens Scheffler
> >> <j_scheff...@gmx.de.invalid>
> >>>> wrote:
> >>>>
> >>>>> Hi Ash,
> >>>>>
> >>>>> I was thinking hard... was setting the email aside and still have no
> >>>>> real _good_ ideas. I am still good with "models" and "sdk".
> >>>>>
> >>>>> Actually what we want to define is an "execution interface" to which
> >> the
> >>>>> structual model as API in Python/or other language gives bindings and
> >>>>> helper methods. For the application it is around DAGs - but naming it
> >>>>> DAGs is not good because other non-DAG parts as side objects also
> need
> >>>>> to belong there.
> >>>>>
> >>>>> Other terms which came into my mind were "Schema", "System" and
> "Plan"
> >>>>> but all of there are not as good as the previous "models" or "SDK".
> >>>>>
> >>>>> API by the way is too brad and generic and smells like remote. So it
> >>>>> should _not_ be "API".
> >>>>>
> >>>>> The term "Definitions" is a bit too long in my view.
> >>>>>
> >>>>> So... TLDR... this email is not much of help other than saying that
> I'd
> >>>>> propose to use "airflow.models" or "airflow.sdk". If there are no
> other
> >>>>> / better ideas coming :-D
> >>>>>
> >>>>> Jens
> >>>>>
> >>>>> On 30.08.24 19:03, Ash Berlin-Taylor wrote:
> >>>>>>> As a side note, I wonder if we should do the user-internal
> separation
> >>>>> better for DagRun and TaskInstance
> >>>>>> Yes, that is a somewhat inevitable side effect of making it be
> behind
> >>>>> an API, and one I am looking forward to. There are almost just
> >> plain-data
> >>>>> classes (but not using data classes per se) so we have two different
> >>>>> classes — one that is the API representation, and an separate
> internal
> >> one
> >>>>> used by scheduler etc that will have all of the scheduling logic
> >> methods.
> >>>>>>
> >>>>>> -ash
> >>>>>>
> >>>>>>> On 30 Aug 2024, at 17:55, Tzu-ping Chung <t...@astronomer.io.INVALID
> >
> >>>>> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> On 30 Aug 2024, at 17:48, Ash Berlin-Taylor <a...@apache.org>
> wrote:
> >>>>>>>>
> >>>>>>>> Where should DAG, TaskGroup, Labels, decorators etc for authoring
> be
> >>>>> imported from inside the DAG files? Similarly for DagRun,
> TaskInstance
> >>>>> (these two likely won’t be created directly by users, but just used
> for
> >>>>> reference docs/type hints)
> >>>>>>>>
> >>>>>>> How about airflow.definitions? When discussing assets there’s a
> >>>>> question raised on how we should call “DAG files” going forward
> >> (because
> >>>>> those files now may not contain user-defined DAGs at all).
> “Definition
> >>>>> files” was raised as a choice, but there’s no existing usage and it
> >> might
> >>>>> be a bit to catch on. If we put all these things into
> >> airflow.definitions,
> >>>>> maybe people will start using that term?
> >>>>>>>
> >>>>>>> As a side note, I wonder if we should do the user-internal
> separation
> >>>>> better for DagRun and TaskInstance. We already have that separation
> for
> >>>>> DAG/DagModel, Dataset/DatasetModel, and more. Maybe we should also
> have
> >>>>> constructs that users only see, and are converted to “real” objects
> >> (i.e.
> >>>>> exists in the db) for the scheduler. We already sort of have those in
> >>>>> DagRunPydantic and TaskInstancePydantic, we just need to name them
> >> better
> >>>>> and expose them at the right places.
> >>>>>>>
> >>>>>>> TP
> >>>>>>>
> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> >>>>>>> For additional commands, e-mail: dev-h...@airflow.apache.org
> >>>>>>>
> >>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> >>>>>> For additional commands, e-mail: dev-h...@airflow.apache.org
> >>>>>>
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> >>>>> For additional commands, e-mail: dev-h...@airflow.apache.org
> >>>>>
> >>>>>
> >>
> >>
>
>

-- 
-Fritz Davenport
Senior Data Engineer & CETA Team Lead, Customer Dept @ Astronomer

Reply via email to