Chiming in here mostly from the DAG author perspective!

I like `airflow.sdk` best. It makes it super clear what the user is supposed to 
interact with and what Airflow’s “public” interface is. Importing from 
`airflow.models` has always felt weird because it feels like you’re going into 
Airflow’s internals, and importing from things like `airflow.utils` just added 
to the confusion because it was always super unclear what a normal user is 
supposed to interact with vs what’s internal and subject to change.

The only slight downside (imo) to `airflow.sdk` is that an SDK is traditionally 
used to manage/interact with APIs (e.g. the Stripe SDK), so you could make the 
case that an “Airflow SDK” should be a library to interact with Airflow’s API. 
We’ve run into this before with Astro, where we published the Astro SDK as an 
Airflow provider for doing ETL. Then we were considering releasing a separate 
tool for interacting with Astro’s API (creating deployments, etc), which we 
would’ve called an “Astro SDK” but that name was already taken. I don’t think 
we’ll run into that here because we already have the `clients` concept to 
interact with the API.

The `airflow.definitions` pattern feels odd because it’s not something I’ve 
seen elsewhere, so a user would have to learn/remember the pattern just for 
Airflow. The top level option also feels nice but the “user” of Airflow is more 
than just a DAG author, so I wouldn’t want to restrict top-level imports just 
to one audience.

--
Julian LaNeve
CTO

Email: jul...@astronomer.io
 <mailto:jul...@astronomer.io>Mobile: 330 509 5792

> On Sep 2, 2024, at 6:46 AM, Jarek Potiuk <ja...@potiuk.com> wrote:
> 
> Yep so. If we do not have side-effects from import airflow -> my vote would
> be "airflow.sdk" :)
> 
> On Mon, Sep 2, 2024 at 10:29 AM Ash Berlin-Taylor <a...@apache.org> wrote:
> 
>> Yes, strongly agreed on the “no side-effects form `import airflow`”.
>> 
>> To summarise the options so far:
>> 
>> 1. `from airflow import DAG, TaskGroup` — have the imports be from the top
>> level airflow module
>> 2. `from airflow.definitions import DAG, TaskGroup`
>> 3. `from airflow.sdk import DAG, TaskGroup`
>> 
>>> On 31 Aug 2024, at 23:07, Jarek Potiuk <ja...@potiuk.com> wrote:
>>> 
>>> Should be:
>>> 
>>> ```
>>> @configure_settings
>>> @configure_worker_plugins
>>> def cli_worker():
>>>   pass
>>> ```
>>> 
>>> On Sun, Sep 1, 2024 at 12:05 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>>> 
>>>> Personally for me "airflow.sdk" is best and very straightforward. And we
>>>> have not yet used that for other things before, so it's free to use.
>>>> 
>>>> "Models" and similar carried more (often misleading) information - they
>>>> were sometimes database models, sometimes they were not. This caused a
>> lot
>>>> of confusion.
>>>> 
>>>> IMHO explicitly calling something "sdk" is a clear indication "this is
>>>> what you are expected to use". And makes it very clear what is and what
>> is
>>>> not a public interface. We should aim to make everything in
>> "airflow.<sdk>"
>>>> (or whatever we choose) "public" and everything else "private". That
>> should
>>>> also reduce the need of having to have a separate description of "what
>> is
>>>> public and what is not".
>>>> 
>>>> Actually - if we continue doing import initialization as we do today - I
>>>> would even go as far as the "airflow_sdk" package - unless we do
>> something
>>>> else that we have had a problem with for a long time - getting rid of
>> side
>>>> effects of "airflow" import.
>>>> 
>>>> It's a bit tangential but actually related - as part of this work we
>>>> should IMHO get rid of all side-effects of "import airflow" that we
>>>> currently have. If we stick to sub-package of airflow  - it is almost a
>>>> given thing since "airflow.sdk"  (or whatever we choose) will be
>>>> available to "worker", "dag file processor" and "triggerer" but the
>> rest of
>>>> the "airlfow","whatever" will not be, and they won't be able to use DB,
>>>> where scheduler, api_server will.
>>>> 
>>>> So having side effects - such as connecting to the DB, configuring
>>>> settings, plugin manager initialization when you do "import" caused a
>> lot
>>>> of pain, cyclic imports and a number of other problems.
>>>> 
>>>> I think we should aim to  make "initialization" code explicit rather
>> than
>>>> implicit (Python zen) - and (possibly via decorators) simply initialize
>>>> what is needed and in the right sequence explicitly for each command.
>> If we
>>>> will be able to do it "airflow.sdk" is ok, if we will still have "import
>>>> airflow" side-effects, The "airflow_sdk" (or similar) is in this case
>>>> better, because otherwise we will have to have some ugly conditional
>> code -
>>>> when you have and when you do not have database access.
>>>> 
>>>> As an example - If we go for "airflow.sdk" I'd love to see something
>> like
>>>> that:
>>>> 
>>>> ```
>>>> @configure_db
>>>> @configure_settings
>>>> def cli_db():
>>>>   pass
>>>> 
>>>> @configure_db
>>>> @configure_settings
>>>> @configure_ui_plugins
>>>> def cli_webserver():
>>>>   pass
>>>> 
>>>> @configure_settings
>>>> @configure_ui_plugins
>>>> def cli_worker():
>>>>   pass
>>>> ```
>>>> 
>>>> Rather than that:
>>>> 
>>>> ```
>>>> import airflow <-- here everything gets initialized
>>>> ```
>>>> 
>>>> J
>>>> 
>>>> 
>>>> On Sat, Aug 31, 2024 at 10:17 PM Jens Scheffler
>> <j_scheff...@gmx.de.invalid>
>>>> wrote:
>>>> 
>>>>> Hi Ash,
>>>>> 
>>>>> I was thinking hard... was setting the email aside and still have no
>>>>> real _good_ ideas. I am still good with "models" and "sdk".
>>>>> 
>>>>> Actually what we want to define is an "execution interface" to which
>> the
>>>>> structual model as API in Python/or other language gives bindings and
>>>>> helper methods. For the application it is around DAGs - but naming it
>>>>> DAGs is not good because other non-DAG parts as side objects also need
>>>>> to belong there.
>>>>> 
>>>>> Other terms which came into my mind were "Schema", "System" and "Plan"
>>>>> but all of there are not as good as the previous "models" or "SDK".
>>>>> 
>>>>> API by the way is too brad and generic and smells like remote. So it
>>>>> should _not_ be "API".
>>>>> 
>>>>> The term "Definitions" is a bit too long in my view.
>>>>> 
>>>>> So... TLDR... this email is not much of help other than saying that I'd
>>>>> propose to use "airflow.models" or "airflow.sdk". If there are no other
>>>>> / better ideas coming :-D
>>>>> 
>>>>> Jens
>>>>> 
>>>>> On 30.08.24 19:03, Ash Berlin-Taylor wrote:
>>>>>>> As a side note, I wonder if we should do the user-internal separation
>>>>> better for DagRun and TaskInstance
>>>>>> Yes, that is a somewhat inevitable side effect of making it be behind
>>>>> an API, and one I am looking forward to. There are almost just
>> plain-data
>>>>> classes (but not using data classes per se) so we have two different
>>>>> classes — one that is the API representation, and an separate internal
>> one
>>>>> used by scheduler etc that will have all of the scheduling logic
>> methods.
>>>>>> 
>>>>>> -ash
>>>>>> 
>>>>>>> On 30 Aug 2024, at 17:55, Tzu-ping Chung <t...@astronomer.io.INVALID>
>>>>> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On 30 Aug 2024, at 17:48, Ash Berlin-Taylor <a...@apache.org> wrote:
>>>>>>>> 
>>>>>>>> Where should DAG, TaskGroup, Labels, decorators etc for authoring be
>>>>> imported from inside the DAG files? Similarly for DagRun, TaskInstance
>>>>> (these two likely won’t be created directly by users, but just used for
>>>>> reference docs/type hints)
>>>>>>>> 
>>>>>>> How about airflow.definitions? When discussing assets there’s a
>>>>> question raised on how we should call “DAG files” going forward
>> (because
>>>>> those files now may not contain user-defined DAGs at all). “Definition
>>>>> files” was raised as a choice, but there’s no existing usage and it
>> might
>>>>> be a bit to catch on. If we put all these things into
>> airflow.definitions,
>>>>> maybe people will start using that term?
>>>>>>> 
>>>>>>> As a side note, I wonder if we should do the user-internal separation
>>>>> better for DagRun and TaskInstance. We already have that separation for
>>>>> DAG/DagModel, Dataset/DatasetModel, and more. Maybe we should also have
>>>>> constructs that users only see, and are converted to “real” objects
>> (i.e.
>>>>> exists in the db) for the scheduler. We already sort of have those in
>>>>> DagRunPydantic and TaskInstancePydantic, we just need to name them
>> better
>>>>> and expose them at the right places.
>>>>>>> 
>>>>>>> TP
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
>>>>>>> For additional commands, e-mail: dev-h...@airflow.apache.org
>>>>>>> 
>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
>>>>>> For additional commands, e-mail: dev-h...@airflow.apache.org
>>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
>>>>> For additional commands, e-mail: dev-h...@airflow.apache.org
>>>>> 
>>>>> 
>> 
>> 

Reply via email to