Re: S3 Dag Bundle Versions and DB Manager

Jarek Potiuk Wed, 09 Jul 2025 10:59:57 -0700

What about the DynamoDB idea ? What you are trying to trade-off is "writing
to airflow metadata DB" with "writing to another DB" really. So yes it is -
another thing you will need to have access to write to - other than Airflow
DB, but it's really the question should the boundaries be on "Everything
writable should be in Airflow" vs. "Everything writable should be in the
"cloud" that the integration is about.


Yes - it makes the management using S3 versioning a bit more "write-y" -
but on the other hand it does allow to confine complexity to a pure
"amazon" provider  - with practically 0 impact on Airflow core and airflow
DB. Which I really like to be honest.

And yes "co-location" is also my goal. And I think this is a perfect way to
explain it as well why it is better to keep "S3 versioning" close to "S3"
and not to Airflow - especially that there will be a lot of "S3-specific"
things in the state that are not easy to abstract and have "common" for
other Airflow versioning implementations.

You can think about it this way:

Airflow has already done its job with abstractions - versioning changes and
metadata DB is implemented in Airflow DB. If there are any missing pieces
in the abstraction that will be usable across multiple implementations of
versioning, we should - of course - add it to Airflow metadata DB - in the
way that they can be used by those different implementations. But the code
to manage and use it should be in airflow-core.
If there is anything specific for the implementation of S3 / Amazon
integration -> it should be implemented independently from Airflow Metadata
DB. There are many complexities in managing and upgrading core DB and we
should not use the db to make provider-specific things. The discussion
about shared code and isolation is interesting in this context. Because I
think we might get to the point when we go deeper and deeper in this
direction that we will have (and we already do it more or less) NO
(regular) providers needed with whatever CLI or tooling we will need to
manage the Metadata DB. FAB and Edge are currently exceptions - but they
are by no means "regular" providers.

So I'd say - if while designing/ implementing S3 versioning you will see
that part of the implementation can be abstracted away and added to the
core and used by other implementations - 100% - let's add it to the core.
But only then. If it is something that only Amazon provider needs and S3
needs - let's make it use Amazon **whatever** as backing storage.

I would even say - talk to the Google team and try to come up with an
abstraction that can be used for versioning in both S3 and GCS, agree on
it, and let's see if this abstraction should find its way to the core. That
would be my proposal.

J.




On Wed, Jul 9, 2025 at 7:37 PM Oliveira, Niko <oniko...@amazon.com.invalid>
wrote:

> Thanks for engaging folks!
>
> I don’t love the idea of using another bucket. For one, this means Airflow
> needs write access to S3 which is not ideal; some users/customers are very
> sensitive about ever allowing write access to things. And two, you will
> commonly get issues with a design that leaks state into customer managed
> accounts/resources, they may delete the bucket not knowing what it is, they
> may not migrate it to a new account or region if they ever move. I think
> it’s best for the data to be stored transparently to the user and
> co-located with the data it strongly relates to (i.e. the dag runs that are
> associated with those bundle versions).
>
> Is using DB Manager completely unacceptable these days? What are folks'
> thoughts on that?
>
> Cheers,
> Niko
>
> ________________________________
> From: Jarek Potiuk <ja...@potiuk.com>
> Sent: Wednesday, July 9, 2025 6:23:54 AM
> To: dev@airflow.apache.org
> Subject: RE: [EXT] S3 Dag Bundle Versions and DB Manager
>
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe.
> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez
> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que
> le contenu ne présente aucun risque.
>
>
>
> > Another option also would be Using dynamodb table? that also supports
> snapshots and i feel it works very well with state management.
>
> Yep that would also work.
>
> Anything "Amazon" to keep state would do. I think that it should be our
> "default" approach that if we have to keep state and the state is connected
> with specific "provider's" implementation, it's best to not keep state in
> Airflow, but in the "integration" that the provider works with if possible.
> We cannot do it in "generic" case because we do not know what
> "integrations" the user has - but since this is "provider's" functionality,
> using anything else that the given integration provides makes perfect
> sense.
>
> J.
>
>
> On Wed, Jul 9, 2025 at 3:12 PM Pavankumar Gopidesu <
> gopidesupa...@gmail.com>
> wrote:
>
> > Agree another s3 bucket also works here
> >
> > Another option also would be Using dynamodb table? that also supports
> > snapshots and i feel it works very well with state management.
> >
> >
> > Pavan
> >
> > On Wed, Jul 9, 2025 at 2:06 PM Jarek Potiuk <ja...@potiuk.com> wrote:
> >
> > > One of the options would be to use a similar approach as terraform
> uses -
> > > i.e. use dedicated "metadata" state storage in a DIFFERENT s3 bucket
> than
> > > DAG files. Since we know there must be an S3 available (obviously) - it
> > > seems not too excessive to assume that there might be another bucket,
> > > independent of the DAG bucket where the state is stored - same bucket
> > (and
> > > dedicated connection id) could even be used to store state for multiple
> > S3
> > > dag bundles - each Dag bundle could have a dedicated object storing the
> > > state. The metadata is not huge, so continuously reading and replacing
> it
> > > should not be an issue.
> > >
> > >  What's nice about it - this single object could even **actually** use
> S3
> > > versioning to keep historical state  - to optimize things and keep a
> log
> > of
> > > changes potentially.
> > >
> > > J.
> > >
> > > On Wed, Jul 9, 2025 at 3:01 AM Oliveira, Niko
> > <oniko...@amazon.com.invalid
> > > >
> > > wrote:
> > >
> > > > Hey folks,
> > > >
> > > > tl;dr I’d like to get some thoughts on a proposal to use DB Manager
> for
> > > S3
> > > > Dag Bundle versioning.
> > > >
> > > > The initial commit for S3 Dag Bundles was recently merged [1] but it
> > > lacks
> > > > Bundle versioning (since this isn’t trivial with something like S3,
> > like
> > > it
> > > > is with Git). The proposed solution involves building a snapshot of
> the
> > > S3
> > > > bucket at the time each Bundle version is created, noting the version
> > of
> > > > all the objects in the bucket (using S3’s native bucket versioning
> > > feature)
> > > > and creating a manifest to store those versions and then giving that
> > > whole
> > > > manifest itself some unique id/version/uuid. These manifests now need
> > to
> > > be
> > > > stored somewhere for future use/retrieval. The proposal is to use the
> > > > Airflow database using the DB Manager feature. Other options include
> > > using
> > > > the local filesystem to store them (but this obviously wont work in
> > > > Airflow’s distributed architecture) or the S3 bucket itself (but this
> > > > requires write access to the bucket and we will always be at the
> mercy
> > of
> > > > the user accidentally deleting/modifying the manifests as they try to
> > > > manage the lifecycle of their bucket, they should not need to be
> aware
> > of
> > > > or need to account for this metadata). So the Airflow DB works nicely
> > as
> > > a
> > > > persistent and internally accessible location for this data.
> > > >
> > > > But I’m aware of the complexities of using the DB Manager and the
> > > > discussion we had during the last dev call about providers vending DB
> > > > tables (concerning migrations and ensuring smooth upgrades or
> > downgrades
> > > of
> > > > the schema). So I wanted to reach out to see what folks thought. I
> have
> > > > talked to Jed, the Bundle Master (tm), and we haven’t come up with
> > > anything
> > > > else that solves the problem as cleanly, so the DB Manager is still
> my
> > > top
> > > > choice. I think what we go with will pave the way for other Bundle
> > > > providers of a similar type as well, so it's worth thinking deeply
> > about
> > > > this decision.
> > > >
> > > > Let me know what you think and thanks for your time!
> > > >
> > > > Cheers,
> > > > Niko
> > > >
> > > > [1] https://github.com/apache/airflow/pull/46621
> > > >
> > >
> >
>

Re: S3 Dag Bundle Versions and DB Manager

Reply via email to