Hey folks,

tl;dr I’d like to get some thoughts on a proposal to use DB Manager for S3 Dag 
Bundle versioning.

The initial commit for S3 Dag Bundles was recently merged [1] but it lacks 
Bundle versioning (since this isn’t trivial with something like S3, like it is 
with Git). The proposed solution involves building a snapshot of the S3 bucket 
at the time each Bundle version is created, noting the version of all the 
objects in the bucket (using S3’s native bucket versioning feature) and 
creating a manifest to store those versions and then giving that whole manifest 
itself some unique id/version/uuid. These manifests now need to be stored 
somewhere for future use/retrieval. The proposal is to use the Airflow database 
using the DB Manager feature. Other options include using the local filesystem 
to store them (but this obviously wont work in Airflow’s distributed 
architecture) or the S3 bucket itself (but this requires write access to the 
bucket and we will always be at the mercy of the user accidentally 
deleting/modifying the manifests as they try to manage the lifecycle of their 
bucket, they should not need to be aware of or need to account for this 
metadata). So the Airflow DB works nicely as a persistent and internally 
accessible location for this data.

But I’m aware of the complexities of using the DB Manager and the discussion we 
had during the last dev call about providers vending DB tables (concerning 
migrations and ensuring smooth upgrades or downgrades of the schema). So I 
wanted to reach out to see what folks thought. I have talked to Jed, the Bundle 
Master (tm), and we haven’t come up with anything else that solves the problem 
as cleanly, so the DB Manager is still my top choice. I think what we go with 
will pave the way for other Bundle providers of a similar type as well, so it's 
worth thinking deeply about this decision.

Let me know what you think and thanks for your time!

Cheers,
Niko

[1] https://github.com/apache/airflow/pull/46621

Reply via email to