Hey folks, tl;dr I’d like to get some thoughts on a proposal to use DB Manager for S3 Dag Bundle versioning.
The initial commit for S3 Dag Bundles was recently merged [1] but it lacks Bundle versioning (since this isn’t trivial with something like S3, like it is with Git). The proposed solution involves building a snapshot of the S3 bucket at the time each Bundle version is created, noting the version of all the objects in the bucket (using S3’s native bucket versioning feature) and creating a manifest to store those versions and then giving that whole manifest itself some unique id/version/uuid. These manifests now need to be stored somewhere for future use/retrieval. The proposal is to use the Airflow database using the DB Manager feature. Other options include using the local filesystem to store them (but this obviously wont work in Airflow’s distributed architecture) or the S3 bucket itself (but this requires write access to the bucket and we will always be at the mercy of the user accidentally deleting/modifying the manifests as they try to manage the lifecycle of their bucket, they should not need to be aware of or need to account for this metadata). So the Airflow DB works nicely as a persistent and internally accessible location for this data. But I’m aware of the complexities of using the DB Manager and the discussion we had during the last dev call about providers vending DB tables (concerning migrations and ensuring smooth upgrades or downgrades of the schema). So I wanted to reach out to see what folks thought. I have talked to Jed, the Bundle Master (tm), and we haven’t come up with anything else that solves the problem as cleanly, so the DB Manager is still my top choice. I think what we go with will pave the way for other Bundle providers of a similar type as well, so it's worth thinking deeply about this decision. Let me know what you think and thanks for your time! Cheers, Niko [1] https://github.com/apache/airflow/pull/46621