Anjali, my thoughts are inline below:

On Mon, Aug 23, 2021 at 1:14 PM Anjali Norwood <anorw...@netflix.com.invalid>
wrote:

> *"The more I think about this, the more I like the solution to add
> multiple table roots to metadata, rather than removing table roots. Adding
> a way to plug in a root selector makes a lot of sense to me and it ensures
> that the metadata is complete (table location is set in metadata) and that
> multiple locations can be used. Are there any objections or arguments
> against doing it that way?"*
>
> In the context of your comment on multiple locations above, I am thinking
> about the following scenarios:
> 1) Disaster recovery or low-latency use case where clients connect to the
> region geographically closest to them: In this case, multiple table roots
> represent a copy of the table, the copies may or may not be in sync. (Most
> likely active-active replication would be set up in this case and the
> copies are near-identical). A root level selector works/makes sense.
> 2) Data residency requirement: data must not leave a country/region. In
> this case, federation of data from multiple roots constitutes the entirety
> of the table.
> 3) One can also imagine combinations of 1 and 2 above where some locations
> need to be federated and some locations have data replicated from other
> locations.
>
> Curious how the 2nd and 3rd scenarios would be supported with this design.
>

#2 is currently done by adding a custom LocationProvider that creates file
locations for data files. Stripe created one that will map a partition to a
bucket for use cases like this.

When combining #2 and #1, you'd have a service that is responsible for
mirroring the table and making the copies. You'd have to keep logic in that
service to mirror only certain buckets or to mirror buckets to specific
places. So if I have buckets region1 and region2, I might only mirror
region1 or maybe I'd mirror region1 to region3-in-same-country-as-region1
and region2 to region4-in-same-country-as-region2. From there, you'd be
able to choose whether to have all 3 or 4 buckets as table roots and how to
resolve paths.

Layering on relative paths, if you're mirroring only region1 then it might
make sense for region1 paths to be relative and region2 paths to be
absolute. Then you could have multiple table locations that work for
relative region1 paths, but region2 is always in the same place. If you
were mirroring both regions, then it might make more sense to not use
relative paths at all and simply rewrite URIs in FileIO for fallback.

Ryan

Reply via email to