For the multiple table roots, do we expect or ensure that the data are identical across the different roots? or this is best-effort background synchronization across the different roots?
On Sun, Aug 22, 2021 at 11:53 AM Ryan Blue <b...@tabular.io> wrote: > Peter, I think that this feature would be useful when moving tables > between root locations or when you want to maintain multiple root > locations. Renames are orthogonal because a rename doesn't change the table > location. You may want to move the table after a rename, and this would > help in that case. But actually moving data is optional. That's why we put > the table location in metadata. > > Anjali, DR is a big use case, but we also talked about directing accesses > through other URLs, like S3 access points, table migration (like the rename > case), and background data migration (e.g. lifting files between S3 > regions). There are a few uses for it. > > The more I think about this, the more I like the solution to add multiple > table roots to metadata, rather than removing table roots. Adding a way to > plug in a root selector makes a lot of sense to me and it ensures that the > metadata is complete (table location is set in metadata) and that multiple > locations can be used. Are there any objections or arguments against doing > it that way? > > On Fri, Aug 20, 2021 at 9:00 AM Anjali Norwood > <anorw...@netflix.com.invalid> wrote: > >> Hi, >> >> This thread is about disaster recovery and relative paths, but I wanted >> to ask an orthogonal but related question. >> Do we see disaster recovery as the only (or main) use case for >> multi-region? >> Is data residency requirement a use case for anybody? Is it possible to >> shard an iceberg table across regions? How is the location managed in that >> case? >> >> thanks, >> Anjali. >> >> On Fri, Aug 20, 2021 at 12:20 AM Peter Vary <pv...@cloudera.com.invalid> >> wrote: >> >>> Sadly, I have missed the meeting :( >>> >>> Quick question: >>> Was table rename / location change discussed for tables with relative >>> paths? >>> >>> AFAIK when a table rename happens then we do not move old data / >>> metadata files, we just change the root location of the new data / metadata >>> files. If I am correct about this then we might need to handle this >>> differently for tables with relative paths. >>> >>> Thanks, Peter >>> >>> On Fri, 13 Aug 2021, 15:12 Anjali Norwood, <anorw...@netflix.com.invalid> >>> wrote: >>> >>>> Perfect, thank you Yufei. >>>> >>>> Regards >>>> Anjali >>>> >>>> On Thu, Aug 12, 2021 at 9:58 PM Yufei Gu <flyrain...@gmail.com> wrote: >>>> >>>>> Hi Anjali, >>>>> >>>>> Inline... >>>>> On Thu, Aug 12, 2021 at 5:31 PM Anjali Norwood >>>>> <anorw...@netflix.com.invalid> wrote: >>>>> >>>>>> Thanks for the summary Yufei. >>>>>> Sorry, if this was already discussed, I missed the meeting yesterday. >>>>>> Is there anything in the design that would prevent multiple roots >>>>>> from being in different aws regions? >>>>>> >>>>> No. DR is the major use case of relative paths, if not the only one. >>>>> So, it will support roots in different regions. >>>>> >>>>> For disaster recovery in the case of an entire aws region down or >>>>>> slow, is metastore still a point of failure or can metastore be stood up >>>>>> in >>>>>> a different region and could select a different root? >>>>>> >>>>> Normally, DR also requires a backup metastore, besides the storage(s3 >>>>> bucket). In that case, the backup metastore will be in a different region >>>>> along with the table files. For example, the primary table is located in >>>>> region A as well as its metastore, the backup table is located in region B >>>>> as well as its metastore. The primary table root points to a path in >>>>> region >>>>> A, while backup table root points to a path in region B. >>>>> >>>>> >>>>>> regards, >>>>>> Anjali. >>>>>> >>>>>> On Thu, Aug 12, 2021 at 11:35 AM Yufei Gu <flyrain...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Here is a summary of yesterday's community sync-up. >>>>>>> >>>>>>> >>>>>>> Yufei gave a brief update on disaster recovery requirements and the >>>>>>> current progress of relative path approach. >>>>>>> >>>>>>> >>>>>>> Ryan: We all agreed that relative path is the way for disaster >>>>>>> recovery. >>>>>>> >>>>>>> >>>>>>> *Multiple roots for the relative path* >>>>>>> >>>>>>> Ryan proposed an idea to enable multiple roots for a table, >>>>>>> basically, we can add a list of roots in table metadata, and use a >>>>>>> selector >>>>>>> to choose different roots when we move the table from one place to >>>>>>> another. >>>>>>> The selector reads a property to decide which root to use. The property >>>>>>> could be either from catalog or the table metadata, which is yet to be >>>>>>> decided. >>>>>>> >>>>>>> >>>>>>> Here is an example I’d image: >>>>>>> >>>>>>> 1. Root1: hdfs://nn:8020/path/to/the/table >>>>>>> 2. Root2: s3://bucket1/path/to/the/table >>>>>>> 3. Root3: s3://bucket2/path/to/the/table >>>>>>> >>>>>>> *Relative path use case* >>>>>>> >>>>>>> We brainstormed use cases for relative paths. Please let us know if >>>>>>> there are any other use cases. >>>>>>> >>>>>>> 1. Disaster Recovery >>>>>>> 2. Jack: AWS s3 bucket alias >>>>>>> 3. Ryan: fall-back use case. In case that the root1 doesn’t >>>>>>> work, the table falls back to root2, then root3. As Russell >>>>>>> mentioned, it >>>>>>> is challenging to do snapshot expiration and other table maintenance >>>>>>> actions. >>>>>>> >>>>>>> *Timeline* >>>>>>> >>>>>>> In terms of timeline, relative path could be a feature in Spec V3, >>>>>>> since Spec V1 and V2 assume absolute path in metadata. >>>>>>> >>>>>>> >>>>>>> *Misc* >>>>>>> >>>>>>> 1. Miao: How is the relative path compatible with the absolute >>>>>>> path? >>>>>>> 2. How do we migrate an existing table? Build a tool for that. >>>>>>> >>>>>>> Please let us know if you have any ideas, questions, or concerns. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Yufei >>>>>>> >>>>>> > > -- > Ryan Blue > Tabular >