Even for Iceberg, as I noted, we track multiple locations and vend credentials scoped to those multiple locations.
On Wed, May 7, 2025 at 5:22 PM Dmitri Bourlatchkov <di...@apache.org> wrote: > No worries about the name. It is a possible alternative spelling :) > > On Wed, May 7, 2025 at 8:04 PM yun zou <yunzou.colost...@gmail.com> wrote: > > > Hi Dmitri, > > > > Sorry, I accidentally typed your name wrong in the previous reply! > > Apologize for this! > > > > For the S3 issue, I think we will need to deal with those regardless, > > especially with the federation work going on, we will need to handle all > > those entities eventually coming from different Catalogs, and the URI > > format seems the standard format used by various Catalog services. > > > > Best Regards, > > Yun > > > > On Wed, May 7, 2025 at 4:55 PM yun zou <yunzou.colost...@gmail.com> > wrote: > > > > > Hi Dimitri and Eric, > > > > > > Thanks a lot for the feedback! > > > > > > For the questions: > > > - Is one value or many? > > > It will be one value, similar to the location in Iceberg and the > > > storage_location in unity catalog. > > > > > > Regarding to the point about having new data in new locations and > keeping > > > old data in old locations, do we support that for Iceberg > > > today? > > > For most of the Spark tables, it seems to only have one location. > Also, I > > > think it is better to start restricted first, and then extend it to > > > allow multiple locations when the use case raises. > > > > > > Ref: > > > Iceberg location: > > > > > > https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L3451 > > > Storage location in Unity Catalog: > > > > > > https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L3451 > > > > > > - Is it a URI? > > > Yes, it will be a URI, which seems the standard catalog implementation. > > > Regarding to the point about s3 v2 s3a, i assume that is a common > > > problem that every catalog implementation needs to address, and we will > > > stay the same on this part. At least from the load table point of view, > > > Spark engine knows how to deal with such cases. > > > > > > - Does it point to any particular file? > > > No, it doesn't point to a particular file. It is the base table > location. > > > > > > - Is it a common prefix of all files within a table? > > > It is supposed to be the base table location, which theoretically > should > > > be the common prefix of all files within a table I believe. > > > > > > - What happens when a value does not match these expectations? > > > Whether it is one value or many is restricted by the spec already. > > > For URI format, I think we can do a format check, and fail it. > > > Other than that, we will not do any other special check, and we rely on > > > the client to put the correct value, otherwise, the other engine will > > > not be able to successfully read the table. > > > > > > For the location keyword, as Eric has pointed out, we can potentially > > have > > > a reserved key for the properties. However, location is a common > > > enough key among various table formats, which worths a dedicated key to > > > help store and load the information in a more straightforward > > > way. For things that are specific to one or two formats, I think it > > makes > > > more sense to use a reserved property key. > > > > > > As a reference, in Iceberg, the CreateTable request and TableMetadata > > does > > > have an explicit location key in the spec. For write.data.path > > > and write.metadata.path, they are passed as properties today. > > > > > > Best Regards, > > > Yun > > > > > > > > > On Wed, May 7, 2025 at 3:54 PM Dmitri Bourlatchkov <di...@apache.org> > > > wrote: > > > > > >> Another point: I'm pretty sure sooner or later users will want to move > > >> their data to some other location. As an option users may want to > write > > >> new > > >> files into another location but keep old files in place. > > >> > > >> Also: if the location is a URI, how do we deal with s3 vs. s3a for > > >> example? > > >> > > >> In Iceberg it is quite common for different engines to use different > > >> access > > >> tools, which often leads to different URI schemes. > > >> > > >> Cheers, > > >> Dmitri. > > >> > > >> On Wed, May 7, 2025 at 6:46 PM Eric Maynard <eric.w.mayn...@gmail.com > > > > >> wrote: > > >> > > >> > All good questions Dmitri — I’m especially interested in the first > one > > >> as > > >> > from what I understand Iceberg tables can have metadata and data at > > two > > >> > different paths that we need to vend credentials for. > > >> > > > >> > For iceberg tables, we just use special properties to track these > > >> > locations. I wonder if we couldn’t do the same for generic tables. > > >> > > > >> > On Wed, May 7, 2025 at 3:42 PM Dmitri Bourlatchkov < > di...@apache.org> > > >> > wrote: > > >> > > > >> > > Hi Yun, > > >> > > > > >> > > Please clarify the meaning of the value of the new location > > attribute. > > >> > > > > >> > > - Is is one value or many? > > >> > > - Is it a URI? > > >> > > - Does it point to any particular file? > > >> > > - Is it a common prefix of all files within a table? > > >> > > - What happens when a value does not match these expectation? > > >> > > > > >> > > Thanks, > > >> > > Dmitri. > > >> > > > > >> > > On 2025/05/07 21:50:19 yun zou wrote: > > >> > > > Hi folks, > > >> > > > > > >> > > > I would like to propose to add an optional `location` field to > > >> > > > CreateGenricTable Request and LoadGenericTable response. > > >> > > > > > >> > > > The `location` is the location for the table, which is common to > > >> most > > >> > > table > > >> > > > formats including Iceberg, Delta, Hudi, csv, parquet etc. The > > >> location > > >> > > > information is critical for loading the table at engine side, > > >> having a > > >> > > > dedicated keyword could help improve the robustness for cross > > engine > > >> > > > sharing, instead of relying on the properties passed by the > client > > >> > side. > > >> > > > > > >> > > > Furthermore, this information is also required to provide > > credential > > >> > > > vending capabilities later. > > >> > > > > > >> > > > Here is the PR for adding the spec: > > >> > > > https://github.com/apache/polaris/pull/1543 > > >> > > > > > >> > > > Looking forward to your reply and feedback! > > >> > > > > > >> > > > Best Regards, > > >> > > > Yun > > >> > > > > > >> > > > > >> > > > >> > > > > > >