Hi Dmitri, Sorry, I accidentally typed your name wrong in the previous reply! Apologize for this!
For the S3 issue, I think we will need to deal with those regardless, especially with the federation work going on, we will need to handle all those entities eventually coming from different Catalogs, and the URI format seems the standard format used by various Catalog services. Best Regards, Yun On Wed, May 7, 2025 at 4:55 PM yun zou <yunzou.colost...@gmail.com> wrote: > Hi Dimitri and Eric, > > Thanks a lot for the feedback! > > For the questions: > - Is one value or many? > It will be one value, similar to the location in Iceberg and the > storage_location in unity catalog. > > Regarding to the point about having new data in new locations and keeping > old data in old locations, do we support that for Iceberg > today? > For most of the Spark tables, it seems to only have one location. Also, I > think it is better to start restricted first, and then extend it to > allow multiple locations when the use case raises. > > Ref: > Iceberg location: > https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L3451 > Storage location in Unity Catalog: > https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L3451 > > - Is it a URI? > Yes, it will be a URI, which seems the standard catalog implementation. > Regarding to the point about s3 v2 s3a, i assume that is a common > problem that every catalog implementation needs to address, and we will > stay the same on this part. At least from the load table point of view, > Spark engine knows how to deal with such cases. > > - Does it point to any particular file? > No, it doesn't point to a particular file. It is the base table location. > > - Is it a common prefix of all files within a table? > It is supposed to be the base table location, which theoretically should > be the common prefix of all files within a table I believe. > > - What happens when a value does not match these expectations? > Whether it is one value or many is restricted by the spec already. > For URI format, I think we can do a format check, and fail it. > Other than that, we will not do any other special check, and we rely on > the client to put the correct value, otherwise, the other engine will > not be able to successfully read the table. > > For the location keyword, as Eric has pointed out, we can potentially have > a reserved key for the properties. However, location is a common > enough key among various table formats, which worths a dedicated key to > help store and load the information in a more straightforward > way. For things that are specific to one or two formats, I think it makes > more sense to use a reserved property key. > > As a reference, in Iceberg, the CreateTable request and TableMetadata does > have an explicit location key in the spec. For write.data.path > and write.metadata.path, they are passed as properties today. > > Best Regards, > Yun > > > On Wed, May 7, 2025 at 3:54 PM Dmitri Bourlatchkov <di...@apache.org> > wrote: > >> Another point: I'm pretty sure sooner or later users will want to move >> their data to some other location. As an option users may want to write >> new >> files into another location but keep old files in place. >> >> Also: if the location is a URI, how do we deal with s3 vs. s3a for >> example? >> >> In Iceberg it is quite common for different engines to use different >> access >> tools, which often leads to different URI schemes. >> >> Cheers, >> Dmitri. >> >> On Wed, May 7, 2025 at 6:46 PM Eric Maynard <eric.w.mayn...@gmail.com> >> wrote: >> >> > All good questions Dmitri — I’m especially interested in the first one >> as >> > from what I understand Iceberg tables can have metadata and data at two >> > different paths that we need to vend credentials for. >> > >> > For iceberg tables, we just use special properties to track these >> > locations. I wonder if we couldn’t do the same for generic tables. >> > >> > On Wed, May 7, 2025 at 3:42 PM Dmitri Bourlatchkov <di...@apache.org> >> > wrote: >> > >> > > Hi Yun, >> > > >> > > Please clarify the meaning of the value of the new location attribute. >> > > >> > > - Is is one value or many? >> > > - Is it a URI? >> > > - Does it point to any particular file? >> > > - Is it a common prefix of all files within a table? >> > > - What happens when a value does not match these expectation? >> > > >> > > Thanks, >> > > Dmitri. >> > > >> > > On 2025/05/07 21:50:19 yun zou wrote: >> > > > Hi folks, >> > > > >> > > > I would like to propose to add an optional `location` field to >> > > > CreateGenricTable Request and LoadGenericTable response. >> > > > >> > > > The `location` is the location for the table, which is common to >> most >> > > table >> > > > formats including Iceberg, Delta, Hudi, csv, parquet etc. The >> location >> > > > information is critical for loading the table at engine side, >> having a >> > > > dedicated keyword could help improve the robustness for cross engine >> > > > sharing, instead of relying on the properties passed by the client >> > side. >> > > > >> > > > Furthermore, this information is also required to provide credential >> > > > vending capabilities later. >> > > > >> > > > Here is the PR for adding the spec: >> > > > https://github.com/apache/polaris/pull/1543 >> > > > >> > > > Looking forward to your reply and feedback! >> > > > >> > > > Best Regards, >> > > > Yun >> > > > >> > > >> > >> >