No worries about the name. It is a possible alternative spelling :) On Wed, May 7, 2025 at 8:04 PM yun zou <yunzou.colost...@gmail.com> wrote:
> Hi Dmitri, > > Sorry, I accidentally typed your name wrong in the previous reply! > Apologize for this! > > For the S3 issue, I think we will need to deal with those regardless, > especially with the federation work going on, we will need to handle all > those entities eventually coming from different Catalogs, and the URI > format seems the standard format used by various Catalog services. > > Best Regards, > Yun > > On Wed, May 7, 2025 at 4:55 PM yun zou <yunzou.colost...@gmail.com> wrote: > > > Hi Dimitri and Eric, > > > > Thanks a lot for the feedback! > > > > For the questions: > > - Is one value or many? > > It will be one value, similar to the location in Iceberg and the > > storage_location in unity catalog. > > > > Regarding to the point about having new data in new locations and keeping > > old data in old locations, do we support that for Iceberg > > today? > > For most of the Spark tables, it seems to only have one location. Also, I > > think it is better to start restricted first, and then extend it to > > allow multiple locations when the use case raises. > > > > Ref: > > Iceberg location: > > > https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L3451 > > Storage location in Unity Catalog: > > > https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L3451 > > > > - Is it a URI? > > Yes, it will be a URI, which seems the standard catalog implementation. > > Regarding to the point about s3 v2 s3a, i assume that is a common > > problem that every catalog implementation needs to address, and we will > > stay the same on this part. At least from the load table point of view, > > Spark engine knows how to deal with such cases. > > > > - Does it point to any particular file? > > No, it doesn't point to a particular file. It is the base table location. > > > > - Is it a common prefix of all files within a table? > > It is supposed to be the base table location, which theoretically should > > be the common prefix of all files within a table I believe. > > > > - What happens when a value does not match these expectations? > > Whether it is one value or many is restricted by the spec already. > > For URI format, I think we can do a format check, and fail it. > > Other than that, we will not do any other special check, and we rely on > > the client to put the correct value, otherwise, the other engine will > > not be able to successfully read the table. > > > > For the location keyword, as Eric has pointed out, we can potentially > have > > a reserved key for the properties. However, location is a common > > enough key among various table formats, which worths a dedicated key to > > help store and load the information in a more straightforward > > way. For things that are specific to one or two formats, I think it > makes > > more sense to use a reserved property key. > > > > As a reference, in Iceberg, the CreateTable request and TableMetadata > does > > have an explicit location key in the spec. For write.data.path > > and write.metadata.path, they are passed as properties today. > > > > Best Regards, > > Yun > > > > > > On Wed, May 7, 2025 at 3:54 PM Dmitri Bourlatchkov <di...@apache.org> > > wrote: > > > >> Another point: I'm pretty sure sooner or later users will want to move > >> their data to some other location. As an option users may want to write > >> new > >> files into another location but keep old files in place. > >> > >> Also: if the location is a URI, how do we deal with s3 vs. s3a for > >> example? > >> > >> In Iceberg it is quite common for different engines to use different > >> access > >> tools, which often leads to different URI schemes. > >> > >> Cheers, > >> Dmitri. > >> > >> On Wed, May 7, 2025 at 6:46 PM Eric Maynard <eric.w.mayn...@gmail.com> > >> wrote: > >> > >> > All good questions Dmitri — I’m especially interested in the first one > >> as > >> > from what I understand Iceberg tables can have metadata and data at > two > >> > different paths that we need to vend credentials for. > >> > > >> > For iceberg tables, we just use special properties to track these > >> > locations. I wonder if we couldn’t do the same for generic tables. > >> > > >> > On Wed, May 7, 2025 at 3:42 PM Dmitri Bourlatchkov <di...@apache.org> > >> > wrote: > >> > > >> > > Hi Yun, > >> > > > >> > > Please clarify the meaning of the value of the new location > attribute. > >> > > > >> > > - Is is one value or many? > >> > > - Is it a URI? > >> > > - Does it point to any particular file? > >> > > - Is it a common prefix of all files within a table? > >> > > - What happens when a value does not match these expectation? > >> > > > >> > > Thanks, > >> > > Dmitri. > >> > > > >> > > On 2025/05/07 21:50:19 yun zou wrote: > >> > > > Hi folks, > >> > > > > >> > > > I would like to propose to add an optional `location` field to > >> > > > CreateGenricTable Request and LoadGenericTable response. > >> > > > > >> > > > The `location` is the location for the table, which is common to > >> most > >> > > table > >> > > > formats including Iceberg, Delta, Hudi, csv, parquet etc. The > >> location > >> > > > information is critical for loading the table at engine side, > >> having a > >> > > > dedicated keyword could help improve the robustness for cross > engine > >> > > > sharing, instead of relying on the properties passed by the client > >> > side. > >> > > > > >> > > > Furthermore, this information is also required to provide > credential > >> > > > vending capabilities later. > >> > > > > >> > > > Here is the PR for adding the spec: > >> > > > https://github.com/apache/polaris/pull/1543 > >> > > > > >> > > > Looking forward to your reply and feedback! > >> > > > > >> > > > Best Regards, > >> > > > Yun > >> > > > > >> > > > >> > > >> > > >