Thanks for the quick reply, Yun! Regarding to the point about having new data in new locations and keeping old data in old locations, do we support that for Iceberg today?
My concern is not about supporting that for Iceberg tables. Their spec is outside of Polaris' control. Here we're talking about a Polaris spec. I believe we should be prepared to deal with this problem. I do not have a solution ATM, but I think it is an important use case to consider when making a new spec. For most of the Spark tables, it seems to only have one location. If it's not "all" is it not strong enough for a spec, IMHO. If some tables have multiple base locations how is Polaris going to deal with them? - Is it a common prefix of all files within a table? It is supposed to be the base table location, which theoretically should be the common prefix of all files within a table I believe. I believe this is too weak for a Polaris spec. If we're not sure that one location can be a prefix for all files, we should plan for having multiple locations. Regarding to the point about s3 v2 s3a, i assume that is a common problem that every catalog implementation needs to address, and we will stay the same on this part We're making a specification for Polaris. I do not think it is sufficient to say we'll do the same as other (unspecified ATM) catalogs. At least from the load table point of view, > Spark engine knows how to deal with such cases. > What will Polaris Server do with this location? For the location keyword, as Eric has pointed out, we can potentially have > a reserved key for the properties. However, location is a common > enough key among various table formats, which worths a dedicated key to > help store and load the information in a more straightforward > way. For things that are specific to one or two formats, I think it makes > more sense to use a reserved property key. The "straightforward" part is what makes me uneasy. If Polaris has to define it in a spec, it will be hard to change in the future. >From a different angle: If Polaris has location in a Polaris spec, it is Polaris that tells how engines should interpret that location, not the other way around. If Polaris takes control of the location, I think we have to be more careful and at least try to make it future-proof. Thanks, Dmitri. On Wed, May 7, 2025 at 7:56 PM yun zou <yunzou.colost...@gmail.com> wrote: > Hi Dimitri and Eric, > > Thanks a lot for the feedback! > > For the questions: > - Is one value or many? > It will be one value, similar to the location in Iceberg and the > storage_location in unity catalog. > > Regarding to the point about having new data in new locations and keeping > old data in old locations, do we support that for Iceberg > today? > For most of the Spark tables, it seems to only have one location. Also, I > think it is better to start restricted first, and then extend it to > allow multiple locations when the use case raises. > > Ref: > Iceberg location: > > https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L3451 > Storage location in Unity Catalog: > > https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L3451 > > - Is it a URI? > Yes, it will be a URI, which seems the standard catalog implementation. > Regarding to the point about s3 v2 s3a, i assume that is a common > problem that every catalog implementation needs to address, and we will > stay the same on this part. At least from the load table point of view, > Spark engine knows how to deal with such cases. > > - Does it point to any particular file? > No, it doesn't point to a particular file. It is the base table location. > > - Is it a common prefix of all files within a table? > It is supposed to be the base table location, which theoretically should be > the common prefix of all files within a table I believe. > > - What happens when a value does not match these expectations? > Whether it is one value or many is restricted by the spec already. > For URI format, I think we can do a format check, and fail it. > Other than that, we will not do any other special check, and we rely on the > client to put the correct value, otherwise, the other engine will > not be able to successfully read the table. > > For the location keyword, as Eric has pointed out, we can potentially have > a reserved key for the properties. However, location is a common > enough key among various table formats, which worths a dedicated key to > help store and load the information in a more straightforward > way. For things that are specific to one or two formats, I think it makes > more sense to use a reserved property key. > > As a reference, in Iceberg, the CreateTable request and TableMetadata does > have an explicit location key in the spec. For write.data.path > and write.metadata.path, they are passed as properties today. > > Best Regards, > Yun > > > On Wed, May 7, 2025 at 3:54 PM Dmitri Bourlatchkov <di...@apache.org> > wrote: > > > Another point: I'm pretty sure sooner or later users will want to move > > their data to some other location. As an option users may want to write > new > > files into another location but keep old files in place. > > > > Also: if the location is a URI, how do we deal with s3 vs. s3a for > example? > > > > In Iceberg it is quite common for different engines to use different > access > > tools, which often leads to different URI schemes. > > > > Cheers, > > Dmitri. > > > > On Wed, May 7, 2025 at 6:46 PM Eric Maynard <eric.w.mayn...@gmail.com> > > wrote: > > > > > All good questions Dmitri — I’m especially interested in the first one > as > > > from what I understand Iceberg tables can have metadata and data at two > > > different paths that we need to vend credentials for. > > > > > > For iceberg tables, we just use special properties to track these > > > locations. I wonder if we couldn’t do the same for generic tables. > > > > > > On Wed, May 7, 2025 at 3:42 PM Dmitri Bourlatchkov <di...@apache.org> > > > wrote: > > > > > > > Hi Yun, > > > > > > > > Please clarify the meaning of the value of the new location > attribute. > > > > > > > > - Is is one value or many? > > > > - Is it a URI? > > > > - Does it point to any particular file? > > > > - Is it a common prefix of all files within a table? > > > > - What happens when a value does not match these expectation? > > > > > > > > Thanks, > > > > Dmitri. > > > > > > > > On 2025/05/07 21:50:19 yun zou wrote: > > > > > Hi folks, > > > > > > > > > > I would like to propose to add an optional `location` field to > > > > > CreateGenricTable Request and LoadGenericTable response. > > > > > > > > > > The `location` is the location for the table, which is common to > most > > > > table > > > > > formats including Iceberg, Delta, Hudi, csv, parquet etc. The > > location > > > > > information is critical for loading the table at engine side, > having > > a > > > > > dedicated keyword could help improve the robustness for cross > engine > > > > > sharing, instead of relying on the properties passed by the client > > > side. > > > > > > > > > > Furthermore, this information is also required to provide > credential > > > > > vending capabilities later. > > > > > > > > > > Here is the PR for adding the spec: > > > > > https://github.com/apache/polaris/pull/1543 > > > > > > > > > > Looking forward to your reply and feedback! > > > > > > > > > > Best Regards, > > > > > Yun > > > > > > > > > > > > > > >