As I commented in my other recent email, I think by introducing a "location" property Polaris enters the realm of table format specs.
This is fine, from my POV, however, since Polaris is the defining project behind that property, I believe Polaris should provide a more definitive description of the meaning and intended processing of that property. To repeat myself, I think the Open API spec defines only the API for obtaining the location. We need a place to define what this location means. I do not insist on calling this a "spec" for Generic Tables, but I think it deserves a separate page in Polaris docs, where it would be defined with more rigor. Specifically, I think we need to call out that: * The location is a base URI (essentially prefix) for all files in a generic table. * Clients (engines) are responsible for writing files only under the specified location. * A table, whose files exist outside the declared location, is not compliant with the Polaris' definition for a Generic Table. By extension, I think we ought to describe other existing properties too. WDYT? Thanks, Dmitri. On Mon, May 19, 2025 at 5:39 PM yun zou <yunzou.colost...@gmail.com> wrote: > Hi Dmitri, > > I think for Iceberg, we all agreed that there can be multiple locations, > and I definitely agree with Russel that the extension > should be done with the IRC endpoints. The Generic Table APIs are designed > for non-Iceberg table usage today, and > We still want Iceberg table usage to go through the IRC endpoint to have > full IRC support. > > As for the following point > "a more strict spec for that (define where file should and should not go)" > Are you referring that Polaris need to generate a location for the table to > use, if that is the case, I don't think engines > respects that today. The table locations are either generated by the engine > or specified by the user. > Or are you referring that we should have something like Iceberg that we > should have an allowed location and do a > validation to make sure the location is under the allowed location? Would > you mind elaborate more on this point? > > Best Regards, > Yun > > > > > > > > > > On Mon, May 19, 2025 at 1:45 PM Russell Spitzer <russell.spit...@gmail.com > > > wrote: > > > Yeah I think Iceberg and Hive are the only ones trying to make life > > difficult, that I think > > we should also cover but in changes to the Iceberg Spec. Hive can just > stay > > how it is ... > > > > On Mon, May 19, 2025 at 2:59 PM Dmitri Bourlatchkov <di...@apache.org> > > wrote: > > > > > For context: my locations concerns are rooted in Nessie's experience > > where > > > we often get problem reports related to files being outside the > declared > > > Iceberg metadata location. > > > > > > Example: > > > > > > > > > https://github.com/projectnessie/nessie/issues/10817#issuecomment-2887329227 > > > > > > I'm ok going with a single location for generic tables, but I think > > Polaris > > > needs to have a more strict spec for that (define where file should and > > > should not go) because polaris owns this spec. Polaris ought to define > > what > > > complies with the spec and what does not. Having a proper spec is > > essential > > > to ensure a mutual understanding of all parties dealing with Generic > > > Tables. > > > > > > Open API yaml comments are not sufficient, IMHO. I'd prefer to have a > > > dedicated doc page to define expectations and compliance. > > > > > > Thanks, > > > Dmitri. > > > > > > > > > On Mon, May 19, 2025 at 2:17 PM Russell Spitzer < > > russell.spit...@gmail.com > > > > > > > wrote: > > > > > > > The only multiple locations table formats I'm currently aware of are > > Hive > > > > (partitions can live wherever) and Iceberg. > > > > > > > > I think for Delta, Hudi, LanceDB, Paimon and File based tables they > > all > > > > have to live in the root location. I'm not sure of any other "file" > > based > > > > tables where this would be an issue but I'd love to know if someone > > else > > > > has ideas. I think with the rise in credential vending, splitting > > things > > > > amongst multiple prefixes is becoming less common. I don't oppose > doing > > > an > > > > array of locations but it may be enough to just leave this as an > > > extension > > > > later. (Support location or locations) > > > > > > > > On Wed, May 7, 2025 at 8:52 PM yun zou <yunzou.colost...@gmail.com> > > > wrote: > > > > > > > > > Hi Dmitri, > > > > > > > > > > If it's not "all" is it not strong enough for a spec, IMHO. If some > > > > tables > > > > > have multiple base locations how is Polaris going to deal with > them? > > > > > > > > > > Sorry, when I say most of them, it was because I haven't tested all > > of > > > > them > > > > > (I only tested Delta and CSV before). > > > > > However, if Unity Catalog is only taking one location, I think that > > is > > > a > > > > > strong enough proof that > > > > > one location is enough today. > > > > > > > > > > It is also more natural to start with one location, and if there > are > > > use > > > > > cases that > > > > > require support for multiple locations later, we can move on to V2 > > spec > > > > to > > > > > support multiple > > > > > tables locations. > > > > > > > > > > We're making a specification for Polaris. I do not think it is > > > sufficient > > > > > to say we'll do the same as other (unspecified ATM) catalogs. > > > > > If we want to migrate users from other Catalog services to Polaris > > > > (through > > > > > federation), then Polaris will need to > > > > > provide corresponding capabilities. For example, Unity Catalog > > storage > > > > > location is a URI representation, when entity > > > > > are federated from Unity Catalog, we will need to be able to handle > > the > > > > URI > > > > > location. > > > > > If URI representation is a common standard that has been accepted > by > > > > other > > > > > Catalog services like Unity Catalog, Gravitino, > > > > > Polaris should be compatible with that, otherwise it might cause > > > problem > > > > > for users when they are migrating from one to > > > > > another. > > > > > > > > > > What will Polaris Server do with this location? > > > > > For generic tables, Polaris will provide credential vending for > this > > > > > location in near future, I don't see we will provide > > > > > anything else in short or mid term, since we still want to promote > > > > > native support for Iceberg. > > > > > Or if you have anything special in your mind that you think we > should > > > > > support? > > > > > > > > > > If Polaris has to define it in a spec, it will be hard to change in > > the > > > > > future. > > > > > Regardless of whether it is explicitly in the spec definition or > as a > > > > > reserved property key, as long as they are explicitly > > > > > documented, they will be hard to change in the future. From that > > > > > perspective, those two approaches seem the same to me. > > > > > > > > > > Table location is critical information that is required by the > engine > > > > side > > > > > to read and write the tables, which should > > > > > be explicitly defined to provide better sharing across engines. For > > > > > example, the delta table location is passed in the > > > > > table properties with a property key either "location" or "path" > > > depends > > > > on > > > > > how the table is created. Now, if another > > > > > engine wants to read the delta table, it will need to understand > > those > > > > > keys, which are controlled by Spark today. If Spark > > > > > changes them one day, all sharing will stop working. > > > > > > > > > > As to whether we want to put it as an explicit field or a reserved > > > key, I > > > > > think for a common field among various > > > > > table formats, it makes more sense to have it as an explicit field. > > For > > > > > properties that are specific to a particular table format, > > > > > it is more proper to just have a reserved key. > > > > > > > > > > If Polaris takes control of the location, I think we have to be > more > > > > > careful > > > > > and at least try to make it future-proof. > > > > > > > > > > I don't think Polaris is taking control of the location, the > location > > > is > > > > > still controlled by the engine and users today like table names. > > > > > Polaris is a Catalog service, it records the generic table entity, > > and > > > > > returns the information back to the user on query. > > > > > It might be able to do some validation on the location (like check > > > > special > > > > > character), but it doesn't decide which location > > > > > the table will be used. I personally don't think it is a bad idea > to > > > let > > > > > the Catalog service also take control of generating > > > > > the table location, but I think that will require a lot of work. > > > > > > > > > > Best Regards, > > > > > Yun > > > > > > > > > > On Wed, May 7, 2025 at 5:22 PM Dmitri Bourlatchkov < > di...@apache.org > > > > > > > > wrote: > > > > > > > > > > > No worries about the name. It is a possible alternative spelling > :) > > > > > > > > > > > > On Wed, May 7, 2025 at 8:04 PM yun zou < > yunzou.colost...@gmail.com > > > > > > > > wrote: > > > > > > > > > > > > > Hi Dmitri, > > > > > > > > > > > > > > Sorry, I accidentally typed your name wrong in the previous > > reply! > > > > > > > Apologize for this! > > > > > > > > > > > > > > For the S3 issue, I think we will need to deal with those > > > regardless, > > > > > > > especially with the federation work going on, we will need to > > > handle > > > > > all > > > > > > > those entities eventually coming from different Catalogs, and > the > > > URI > > > > > > > format seems the standard format used by various Catalog > > services. > > > > > > > > > > > > > > Best Regards, > > > > > > > Yun > > > > > > > > > > > > > > On Wed, May 7, 2025 at 4:55 PM yun zou < > > yunzou.colost...@gmail.com > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Hi Dimitri and Eric, > > > > > > > > > > > > > > > > Thanks a lot for the feedback! > > > > > > > > > > > > > > > > For the questions: > > > > > > > > - Is one value or many? > > > > > > > > It will be one value, similar to the location in Iceberg and > > the > > > > > > > > storage_location in unity catalog. > > > > > > > > > > > > > > > > Regarding to the point about having new data in new locations > > and > > > > > > keeping > > > > > > > > old data in old locations, do we support that for Iceberg > > > > > > > > today? > > > > > > > > For most of the Spark tables, it seems to only have one > > location. > > > > > > Also, I > > > > > > > > think it is better to start restricted first, and then extend > > it > > > to > > > > > > > > allow multiple locations when the use case raises. > > > > > > > > > > > > > > > > Ref: > > > > > > > > Iceberg location: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L3451 > > > > > > > > Storage location in Unity Catalog: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L3451 > > > > > > > > > > > > > > > > - Is it a URI? > > > > > > > > Yes, it will be a URI, which seems the standard catalog > > > > > implementation. > > > > > > > > Regarding to the point about s3 v2 s3a, i assume that is a > > common > > > > > > > > problem that every catalog implementation needs to address, > and > > > we > > > > > will > > > > > > > > stay the same on this part. At least from the load table > point > > of > > > > > view, > > > > > > > > Spark engine knows how to deal with such cases. > > > > > > > > > > > > > > > > - Does it point to any particular file? > > > > > > > > No, it doesn't point to a particular file. It is the base > table > > > > > > location. > > > > > > > > > > > > > > > > - Is it a common prefix of all files within a table? > > > > > > > > It is supposed to be the base table location, which > > theoretically > > > > > > should > > > > > > > > be the common prefix of all files within a table I believe. > > > > > > > > > > > > > > > > - What happens when a value does not match these > expectations? > > > > > > > > Whether it is one value or many is restricted by the spec > > > already. > > > > > > > > For URI format, I think we can do a format check, and fail > it. > > > > > > > > Other than that, we will not do any other special check, and > we > > > > rely > > > > > on > > > > > > > > the client to put the correct value, otherwise, the other > > engine > > > > will > > > > > > > > not be able to successfully read the table. > > > > > > > > > > > > > > > > For the location keyword, as Eric has pointed out, we can > > > > potentially > > > > > > > have > > > > > > > > a reserved key for the properties. However, location is a > > common > > > > > > > > enough key among various table formats, which worths a > > dedicated > > > > key > > > > > to > > > > > > > > help store and load the information in a more straightforward > > > > > > > > way. For things that are specific to one or two formats, I > > think > > > > it > > > > > > > makes > > > > > > > > more sense to use a reserved property key. > > > > > > > > > > > > > > > > As a reference, in Iceberg, the CreateTable request and > > > > TableMetadata > > > > > > > does > > > > > > > > have an explicit location key in the spec. For > write.data.path > > > > > > > > and write.metadata.path, they are passed as properties today. > > > > > > > > > > > > > > > > Best Regards, > > > > > > > > Yun > > > > > > > > > > > > > > > > > > > > > > > > On Wed, May 7, 2025 at 3:54 PM Dmitri Bourlatchkov < > > > > di...@apache.org > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > >> Another point: I'm pretty sure sooner or later users will > want > > > to > > > > > move > > > > > > > >> their data to some other location. As an option users may > want > > > to > > > > > > write > > > > > > > >> new > > > > > > > >> files into another location but keep old files in place. > > > > > > > >> > > > > > > > >> Also: if the location is a URI, how do we deal with s3 vs. > s3a > > > for > > > > > > > >> example? > > > > > > > >> > > > > > > > >> In Iceberg it is quite common for different engines to use > > > > different > > > > > > > >> access > > > > > > > >> tools, which often leads to different URI schemes. > > > > > > > >> > > > > > > > >> Cheers, > > > > > > > >> Dmitri. > > > > > > > >> > > > > > > > >> On Wed, May 7, 2025 at 6:46 PM Eric Maynard < > > > > > eric.w.mayn...@gmail.com > > > > > > > > > > > > > > >> wrote: > > > > > > > >> > > > > > > > >> > All good questions Dmitri — I’m especially interested in > the > > > > first > > > > > > one > > > > > > > >> as > > > > > > > >> > from what I understand Iceberg tables can have metadata > and > > > data > > > > > at > > > > > > > two > > > > > > > >> > different paths that we need to vend credentials for. > > > > > > > >> > > > > > > > > >> > For iceberg tables, we just use special properties to > track > > > > these > > > > > > > >> > locations. I wonder if we couldn’t do the same for generic > > > > tables. > > > > > > > >> > > > > > > > > >> > On Wed, May 7, 2025 at 3:42 PM Dmitri Bourlatchkov < > > > > > > di...@apache.org> > > > > > > > >> > wrote: > > > > > > > >> > > > > > > > > >> > > Hi Yun, > > > > > > > >> > > > > > > > > > >> > > Please clarify the meaning of the value of the new > > location > > > > > > > attribute. > > > > > > > >> > > > > > > > > > >> > > - Is is one value or many? > > > > > > > >> > > - Is it a URI? > > > > > > > >> > > - Does it point to any particular file? > > > > > > > >> > > - Is it a common prefix of all files within a table? > > > > > > > >> > > - What happens when a value does not match these > > > expectation? > > > > > > > >> > > > > > > > > > >> > > Thanks, > > > > > > > >> > > Dmitri. > > > > > > > >> > > > > > > > > > >> > > On 2025/05/07 21:50:19 yun zou wrote: > > > > > > > >> > > > Hi folks, > > > > > > > >> > > > > > > > > > > >> > > > I would like to propose to add an optional `location` > > > field > > > > to > > > > > > > >> > > > CreateGenricTable Request and LoadGenericTable > response. > > > > > > > >> > > > > > > > > > > >> > > > The `location` is the location for the table, which is > > > > common > > > > > to > > > > > > > >> most > > > > > > > >> > > table > > > > > > > >> > > > formats including Iceberg, Delta, Hudi, csv, parquet > > etc. > > > > The > > > > > > > >> location > > > > > > > >> > > > information is critical for loading the table at > engine > > > > side, > > > > > > > >> having a > > > > > > > >> > > > dedicated keyword could help improve the robustness > for > > > > cross > > > > > > > engine > > > > > > > >> > > > sharing, instead of relying on the properties passed > by > > > the > > > > > > client > > > > > > > >> > side. > > > > > > > >> > > > > > > > > > > >> > > > Furthermore, this information is also required to > > provide > > > > > > > credential > > > > > > > >> > > > vending capabilities later. > > > > > > > >> > > > > > > > > > > >> > > > Here is the PR for adding the spec: > > > > > > > >> > > > https://github.com/apache/polaris/pull/1543 > > > > > > > >> > > > > > > > > > > >> > > > Looking forward to your reply and feedback! > > > > > > > >> > > > > > > > > > > >> > > > Best Regards, > > > > > > > >> > > > Yun > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >