Hi Dmitri, " I do not think those doc comments provide enough visibility to ensure that the key information is received by users, unless they are dealing directly with the API" -- Yeah, I agree those information may not be visible enough for users who don't directly work with APIs. However, I think just having one page for "location" might be a little bit overkill. Given that generic table API support is a new catalog capabilities that Polaris added which is not IRC, I think it might worth having a more general page to describe the Polaris Generic Table support and describe some of the critical fields like *location*. I think we should have the description in the spec also, so that things could be clear for API users.
Please let me know what you think. Best Regards, Yun On Mon, May 19, 2025 at 4:22 PM Dmitri Bourlatchkov <di...@apache.org> wrote: > I believe the Open API spec and the definition of "location" are slightly > different concerns. > > The former is about the API used to obtain information about Generic > Tables. > > The latter is about the interpretation of that information. One can think > of the location > value being handled / transferred beyond the immediate Polaris client, in > which case > is loses its connection to the API, but does not lose its meaning as a > location of a > Generic Table. > > Also, I think that Open API doc comments are too low-level and too obscure > for > people who will work with processing actual Generic Table files. I do not > think > those doc comment provide enough visibility to ensure that the key > information > is received by users, unless they are dealing directly with the API. > > That said, if you prefer to keep the finer points about Generic Table > locations in the > Open API spec, I'd be fine with that. > > Cheers, > Dmitri. > > On Mon, May 19, 2025 at 6:46 PM yun zou <yunzou.colost...@gmail.com> > wrote: > > > Hi Dmitri, > > > > Thanks for the detailed explanation, I definitely agree we need to call > out > > those restrictions and compliance in our Spec. > > > > As for the documentation, Polaris today already publishes the API spec, > if > > you go to page https://polaris.apache.org/in-dev/unreleased/, > > and click on the Catalog API Spec, it will lead you to the published > Spec, > > which contains all description in the Spec. > > That basically means we have both published doc and spec code, and the > > single source of truth is the description in the doc. > > or do you think we should have an extra page for the Generic Table API > > spec? > > > > Best Regards, > > Yun > > > > On Mon, May 19, 2025 at 3:20 PM Yufei Gu <flyrain...@gmail.com> wrote: > > > > > > > > > > * Clients (engines) are responsible for writing files only under the > > > > specified location. > > > > > > It's nice to have a doc like that. But the open API spec is *the* place > > to > > > define the behavior of client and server, and how they interact with > each > > > other. Just as we said before, spec change is recommended to have a ML > > > discussion. > > > > > > * A table, whose files exist outside the declared location, is not > > > > compliant with the Polaris' definition for a Generic Table. > > > > > > I'm not sure we should go that far. "location" is an optional field. > It's > > > just some features like credential vending that don't work if > "location" > > is > > > missing. > > > > > > Yufei > > > > > > > > > On Mon, May 19, 2025 at 2:59 PM Dmitri Bourlatchkov <di...@apache.org> > > > wrote: > > > > > > > As I commented in my other recent email, I think by introducing a > > > > "location" property Polaris enters the realm of table format specs. > > > > > > > > This is fine, from my POV, however, since Polaris is the defining > > project > > > > behind that property, I believe Polaris should provide a more > > definitive > > > > description of the meaning and intended processing of that property. > > > > > > > > To repeat myself, I think the Open API spec defines only the API for > > > > obtaining the location. We need a place to define what this location > > > means. > > > > I do not insist on calling this a "spec" for Generic Tables, but I > > think > > > it > > > > deserves a separate page in Polaris docs, where it would be defined > > with > > > > more rigor. > > > > > > > > Specifically, I think we need to call out that: > > > > * The location is a base URI (essentially prefix) for all files in a > > > > generic table. > > > > * Clients (engines) are responsible for writing files only under the > > > > specified location. > > > > * A table, whose files exist outside the declared location, is not > > > > compliant with the Polaris' definition for a Generic Table. > > > > > > > > By extension, I think we ought to describe other existing properties > > too. > > > > > > > > WDYT? > > > > > > > > Thanks, > > > > Dmitri. > > > > > > > > On Mon, May 19, 2025 at 5:39 PM yun zou <yunzou.colost...@gmail.com> > > > > wrote: > > > > > > > > > Hi Dmitri, > > > > > > > > > > I think for Iceberg, we all agreed that there can be multiple > > > locations, > > > > > and I definitely agree with Russel that the extension > > > > > should be done with the IRC endpoints. The Generic Table APIs are > > > > designed > > > > > for non-Iceberg table usage today, and > > > > > We still want Iceberg table usage to go through the IRC endpoint to > > > have > > > > > full IRC support. > > > > > > > > > > As for the following point > > > > > "a more strict spec for that (define where file should and should > not > > > > go)" > > > > > Are you referring that Polaris need to generate a location for the > > > table > > > > to > > > > > use, if that is the case, I don't think engines > > > > > respects that today. The table locations are either generated by > the > > > > engine > > > > > or specified by the user. > > > > > Or are you referring that we should have something like Iceberg > that > > we > > > > > should have an allowed location and do a > > > > > validation to make sure the location is under the allowed location? > > > Would > > > > > you mind elaborate more on this point? > > > > > > > > > > Best Regards, > > > > > Yun > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, May 19, 2025 at 1:45 PM Russell Spitzer < > > > > russell.spit...@gmail.com > > > > > > > > > > > wrote: > > > > > > > > > > > Yeah I think Iceberg and Hive are the only ones trying to make > life > > > > > > difficult, that I think > > > > > > we should also cover but in changes to the Iceberg Spec. Hive can > > > just > > > > > stay > > > > > > how it is ... > > > > > > > > > > > > On Mon, May 19, 2025 at 2:59 PM Dmitri Bourlatchkov < > > > di...@apache.org> > > > > > > wrote: > > > > > > > > > > > > > For context: my locations concerns are rooted in Nessie's > > > experience > > > > > > where > > > > > > > we often get problem reports related to files being outside the > > > > > declared > > > > > > > Iceberg metadata location. > > > > > > > > > > > > > > Example: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/projectnessie/nessie/issues/10817#issuecomment-2887329227 > > > > > > > > > > > > > > I'm ok going with a single location for generic tables, but I > > think > > > > > > Polaris > > > > > > > needs to have a more strict spec for that (define where file > > should > > > > and > > > > > > > should not go) because polaris owns this spec. Polaris ought to > > > > define > > > > > > what > > > > > > > complies with the spec and what does not. Having a proper spec > is > > > > > > essential > > > > > > > to ensure a mutual understanding of all parties dealing with > > > Generic > > > > > > > Tables. > > > > > > > > > > > > > > Open API yaml comments are not sufficient, IMHO. I'd prefer to > > > have a > > > > > > > dedicated doc page to define expectations and compliance. > > > > > > > > > > > > > > Thanks, > > > > > > > Dmitri. > > > > > > > > > > > > > > > > > > > > > On Mon, May 19, 2025 at 2:17 PM Russell Spitzer < > > > > > > russell.spit...@gmail.com > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > The only multiple locations table formats I'm currently aware > > of > > > > are > > > > > > Hive > > > > > > > > (partitions can live wherever) and Iceberg. > > > > > > > > > > > > > > > > I think for Delta, Hudi, LanceDB, Paimon and File based > tables > > > > they > > > > > > all > > > > > > > > have to live in the root location. I'm not sure of any other > > > "file" > > > > > > based > > > > > > > > tables where this would be an issue but I'd love to know if > > > someone > > > > > > else > > > > > > > > has ideas. I think with the rise in credential vending, > > splitting > > > > > > things > > > > > > > > amongst multiple prefixes is becoming less common. I don't > > oppose > > > > > doing > > > > > > > an > > > > > > > > array of locations but it may be enough to just leave this as > > an > > > > > > > extension > > > > > > > > later. (Support location or locations) > > > > > > > > > > > > > > > > On Wed, May 7, 2025 at 8:52 PM yun zou < > > > yunzou.colost...@gmail.com > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hi Dmitri, > > > > > > > > > > > > > > > > > > If it's not "all" is it not strong enough for a spec, IMHO. > > If > > > > some > > > > > > > > tables > > > > > > > > > have multiple base locations how is Polaris going to deal > > with > > > > > them? > > > > > > > > > > > > > > > > > > Sorry, when I say most of them, it was because I haven't > > tested > > > > all > > > > > > of > > > > > > > > them > > > > > > > > > (I only tested Delta and CSV before). > > > > > > > > > However, if Unity Catalog is only taking one location, I > > think > > > > that > > > > > > is > > > > > > > a > > > > > > > > > strong enough proof that > > > > > > > > > one location is enough today. > > > > > > > > > > > > > > > > > > It is also more natural to start with one location, and if > > > there > > > > > are > > > > > > > use > > > > > > > > > cases that > > > > > > > > > require support for multiple locations later, we can move > on > > to > > > > V2 > > > > > > spec > > > > > > > > to > > > > > > > > > support multiple > > > > > > > > > tables locations. > > > > > > > > > > > > > > > > > > We're making a specification for Polaris. I do not think it > > is > > > > > > > sufficient > > > > > > > > > to say we'll do the same as other (unspecified ATM) > catalogs. > > > > > > > > > If we want to migrate users from other Catalog services to > > > > Polaris > > > > > > > > (through > > > > > > > > > federation), then Polaris will need to > > > > > > > > > provide corresponding capabilities. For example, Unity > > Catalog > > > > > > storage > > > > > > > > > location is a URI representation, when entity > > > > > > > > > are federated from Unity Catalog, we will need to be able > to > > > > handle > > > > > > the > > > > > > > > URI > > > > > > > > > location. > > > > > > > > > If URI representation is a common standard that has been > > > accepted > > > > > by > > > > > > > > other > > > > > > > > > Catalog services like Unity Catalog, Gravitino, > > > > > > > > > Polaris should be compatible with that, otherwise it might > > > cause > > > > > > > problem > > > > > > > > > for users when they are migrating from one to > > > > > > > > > another. > > > > > > > > > > > > > > > > > > What will Polaris Server do with this location? > > > > > > > > > For generic tables, Polaris will provide credential vending > > for > > > > > this > > > > > > > > > location in near future, I don't see we will provide > > > > > > > > > anything else in short or mid term, since we still want to > > > > promote > > > > > > > > > native support for Iceberg. > > > > > > > > > Or if you have anything special in your mind that you think > > we > > > > > should > > > > > > > > > support? > > > > > > > > > > > > > > > > > > If Polaris has to define it in a spec, it will be hard to > > > change > > > > in > > > > > > the > > > > > > > > > future. > > > > > > > > > Regardless of whether it is explicitly in the spec > definition > > > or > > > > > as a > > > > > > > > > reserved property key, as long as they are explicitly > > > > > > > > > documented, they will be hard to change in the future. From > > > that > > > > > > > > > perspective, those two approaches seem the same to me. > > > > > > > > > > > > > > > > > > Table location is critical information that is required by > > the > > > > > engine > > > > > > > > side > > > > > > > > > to read and write the tables, which should > > > > > > > > > be explicitly defined to provide better sharing across > > engines. > > > > For > > > > > > > > > example, the delta table location is passed in the > > > > > > > > > table properties with a property key either "location" or > > > "path" > > > > > > > depends > > > > > > > > on > > > > > > > > > how the table is created. Now, if another > > > > > > > > > engine wants to read the delta table, it will need to > > > understand > > > > > > those > > > > > > > > > keys, which are controlled by Spark today. If Spark > > > > > > > > > changes them one day, all sharing will stop working. > > > > > > > > > > > > > > > > > > As to whether we want to put it as an explicit field or a > > > > reserved > > > > > > > key, I > > > > > > > > > think for a common field among various > > > > > > > > > table formats, it makes more sense to have it as an > explicit > > > > field. > > > > > > For > > > > > > > > > properties that are specific to a particular table format, > > > > > > > > > it is more proper to just have a reserved key. > > > > > > > > > > > > > > > > > > If Polaris takes control of the location, I think we have > to > > be > > > > > more > > > > > > > > > careful > > > > > > > > > and at least try to make it future-proof. > > > > > > > > > > > > > > > > > > I don't think Polaris is taking control of the location, > the > > > > > location > > > > > > > is > > > > > > > > > still controlled by the engine and users today like table > > > names. > > > > > > > > > Polaris is a Catalog service, it records the generic table > > > > entity, > > > > > > and > > > > > > > > > returns the information back to the user on query. > > > > > > > > > It might be able to do some validation on the location > (like > > > > check > > > > > > > > special > > > > > > > > > character), but it doesn't decide which location > > > > > > > > > the table will be used. I personally don't think it is a > bad > > > idea > > > > > to > > > > > > > let > > > > > > > > > the Catalog service also take control of generating > > > > > > > > > the table location, but I think that will require a lot of > > > work. > > > > > > > > > > > > > > > > > > Best Regards, > > > > > > > > > Yun > > > > > > > > > > > > > > > > > > On Wed, May 7, 2025 at 5:22 PM Dmitri Bourlatchkov < > > > > > di...@apache.org > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > No worries about the name. It is a possible alternative > > > > spelling > > > > > :) > > > > > > > > > > > > > > > > > > > > On Wed, May 7, 2025 at 8:04 PM yun zou < > > > > > yunzou.colost...@gmail.com > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Hi Dmitri, > > > > > > > > > > > > > > > > > > > > > > Sorry, I accidentally typed your name wrong in the > > previous > > > > > > reply! > > > > > > > > > > > Apologize for this! > > > > > > > > > > > > > > > > > > > > > > For the S3 issue, I think we will need to deal with > those > > > > > > > regardless, > > > > > > > > > > > especially with the federation work going on, we will > > need > > > to > > > > > > > handle > > > > > > > > > all > > > > > > > > > > > those entities eventually coming from different > Catalogs, > > > and > > > > > the > > > > > > > URI > > > > > > > > > > > format seems the standard format used by various > Catalog > > > > > > services. > > > > > > > > > > > > > > > > > > > > > > Best Regards, > > > > > > > > > > > Yun > > > > > > > > > > > > > > > > > > > > > > On Wed, May 7, 2025 at 4:55 PM yun zou < > > > > > > yunzou.colost...@gmail.com > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Hi Dimitri and Eric, > > > > > > > > > > > > > > > > > > > > > > > > Thanks a lot for the feedback! > > > > > > > > > > > > > > > > > > > > > > > > For the questions: > > > > > > > > > > > > - Is one value or many? > > > > > > > > > > > > It will be one value, similar to the location in > > Iceberg > > > > and > > > > > > the > > > > > > > > > > > > storage_location in unity catalog. > > > > > > > > > > > > > > > > > > > > > > > > Regarding to the point about having new data in new > > > > locations > > > > > > and > > > > > > > > > > keeping > > > > > > > > > > > > old data in old locations, do we support that for > > Iceberg > > > > > > > > > > > > today? > > > > > > > > > > > > For most of the Spark tables, it seems to only have > one > > > > > > location. > > > > > > > > > > Also, I > > > > > > > > > > > > think it is better to start restricted first, and > then > > > > extend > > > > > > it > > > > > > > to > > > > > > > > > > > > allow multiple locations when the use case raises. > > > > > > > > > > > > > > > > > > > > > > > > Ref: > > > > > > > > > > > > Iceberg location: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L3451 > > > > > > > > > > > > Storage location in Unity Catalog: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L3451 > > > > > > > > > > > > > > > > > > > > > > > > - Is it a URI? > > > > > > > > > > > > Yes, it will be a URI, which seems the standard > catalog > > > > > > > > > implementation. > > > > > > > > > > > > Regarding to the point about s3 v2 s3a, i assume that > > is > > > a > > > > > > common > > > > > > > > > > > > problem that every catalog implementation needs to > > > address, > > > > > and > > > > > > > we > > > > > > > > > will > > > > > > > > > > > > stay the same on this part. At least from the load > > table > > > > > point > > > > > > of > > > > > > > > > view, > > > > > > > > > > > > Spark engine knows how to deal with such cases. > > > > > > > > > > > > > > > > > > > > > > > > - Does it point to any particular file? > > > > > > > > > > > > No, it doesn't point to a particular file. It is the > > base > > > > > table > > > > > > > > > > location. > > > > > > > > > > > > > > > > > > > > > > > > - Is it a common prefix of all files within a table? > > > > > > > > > > > > It is supposed to be the base table location, which > > > > > > theoretically > > > > > > > > > > should > > > > > > > > > > > > be the common prefix of all files within a table I > > > believe. > > > > > > > > > > > > > > > > > > > > > > > > - What happens when a value does not match these > > > > > expectations? > > > > > > > > > > > > Whether it is one value or many is restricted by the > > spec > > > > > > > already. > > > > > > > > > > > > For URI format, I think we can do a format check, and > > > fail > > > > > it. > > > > > > > > > > > > Other than that, we will not do any other special > > check, > > > > and > > > > > we > > > > > > > > rely > > > > > > > > > on > > > > > > > > > > > > the client to put the correct value, otherwise, the > > other > > > > > > engine > > > > > > > > will > > > > > > > > > > > > not be able to successfully read the table. > > > > > > > > > > > > > > > > > > > > > > > > For the location keyword, as Eric has pointed out, we > > can > > > > > > > > potentially > > > > > > > > > > > have > > > > > > > > > > > > a reserved key for the properties. However, location > > is a > > > > > > common > > > > > > > > > > > > enough key among various table formats, which worths > a > > > > > > dedicated > > > > > > > > key > > > > > > > > > to > > > > > > > > > > > > help store and load the information in a more > > > > straightforward > > > > > > > > > > > > way. For things that are specific to one or two > > > formats, I > > > > > > think > > > > > > > > it > > > > > > > > > > > makes > > > > > > > > > > > > more sense to use a reserved property key. > > > > > > > > > > > > > > > > > > > > > > > > As a reference, in Iceberg, the CreateTable request > and > > > > > > > > TableMetadata > > > > > > > > > > > does > > > > > > > > > > > > have an explicit location key in the spec. For > > > > > write.data.path > > > > > > > > > > > > and write.metadata.path, they are passed as > properties > > > > today. > > > > > > > > > > > > > > > > > > > > > > > > Best Regards, > > > > > > > > > > > > Yun > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, May 7, 2025 at 3:54 PM Dmitri Bourlatchkov < > > > > > > > > di...@apache.org > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > >> Another point: I'm pretty sure sooner or later users > > > will > > > > > want > > > > > > > to > > > > > > > > > move > > > > > > > > > > > >> their data to some other location. As an option > users > > > may > > > > > want > > > > > > > to > > > > > > > > > > write > > > > > > > > > > > >> new > > > > > > > > > > > >> files into another location but keep old files in > > place. > > > > > > > > > > > >> > > > > > > > > > > > >> Also: if the location is a URI, how do we deal with > s3 > > > vs. > > > > > s3a > > > > > > > for > > > > > > > > > > > >> example? > > > > > > > > > > > >> > > > > > > > > > > > >> In Iceberg it is quite common for different engines > to > > > use > > > > > > > > different > > > > > > > > > > > >> access > > > > > > > > > > > >> tools, which often leads to different URI schemes. > > > > > > > > > > > >> > > > > > > > > > > > >> Cheers, > > > > > > > > > > > >> Dmitri. > > > > > > > > > > > >> > > > > > > > > > > > >> On Wed, May 7, 2025 at 6:46 PM Eric Maynard < > > > > > > > > > eric.w.mayn...@gmail.com > > > > > > > > > > > > > > > > > > > > > > >> wrote: > > > > > > > > > > > >> > > > > > > > > > > > >> > All good questions Dmitri — I’m especially > > interested > > > in > > > > > the > > > > > > > > first > > > > > > > > > > one > > > > > > > > > > > >> as > > > > > > > > > > > >> > from what I understand Iceberg tables can have > > > metadata > > > > > and > > > > > > > data > > > > > > > > > at > > > > > > > > > > > two > > > > > > > > > > > >> > different paths that we need to vend credentials > > for. > > > > > > > > > > > >> > > > > > > > > > > > > >> > For iceberg tables, we just use special properties > > to > > > > > track > > > > > > > > these > > > > > > > > > > > >> > locations. I wonder if we couldn’t do the same for > > > > generic > > > > > > > > tables. > > > > > > > > > > > >> > > > > > > > > > > > > >> > On Wed, May 7, 2025 at 3:42 PM Dmitri > Bourlatchkov < > > > > > > > > > > di...@apache.org> > > > > > > > > > > > >> > wrote: > > > > > > > > > > > >> > > > > > > > > > > > > >> > > Hi Yun, > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > Please clarify the meaning of the value of the > new > > > > > > location > > > > > > > > > > > attribute. > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > - Is is one value or many? > > > > > > > > > > > >> > > - Is it a URI? > > > > > > > > > > > >> > > - Does it point to any particular file? > > > > > > > > > > > >> > > - Is it a common prefix of all files within a > > table? > > > > > > > > > > > >> > > - What happens when a value does not match these > > > > > > > expectation? > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > Thanks, > > > > > > > > > > > >> > > Dmitri. > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > On 2025/05/07 21:50:19 yun zou wrote: > > > > > > > > > > > >> > > > Hi folks, > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > I would like to propose to add an optional > > > > `location` > > > > > > > field > > > > > > > > to > > > > > > > > > > > >> > > > CreateGenricTable Request and LoadGenericTable > > > > > response. > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > The `location` is the location for the table, > > > which > > > > is > > > > > > > > common > > > > > > > > > to > > > > > > > > > > > >> most > > > > > > > > > > > >> > > table > > > > > > > > > > > >> > > > formats including Iceberg, Delta, Hudi, csv, > > > parquet > > > > > > etc. > > > > > > > > The > > > > > > > > > > > >> location > > > > > > > > > > > >> > > > information is critical for loading the table > at > > > > > engine > > > > > > > > side, > > > > > > > > > > > >> having a > > > > > > > > > > > >> > > > dedicated keyword could help improve the > > > robustness > > > > > for > > > > > > > > cross > > > > > > > > > > > engine > > > > > > > > > > > >> > > > sharing, instead of relying on the properties > > > passed > > > > > by > > > > > > > the > > > > > > > > > > client > > > > > > > > > > > >> > side. > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > Furthermore, this information is also required > > to > > > > > > provide > > > > > > > > > > > credential > > > > > > > > > > > >> > > > vending capabilities later. > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > Here is the PR for adding the spec: > > > > > > > > > > > >> > > > https://github.com/apache/polaris/pull/1543 > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > Looking forward to your reply and feedback! > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > Best Regards, > > > > > > > > > > > >> > > > Yun > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >