Inlined. On Thu, May 22, 2025 at 7:48 AM Dmitri Bourlatchkov <di...@apache.org> wrote:
> > Can we keep it simple for v1 [...] > > What is v1 in this context? > I meant as the first iteration, sorry for the confusion. > > Thanks, > Dmitri. > > On Wed, May 21, 2025 at 8:42 PM Yufei Gu <flyrain...@gmail.com> wrote: > > > Can we keep it simple for v1, as one location field is enough for today’s > > use cases? And we can revisit multi-location support when there’s real > > demand. > > > > The current API spec already implies that a table’s location is > immutable, > > there’s no “alter location” call. I’m fine leaving it implicit, but we > > could add an explicit note to make that clear if it helps avoid > confusion. > > > > Yufei > > > > > > On Wed, May 21, 2025 at 4:36 PM Eric Maynard <eric.w.mayn...@gmail.com> > > wrote: > > > > > No two tables globally can have a location overlap? That’s a stricter > > > requirement than we have for even Iceberg tables and doesn’t sound > > correct. > > > > > > Similarly, the restriction that you can’t change location is stricter > > than > > > what we have for Iceberg. > > > > > > Finally, I’m still not sure what the problem is with having multiple > > > locations. Again, we already track multiple locations for Iceberg. > > > > > > On Thu, May 22, 2025 at 12:32 AM yun zou <yunzou.colost...@gmail.com> > > > wrote: > > > > > > > Hi All, > > > > > > > > Want to summarize the thread here: > > > > > > > > For generic tables, we will add a `location` key to help cross engine > > > > sharing and future support for credential vending. > > > > > > > > Here is a description about this `location` key and corresponding > > > > restrictions and responsibilities: > > > > - `location`(OPTIONAL): table root location in URI format. For > example: > > > > s3://<my-bucket>/path/to/table. > > > > - The table root location is a location that includes all files for > > the > > > > table. > > > > - Clients (engines) are responsible to make sure all files are > > written > > > > under the configured location. > > > > - A table with multiple root locations (i.e. containing files that > > are > > > > outside the configured root location) is not compliant with the > current > > > > generic table support in Polaris. > > > > - No two tables can have the same or overlapped location, > otherwise, > > a > > > > ForbiddenException will be thrown on creation. > > > > - If no location is provided, clients or users are responsible to > > > manage > > > > the location and location related concerns such as path conflict > check > > > etc. > > > > - The location configuration can not be updated once the table is > > > > created. > > > > > > > > This description will be added into the spec. In order to help > non-API > > > > users to discover the information easily, we will also get a site > page > > to > > > > describe the support > > > > for Generic Table and key fields. > > > > > > > > Best Regards, > > > > Yun > > > > > > > > On Mon, May 19, 2025 at 11:16 PM yun zou <yunzou.colost...@gmail.com > > > > > > wrote: > > > > > > > > > Hi Dmitri, > > > > > > > > > > " I do not think those doc comments provide enough visibility to > > ensure > > > > > that the key information > > > > > is received by users, unless they are dealing directly with the > API" > > > > > -- Yeah, I agree those information may not be visible enough for > > users > > > > who > > > > > don't directly work with APIs. > > > > > However, I think just having one page for "location" might be a > > little > > > > bit > > > > > overkill. Given that generic table API support is > > > > > a new catalog capabilities that Polaris added which is not IRC, I > > think > > > > it > > > > > might worth having a more general page to > > > > > describe the Polaris Generic Table support and describe some of the > > > > > critical fields like *location*. > > > > > I think we should have the description in the spec also, so that > > things > > > > > could be clear for API users. > > > > > > > > > > Please let me know what you think. > > > > > > > > > > Best Regards, > > > > > Yun > > > > > > > > > > On Mon, May 19, 2025 at 4:22 PM Dmitri Bourlatchkov < > > di...@apache.org> > > > > > wrote: > > > > > > > > > >> I believe the Open API spec and the definition of "location" are > > > > slightly > > > > >> different concerns. > > > > >> > > > > >> The former is about the API used to obtain information about > Generic > > > > >> Tables. > > > > >> > > > > >> The latter is about the interpretation of that information. One > can > > > > think > > > > >> of the location > > > > >> value being handled / transferred beyond the immediate Polaris > > client, > > > > in > > > > >> which case > > > > >> is loses its connection to the API, but does not lose its meaning > > as a > > > > >> location of a > > > > >> Generic Table. > > > > >> > > > > >> Also, I think that Open API doc comments are too low-level and too > > > > obscure > > > > >> for > > > > >> people who will work with processing actual Generic Table files. I > > do > > > > not > > > > >> think > > > > >> those doc comment provide enough visibility to ensure that the key > > > > >> information > > > > >> is received by users, unless they are dealing directly with the > API. > > > > >> > > > > >> That said, if you prefer to keep the finer points about Generic > > Table > > > > >> locations in the > > > > >> Open API spec, I'd be fine with that. > > > > >> > > > > >> Cheers, > > > > >> Dmitri. > > > > >> > > > > >> On Mon, May 19, 2025 at 6:46 PM yun zou < > yunzou.colost...@gmail.com > > > > > > > >> wrote: > > > > >> > > > > >> > Hi Dmitri, > > > > >> > > > > > >> > Thanks for the detailed explanation, I definitely agree we need > to > > > > call > > > > >> out > > > > >> > those restrictions and compliance in our Spec. > > > > >> > > > > > >> > As for the documentation, Polaris today already publishes the > API > > > > spec, > > > > >> if > > > > >> > you go to page https://polaris.apache.org/in-dev/unreleased/, > > > > >> > and click on the Catalog API Spec, it will lead you to the > > published > > > > >> Spec, > > > > >> > which contains all description in the Spec. > > > > >> > That basically means we have both published doc and spec code, > and > > > the > > > > >> > single source of truth is the description in the doc. > > > > >> > or do you think we should have an extra page for the Generic > Table > > > API > > > > >> > spec? > > > > >> > > > > > >> > Best Regards, > > > > >> > Yun > > > > >> > > > > > >> > On Mon, May 19, 2025 at 3:20 PM Yufei Gu <flyrain...@gmail.com> > > > > wrote: > > > > >> > > > > > >> > > > > > > > >> > > > * Clients (engines) are responsible for writing files only > > under > > > > the > > > > >> > > > specified location. > > > > >> > > > > > > >> > > It's nice to have a doc like that. But the open API spec is > > *the* > > > > >> place > > > > >> > to > > > > >> > > define the behavior of client and server, and how they > interact > > > with > > > > >> each > > > > >> > > other. Just as we said before, spec change is recommended to > > have > > > a > > > > ML > > > > >> > > discussion. > > > > >> > > > > > > >> > > * A table, whose files exist outside the declared location, is > > not > > > > >> > > > compliant with the Polaris' definition for a Generic Table. > > > > >> > > > > > > >> > > I'm not sure we should go that far. "location" is an optional > > > field. > > > > >> It's > > > > >> > > just some features like credential vending that don't work if > > > > >> "location" > > > > >> > is > > > > >> > > missing. > > > > >> > > > > > > >> > > Yufei > > > > >> > > > > > > >> > > > > > > >> > > On Mon, May 19, 2025 at 2:59 PM Dmitri Bourlatchkov < > > > > di...@apache.org > > > > >> > > > > > >> > > wrote: > > > > >> > > > > > > >> > > > As I commented in my other recent email, I think by > > introducing > > > a > > > > >> > > > "location" property Polaris enters the realm of table format > > > > specs. > > > > >> > > > > > > > >> > > > This is fine, from my POV, however, since Polaris is the > > > defining > > > > >> > project > > > > >> > > > behind that property, I believe Polaris should provide a > more > > > > >> > definitive > > > > >> > > > description of the meaning and intended processing of that > > > > property. > > > > >> > > > > > > > >> > > > To repeat myself, I think the Open API spec defines only the > > API > > > > for > > > > >> > > > obtaining the location. We need a place to define what this > > > > location > > > > >> > > means. > > > > >> > > > I do not insist on calling this a "spec" for Generic Tables, > > > but I > > > > >> > think > > > > >> > > it > > > > >> > > > deserves a separate page in Polaris docs, where it would be > > > > defined > > > > >> > with > > > > >> > > > more rigor. > > > > >> > > > > > > > >> > > > Specifically, I think we need to call out that: > > > > >> > > > * The location is a base URI (essentially prefix) for all > > files > > > > in a > > > > >> > > > generic table. > > > > >> > > > * Clients (engines) are responsible for writing files only > > under > > > > the > > > > >> > > > specified location. > > > > >> > > > * A table, whose files exist outside the declared location, > is > > > not > > > > >> > > > compliant with the Polaris' definition for a Generic Table. > > > > >> > > > > > > > >> > > > By extension, I think we ought to describe other existing > > > > properties > > > > >> > too. > > > > >> > > > > > > > >> > > > WDYT? > > > > >> > > > > > > > >> > > > Thanks, > > > > >> > > > Dmitri. > > > > >> > > > > > > > >> > > > On Mon, May 19, 2025 at 5:39 PM yun zou < > > > > yunzou.colost...@gmail.com > > > > >> > > > > > >> > > > wrote: > > > > >> > > > > > > > >> > > > > Hi Dmitri, > > > > >> > > > > > > > > >> > > > > I think for Iceberg, we all agreed that there can be > > multiple > > > > >> > > locations, > > > > >> > > > > and I definitely agree with Russel that the extension > > > > >> > > > > should be done with the IRC endpoints. The Generic Table > > APIs > > > > are > > > > >> > > > designed > > > > >> > > > > for non-Iceberg table usage today, and > > > > >> > > > > We still want Iceberg table usage to go through the IRC > > > endpoint > > > > >> to > > > > >> > > have > > > > >> > > > > full IRC support. > > > > >> > > > > > > > > >> > > > > As for the following point > > > > >> > > > > "a more strict spec for that (define where file should and > > > > should > > > > >> not > > > > >> > > > go)" > > > > >> > > > > Are you referring that Polaris need to generate a location > > for > > > > the > > > > >> > > table > > > > >> > > > to > > > > >> > > > > use, if that is the case, I don't think engines > > > > >> > > > > respects that today. The table locations are either > > generated > > > by > > > > >> the > > > > >> > > > engine > > > > >> > > > > or specified by the user. > > > > >> > > > > Or are you referring that we should have something like > > > Iceberg > > > > >> that > > > > >> > we > > > > >> > > > > should have an allowed location and do a > > > > >> > > > > validation to make sure the location is under the allowed > > > > >> location? > > > > >> > > Would > > > > >> > > > > you mind elaborate more on this point? > > > > >> > > > > > > > > >> > > > > Best Regards, > > > > >> > > > > Yun > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > On Mon, May 19, 2025 at 1:45 PM Russell Spitzer < > > > > >> > > > russell.spit...@gmail.com > > > > >> > > > > > > > > > >> > > > > wrote: > > > > >> > > > > > > > > >> > > > > > Yeah I think Iceberg and Hive are the only ones trying > to > > > make > > > > >> life > > > > >> > > > > > difficult, that I think > > > > >> > > > > > we should also cover but in changes to the Iceberg Spec. > > > Hive > > > > >> can > > > > >> > > just > > > > >> > > > > stay > > > > >> > > > > > how it is ... > > > > >> > > > > > > > > > >> > > > > > On Mon, May 19, 2025 at 2:59 PM Dmitri Bourlatchkov < > > > > >> > > di...@apache.org> > > > > >> > > > > > wrote: > > > > >> > > > > > > > > > >> > > > > > > For context: my locations concerns are rooted in > > Nessie's > > > > >> > > experience > > > > >> > > > > > where > > > > >> > > > > > > we often get problem reports related to files being > > > outside > > > > >> the > > > > >> > > > > declared > > > > >> > > > > > > Iceberg metadata location. > > > > >> > > > > > > > > > > >> > > > > > > Example: > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > https://github.com/projectnessie/nessie/issues/10817#issuecomment-2887329227 > > > > >> > > > > > > > > > > >> > > > > > > I'm ok going with a single location for generic > tables, > > > but > > > > I > > > > >> > think > > > > >> > > > > > Polaris > > > > >> > > > > > > needs to have a more strict spec for that (define > where > > > file > > > > >> > should > > > > >> > > > and > > > > >> > > > > > > should not go) because polaris owns this spec. Polaris > > > ought > > > > >> to > > > > >> > > > define > > > > >> > > > > > what > > > > >> > > > > > > complies with the spec and what does not. Having a > > proper > > > > >> spec is > > > > >> > > > > > essential > > > > >> > > > > > > to ensure a mutual understanding of all parties > dealing > > > with > > > > >> > > Generic > > > > >> > > > > > > Tables. > > > > >> > > > > > > > > > > >> > > > > > > Open API yaml comments are not sufficient, IMHO. I'd > > > prefer > > > > to > > > > >> > > have a > > > > >> > > > > > > dedicated doc page to define expectations and > > compliance. > > > > >> > > > > > > > > > > >> > > > > > > Thanks, > > > > >> > > > > > > Dmitri. > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > On Mon, May 19, 2025 at 2:17 PM Russell Spitzer < > > > > >> > > > > > russell.spit...@gmail.com > > > > >> > > > > > > > > > > > >> > > > > > > wrote: > > > > >> > > > > > > > > > > >> > > > > > > > The only multiple locations table formats I'm > > currently > > > > >> aware > > > > >> > of > > > > >> > > > are > > > > >> > > > > > Hive > > > > >> > > > > > > > (partitions can live wherever) and Iceberg. > > > > >> > > > > > > > > > > > >> > > > > > > > I think for Delta, Hudi, LanceDB, Paimon and File > > based > > > > >> tables > > > > >> > > > they > > > > >> > > > > > all > > > > >> > > > > > > > have to live in the root location. I'm not sure of > any > > > > other > > > > >> > > "file" > > > > >> > > > > > based > > > > >> > > > > > > > tables where this would be an issue but I'd love to > > know > > > > if > > > > >> > > someone > > > > >> > > > > > else > > > > >> > > > > > > > has ideas. I think with the rise in credential > > vending, > > > > >> > splitting > > > > >> > > > > > things > > > > >> > > > > > > > amongst multiple prefixes is becoming less common. I > > > don't > > > > >> > oppose > > > > >> > > > > doing > > > > >> > > > > > > an > > > > >> > > > > > > > array of locations but it may be enough to just > leave > > > this > > > > >> as > > > > >> > an > > > > >> > > > > > > extension > > > > >> > > > > > > > later. (Support location or locations) > > > > >> > > > > > > > > > > > >> > > > > > > > On Wed, May 7, 2025 at 8:52 PM yun zou < > > > > >> > > yunzou.colost...@gmail.com > > > > >> > > > > > > > > >> > > > > > > wrote: > > > > >> > > > > > > > > > > > >> > > > > > > > > Hi Dmitri, > > > > >> > > > > > > > > > > > > >> > > > > > > > > If it's not "all" is it not strong enough for a > > spec, > > > > >> IMHO. > > > > >> > If > > > > >> > > > some > > > > >> > > > > > > > tables > > > > >> > > > > > > > > have multiple base locations how is Polaris going > to > > > > deal > > > > >> > with > > > > >> > > > > them? > > > > >> > > > > > > > > > > > > >> > > > > > > > > Sorry, when I say most of them, it was because I > > > haven't > > > > >> > tested > > > > >> > > > all > > > > >> > > > > > of > > > > >> > > > > > > > them > > > > >> > > > > > > > > (I only tested Delta and CSV before). > > > > >> > > > > > > > > However, if Unity Catalog is only taking one > > > location, I > > > > >> > think > > > > >> > > > that > > > > >> > > > > > is > > > > >> > > > > > > a > > > > >> > > > > > > > > strong enough proof that > > > > >> > > > > > > > > one location is enough today. > > > > >> > > > > > > > > > > > > >> > > > > > > > > It is also more natural to start with one > location, > > > and > > > > if > > > > >> > > there > > > > >> > > > > are > > > > >> > > > > > > use > > > > >> > > > > > > > > cases that > > > > >> > > > > > > > > require support for multiple locations later, we > can > > > > move > > > > >> on > > > > >> > to > > > > >> > > > V2 > > > > >> > > > > > spec > > > > >> > > > > > > > to > > > > >> > > > > > > > > support multiple > > > > >> > > > > > > > > tables locations. > > > > >> > > > > > > > > > > > > >> > > > > > > > > We're making a specification for Polaris. I do not > > > think > > > > >> it > > > > >> > is > > > > >> > > > > > > sufficient > > > > >> > > > > > > > > to say we'll do the same as other (unspecified > ATM) > > > > >> catalogs. > > > > >> > > > > > > > > If we want to migrate users from other Catalog > > > services > > > > to > > > > >> > > > Polaris > > > > >> > > > > > > > (through > > > > >> > > > > > > > > federation), then Polaris will need to > > > > >> > > > > > > > > provide corresponding capabilities. For example, > > > Unity > > > > >> > Catalog > > > > >> > > > > > storage > > > > >> > > > > > > > > location is a URI representation, when entity > > > > >> > > > > > > > > are federated from Unity Catalog, we will need to > be > > > > able > > > > >> to > > > > >> > > > handle > > > > >> > > > > > the > > > > >> > > > > > > > URI > > > > >> > > > > > > > > location. > > > > >> > > > > > > > > If URI representation is a common standard that > has > > > been > > > > >> > > accepted > > > > >> > > > > by > > > > >> > > > > > > > other > > > > >> > > > > > > > > Catalog services like Unity Catalog, Gravitino, > > > > >> > > > > > > > > Polaris should be compatible with that, otherwise > it > > > > might > > > > >> > > cause > > > > >> > > > > > > problem > > > > >> > > > > > > > > for users when they are migrating from one to > > > > >> > > > > > > > > another. > > > > >> > > > > > > > > > > > > >> > > > > > > > > What will Polaris Server do with this location? > > > > >> > > > > > > > > For generic tables, Polaris will provide > credential > > > > >> vending > > > > >> > for > > > > >> > > > > this > > > > >> > > > > > > > > location in near future, I don't see we will > provide > > > > >> > > > > > > > > anything else in short or mid term, since we still > > > want > > > > to > > > > >> > > > promote > > > > >> > > > > > > > > native support for Iceberg. > > > > >> > > > > > > > > Or if you have anything special in your mind that > > you > > > > >> think > > > > >> > we > > > > >> > > > > should > > > > >> > > > > > > > > support? > > > > >> > > > > > > > > > > > > >> > > > > > > > > If Polaris has to define it in a spec, it will be > > hard > > > > to > > > > >> > > change > > > > >> > > > in > > > > >> > > > > > the > > > > >> > > > > > > > > future. > > > > >> > > > > > > > > Regardless of whether it is explicitly in the spec > > > > >> definition > > > > >> > > or > > > > >> > > > > as a > > > > >> > > > > > > > > reserved property key, as long as they are > > explicitly > > > > >> > > > > > > > > documented, they will be hard to change in the > > future. > > > > >> From > > > > >> > > that > > > > >> > > > > > > > > perspective, those two approaches seem the same to > > me. > > > > >> > > > > > > > > > > > > >> > > > > > > > > Table location is critical information that is > > > required > > > > by > > > > >> > the > > > > >> > > > > engine > > > > >> > > > > > > > side > > > > >> > > > > > > > > to read and write the tables, which should > > > > >> > > > > > > > > be explicitly defined to provide better sharing > > across > > > > >> > engines. > > > > >> > > > For > > > > >> > > > > > > > > example, the delta table location is passed in the > > > > >> > > > > > > > > table properties with a property key either > > "location" > > > > or > > > > >> > > "path" > > > > >> > > > > > > depends > > > > >> > > > > > > > on > > > > >> > > > > > > > > how the table is created. Now, if another > > > > >> > > > > > > > > engine wants to read the delta table, it will need > > to > > > > >> > > understand > > > > >> > > > > > those > > > > >> > > > > > > > > keys, which are controlled by Spark today. If > Spark > > > > >> > > > > > > > > changes them one day, all sharing will stop > working. > > > > >> > > > > > > > > > > > > >> > > > > > > > > As to whether we want to put it as an explicit > field > > > or > > > > a > > > > >> > > > reserved > > > > >> > > > > > > key, I > > > > >> > > > > > > > > think for a common field among various > > > > >> > > > > > > > > table formats, it makes more sense to have it as > an > > > > >> explicit > > > > >> > > > field. > > > > >> > > > > > For > > > > >> > > > > > > > > properties that are specific to a particular table > > > > format, > > > > >> > > > > > > > > it is more proper to just have a reserved key. > > > > >> > > > > > > > > > > > > >> > > > > > > > > If Polaris takes control of the location, I think > we > > > > have > > > > >> to > > > > >> > be > > > > >> > > > > more > > > > >> > > > > > > > > careful > > > > >> > > > > > > > > and at least try to make it future-proof. > > > > >> > > > > > > > > > > > > >> > > > > > > > > I don't think Polaris is taking control of the > > > location, > > > > >> the > > > > >> > > > > location > > > > >> > > > > > > is > > > > >> > > > > > > > > still controlled by the engine and users today > like > > > > table > > > > >> > > names. > > > > >> > > > > > > > > Polaris is a Catalog service, it records the > generic > > > > table > > > > >> > > > entity, > > > > >> > > > > > and > > > > >> > > > > > > > > returns the information back to the user on query. > > > > >> > > > > > > > > It might be able to do some validation on the > > location > > > > >> (like > > > > >> > > > check > > > > >> > > > > > > > special > > > > >> > > > > > > > > character), but it doesn't decide which location > > > > >> > > > > > > > > the table will be used. I personally don't think > it > > > is a > > > > >> bad > > > > >> > > idea > > > > >> > > > > to > > > > >> > > > > > > let > > > > >> > > > > > > > > the Catalog service also take control of > generating > > > > >> > > > > > > > > the table location, but I think that will require > a > > > lot > > > > of > > > > >> > > work. > > > > >> > > > > > > > > > > > > >> > > > > > > > > Best Regards, > > > > >> > > > > > > > > Yun > > > > >> > > > > > > > > > > > > >> > > > > > > > > On Wed, May 7, 2025 at 5:22 PM Dmitri > Bourlatchkov < > > > > >> > > > > di...@apache.org > > > > >> > > > > > > > > > > >> > > > > > > > > wrote: > > > > >> > > > > > > > > > > > > >> > > > > > > > > > No worries about the name. It is a possible > > > > alternative > > > > >> > > > spelling > > > > >> > > > > :) > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > On Wed, May 7, 2025 at 8:04 PM yun zou < > > > > >> > > > > yunzou.colost...@gmail.com > > > > >> > > > > > > > > > > >> > > > > > > > > wrote: > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > Hi Dmitri, > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > Sorry, I accidentally typed your name wrong in > > the > > > > >> > previous > > > > >> > > > > > reply! > > > > >> > > > > > > > > > > Apologize for this! > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > For the S3 issue, I think we will need to deal > > > with > > > > >> those > > > > >> > > > > > > regardless, > > > > >> > > > > > > > > > > especially with the federation work going on, > we > > > > will > > > > >> > need > > > > >> > > to > > > > >> > > > > > > handle > > > > >> > > > > > > > > all > > > > >> > > > > > > > > > > those entities eventually coming from > different > > > > >> Catalogs, > > > > >> > > and > > > > >> > > > > the > > > > >> > > > > > > URI > > > > >> > > > > > > > > > > format seems the standard format used by > various > > > > >> Catalog > > > > >> > > > > > services. > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > Best Regards, > > > > >> > > > > > > > > > > Yun > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > On Wed, May 7, 2025 at 4:55 PM yun zou < > > > > >> > > > > > yunzou.colost...@gmail.com > > > > >> > > > > > > > > > > > >> > > > > > > > > > wrote: > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > Hi Dimitri and Eric, > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > Thanks a lot for the feedback! > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > For the questions: > > > > >> > > > > > > > > > > > - Is one value or many? > > > > >> > > > > > > > > > > > It will be one value, similar to the > location > > in > > > > >> > Iceberg > > > > >> > > > and > > > > >> > > > > > the > > > > >> > > > > > > > > > > > storage_location in unity catalog. > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > Regarding to the point about having new data > > in > > > > new > > > > >> > > > locations > > > > >> > > > > > and > > > > >> > > > > > > > > > keeping > > > > >> > > > > > > > > > > > old data in old locations, do we support > that > > > for > > > > >> > Iceberg > > > > >> > > > > > > > > > > > today? > > > > >> > > > > > > > > > > > For most of the Spark tables, it seems to > only > > > > have > > > > >> one > > > > >> > > > > > location. > > > > >> > > > > > > > > > Also, I > > > > >> > > > > > > > > > > > think it is better to start restricted > first, > > > and > > > > >> then > > > > >> > > > extend > > > > >> > > > > > it > > > > >> > > > > > > to > > > > >> > > > > > > > > > > > allow multiple locations when the use case > > > raises. > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > Ref: > > > > >> > > > > > > > > > > > Iceberg location: > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L3451 > > > > >> > > > > > > > > > > > Storage location in Unity Catalog: > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L3451 > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > - Is it a URI? > > > > >> > > > > > > > > > > > Yes, it will be a URI, which seems the > > standard > > > > >> catalog > > > > >> > > > > > > > > implementation. > > > > >> > > > > > > > > > > > Regarding to the point about s3 v2 s3a, i > > assume > > > > >> that > > > > >> > is > > > > >> > > a > > > > >> > > > > > common > > > > >> > > > > > > > > > > > problem that every catalog implementation > > needs > > > to > > > > >> > > address, > > > > >> > > > > and > > > > >> > > > > > > we > > > > >> > > > > > > > > will > > > > >> > > > > > > > > > > > stay the same on this part. At least from > the > > > load > > > > >> > table > > > > >> > > > > point > > > > >> > > > > > of > > > > >> > > > > > > > > view, > > > > >> > > > > > > > > > > > Spark engine knows how to deal with such > > cases. > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > - Does it point to any particular file? > > > > >> > > > > > > > > > > > No, it doesn't point to a particular file. > It > > is > > > > the > > > > >> > base > > > > >> > > > > table > > > > >> > > > > > > > > > location. > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > - Is it a common prefix of all files within > a > > > > table? > > > > >> > > > > > > > > > > > It is supposed to be the base table > location, > > > > which > > > > >> > > > > > theoretically > > > > >> > > > > > > > > > should > > > > >> > > > > > > > > > > > be the common prefix of all files within a > > > table I > > > > >> > > believe. > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > - What happens when a value does not match > > these > > > > >> > > > > expectations? > > > > >> > > > > > > > > > > > Whether it is one value or many is > restricted > > by > > > > the > > > > >> > spec > > > > >> > > > > > > already. > > > > >> > > > > > > > > > > > For URI format, I think we can do a format > > > check, > > > > >> and > > > > >> > > fail > > > > >> > > > > it. > > > > >> > > > > > > > > > > > Other than that, we will not do any other > > > special > > > > >> > check, > > > > >> > > > and > > > > >> > > > > we > > > > >> > > > > > > > rely > > > > >> > > > > > > > > on > > > > >> > > > > > > > > > > > the client to put the correct value, > > otherwise, > > > > the > > > > >> > other > > > > >> > > > > > engine > > > > >> > > > > > > > will > > > > >> > > > > > > > > > > > not be able to successfully read the table. > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > For the location keyword, as Eric has > pointed > > > out, > > > > >> we > > > > >> > can > > > > >> > > > > > > > potentially > > > > >> > > > > > > > > > > have > > > > >> > > > > > > > > > > > a reserved key for the properties. However, > > > > location > > > > >> > is a > > > > >> > > > > > common > > > > >> > > > > > > > > > > > enough key among various table formats, > which > > > > >> worths a > > > > >> > > > > > dedicated > > > > >> > > > > > > > key > > > > >> > > > > > > > > to > > > > >> > > > > > > > > > > > help store and load the information in a > more > > > > >> > > > straightforward > > > > >> > > > > > > > > > > > way. For things that are specific to one or > > two > > > > >> > > formats, I > > > > >> > > > > > think > > > > >> > > > > > > > it > > > > >> > > > > > > > > > > makes > > > > >> > > > > > > > > > > > more sense to use a reserved property key. > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > As a reference, in Iceberg, the CreateTable > > > > request > > > > >> and > > > > >> > > > > > > > TableMetadata > > > > >> > > > > > > > > > > does > > > > >> > > > > > > > > > > > have an explicit location key in the spec. > For > > > > >> > > > > write.data.path > > > > >> > > > > > > > > > > > and write.metadata.path, they are passed as > > > > >> properties > > > > >> > > > today. > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > Best Regards, > > > > >> > > > > > > > > > > > Yun > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > On Wed, May 7, 2025 at 3:54 PM Dmitri > > > > Bourlatchkov < > > > > >> > > > > > > > di...@apache.org > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > wrote: > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > >> Another point: I'm pretty sure sooner or > > later > > > > >> users > > > > >> > > will > > > > >> > > > > want > > > > >> > > > > > > to > > > > >> > > > > > > > > move > > > > >> > > > > > > > > > > >> their data to some other location. As an > > option > > > > >> users > > > > >> > > may > > > > >> > > > > want > > > > >> > > > > > > to > > > > >> > > > > > > > > > write > > > > >> > > > > > > > > > > >> new > > > > >> > > > > > > > > > > >> files into another location but keep old > > files > > > in > > > > >> > place. > > > > >> > > > > > > > > > > >> > > > > >> > > > > > > > > > > >> Also: if the location is a URI, how do we > > deal > > > > >> with s3 > > > > >> > > vs. > > > > >> > > > > s3a > > > > >> > > > > > > for > > > > >> > > > > > > > > > > >> example? > > > > >> > > > > > > > > > > >> > > > > >> > > > > > > > > > > >> In Iceberg it is quite common for different > > > > >> engines to > > > > >> > > use > > > > >> > > > > > > > different > > > > >> > > > > > > > > > > >> access > > > > >> > > > > > > > > > > >> tools, which often leads to different URI > > > > schemes. > > > > >> > > > > > > > > > > >> > > > > >> > > > > > > > > > > >> Cheers, > > > > >> > > > > > > > > > > >> Dmitri. > > > > >> > > > > > > > > > > >> > > > > >> > > > > > > > > > > >> On Wed, May 7, 2025 at 6:46 PM Eric > Maynard < > > > > >> > > > > > > > > eric.w.mayn...@gmail.com > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > >> wrote: > > > > >> > > > > > > > > > > >> > > > > >> > > > > > > > > > > >> > All good questions Dmitri — I’m > especially > > > > >> > interested > > > > >> > > in > > > > >> > > > > the > > > > >> > > > > > > > first > > > > >> > > > > > > > > > one > > > > >> > > > > > > > > > > >> as > > > > >> > > > > > > > > > > >> > from what I understand Iceberg tables can > > > have > > > > >> > > metadata > > > > >> > > > > and > > > > >> > > > > > > data > > > > >> > > > > > > > > at > > > > >> > > > > > > > > > > two > > > > >> > > > > > > > > > > >> > different paths that we need to vend > > > > credentials > > > > >> > for. > > > > >> > > > > > > > > > > >> > > > > > >> > > > > > > > > > > >> > For iceberg tables, we just use special > > > > >> properties > > > > >> > to > > > > >> > > > > track > > > > >> > > > > > > > these > > > > >> > > > > > > > > > > >> > locations. I wonder if we couldn’t do the > > > same > > > > >> for > > > > >> > > > generic > > > > >> > > > > > > > tables. > > > > >> > > > > > > > > > > >> > > > > > >> > > > > > > > > > > >> > On Wed, May 7, 2025 at 3:42 PM Dmitri > > > > >> Bourlatchkov < > > > > >> > > > > > > > > > di...@apache.org> > > > > >> > > > > > > > > > > >> > wrote: > > > > >> > > > > > > > > > > >> > > > > > >> > > > > > > > > > > >> > > Hi Yun, > > > > >> > > > > > > > > > > >> > > > > > > >> > > > > > > > > > > >> > > Please clarify the meaning of the value > > of > > > > the > > > > >> new > > > > >> > > > > > location > > > > >> > > > > > > > > > > attribute. > > > > >> > > > > > > > > > > >> > > > > > > >> > > > > > > > > > > >> > > - Is is one value or many? > > > > >> > > > > > > > > > > >> > > - Is it a URI? > > > > >> > > > > > > > > > > >> > > - Does it point to any particular file? > > > > >> > > > > > > > > > > >> > > - Is it a common prefix of all files > > > within a > > > > >> > table? > > > > >> > > > > > > > > > > >> > > - What happens when a value does not > > match > > > > >> these > > > > >> > > > > > > expectation? > > > > >> > > > > > > > > > > >> > > > > > > >> > > > > > > > > > > >> > > Thanks, > > > > >> > > > > > > > > > > >> > > Dmitri. > > > > >> > > > > > > > > > > >> > > > > > > >> > > > > > > > > > > >> > > On 2025/05/07 21:50:19 yun zou wrote: > > > > >> > > > > > > > > > > >> > > > Hi folks, > > > > >> > > > > > > > > > > >> > > > > > > > >> > > > > > > > > > > >> > > > I would like to propose to add an > > > optional > > > > >> > > > `location` > > > > >> > > > > > > field > > > > >> > > > > > > > to > > > > >> > > > > > > > > > > >> > > > CreateGenricTable Request and > > > > >> LoadGenericTable > > > > >> > > > > response. > > > > >> > > > > > > > > > > >> > > > > > > > >> > > > > > > > > > > >> > > > The `location` is the location for > the > > > > table, > > > > >> > > which > > > > >> > > > is > > > > >> > > > > > > > common > > > > >> > > > > > > > > to > > > > >> > > > > > > > > > > >> most > > > > >> > > > > > > > > > > >> > > table > > > > >> > > > > > > > > > > >> > > > formats including Iceberg, Delta, > Hudi, > > > > csv, > > > > >> > > parquet > > > > >> > > > > > etc. > > > > >> > > > > > > > The > > > > >> > > > > > > > > > > >> location > > > > >> > > > > > > > > > > >> > > > information is critical for loading > the > > > > >> table at > > > > >> > > > > engine > > > > >> > > > > > > > side, > > > > >> > > > > > > > > > > >> having a > > > > >> > > > > > > > > > > >> > > > dedicated keyword could help improve > > the > > > > >> > > robustness > > > > >> > > > > for > > > > >> > > > > > > > cross > > > > >> > > > > > > > > > > engine > > > > >> > > > > > > > > > > >> > > > sharing, instead of relying on the > > > > properties > > > > >> > > passed > > > > >> > > > > by > > > > >> > > > > > > the > > > > >> > > > > > > > > > client > > > > >> > > > > > > > > > > >> > side. > > > > >> > > > > > > > > > > >> > > > > > > > >> > > > > > > > > > > >> > > > Furthermore, this information is also > > > > >> required > > > > >> > to > > > > >> > > > > > provide > > > > >> > > > > > > > > > > credential > > > > >> > > > > > > > > > > >> > > > vending capabilities later. > > > > >> > > > > > > > > > > >> > > > > > > > >> > > > > > > > > > > >> > > > Here is the PR for adding the spec: > > > > >> > > > > > > > > > > >> > > > > > > > https://github.com/apache/polaris/pull/1543 > > > > >> > > > > > > > > > > >> > > > > > > > >> > > > > > > > > > > >> > > > Looking forward to your reply and > > > feedback! > > > > >> > > > > > > > > > > >> > > > > > > > >> > > > > > > > > > > >> > > > Best Regards, > > > > >> > > > > > > > > > > >> > > > Yun > > > > >> > > > > > > > > > > >> > > > > > > > >> > > > > > > > > > > >> > > > > > > >> > > > > > > > > > > >> > > > > > >> > > > > > > > > > > >> > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > > > > > >