Open API spec defines the API for obtaining the "location" property for
Generic Tables.

My concern is with the meaning of that property, which is at the level of
Generic Table files. It is essentially about making a table format spec for
Generic Tables, even though this spec may be very simple (compared to other
table formats). This is why I think a separate doc would make sense.

Open API descriptions can (and should) direct users to the Generic Tables
doc (in Polaris) for the meaning of the location (and other) properties.

Cheers,
Dmitri.

On Mon, May 19, 2025 at 5:36 PM Yufei Gu <flyrain...@gmail.com> wrote:

> >
> > Open API yaml comments are not sufficient, IMHO. I'd prefer to have a
> > dedicated doc page to define expectations and compliance.
>
>
> I'm not against a dedicated doc page for that, but I think open API spec
> including comments should be the source of truth, instead of anywhere else.
>
>
> Yufei
>
>
> On Mon, May 19, 2025 at 1:45 PM Russell Spitzer <russell.spit...@gmail.com
> >
> wrote:
>
> > Yeah I think Iceberg and Hive are the only ones trying to make life
> > difficult, that I think
> > we should also cover but in changes to the Iceberg Spec. Hive can just
> stay
> > how it is ...
> >
> > On Mon, May 19, 2025 at 2:59 PM Dmitri Bourlatchkov <di...@apache.org>
> > wrote:
> >
> > > For context: my locations concerns are rooted in Nessie's experience
> > where
> > > we often get problem reports related to files being outside the
> declared
> > > Iceberg metadata location.
> > >
> > > Example:
> > >
> > >
> >
> https://github.com/projectnessie/nessie/issues/10817#issuecomment-2887329227
> > >
> > > I'm ok going with a single location for generic tables, but I think
> > Polaris
> > > needs to have a more strict spec for that (define where file should and
> > > should not go) because polaris owns this spec. Polaris ought to define
> > what
> > > complies with the spec and what does not. Having a proper spec is
> > essential
> > > to ensure a mutual understanding of all parties dealing with Generic
> > > Tables.
> > >
> > > Open API yaml comments are not sufficient, IMHO. I'd prefer to have a
> > > dedicated doc page to define expectations and compliance.
> > >
> > > Thanks,
> > > Dmitri.
> > >
> > >
> > > On Mon, May 19, 2025 at 2:17 PM Russell Spitzer <
> > russell.spit...@gmail.com
> > > >
> > > wrote:
> > >
> > > > The only multiple locations table formats I'm currently aware of are
> > Hive
> > > > (partitions can live wherever) and Iceberg.
> > > >
> > > >  I think for Delta, Hudi, LanceDB, Paimon and File based tables they
> > all
> > > > have to live in the root location. I'm not sure of any other "file"
> > based
> > > > tables where this would be an issue but I'd love to know if someone
> > else
> > > > has ideas. I think with the rise in credential vending, splitting
> > things
> > > > amongst multiple prefixes is becoming less common. I don't oppose
> doing
> > > an
> > > > array of locations but it may be enough to just leave this as an
> > > extension
> > > > later. (Support location or locations)
> > > >
> > > > On Wed, May 7, 2025 at 8:52 PM yun zou <yunzou.colost...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi Dmitri,
> > > > >
> > > > > If it's not "all" is it not strong enough for a spec, IMHO. If some
> > > > tables
> > > > > have multiple base locations how is Polaris going to deal with
> them?
> > > > >
> > > > > Sorry, when I say most of them, it was because I haven't tested all
> > of
> > > > them
> > > > > (I only tested Delta and CSV before).
> > > > > However, if Unity Catalog is only taking one location, I think that
> > is
> > > a
> > > > > strong enough proof that
> > > > > one location is enough today.
> > > > >
> > > > > It is also more natural to start with one location, and if there
> are
> > > use
> > > > > cases that
> > > > > require support for multiple locations later, we can move on to V2
> > spec
> > > > to
> > > > > support multiple
> > > > > tables locations.
> > > > >
> > > > > We're making a specification for Polaris. I do not think it is
> > > sufficient
> > > > > to say we'll do the same as other (unspecified ATM) catalogs.
> > > > > If we want to migrate users from other Catalog services to Polaris
> > > > (through
> > > > > federation), then Polaris will need to
> > > > > provide corresponding capabilities.  For example, Unity Catalog
> > storage
> > > > > location is a URI representation, when entity
> > > > > are federated from Unity Catalog, we will need to be able to handle
> > the
> > > > URI
> > > > > location.
> > > > > If URI representation is a common standard that has been accepted
> by
> > > > other
> > > > > Catalog services like Unity Catalog, Gravitino,
> > > > > Polaris should be compatible with that, otherwise it might cause
> > > problem
> > > > > for users when they are migrating from one to
> > > > > another.
> > > > >
> > > > > What will Polaris Server do with this location?
> > > > > For generic tables, Polaris will provide credential vending for
> this
> > > > > location in near future, I don't see we will provide
> > > > > anything else in short or mid term, since we still want to promote
> > > > > native support for Iceberg.
> > > > > Or if you have anything special in your mind that you think we
> should
> > > > > support?
> > > > >
> > > > > If Polaris has to define it in a spec, it will be hard to change in
> > the
> > > > > future.
> > > > > Regardless of whether it is explicitly in the spec definition or
> as a
> > > > > reserved property key, as long as they are explicitly
> > > > > documented, they will be hard to change in the future. From that
> > > > > perspective, those two approaches seem the same to me.
> > > > >
> > > > > Table location is critical information that is required by the
> engine
> > > > side
> > > > > to read and write the tables, which should
> > > > > be explicitly defined to provide better sharing across engines. For
> > > > > example, the delta table location is passed in the
> > > > > table properties with a property key either "location" or "path"
> > > depends
> > > > on
> > > > > how the table is created. Now, if another
> > > > > engine wants to read the delta table, it will need to understand
> > those
> > > > > keys, which are controlled by Spark today. If Spark
> > > > > changes them one day, all sharing will stop working.
> > > > >
> > > > > As to whether we want to put it as an explicit field or a reserved
> > > key, I
> > > > > think for a common field among various
> > > > > table formats, it makes more sense to have it as an explicit field.
> > For
> > > > > properties that are specific to a particular table format,
> > > > > it is more proper to just have a reserved key.
> > > > >
> > > > > If Polaris takes control of the location, I think we have to be
> more
> > > > > careful
> > > > > and at least try to make it future-proof.
> > > > >
> > > > > I don't think Polaris is taking control of the location, the
> location
> > > is
> > > > > still controlled by the engine and users today like table names.
> > > > > Polaris is a Catalog service, it records the generic table entity,
> > and
> > > > > returns the information back to the user on query.
> > > > > It might be able to do some validation on the location (like check
> > > > special
> > > > > character), but it doesn't decide which location
> > > > > the table will be used. I personally don't think it is a bad idea
> to
> > > let
> > > > > the Catalog service also take control of generating
> > > > > the table location, but I think that will require a lot of work.
> > > > >
> > > > > Best Regards,
> > > > > Yun
> > > > >
> > > > > On Wed, May 7, 2025 at 5:22 PM Dmitri Bourlatchkov <
> di...@apache.org
> > >
> > > > > wrote:
> > > > >
> > > > > > No worries about the name. It is a possible alternative spelling
> :)
> > > > > >
> > > > > > On Wed, May 7, 2025 at 8:04 PM yun zou <
> yunzou.colost...@gmail.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Hi Dmitri,
> > > > > > >
> > > > > > > Sorry, I accidentally typed your name wrong in the previous
> > reply!
> > > > > > > Apologize for this!
> > > > > > >
> > > > > > > For the S3 issue, I think we will need to deal with those
> > > regardless,
> > > > > > > especially with the federation work going on, we will need to
> > > handle
> > > > > all
> > > > > > > those entities eventually coming from different Catalogs, and
> the
> > > URI
> > > > > > > format seems the standard format used by various Catalog
> > services.
> > > > > > >
> > > > > > > Best Regards,
> > > > > > > Yun
> > > > > > >
> > > > > > > On Wed, May 7, 2025 at 4:55 PM yun zou <
> > yunzou.colost...@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Dimitri and Eric,
> > > > > > > >
> > > > > > > > Thanks a lot for the feedback!
> > > > > > > >
> > > > > > > > For the questions:
> > > > > > > > - Is one value or many?
> > > > > > > > It will be one value, similar to the location in Iceberg and
> > the
> > > > > > > > storage_location in unity catalog.
> > > > > > > >
> > > > > > > > Regarding to the point about having new data in new locations
> > and
> > > > > > keeping
> > > > > > > > old data in old locations, do we support that for Iceberg
> > > > > > > > today?
> > > > > > > > For most of the Spark tables, it seems to only have one
> > location.
> > > > > > Also, I
> > > > > > > > think it is better to start restricted first, and then extend
> > it
> > > to
> > > > > > > > allow multiple locations when the use case raises.
> > > > > > > >
> > > > > > > > Ref:
> > > > > > > > Iceberg location:
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L3451
> > > > > > > > Storage location in Unity Catalog:
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L3451
> > > > > > > >
> > > > > > > > - Is it a URI?
> > > > > > > > Yes, it will be a URI, which seems the standard catalog
> > > > > implementation.
> > > > > > > > Regarding to the point about s3 v2 s3a, i assume that is a
> > common
> > > > > > > > problem that every catalog implementation needs to address,
> and
> > > we
> > > > > will
> > > > > > > > stay the same on this part. At least from the load table
> point
> > of
> > > > > view,
> > > > > > > > Spark engine knows how to deal with such cases.
> > > > > > > >
> > > > > > > > - Does it point to any particular file?
> > > > > > > > No, it doesn't point to a particular file. It is the base
> table
> > > > > > location.
> > > > > > > >
> > > > > > > > - Is it a common prefix of all files within a table?
> > > > > > > > It is supposed to be the base table location, which
> > theoretically
> > > > > > should
> > > > > > > > be the common prefix of all files within a table I believe.
> > > > > > > >
> > > > > > > > - What happens when a value does not match these
> expectations?
> > > > > > > > Whether it is one value or many is restricted by the spec
> > > already.
> > > > > > > > For URI format, I think we can do a format check, and fail
> it.
> > > > > > > > Other than that, we will not do any other special check, and
> we
> > > > rely
> > > > > on
> > > > > > > > the client to put the correct value, otherwise, the other
> > engine
> > > > will
> > > > > > > > not be able to successfully read the table.
> > > > > > > >
> > > > > > > > For the location keyword, as Eric has pointed out, we can
> > > > potentially
> > > > > > > have
> > > > > > > > a reserved key for the properties. However, location is a
> > common
> > > > > > > > enough key among various table formats, which worths a
> > dedicated
> > > > key
> > > > > to
> > > > > > > > help store and load the information in a more straightforward
> > > > > > > > way.  For things that are specific to one or two formats, I
> > think
> > > > it
> > > > > > > makes
> > > > > > > > more sense to use a reserved property key.
> > > > > > > >
> > > > > > > > As a reference, in Iceberg, the CreateTable request and
> > > > TableMetadata
> > > > > > > does
> > > > > > > > have an explicit location key in the spec. For
> write.data.path
> > > > > > > > and write.metadata.path, they are passed as properties today.
> > > > > > > >
> > > > > > > > Best Regards,
> > > > > > > > Yun
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, May 7, 2025 at 3:54 PM Dmitri Bourlatchkov <
> > > > di...@apache.org
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >> Another point: I'm pretty sure sooner or later users will
> want
> > > to
> > > > > move
> > > > > > > >> their data to some other location. As an option users may
> want
> > > to
> > > > > > write
> > > > > > > >> new
> > > > > > > >> files into another location but keep old files in place.
> > > > > > > >>
> > > > > > > >> Also: if the location is a URI, how do we deal with s3 vs.
> s3a
> > > for
> > > > > > > >> example?
> > > > > > > >>
> > > > > > > >> In Iceberg it is quite common for different engines to use
> > > > different
> > > > > > > >> access
> > > > > > > >> tools, which often leads to different URI schemes.
> > > > > > > >>
> > > > > > > >> Cheers,
> > > > > > > >> Dmitri.
> > > > > > > >>
> > > > > > > >> On Wed, May 7, 2025 at 6:46 PM Eric Maynard <
> > > > > eric.w.mayn...@gmail.com
> > > > > > >
> > > > > > > >> wrote:
> > > > > > > >>
> > > > > > > >> > All good questions Dmitri — I’m especially interested in
> the
> > > > first
> > > > > > one
> > > > > > > >> as
> > > > > > > >> > from what I understand Iceberg tables can have metadata
> and
> > > data
> > > > > at
> > > > > > > two
> > > > > > > >> > different paths that we need to vend credentials for.
> > > > > > > >> >
> > > > > > > >> > For iceberg tables, we just use special properties to
> track
> > > > these
> > > > > > > >> > locations. I wonder if we couldn’t do the same for generic
> > > > tables.
> > > > > > > >> >
> > > > > > > >> > On Wed, May 7, 2025 at 3:42 PM Dmitri Bourlatchkov <
> > > > > > di...@apache.org>
> > > > > > > >> > wrote:
> > > > > > > >> >
> > > > > > > >> > > Hi Yun,
> > > > > > > >> > >
> > > > > > > >> > > Please clarify the meaning of the value of the new
> > location
> > > > > > > attribute.
> > > > > > > >> > >
> > > > > > > >> > > - Is is one value or many?
> > > > > > > >> > > - Is it a URI?
> > > > > > > >> > > - Does it point to any particular file?
> > > > > > > >> > > - Is it a common prefix of all files within a table?
> > > > > > > >> > > - What happens when a value does not match these
> > > expectation?
> > > > > > > >> > >
> > > > > > > >> > > Thanks,
> > > > > > > >> > > Dmitri.
> > > > > > > >> > >
> > > > > > > >> > > On 2025/05/07 21:50:19 yun zou wrote:
> > > > > > > >> > > > Hi folks,
> > > > > > > >> > > >
> > > > > > > >> > > > I would like to propose to add an optional `location`
> > > field
> > > > to
> > > > > > > >> > > > CreateGenricTable Request and LoadGenericTable
> response.
> > > > > > > >> > > >
> > > > > > > >> > > > The `location` is the location for the table, which is
> > > > common
> > > > > to
> > > > > > > >> most
> > > > > > > >> > > table
> > > > > > > >> > > > formats including Iceberg, Delta, Hudi, csv, parquet
> > etc.
> > > > The
> > > > > > > >> location
> > > > > > > >> > > > information is critical for loading the table at
> engine
> > > > side,
> > > > > > > >> having a
> > > > > > > >> > > > dedicated keyword could help improve the robustness
> for
> > > > cross
> > > > > > > engine
> > > > > > > >> > > > sharing, instead of relying on the properties passed
> by
> > > the
> > > > > > client
> > > > > > > >> > side.
> > > > > > > >> > > >
> > > > > > > >> > > > Furthermore, this information is also required to
> > provide
> > > > > > > credential
> > > > > > > >> > > > vending capabilities later.
> > > > > > > >> > > >
> > > > > > > >> > > > Here is the PR for adding the spec:
> > > > > > > >> > > > https://github.com/apache/polaris/pull/1543
> > > > > > > >> > > >
> > > > > > > >> > > > Looking forward to your reply and feedback!
> > > > > > > >> > > >
> > > > > > > >> > > > Best Regards,
> > > > > > > >> > > > Yun
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to