Re: [Discuss] Add `location` to generic table spec

yun zou Thu, 08 May 2025 14:52:33 -0700

Hi forks,

Thanks a lot for the feedback! I want to summarize the major discussion
points here, so that we can focus on the discussion
of those points to avoid spreading topics around.


The major questions I see are the following:
1) Shall we introduce an explicit definition of location, and what is
`location`?
The location refers to the root location of the table, we should definitely
doc it clearly.
The table root location is a required information for different engines to
access the table with formats like Delta, CSV etc. It
is important that we explicitly define this information to provide robust
cross engine interpolation.

2) Do we support single table root location or multiple root location ?
Today, only the Iceberg table allows multiple root locations, other table
formats including Delta and Hive style tables (CSV, Parquet) only
support single table root location.  Since Generic Table is not designed to
serve Iceberg functionality today, there is no use case
for multiple table root locations, and starting with single location
should be sufficient.

If in the future, we want to repurpose Generic table to also support
Iceberg capabilities, and encounters the following use case:
"people may want to move their data to some other location. As an option
users may want to write new
files into another location but keep old files in place"

One option we can do is to introduce an extra `additional location` to
record the
old data locations, and it can make a clear separation there about which is
the current location, which are the
old data locations. Similar as what glue has been introducing. Or another
option is to move on for V2 spec.

Glue Table Catalog:
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-Table

3) Should the location allow all URI schemas and special characters,
especially s3a, s3n?
There are various issues raised in the Iceberg community when dealing with
all those S3 schemas during path matching. However, since the root
table location is not an absolute path, and Generic Table has restricted
support in short and mid term (Polaris is still promoting
for native Iceberg support), it will not do any complicated matching
operation with the path, I do not see a problem of allowing
all schemas, or event with special characters. Sometimes, people may uses
other schema with specific reasons such as performance.

Dmitri: Is the above your concern? if not , could you elaborate more on the
concerns?

4) Should the location be an explicit field or a reserved property key?
Given that table root location is an important information for most of the
non-iceberg table formats. Having an explicit field could make
things more clear when sharing across engines.

Please let me know if there is any point I am missing!

Looking forward to your reply and feeback!

Best Regards,
Yun

On Wed, May 7, 2025 at 6:51 PM yun zou <[email protected]> wrote:

> Hi Dmitri,
>
> If it's not "all" is it not strong enough for a spec, IMHO. If some tables
> have multiple base locations how is Polaris going to deal with them?
>
> Sorry, when I say most of them, it was because I haven't tested all of
> them (I only tested Delta and CSV before).
> However, if Unity Catalog is only taking one location, I think that is a
> strong enough proof that
> one location is enough today.
>
> It is also more natural to start with one location, and if there are use
> cases that
> require support for multiple locations later, we can move on to V2 spec to
> support multiple
> tables locations.
>
> We're making a specification for Polaris. I do not think it is sufficient
> to say we'll do the same as other (unspecified ATM) catalogs.
> If we want to migrate users from other Catalog services to Polaris
> (through federation), then Polaris will need to
> provide corresponding capabilities.  For example, Unity Catalog storage
> location is a URI representation, when entity
> are federated from Unity Catalog, we will need to be able to handle the
> URI location.
> If URI representation is a common standard that has been accepted by
> other Catalog services like Unity Catalog, Gravitino,
> Polaris should be compatible with that, otherwise it might cause problem
> for users when they are migrating from one to
> another.
>
> What will Polaris Server do with this location?
> For generic tables, Polaris will provide credential vending for this
> location in near future, I don't see we will provide
> anything else in short or mid term, since we still want to promote
> native support for Iceberg.
> Or if you have anything special in your mind that you think we should
> support?
>
> If Polaris has to define it in a spec, it will be hard to change in the
> future.
> Regardless of whether it is explicitly in the spec definition or as a
> reserved property key, as long as they are explicitly
> documented, they will be hard to change in the future. From that
> perspective, those two approaches seem the same to me.
>
> Table location is critical information that is required by the engine side
> to read and write the tables, which should
> be explicitly defined to provide better sharing across engines. For
> example, the delta table location is passed in the
> table properties with a property key either "location" or "path" depends
> on how the table is created. Now, if another
> engine wants to read the delta table, it will need to understand those
> keys, which are controlled by Spark today. If Spark
> changes them one day, all sharing will stop working.
>
> As to whether we want to put it as an explicit field or a reserved key, I
> think for a common field among various
> table formats, it makes more sense to have it as an explicit field. For
> properties that are specific to a particular table format,
> it is more proper to just have a reserved key.
>
> If Polaris takes control of the location, I think we have to be more
> careful
> and at least try to make it future-proof.
>
> I don't think Polaris is taking control of the location, the location is
> still controlled by the engine and users today like table names.
> Polaris is a Catalog service, it records the generic table entity, and
> returns the information back to the user on query.
> It might be able to do some validation on the location (like check special
> character), but it doesn't decide which location
> the table will be used. I personally don't think it is a bad idea to let
> the Catalog service also take control of generating
> the table location, but I think that will require a lot of work.
>
> Best Regards,
> Yun
>
> On Wed, May 7, 2025 at 5:22 PM Dmitri Bourlatchkov <[email protected]>
> wrote:
>
>> No worries about the name. It is a possible alternative spelling :)
>>
>> On Wed, May 7, 2025 at 8:04 PM yun zou <[email protected]>
>> wrote:
>>
>> > Hi Dmitri,
>> >
>> > Sorry, I accidentally typed your name wrong in the previous reply!
>> > Apologize for this!
>> >
>> > For the S3 issue, I think we will need to deal with those regardless,
>> > especially with the federation work going on, we will need to handle all
>> > those entities eventually coming from different Catalogs, and the URI
>> > format seems the standard format used by various Catalog services.
>> >
>> > Best Regards,
>> > Yun
>> >
>> > On Wed, May 7, 2025 at 4:55 PM yun zou <[email protected]>
>> wrote:
>> >
>> > > Hi Dimitri and Eric,
>> > >
>> > > Thanks a lot for the feedback!
>> > >
>> > > For the questions:
>> > > - Is one value or many?
>> > > It will be one value, similar to the location in Iceberg and the
>> > > storage_location in unity catalog.
>> > >
>> > > Regarding to the point about having new data in new locations and
>> keeping
>> > > old data in old locations, do we support that for Iceberg
>> > > today?
>> > > For most of the Spark tables, it seems to only have one location.
>> Also, I
>> > > think it is better to start restricted first, and then extend it to
>> > > allow multiple locations when the use case raises.
>> > >
>> > > Ref:
>> > > Iceberg location:
>> > >
>> >
>> https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L3451
>> > > Storage location in Unity Catalog:
>> > >
>> >
>> https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L3451
>> > >
>> > > - Is it a URI?
>> > > Yes, it will be a URI, which seems the standard catalog
>> implementation.
>> > > Regarding to the point about s3 v2 s3a, i assume that is a common
>> > > problem that every catalog implementation needs to address, and we
>> will
>> > > stay the same on this part. At least from the load table point of
>> view,
>> > > Spark engine knows how to deal with such cases.
>> > >
>> > > - Does it point to any particular file?
>> > > No, it doesn't point to a particular file. It is the base table
>> location.
>> > >
>> > > - Is it a common prefix of all files within a table?
>> > > It is supposed to be the base table location, which theoretically
>> should
>> > > be the common prefix of all files within a table I believe.
>> > >
>> > > - What happens when a value does not match these expectations?
>> > > Whether it is one value or many is restricted by the spec already.
>> > > For URI format, I think we can do a format check, and fail it.
>> > > Other than that, we will not do any other special check, and we rely
>> on
>> > > the client to put the correct value, otherwise, the other engine will
>> > > not be able to successfully read the table.
>> > >
>> > > For the location keyword, as Eric has pointed out, we can potentially
>> > have
>> > > a reserved key for the properties. However, location is a common
>> > > enough key among various table formats, which worths a dedicated key
>> to
>> > > help store and load the information in a more straightforward
>> > > way.  For things that are specific to one or two formats, I think it
>> > makes
>> > > more sense to use a reserved property key.
>> > >
>> > > As a reference, in Iceberg, the CreateTable request and TableMetadata
>> > does
>> > > have an explicit location key in the spec. For write.data.path
>> > > and write.metadata.path, they are passed as properties today.
>> > >
>> > > Best Regards,
>> > > Yun
>> > >
>> > >
>> > > On Wed, May 7, 2025 at 3:54 PM Dmitri Bourlatchkov <[email protected]>
>> > > wrote:
>> > >
>> > >> Another point: I'm pretty sure sooner or later users will want to
>> move
>> > >> their data to some other location. As an option users may want to
>> write
>> > >> new
>> > >> files into another location but keep old files in place.
>> > >>
>> > >> Also: if the location is a URI, how do we deal with s3 vs. s3a for
>> > >> example?
>> > >>
>> > >> In Iceberg it is quite common for different engines to use different
>> > >> access
>> > >> tools, which often leads to different URI schemes.
>> > >>
>> > >> Cheers,
>> > >> Dmitri.
>> > >>
>> > >> On Wed, May 7, 2025 at 6:46 PM Eric Maynard <
>> [email protected]>
>> > >> wrote:
>> > >>
>> > >> > All good questions Dmitri — I’m especially interested in the first
>> one
>> > >> as
>> > >> > from what I understand Iceberg tables can have metadata and data at
>> > two
>> > >> > different paths that we need to vend credentials for.
>> > >> >
>> > >> > For iceberg tables, we just use special properties to track these
>> > >> > locations. I wonder if we couldn’t do the same for generic tables.
>> > >> >
>> > >> > On Wed, May 7, 2025 at 3:42 PM Dmitri Bourlatchkov <
>> [email protected]>
>> > >> > wrote:
>> > >> >
>> > >> > > Hi Yun,
>> > >> > >
>> > >> > > Please clarify the meaning of the value of the new location
>> > attribute.
>> > >> > >
>> > >> > > - Is is one value or many?
>> > >> > > - Is it a URI?
>> > >> > > - Does it point to any particular file?
>> > >> > > - Is it a common prefix of all files within a table?
>> > >> > > - What happens when a value does not match these expectation?
>> > >> > >
>> > >> > > Thanks,
>> > >> > > Dmitri.
>> > >> > >
>> > >> > > On 2025/05/07 21:50:19 yun zou wrote:
>> > >> > > > Hi folks,
>> > >> > > >
>> > >> > > > I would like to propose to add an optional `location` field to
>> > >> > > > CreateGenricTable Request and LoadGenericTable response.
>> > >> > > >
>> > >> > > > The `location` is the location for the table, which is common
>> to
>> > >> most
>> > >> > > table
>> > >> > > > formats including Iceberg, Delta, Hudi, csv, parquet etc. The
>> > >> location
>> > >> > > > information is critical for loading the table at engine side,
>> > >> having a
>> > >> > > > dedicated keyword could help improve the robustness for cross
>> > engine
>> > >> > > > sharing, instead of relying on the properties passed by the
>> client
>> > >> > side.
>> > >> > > >
>> > >> > > > Furthermore, this information is also required to provide
>> > credential
>> > >> > > > vending capabilities later.
>> > >> > > >
>> > >> > > > Here is the PR for adding the spec:
>> > >> > > > https://github.com/apache/polaris/pull/1543
>> > >> > > >
>> > >> > > > Looking forward to your reply and feedback!
>> > >> > > >
>> > >> > > > Best Regards,
>> > >> > > > Yun
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> > >
>> >
>>
>

Re: [Discuss] Add `location` to generic table spec

Reply via email to