Re: [Discuss] Add `location` to generic table spec

2025-05-08 Thread yun zou
Hi forks,

Thanks a lot for the feedback! I want to summarize the major discussion
points here, so that we can focus on the discussion
of those points to avoid spreading topics around.

The major questions I see are the following:
1) Shall we introduce an explicit definition of location, and what is
`location`?
The location refers to the root location of the table, we should definitely
doc it clearly.
The table root location is a required information for different engines to
access the table with formats like Delta, CSV etc. It
is important that we explicitly define this information to provide robust
cross engine interpolation.

2) Do we support single table root location or multiple root location ?
Today, only the Iceberg table allows multiple root locations, other table
formats including Delta and Hive style tables (CSV, Parquet) only
support single table root location.  Since Generic Table is not designed to
serve Iceberg functionality today, there is no use case
for multiple table root locations, and starting with single location
should be sufficient.

If in the future, we want to repurpose Generic table to also support
Iceberg capabilities, and encounters the following use case:
"people may want to move their data to some other location. As an option
users may want to write new
files into another location but keep old files in place"

One option we can do is to introduce an extra `additional location` to
record the
old data locations, and it can make a clear separation there about which is
the current location, which are the
old data locations. Similar as what glue has been introducing. Or another
option is to move on for V2 spec.

Glue Table Catalog:
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-Table

3) Should the location allow all URI schemas and special characters,
especially s3a, s3n?
There are various issues raised in the Iceberg community when dealing with
all those S3 schemas during path matching. However, since the root
table location is not an absolute path, and Generic Table has restricted
support in short and mid term (Polaris is still promoting
for native Iceberg support), it will not do any complicated matching
operation with the path, I do not see a problem of allowing
all schemas, or event with special characters. Sometimes, people may uses
other schema with specific reasons such as performance.

Dmitri: Is the above your concern? if not , could you elaborate more on the
concerns?

4) Should the location be an explicit field or a reserved property key?
Given that table root location is an important information for most of the
non-iceberg table formats. Having an explicit field could make
things more clear when sharing across engines.

Please let me know if there is any point I am missing!

Looking forward to your reply and feeback!

Best Regards,
Yun

On Wed, May 7, 2025 at 6:51 PM yun zou  wrote:

> Hi Dmitri,
>
> If it's not "all" is it not strong enough for a spec, IMHO. If some tables
> have multiple base locations how is Polaris going to deal with them?
>
> Sorry, when I say most of them, it was because I haven't tested all of
> them (I only tested Delta and CSV before).
> However, if Unity Catalog is only taking one location, I think that is a
> strong enough proof that
> one location is enough today.
>
> It is also more natural to start with one location, and if there are use
> cases that
> require support for multiple locations later, we can move on to V2 spec to
> support multiple
> tables locations.
>
> We're making a specification for Polaris. I do not think it is sufficient
> to say we'll do the same as other (unspecified ATM) catalogs.
> If we want to migrate users from other Catalog services to Polaris
> (through federation), then Polaris will need to
> provide corresponding capabilities.  For example, Unity Catalog storage
> location is a URI representation, when entity
> are federated from Unity Catalog, we will need to be able to handle the
> URI location.
> If URI representation is a common standard that has been accepted by
> other Catalog services like Unity Catalog, Gravitino,
> Polaris should be compatible with that, otherwise it might cause problem
> for users when they are migrating from one to
> another.
>
> What will Polaris Server do with this location?
> For generic tables, Polaris will provide credential vending for this
> location in near future, I don't see we will provide
> anything else in short or mid term, since we still want to promote
> native support for Iceberg.
> Or if you have anything special in your mind that you think we should
> support?
>
> If Polaris has to define it in a spec, it will be hard to change in the
> future.
> Regardless of whether it is explicitly in the spec definition or as a
> reserved property key, as long as they are explicitly
> documented, they will be hard to change in the future. From that
> perspective, those two approaches seem the same to me.
>
> Table loc

[Discuss] Credentials Management in Polaris

2025-05-08 Thread Rulin Xing
Hi folks,

As Polaris expands its support for external services, such as federated
catalogs and cloud storage, it needs to securely access systems like AWS
S3, AWS Glue, Azure Storage, and others. This external access requires
Polaris to handle credentials correctly, whether they’re long-lived
credentials in self-managed deployments or temporary credentials in
multi-tenant SaaS setups.

We've had several ongoing discussions about how credential handling should
evolve, especially in light of the work around SigV4 Auth for catalog
federation.
* [PR#1191] Fix updating the storage config

* [PR#1506] Spec: Add SigV4 Auth Support for Catalog Federation

* [Spec] Add SigV4 Auth Support for Catalog Federation


To frame the problem and proposed solutions, I’ve drafted a design doc:
Apache Polaris Creds Management Proposal


The proposal breaks the problem into four key areas:
1. How Polaris gets vendor-specific service identity and credentials
(e.g., from server config and service context registry)
2. How Polaris surfaces service identity info to users
(e.g., exposing userArn or consentUrl for trust policy setup)
3. How Polaris injects service-managed identity fields into catalog or
storage configs
(e.g., using entity mutators at creation time)
4. How Polaris retrieves temporary credentials to access external services
(e.g., via STS, with caching support)

The goal is to unify credential handling across storage and connection
configs, support both SaaS and self-managed deployments, and cleanly
separate user-provided config from Polaris-managed properties.

Would love to hear your thoughts, feedback, and suggestions. Happy to
refine based on feedback!

Thanks,
Rulin


Re: [Discuss] Credentials Management in Polaris

2025-05-08 Thread Yufei Gu
Thanks a lot for driving this, Rulin! Left some comments in the doc. I
think this is the right direction to make credential management more secure
and more flexible.

Yufei


On Thu, May 8, 2025 at 1:34 PM Rulin Xing  wrote:

> Hi folks,
>
> As Polaris expands its support for external services, such as federated
> catalogs and cloud storage, it needs to securely access systems like AWS
> S3, AWS Glue, Azure Storage, and others. This external access requires
> Polaris to handle credentials correctly, whether they’re long-lived
> credentials in self-managed deployments or temporary credentials in
> multi-tenant SaaS setups.
>
> We've had several ongoing discussions about how credential handling should
> evolve, especially in light of the work around SigV4 Auth for catalog
> federation.
> * [PR#1191] Fix updating the storage config
> 
> * [PR#1506] Spec: Add SigV4 Auth Support for Catalog Federation
> 
> * [Spec] Add SigV4 Auth Support for Catalog Federation
> 
>
> To frame the problem and proposed solutions, I’ve drafted a design doc:
> Apache Polaris Creds Management Proposal
> <
> https://docs.google.com/document/d/1MAW87DtyHWPPNIEkUCRVUKBGjhh5bPn0GbtV7fifm30/edit?usp=sharing
> >
>
> The proposal breaks the problem into four key areas:
> 1. How Polaris gets vendor-specific service identity and credentials
> (e.g., from server config and service context registry)
> 2. How Polaris surfaces service identity info to users
> (e.g., exposing userArn or consentUrl for trust policy setup)
> 3. How Polaris injects service-managed identity fields into catalog or
> storage configs
> (e.g., using entity mutators at creation time)
> 4. How Polaris retrieves temporary credentials to access external services
> (e.g., via STS, with caching support)
>
> The goal is to unify credential handling across storage and connection
> configs, support both SaaS and self-managed deployments, and cleanly
> separate user-provided config from Polaris-managed properties.
>
> Would love to hear your thoughts, feedback, and suggestions. Happy to
> refine based on feedback!
>
> Thanks,
> Rulin
>


Re: Polaris SNAPSHOT available and nightly build

2025-05-08 Thread Yufei Gu
Great news! Are we planning to publish a night docker image as a follow-up?

Yufei


On Wed, May 7, 2025 at 12:49 AM Jean-Baptiste Onofré 
wrote:

> Hi folks,
>
> FYI, I merged the PR about nightly builds (SNAPSHOTs).
>
> The GitHub Action has been executed yesterday (as expected):
> https://github.com/apache/polaris/actions/workflows/nightly.yml
> and the SNAPSHOT artifacts have been deployed in the snapshots
> repository (for instance
>
> https://repository.apache.org/content/groups/snapshots/org/apache/polaris/polaris-core/0.11.0-beta-incubating-SNAPSHOT/
> ).
>
> Regards
> JB
>
> On Wed, Apr 16, 2025 at 5:03 PM Jean-Baptiste Onofré 
> wrote:
> >
> > Hi folks,
> >
> > I received several requests to get Polaris SNAPSHOT artifacts
> > published (to build external tools, or from polaris-tools repo).
> >
> > I did a first SNAPSHOT deployment:
> >
> https://repository.apache.org/content/groups/snapshots/org/apache/polaris/
> >
> > I also created a PR to have a GH Action for nightly build, publishing
> > SNAPSHOT every day:
> > https://github.com/apache/polaris/pull/1383
> >
> > Thanks !
> > Regards
> > JB
>