Re: [DISCUSS] Add Polaris Observe API

2025-04-10 Thread Eric Maynard
I think the concept is really useful. The only thing I think which would
require some more investigation is how exactly we implement this API --
where the data is stored, how long it's retained, etc. We might need to
consider pushing this data out into another service or at least supporting
such an implementation.

I'm also glad you called out the idea of a "fine-triggered" TMS based on
events. A while ago, I had started drafting a design with a similar idea:

[image: Screenshot 2025-04-10 at 11.46.19 AM.png]

The concept was that some service can scrape events from Polaris (or get
Polaris can push events to it) and that service will persist the events so
that TMS, observability service, etc. can query those events.

To this end, I think it might be worth finishing the ongoing Event Listeners

work,
so we have a way to collect the kind of information that the observe API
will report. This gets us canonical event types as well.

On Wed, Apr 9, 2025 at 10:09 PM Jean-Baptiste Onofré 
wrote:

> Hi folks,
>
> I would like to discuss a proposal that I have in mind: the "observe" API.
>
> The purpose of this API is to return some metrics and gauges from Polaris,
> like:
> - what's number of entities (number of tables, views; etc) in a Polaris
> catalog
> - what's the number of times a entity as been accessed on a period
> - optionaly, access to "polished" metrics from table (extracted++ from
> the metadata)
> - optionaly, provide extra details (from Parquet metrics for instance)
>
> In terms of use cases, this API could be helpful:
> - to have a policy "activated" depending of this metrics (something
> like policy A is only valid if a catalog has more than X tables, or
> policy B is activated when a view has been accessed more than Y times
> in the last hour, etc). We can have TMS service "fined triggered" with
> these policies.
> - to be leverage by a FGAC mechanism (e.g. governance depending of
> these metrics)
> - to be easily displayed by a UI or CLI
>
> I already have a few ideas in mind that I would be happy to share in a
> design document. But before that, I would like to get your feedback
> about this proposal.
>
> Thanks !
> Regards
> JB
>


Re: [DISCUSS] Release cycle for Spark Client

2025-04-10 Thread yun zou
Hi Dmitri,

Thanks a lot for the feedback!

For compatibility, if the client releases along with the server, one
implicit guarantee we
will make is that one version of the client will be compatible with the
server of the same
version and the versions after until there is some API field deprecation.
For example,
a client released in 1.0.0 will be compatible with server >= 1.0.0. If in
server version 1.2.3,
a field in one particular API is deprecated, then client 1.0.0 will not be
compatible with
server version >= 1.2.3. We will definitely need to make sure such
information is published in the
Polaris webpage and Client README, or another place we think should contain
the information.

Furthermore, I think it is important for the server to maintain the API
backward
compatibility, for example: only add extra optional fields for existing
APIs, and give enough time
before completely deprecating a field of an existing API etc.

Please let me know if that makes sense. Thanks again for the feedback!

Best Regards,



On Fri, Apr 4, 2025 at 5:36 PM Dmitri Bourlatchkov  wrote:

> Hi Yun,
>
> Your proposal LGTM.
>
> However, regarding compatibility, I think this information has to be
> tracked regardless of the release cycle, because users can mix different
> client / server versions in their environments.
>
> Cheers,
> Dmitri.
>
> On Tue, Mar 25, 2025 at 5:01 PM yun zou 
> wrote:
>
> > Hi Team,
> >
> > Given that we are now introducing Spark Client, one thing we need to
> decide
> > is the release cycle for the Spark Plugin.
> >
> > I propose to bundle the client release with Polaris main release like
> > Iceberg. In that way, users will be able to get the client support for
> new
> > APIs in the same release, and version compatibility is also implicitly
> > implied in the release.
> >
> > An alternative is to release the client independently like
> > spark-cassandra-connector. In that case, the client will not have to do
> > extra release if there is no server API change. The drawback is that the
> > client release will need to start after the server release is finished in
> > case there are new changes, and extra Compatible Version information
> needs
> > to be published to help users understand the compatibility.
> >
> > Please let me know about your thoughts  on the client release cycles.
> >
> > Best Regards,
> > Yun
> >
>


Re: [DISCUSS] Add Polaris Observe API

2025-04-10 Thread Jean-Baptiste Onofré
Hi Eric

Thanks a lot for your feedback!

As a first step, I would not store additional entities to store that, but
more "querying" the existing entities (tables, etc) and the Iceberg meta
(including table properties) to display that.

I agree about finishing Event Listeners. In the meantime, I would start
with a first version of "Observe API", pretty simple (just entities metrics
like number of tables, views, etc). The idea is to façade persistence to
provide some kind of metrics (the client should not directly access the
persistence layer). A first use case would be UI/CLI, that we can extend
later for "fine-triggered" TMS.

Regards
JB

On Thu, Apr 10, 2025 at 11:48 AM Eric Maynard 
wrote:

> I think the concept is really useful. The only thing I think which would
> require some more investigation is how exactly we implement this API --
> where the data is stored, how long it's retained, etc. We might need to
> consider pushing this data out into another service or at least supporting
> such an implementation.
>
> I'm also glad you called out the idea of a "fine-triggered" TMS based on
> events. A while ago, I had started drafting a design with a similar idea:
>
> [image: Screenshot 2025-04-10 at 11.46.19 AM.png]
>
> The concept was that some service can scrape events from Polaris (or get
> Polaris can push events to it) and that service will persist the events so
> that TMS, observability service, etc. can query those events.
>
> To this end, I think it might be worth finishing the ongoing Event
> Listeners
> 
>  work,
> so we have a way to collect the kind of information that the observe API
> will report. This gets us canonical event types as well.
>
> On Wed, Apr 9, 2025 at 10:09 PM Jean-Baptiste Onofré 
> wrote:
>
>> Hi folks,
>>
>> I would like to discuss a proposal that I have in mind: the "observe" API.
>>
>> The purpose of this API is to return some metrics and gauges from
>> Polaris, like:
>> - what's number of entities (number of tables, views; etc) in a Polaris
>> catalog
>> - what's the number of times a entity as been accessed on a period
>> - optionaly, access to "polished" metrics from table (extracted++ from
>> the metadata)
>> - optionaly, provide extra details (from Parquet metrics for instance)
>>
>> In terms of use cases, this API could be helpful:
>> - to have a policy "activated" depending of this metrics (something
>> like policy A is only valid if a catalog has more than X tables, or
>> policy B is activated when a view has been accessed more than Y times
>> in the last hour, etc). We can have TMS service "fined triggered" with
>> these policies.
>> - to be leverage by a FGAC mechanism (e.g. governance depending of
>> these metrics)
>> - to be easily displayed by a UI or CLI
>>
>> I already have a few ideas in mind that I would be happy to share in a
>> design document. But before that, I would like to get your feedback
>> about this proposal.
>>
>> Thanks !
>> Regards
>> JB
>>
>


Re: [DISCUSS] Add Polaris Observe API

2025-04-10 Thread Michael Collado
I think a simple metrics API makes a lot of sense. Decoupling this from
events makes sense, as this would just be useful to query periodically for
a variety of reasons not tied to event triggering.

Mike

On Thu, Apr 10, 2025 at 3:00 PM Jean-Baptiste Onofré 
wrote:

> Hi Eric
>
> Thanks a lot for your feedback!
>
> As a first step, I would not store additional entities to store that, but
> more "querying" the existing entities (tables, etc) and the Iceberg meta
> (including table properties) to display that.
>
> I agree about finishing Event Listeners. In the meantime, I would start
> with a first version of "Observe API", pretty simple (just entities metrics
> like number of tables, views, etc). The idea is to façade persistence to
> provide some kind of metrics (the client should not directly access the
> persistence layer). A first use case would be UI/CLI, that we can extend
> later for "fine-triggered" TMS.
>
> Regards
> JB
>
> On Thu, Apr 10, 2025 at 11:48 AM Eric Maynard 
> wrote:
>
> > I think the concept is really useful. The only thing I think which would
> > require some more investigation is how exactly we implement this API --
> > where the data is stored, how long it's retained, etc. We might need to
> > consider pushing this data out into another service or at least
> supporting
> > such an implementation.
> >
> > I'm also glad you called out the idea of a "fine-triggered" TMS based on
> > events. A while ago, I had started drafting a design with a similar idea:
> >
> > [image: Screenshot 2025-04-10 at 11.46.19 AM.png]
> >
> > The concept was that some service can scrape events from Polaris (or get
> > Polaris can push events to it) and that service will persist the events
> so
> > that TMS, observability service, etc. can query those events.
> >
> > To this end, I think it might be worth finishing the ongoing Event
> > Listeners
> > <
> https://docs.google.com/document/d/1sJiFKeMlPVlqRUj8Rv4YufMMrfCq_3ZFtDuNv_8eOQ0/edit?tab=t.0#heading=h.8d519gwzsle2>
> work,
> > so we have a way to collect the kind of information that the observe API
> > will report. This gets us canonical event types as well.
> >
> > On Wed, Apr 9, 2025 at 10:09 PM Jean-Baptiste Onofré 
> > wrote:
> >
> >> Hi folks,
> >>
> >> I would like to discuss a proposal that I have in mind: the "observe"
> API.
> >>
> >> The purpose of this API is to return some metrics and gauges from
> >> Polaris, like:
> >> - what's number of entities (number of tables, views; etc) in a Polaris
> >> catalog
> >> - what's the number of times a entity as been accessed on a period
> >> - optionaly, access to "polished" metrics from table (extracted++ from
> >> the metadata)
> >> - optionaly, provide extra details (from Parquet metrics for instance)
> >>
> >> In terms of use cases, this API could be helpful:
> >> - to have a policy "activated" depending of this metrics (something
> >> like policy A is only valid if a catalog has more than X tables, or
> >> policy B is activated when a view has been accessed more than Y times
> >> in the last hour, etc). We can have TMS service "fined triggered" with
> >> these policies.
> >> - to be leverage by a FGAC mechanism (e.g. governance depending of
> >> these metrics)
> >> - to be easily displayed by a UI or CLI
> >>
> >> I already have a few ideas in mind that I would be happy to share in a
> >> design document. But before that, I would like to get your feedback
> >> about this proposal.
> >>
> >> Thanks !
> >> Regards
> >> JB
> >>
> >
>


Re: [DISCUSS] Release cycle for Spark Client

2025-04-10 Thread Yufei Gu
I Wrote this last Friday, but forgot to send it out. Similar to what Yun
said.

It's true that users can mix different client and server versions in their
environments. However, version interoperability should be a core design
goal for Polaris or any web service. Specifically:

   -

   *Server Backward compatibility is a must-have*: newer server versions
   must continue to support older clients. We cannot expect each server update
   to break existing clients. There are exceptions like removal of endpoints,
   which should be planned and announced and go through the deprecating
   process.
   -

   *Server Forward compatibility is desirable*: newer clients should be
   able to interact with older servers, though they may not get access to
   newer features.

So for example, a v1.2 client can seamlessly connect to any server with
version v1.2 or above. If it connects to an older server (v1.0 or v1.1), it
will still function but some newer features of v1.2 may not be available.
The Iceberg field "endpoints" in /v1/config is especially designed for
that. It isn't perfect, but covers most of the use cases.

With that, it's much easier for users to understand how compatibility
applies, if we release them with the same version number.
Yufei


On Fri, Apr 4, 2025 at 5:36 PM Dmitri Bourlatchkov  wrote:

> Hi Yun,
>
> Your proposal LGTM.
>
> However, regarding compatibility, I think this information has to be
> tracked regardless of the release cycle, because users can mix different
> client / server versions in their environments.
>
> Cheers,
> Dmitri.
>
> On Tue, Mar 25, 2025 at 5:01 PM yun zou 
> wrote:
>
> > Hi Team,
> >
> > Given that we are now introducing Spark Client, one thing we need to
> decide
> > is the release cycle for the Spark Plugin.
> >
> > I propose to bundle the client release with Polaris main release like
> > Iceberg. In that way, users will be able to get the client support for
> new
> > APIs in the same release, and version compatibility is also implicitly
> > implied in the release.
> >
> > An alternative is to release the client independently like
> > spark-cassandra-connector. In that case, the client will not have to do
> > extra release if there is no server API change. The drawback is that the
> > client release will need to start after the server release is finished in
> > case there are new changes, and extra Compatible Version information
> needs
> > to be published to help users understand the compatibility.
> >
> > Please let me know about your thoughts  on the client release cycles.
> >
> > Best Regards,
> > Yun
> >
>


[Discuss] Correct Handling of Service-Provided Storage Config Properties During Catalog Updates

2025-04-10 Thread Rulin Xing
Hi folks,

I'd like to initiate a discussion on the expected behavior when updating
catalog properties, specifically around the handling of storage
configuration fields that are automatically provided by the Polaris service.

*Background*
When a catalog is created, certain storage configuration properties are
provided by Polaris itself, polaris users don't need to provide these
properties. Depending on the cloud provider:

   - S3
  - *externalId*: Generated by Polaris if not provided. This is
  immutable.
  - *userARN*: Represents the Polaris service identity, provided by
  Polaris.
   - Azure
  - *consentUrl*: URL used to authorize Polaris to access the user’s
  storage account, generated by Polaris.
  - *multiTenantAppName*: Name of the Polaris client app that must be
  granted permissions to access the specified storage.
   - GCP
  - *gcsServiceAccount*: Represents the Polaris service account.


These values are not required during catalog creation, Polaris sets and
stores them automatically. Users can retrieve them via a GET request
post-creation.

*Workflow:*
Here is the guidance from Open Catalog for creating a catalog:
https://other-docs.snowflake.com/en/opencatalog/create-catalog

To illustrate, consider the scenario of loading an Iceberg table from S3.

1. Before spinning up Polaris, a long-lived AWS user credential needs to be
configured for Polaris (via Environment variable or via some properties).
2. Polaris users create a catalog with S3 storage configurations to provide
the IAM role
3. Polaris users send a getCatalog request to get the service-provided
properties (e.g. IAM user arn).
4. Polaris users add the IAM user arn (which represents polaris) to the
trust relationship of their IAM role so that polaris can assume
user-provided IAM role
5. When Polaris accesses S3, it creates an S3FileIO, which internally uses
an S3 client to send requests to S3.This S3 client leverages sub-scoped
storage credentials to read Iceberg table metadata. These credentials are
derived by assuming a customer-provided IAM role. *Polaris, acting as an
IAM user, uses long-lived AWS credentials *(AWS_ACCESS_KEY_ID,
AWS_SECRET_ACCESS_KEY) *to assume this role with a restricted IAM policy
and requests temporary session credentials* (AWS_ACCESS_KEY_ID,
AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN) for use during this session.

*Problem:*
Previously, when users submitted an *UpdateCatalogRequest*, the provided
storage configuration would completely replace the existing configuration,
including the service-provided fields. If customers forgot to manually
include the service-provided properties in the new storage configurations,
this unintentionally resulted in the loss of those critical properties.

*Fix*
A recent PR addresses this by ensuring that service-provided fields are
inherited during catalog updates. This prevents accidental loss of these
values and keeps the catalog entity intact.
https://github.com/apache/polaris/pull/1191

*Open Questions for Discussion:*
*1. Do users need to provide these properties? *

For Open Catalog, users do not need to provide service-generated properties
like userArn, externalId, etc., and Open Catalog will provide them
automatically. However, this leads to a gap in OSS Polaris, where there’s
no existing mechanism to configure these properties.

*2. Where should these properties live? Should we store these properties in
the Catalog Entity?* *Or do we just inject these info when generating the
loadCatalog response? *

Right now, these properties will be persisted in the metastore.

*3. Should we support both catalog-level and service-level userArn?*

>From a cost and complexity perspective, supporting catalog-level userArn
would require creating a dedicated AWS user credential per catalog, which
is very expensive and likely unnecessary.

It’s better to rely on the externalId to scope permissions at the catalog
level. Users can then configure their IAM role policies to allow access
only for specific Polaris-generated externalIds, offering sufficient
granularity without credential sprawl.

*4. Where and how does Polaris use these properties?*

Taking userArn as an example: Polaris does not use this property directly
in the service logic. Instead, it uses the associated AWS user credentials
to assume the customer’s IAM role. The userArn exists mainly for the
customer’s awareness, they need to know the ARN to update their trust
relationship of their IAM role accordingly.

Sorry for the long post, appreciate you making it through! Please feel free
to share your thoughts, suggestions, or any alternative ideas. Happy to
refine our direction based on what makes the most sense.

Best
Rulin


[DISCUSS] Add In-Dev Polaris Migrator/Synchronizer Tool to apache/polaris-tools

2025-04-10 Thread Mansehaj Singh
Hi all! Nice to meet you.

I opened up https://github.com/apache/polaris-tools/pull/4 recently to add
a Polaris migration/synchronizer tool I've been working on to the
polaris-tools repo. By request, I'm sharing a design document here
detailing how the tool works and the roadmap for functionality that is in
development.

Here's the design doc giving a full overview:
https://docs.google.com/document/d/1AXKmzp3JaTuUS_FMNnxr_pHsBTs86rWRMborMi3deCw/edit?usp=sharing


To summarize:

We can think of this tool as a configurable mirroring/migration tool to
migrate between two Polaris instances. I believe this would enable and
support many use cases that are quite cumbersome to carry out manually
today and break down barriers switching between open source and managed
offerings of Polaris. The tool has been designed with goals in mind that go
beyond supporting just the CLI implementation.

Please take a look at the design doc if you're interested!

Thank you!
- Sehaj