Hi everyone,

I believe Hive Metastore (HMS) federation is a critical feature for the success 
and widespread
adoption of the project. The reality of the current data lake ecosystem is that 
a vast number of 
organizations have significant footprint in HMS. For Polaris to be the Iceberg 
Rest Catalog (IRC) 
of choice, it needs to meet users where they are. Ignoring the existing HMS 
landscape creates a 
significant barrier to entry for a large portion of our potential user base. We 
have already seen 
requests for this functionality from the community. For instance, there has 
been discussion in the 
Polaris Slack 
(https://apache-polaris.slack.com/archives/C084QSKD6S2/p1744998865663079) 
about the need for HMS integration.

Furthermore, other open-source catalog projects are already moving in this 
direction. Apache
Gravitino has this capability, and Unity Catalog OSS is expected to support HMS 
federation in 
its upcoming 0.4 release. The transition from a traditional HMS to a modern IRC 
is rarely an 
overnight flip. Users will likely operate in a hybrid mode for a considerable 
period, with both 
HMS and an IRC coexisting. Thus, a one-time migration tool is not sufficient. 
By supporting 
federation, we will provide a practical path for users to gradually migrate to 
Polaris.

I understand there may be concerns about pulling in extensive dependencies from 
the HMS 
client. However, these challenges can be managed. We can be selective about the 
dependencies we include and explore ways to minimize the impact on the core 
Polaris project. 
The strategic benefit of enabling a smooth migration path for a huge segment 
far outweighs the 
technical challenge of managing these dependencies. 

With regards to the concern about a single HMS support, we can add LDAP support 
in the next iteration, which will enable Polaris to federate to multiple HMS 
instances. Essentially, Polaris will 
federate to multiple HMS instances, as long as the installations use LDAP 
authentication. The 
existing IMPLICIT auth will enable (experimental/limited) Kerberos support 
without pulling in any 
Kerberos dependencies into the Polaris project.

I welcome a discussion on the best way to implement this. I strongly believe 
that deciding
against HMS support will prevent a large subset data lake users from adopting 
Polaris. I'm
hoping we can find a better path forward together.

Thanks,
Pooja

On 2025/07/10 23:47:21 Yufei Gu wrote:
> It’s a valid point that Polaris needs to support multi-tenancy, or even
> across different external catalogs (such as remote HMS) within a single
> realm.
> 
> Unfortunately, Kerberos isn’t compatible with this model, as it requires
> global configuration per JVM, making it inherently single-tenant. So I’d
> suggest we rule out the Kerberos option and explore more flexible
> authentication schemes.
> 
> Here’s a quick summary of viable alternatives:
> 
>    - Implicit Authentication: This can not support multi-tenancy when
>    environment variables are leveraged per instance. We could further extend
>    this in the future by integrating a more sophisticated secrets manager to
>    improve tenant isolation and credential handling.
> 
> 
>    - LDAP: A well-established solution that naturally supports multiple
>    credentials and user contexts. It aligns well with multi-tenant needs.
> 
> Given that, I think we have a clear path forward for enabling multi-tenancy
> with HMS federation. Introducing implicit authentication as a starting
> point seems reasonable. It can be disabled by default, with Polaris admins
> choosing to enable it based on their environment. Even in single-HMS
> deployments, this option brings real value without adding complexity, esp.
> A lot of organizations only have one HMS instance.
> 
> Yufei
> 
> 
> On Wed, Jul 9, 2025 at 7:44 AM Robert Stupp <sn...@snazy.de> wrote:
> 
> > Following up on my email:
> >
> > Polaris would really benefit from supporting HMS and other catalog
> > types. And the way I see to get there is to have a "HMS only" IRC
> > service, which can be legibly built on Java 11, use Kerberos, etc.
> > Polaris can then federate to that HMS catalog.
> > AFAIU clients can authenticate to k8s and get OAuth tokens. Those can
> > be used to talk to Polaris, which can in turn talk to the HMS service.
> >
> > What I do object though is making Polaris effectively a single realm +
> > single catalog service and add new dependencies to Hadoop + Hive +
> > Kerberos to Polaris.
> >
> > On Wed, Jul 9, 2025 at 12:17 PM Robert Stupp <sn...@snazy.de> wrote:
> > >
> > > Let's recap what Polaris offers:
> > > 1. Multi tenancy via realms
> > > 2. Multiple catalogs per realm
> > > 3. OAuth/OIDC
> > >
> > > Adding Kerberos is global per JVM, making #1 impossible and likely
> > > also not suitable for #2, plus adding another complicated and complex
> > > auth mechanism.
> > > If Kerberos is a strong concern, I propose to contribute necessary
> > > changes to the "Iceberg auth manager project" [1] to let clients use
> > > krb and receive OAuth tokens for it.It is also worth mentioning that
> > > testing all that (development and CI including unit and especially
> > > integration tests) is a huge effort in itself.
> > >
> > > Again, federating to another "single tenant / single catalog HMS krb"
> > > Iceberg REST service behind Polaris is fine. Krb clients can authorize
> > > against Polaris via OAuth, and likely can Polaris itself authorize
> > > itself using OAuth.
> > >
> > > I strongly object to depending even more on Hadoop for the reasons
> > > outlined earlier. I also strongly object to adding Kerberos to
> > > Polaris.
> > >
> > > BTW: Hadoop is not necessary for Iceberg to work, it is rather an "opt
> > > in" (ex: org.apache.iceberg.hadoop.Configurable#setConf).
> > >
> > > [1] https://github.com/dremio/iceberg-auth-manager
> > >
> > > On Tue, Jul 8, 2025 at 6:25 PM Yufei Gu <flyrain...@gmail.com> wrote:
> > > >
> > > > HMS integration is a key step toward one of Polaris’s critical
> > missions:
> > > > helping users move off HMS. It brings clear value by aligning with our
> > > > long-term direction.
> > > >
> > > > I’m not too concerned about hive.xml, most of its configurations can be
> > > > dynamically injected at runtime. The real challenge lies in Kerberos
> > > > integration. Since krb5.conf and the keytab are globally configured per
> > > > JVM, a single JVM instance cannot support true multi-tenancy. As far
> > as I
> > > > know, there isn’t a clean solution to this limitation.
> > > >
> > > > If that's indeed the case, Option 2a becomes far less appealing to me.
> > > >
> > > > Yufei
> > > >
> > > >
> > > > On Mon, Jul 7, 2025 at 11:18 AM Russell Spitzer <
> > russell.spit...@gmail.com>
> > > > wrote:
> > > >
> > > > > I think having some integration with HMS is definitely a good idea.
> > We've
> > > > > already seen
> > > > > users build this in the wild on top of Polaris showing that there is
> > > > > definitely a demand.
> > > > >  I'm still a strong believer that we should be helping users get to
> > Polaris
> > > > > from whatever systems
> > > > > they are currently using to Polaris.
> > > > >
> > > > > On Mon, Jul 7, 2025 at 12:59 PM Eric Maynard <
> > eric.w.mayn...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > 1. We (Polaris) can provide end users a way to migrate off of these
> > > > > > catalogs that the Iceberg project no longer wants to invest into.
> > > > > > Implementing HMS federation is in service to the goal of removing
> > > > > > non-Iceberg catalogs, not in contradiction to it.
> > > > > >
> > > > > > 2. This does not seem like a user-centered concern, but I'm also
> > not
> > > > > sure I
> > > > > > understand exactly what is being expressed here. Are you saying
> > that the
> > > > > > current HADOOP federation does not work somehow?
> > > > > >
> > > > > > 3. Yes, please see the other thread about the IMPLICIT
> > authentication
> > > > > type
> > > > > > for discussion of this topic. Note, however, that HMS federation
> > may
> > > > > > support authentication types other than IMPLICIT.
> > > > > >
> > > > > > 4. That depends on what you mean by "depends on" -- it could also
> > be said
> > > > > > that Iceberg itself depends on Hadoop.
> > > > > >
> > > > > > 5. This not only also applies to HADOOP federation, which already
> > exists,
> > > > > > but also does *not* apply to HMS federation when using an
> > authentication
> > > > > > mechanism other than IMPLICIT -- again, please see the other
> > thread for
> > > > > > more discussion of this topic.
> > > > > >
> > > > > > On Fri, Jul 4, 2025 at 3:52 AM Robert Stupp <sn...@snazy.de>
> > wrote:
> > > > > >
> > > > > > > I'd really prefer to not add "anything Hive" to Polaris itself,
> > and I'd
> > > > > > > really like to see Hadoop being removed entirely from the
> > Polaris code
> > > > > > > base.
> > > > > > >
> > > > > > > There are multiple reasons for this:
> > > > > > >
> > > > > > > 1. The Iceberg project would rather like to remove all catalogs
> > except
> > > > > > > the REST catalog. (That's at least what I understood from
> > discussions
> > > > > > > quite a while ago.)
> > > > > > >
> > > > > > > 2. Hadoop is quite behind supporting recent Java versions. It is
> > > > > already
> > > > > > > impossible to run "anything Hadoop" with Java 24. Considering
> > how long
> > > > > > > it took Hadoop to even support Java 11, it will take a long time
> > until
> > > > > > > Hadoop is ready for Java 24+, especially since Hadoop has to
> > refactor a
> > > > > > > lot of things. Polaris requires Java 21 and we know it works in
> > CI with
> > > > > > > Java 22+23 (both are EOL). Hadoop does only support Java 11, not
> > 17,
> > > > > not
> > > > > > > 21.
> > > > > > >
> > > > > > > 3. Hadoop (HDFS) is as a very different security model, which is
> > the
> > > > > > > reason why HDFS is not suitable for Polaris production
> > configuration,
> > > > > > > guarded by explicit configuration options.
> > > > > > >
> > > > > > > 4. Hive depends on Hadoop, so all concerns about Hadoop also
> > apply to
> > > > > > Hive.
> > > > > > >
> > > > > > > 5. Polaris is multi-tenant (realms). A _single_ instance of Hive
> > > > > > > contradicts this.
> > > > > > >
> > > > > > >
> > > > > > > My vote would be on *not* adding Hive and also on removing Hadoop
> > > > > > entirely.
> > > > > > >
> > > > > > > If someone comes up with an Iceberg REST catalog for Hive or
> > HDFS and
> > > > > > > Polaris can connect to it, that's fine for me, because it's
> > outside of
> > > > > > > Polaris. But I strongly object having Hadoop or even Hive in
> > Polaris.
> > > > > > >
> > > > > > >
> > > > > > > On 7/1/25 20:48, Pooja Nilangekar wrote:
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > I wanted to start a discussion around the support for Hive
> > Catalog
> > > > > > > > federation in Polaris. In particular, there are two primary
> > ways we
> > > > > can
> > > > > > > add
> > > > > > > > support for Hive federation:
> > > > > > > >
> > > > > > > > *1. Support a single Hive instance per Polaris deployment* The
> > Hive
> > > > > > > > workflow would be identical to the Hadoop catalog workflow.
> > Polaris
> > > > > > > > would invoke the Iceberg connection library, that would try to
> > find
> > > > > the
> > > > > > > > hive-site.xml file in (1) the CLASSPATH and (2) the default
> > Hadoop
> > > > > > > > locations: HADOOP_PATH and HADOOP_CONF_DIR. Polaris would then
> > > > > > initialize
> > > > > > > > the Hive connection using the configurations it found at these
> > > > > > locations.
> > > > > > > >
> > > > > > > >     -
> > > > > > > >
> > > > > > > >     *Drawbacks: *The primary drawback of this approach is that
> > if
> > > > > > Polaris
> > > > > > > >     finds multiple hive-site.xml files, it would merge their
> > > > > > > configurations,
> > > > > > > >     which could lead to potentially inconsistent connection
> > state.
> > > > > > > >     Furthermore, there is no clear documentation of the order
> > in
> > > > > which
> > > > > > > the
> > > > > > > >     configuration would be applied. While this is often
> > predictable
> > > > > on
> > > > > > a
> > > > > > > given
> > > > > > > >     OS, it is not guaranteed across environments. The other key
> > > > > > drawback
> > > > > > > is
> > > > > > > >     that if a Polaris user wants to federate to multiple Hive
> > > > > catalogs,
> > > > > > > their
> > > > > > > >     only option is to deploy a separate Polaris instance for
> > each
> > > > > Hive
> > > > > > > >     instance.
> > > > > > > >
> > > > > > > > *2. Support multiple Hive instances per Polaris deployment* The
> > > > > > alternate
> > > > > > > > (and in my view, ideal) solution is to allow Polaris to
> > federate with
> > > > > > > > multiple Hive catalogs. To support multiple catalogs, Polaris
> > would
> > > > > > > > explicitly disallow the connection library from reading
> > hive-site.xml
> > > > > > > files
> > > > > > > > in the default paths. To pass in the configurations, Polaris
> > can
> > > > > adopt
> > > > > > > one
> > > > > > > > of two options:
> > > > > > > >
> > > > > > > >     -
> > > > > > > >
> > > > > > > >     *Option 2a: Accept a canonical path to the target
> > hive-site.xml.*
> > > > > > > >     -
> > > > > > > >
> > > > > > > >        *Advantages:* This guarantees that the connection
> > > > > configurations
> > > > > > > are
> > > > > > > >        derived from a single source. It also allows Polaris to
> > rely
> > > > > on
> > > > > > > the
> > > > > > > >        NONE/ENVIRONMENT/PROVIDER/UNMANAGED mechanism, making it
> > > > > > > especially
> > > > > > > >        useful in case the Hive instance relies on Kerberos or
> > custom
> > > > > > > >        authentication that Polaris does not natively
> > support/manage.
> > > > > > > >        -
> > > > > > > >
> > > > > > > >        *Drawbacks:* The user needs to have access (or some
> > mechanism
> > > > > to
> > > > > > > >        upload files) to the Polaris server's file system.
> > > > > > > >        -
> > > > > > > >
> > > > > > > >     *Option 2b: Accept all the connection-specific parameters
> > as a
> > > > > part
> > > > > > > of
> > > > > > > >     the create-catalog request.*
> > > > > > > >     -
> > > > > > > >
> > > > > > > >        *Advantage:* Polaris can directly accept and store the
> > > > > > > configurations
> > > > > > > >        in a DPO instead of relying on the user having access
> > to the
> > > > > > > > server's file
> > > > > > > >        system (to create/update hive-site.xml).
> > > > > > > >        -
> > > > > > > >
> > > > > > > >        *Drawback:* Polaris would need to manage the secrets.
> > This is
> > > > > > > easy to
> > > > > > > >        support for certain authentication types (LDAP/Simple),
> > > > > However,
> > > > > > > >   it would
> > > > > > > >        preclude the support for other authentication
> > mechanisms, such
> > > > > > > > as Kerberos
> > > > > > > >        or Custom.
> > > > > > > >
> > > > > > > > I prefer option 2a primarily because it provides the
> > flexibility of
> > > > > > > > supporting multiple federated Hive catalogs while allowing
> > Polaris to
> > > > > > > > support authentication that it does not natively manage.
> > Please let
> > > > > me
> > > > > > > know
> > > > > > > > if you have any thoughts or feedback.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Pooja
> > > > > > > >
> > > > > > > --
> > > > > > > Robert Stupp
> > > > > > > @snazy
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> >
> 

Reply via email to