I'd really prefer to not add "anything Hive" to Polaris itself, and I'd
really like to see Hadoop being removed entirely from the Polaris code base.
There are multiple reasons for this:
1. The Iceberg project would rather like to remove all catalogs except
the REST catalog. (That's at least what I understood from discussions
quite a while ago.)
2. Hadoop is quite behind supporting recent Java versions. It is already
impossible to run "anything Hadoop" with Java 24. Considering how long
it took Hadoop to even support Java 11, it will take a long time until
Hadoop is ready for Java 24+, especially since Hadoop has to refactor a
lot of things. Polaris requires Java 21 and we know it works in CI with
Java 22+23 (both are EOL). Hadoop does only support Java 11, not 17, not 21.
3. Hadoop (HDFS) is as a very different security model, which is the
reason why HDFS is not suitable for Polaris production configuration,
guarded by explicit configuration options.
4. Hive depends on Hadoop, so all concerns about Hadoop also apply to Hive.
5. Polaris is multi-tenant (realms). A _single_ instance of Hive
contradicts this.
My vote would be on *not* adding Hive and also on removing Hadoop entirely.
If someone comes up with an Iceberg REST catalog for Hive or HDFS and
Polaris can connect to it, that's fine for me, because it's outside of
Polaris. But I strongly object having Hadoop or even Hive in Polaris.
On 7/1/25 20:48, Pooja Nilangekar wrote:
Hi all,
I wanted to start a discussion around the support for Hive Catalog
federation in Polaris. In particular, there are two primary ways we can add
support for Hive federation:
*1. Support a single Hive instance per Polaris deployment* The Hive
workflow would be identical to the Hadoop catalog workflow. Polaris
would invoke the Iceberg connection library, that would try to find the
hive-site.xml file in (1) the CLASSPATH and (2) the default Hadoop
locations: HADOOP_PATH and HADOOP_CONF_DIR. Polaris would then initialize
the Hive connection using the configurations it found at these locations.
-
*Drawbacks: *The primary drawback of this approach is that if Polaris
finds multiple hive-site.xml files, it would merge their configurations,
which could lead to potentially inconsistent connection state.
Furthermore, there is no clear documentation of the order in which the
configuration would be applied. While this is often predictable on a given
OS, it is not guaranteed across environments. The other key drawback is
that if a Polaris user wants to federate to multiple Hive catalogs, their
only option is to deploy a separate Polaris instance for each Hive
instance.
*2. Support multiple Hive instances per Polaris deployment* The alternate
(and in my view, ideal) solution is to allow Polaris to federate with
multiple Hive catalogs. To support multiple catalogs, Polaris would
explicitly disallow the connection library from reading hive-site.xml files
in the default paths. To pass in the configurations, Polaris can adopt one
of two options:
-
*Option 2a: Accept a canonical path to the target hive-site.xml.*
-
*Advantages:* This guarantees that the connection configurations are
derived from a single source. It also allows Polaris to rely on the
NONE/ENVIRONMENT/PROVIDER/UNMANAGED mechanism, making it especially
useful in case the Hive instance relies on Kerberos or custom
authentication that Polaris does not natively support/manage.
-
*Drawbacks:* The user needs to have access (or some mechanism to
upload files) to the Polaris server's file system.
-
*Option 2b: Accept all the connection-specific parameters as a part of
the create-catalog request.*
-
*Advantage:* Polaris can directly accept and store the configurations
in a DPO instead of relying on the user having access to the
server's file
system (to create/update hive-site.xml).
-
*Drawback:* Polaris would need to manage the secrets. This is easy to
support for certain authentication types (LDAP/Simple), However,
it would
preclude the support for other authentication mechanisms, such
as Kerberos
or Custom.
I prefer option 2a primarily because it provides the flexibility of
supporting multiple federated Hive catalogs while allowing Polaris to
support authentication that it does not natively manage. Please let me know
if you have any thoughts or feedback.
Thanks,
Pooja
--
Robert Stupp
@snazy