Re:Fwd: Spark + Hive 4 Integration Guide (Practical Approach)

lisoda Fri, 03 Apr 2026 03:53:18 -0700

Hello,
It is a pleasure to hear from you. Thank you for sharing your insights.
We have adopted a similar approach to address component stack upgrade and
evolution challenges, and we are delighted to see that the community is
actively advancing this work as well.
Our general strategy focuses on two key principles:
1. Decoupling the Engine-Data Relationship
This appears to be a consensus among industry peers. Given that our data
processing remains predominantly structured, the data lake technology stack
naturally became our preferred choice, as it inherently preserves data schema.
In our production environment, we utilize HadoopCatalog to migrate data
previously managed by legacy HMS versions into Iceberg. While alternative
RestCatalog implementations are certainly worth considering, our preference for
minimizing component stack complexity led us to favor a FileSystemCatalog-like
solution.
We are grateful that the community has incorporated similar experiences—HIVE
now supports both HadoopCatalog and LocationBasedIcebergTable.
Leveraging these features, we have achieved seamless data interoperability
between Spark and HIVE4 in our production environment. This liberates us from
concerns about engine upgrades potentially corrupting data or compromising
accessibility. For Trino and other MPP databases, we now implementing a
compatibility layer adhering to RestCatalog specifications, enabling these
engines to access FileSystemCatalog-based tables.
Our current production architecture maintains separate deployments: Spark 3.x
operates with legacy HMS 3.x, while HIVE 4.x runs independently. This
arrangement allows us to upgrade Spark and gracefully phase out the legacy
HMS(3.x,2.x,...) at our discretion,and use hms 4.x. Should we need to
reconfigure extensively—even clearing all metadata and redeploying engines—the
data remains secure within Iceberg. We simply re-establish the connection
between Iceberg and the engine.
2. Minimizing Hadoop Cluster Dependencies
Similar to Spark-on-Kubernetes deployments, our approach involves bundling
essential runtime dependencies within the engine's self-contained libraries,
ensuring the engine operates exclusively with its own libraries during
execution.
This method has effectively decoupled Hadoop versioning from our engines.
Provided base APIs remain stable, we can successfully run engines dependent on
newer Hadoop versions atop older Hadoop/YARN infrastructures.
We were honored to contribute this approach to the official documentation:
https://hive.apache.org/docs/latest/admin/manual-installation/#installing-with-old-version-hadoopgreater-than-or-equal-310
Employing these techniques, we currently operate three or more distinct
HIVE+HMS version combinations in production. In our customer engagements, we
have similarly enabled HIVE4 (dependent on Hadoop 3.4+) to run within Hadoop
2.x environments.
The above reflects our humble experience and observations. We would be
delighted to exchange ideas should you have alternative approaches or insights
to share. Please kindly point out any misconceptions or areas for improvement.
Warm regards,
Lisoda

Re:Fwd: Spark + Hive 4 Integration Guide (Practical Approach)

Reply via email to