Re: Running Hive4 in low-version Hadoop environments.

Ayush Saxena Fri, 11 Oct 2024 23:27:51 -0700

If you already have a solution in place, feel free to create a Jira & PR
with it. However, third-party dependencies present significant challenges.
Different versions of Hadoop bring their own set of third-party libraries,
which can cause compatibility issues with the versions used by Hive. A
prime example is Guava: while Hadoop upgraded Guava in versions post-3.1.x,
Hive couldn’t follow suit. Hadoop eventually shaded Guava in 3.3.x, which
is why we aligned with that version.


One potential improvement could be to switch to using hadoop-client-api,
hadoop-client-runtime, and hadoop-client-minicluster instead of directly
specifying the Hadoop dependencies. These artifacts shade most of the
third-party libraries, which may help minimize conflicts. Spark, for
example, already uses them [1].

As for releasing separate binaries for different Hadoop versions, I don't
think that’s feasible. However, users are free to build their own versions
from the source tarball we provide, using -Dhadoop.version=X. The actual
release is the source code; the binaries are just convenience binaries

That said, I don’t believe supporting the 2.x Hadoop line would be easy, or
even possible, at this point, but we can attempt for 3.x maybe

-Ayush

[1]
https://github.com/apache/spark/blob/6734d4883e76b82249df5c151d42bc83173f4122/pom.xml#L1401-L1424

On Wed, 9 Oct 2024 at 17:32, lisoda <lis...@yeah.net> wrote:

> HI TEAM.
>
> I would like to discuss with everyone the issue of running Hive4 in Hadoop
> environments below version 3.3.6. Currently, a large number of Hive users
> are still using low-version environments such as Hadoop 2.6/2.7/3.1.1. To
> be honest, upgrading Hadoop is a challenging task. We cannot force users to
> upgrade their Hadoop cluster versions just to use Hive4. In order to
> encourage these potential users to adopt and use Hive4, we need to provide
> a general solution that allows Hive4 to run on low-version Hadoop (at least
> we need to address the compatibility issues with Hadoop version 3.1.0).
> The general plan is as follows: In both the Hive and Tez projects, in
> addition to providing the existing tar packages, we should also provide tar
> packages that include high-version Hadoop dependencies. By defining
> configuration files, users can avoid using any jar package dependencies
> from the Hadoop cluster. In this way, users can initiate Tez tasks on
> low-version Hadoop clusters using only the built-in Hadoop dependencies.
> This is how Spark does it, which is also the main reason why users are
> more likely to adopt Spark as a SQL engine. Spark not only provides tar
> packages without Hadoop dependencies but also provides tar packages with
> built-in Hadoop 3 and Hadoop 2. Users can upgrade to a new version of Spark
> without upgrading the Hadoop version.
> We have implemented such a plan in our production environment, and we have
> successfully run Hive4.0.0 and Hive4.0.1 in the HDP 3.1.0 environment. They
> are currently working well.
> Based on our successful experience, I believe it is necessary for us to
> provide tar packages with all Hadoop dependencies built in. At the very
> least, we should document that users can successfully run Hive4 on
> low-version Hadoop in this way.
> However, my idea may not be mature enough, so I would like to know what
> others think. It would be great if someone could participate in this topic
> and discuss it.
>
>
> TKS.
> LISODA.
>
>

Re: Running Hive4 in low-version Hadoop environments.

Reply via email to