Re:Re: Running Hive4 in low-version Hadoop environments.

lisoda Sat, 12 Oct 2024 08:27:37 -0700

Hello Sir.
I agree with your comments related to hadoop2, and I don't actually intend to 
support it.We just need to support hadoop 3.x and we're good to go.

On a low version of hadoop3, this is what we do:
1. Download the hadoop binaries separately(high version,example:3.3.6), and set
hadoop_home in hive to be the directory where the higher version of hadoop is
stored.
2. Package tez with all the dependencies and native lib (including the required
ones for hadoop).
3. In tez-site.xml.Specify that tez will only use all the jar packages in its
own lib folder, and not any hadoop related dependencies in the cluster.

With the above steps, we are currently running hive4.0.1+tez0.10.4 on hdp
3.1.0(hadoop 3.1.1). They work fine.

This is the solution we are currently using, do you see any problems with this
solution? If there are no problems with this solution, can we extend it to all
hive's users?

Tks.
LiSoDa.

在 2024-10-12 14:27:32，"Ayush Saxena" <[email protected]> 写道：

If you already have a solution in place, feel free to create a Jira & PR with
it. However, third-party dependencies present significant challenges. Different
versions of Hadoop bring their own set of third-party libraries, which can
cause compatibility issues with the versions used by Hive. A prime example is
Guava: while Hadoop upgraded Guava in versions post-3.1.x, Hive couldn’t follow
suit. Hadoop eventually shaded Guava in 3.3.x, which is why we aligned with
that version.

One potential improvement could be to switch to using hadoop-client-api,
hadoop-client-runtime, and hadoop-client-minicluster instead of directly
specifying the Hadoop dependencies. These artifacts shade most of the
third-party libraries, which may help minimize conflicts. Spark, for example,
already uses them [1].

As for releasing separate binaries for different Hadoop versions, I don't think
that’s feasible. However, users are free to build their own versions from the
source tarball we provide, using -Dhadoop.version=X. The actual release is the
source code; the binaries are just convenience binaries

That said, I don’t believe supporting the 2.x Hadoop line would be easy, or
even possible, at this point, but we can attempt for 3.x maybe

-Ayush

[1]
https://github.com/apache/spark/blob/6734d4883e76b82249df5c151d42bc83173f4122/pom.xml#L1401-L1424

On Wed, 9 Oct 2024 at 17:32, lisoda <[email protected]> wrote:

HI TEAM.

I would like to discuss with everyone the issue of running Hive4 in Hadoop
environments below version 3.3.6. Currently, a large number of Hive users are
still using low-version environments such as Hadoop 2.6/2.7/3.1.1. To be
honest, upgrading Hadoop is a challenging task. We cannot force users to
upgrade their Hadoop cluster versions just to use Hive4. In order to encourage
these potential users to adopt and use Hive4, we need to provide a general
solution that allows Hive4 to run on low-version Hadoop (at least we need to
address the compatibility issues with Hadoop version 3.1.0).
The general plan is as follows: In both the Hive and Tez projects, in addition
to providing the existing tar packages, we should also provide tar packages
that include high-version Hadoop dependencies. By defining configuration files,
users can avoid using any jar package dependencies from the Hadoop cluster. In
this way, users can initiate Tez tasks on low-version Hadoop clusters using
only the built-in Hadoop dependencies.
This is how Spark does it, which is also the main reason why users are more
likely to adopt Spark as a SQL engine. Spark not only provides tar packages
without Hadoop dependencies but also provides tar packages with built-in Hadoop
3 and Hadoop 2. Users can upgrade to a new version of Spark without upgrading
the Hadoop version.
We have implemented such a plan in our production environment, and we have
successfully run Hive4.0.0 and Hive4.0.1 in the HDP 3.1.0 environment. They are
currently working well.
Based on our successful experience, I believe it is necessary for us to provide
tar packages with all Hadoop dependencies built in. At the very least, we
should document that users can successfully run Hive4 on low-version Hadoop in
this way.
However, my idea may not be mature enough, so I would like to know what others
think. It would be great if someone could participate in this topic and discuss
it.

TKS.
LISODA.

Re:Re: Running Hive4 in low-version Hadoop environments.

Reply via email to