Hello Edward
Yeah, your point is absolutely valid—my intention was never to drag along an
ancient tech stack either.
In fact, the compatibility work is already done; we no longer need to compile
against crusty old dependencies. What we’re doing is simple: just avoid pulling
in extra env libs at runtime. We bundle every runtime dependency into self-lib,
which contains virtually zero legacy packages (Example: tez with full hadoop
3.4.1 lib,and run on hadoop 3.1.0 yarn,no use any hadoop 3.1.0 lib.if something
weird does surface, the user can debug it themselves).
As long as self-lib can call out to external Hadoop/OSS environments via
function calls, the compatibility box is ticked. The effort is tiny and we
basically wash our hands of historical debt—why not do it? (Of course, if this
approach still fails for a particular user, we drop it and move on; people have
to look forward.)
As for the REST Catalog, its original intent is sound, yet I can’t shake a
strong sense of déjà vu. Is it really wise to discard the long-honed power of
the metadata system itself and expose everything through a generic protocol
that will always be treated as a second-class citizen? The world works like
this: there may be a universal law or framework, but the closer you get to
reality, the more compromises and distortions pile up. So my view of
RestCatalog remains that of a generic, second-tier protocol.(I wouldn’t rule
out the possibility that one day we’ll find ourselves returning to a
purpose-built metadata system.)
Still, no matter what, jettisoning historical debt and marching forward
together is unquestionably the right call.
Lisoda.
At 2025-12-24 06:23:27, "Edward Capriolo" <[email protected]> wrote:
Hey all,
I have been out of the game for a while but I am getting active again. My
opinion is 'out-with the old, in the with the new'. I often work in regulated
environments. The first thing they look at is the OSS issues and the problem is
nightmare level, see all this red vulnerability stuff:
https://mvnrepository.com/artifact/org.apache.hive/hive-exec?p=2
I am building hadoop on Alpine. hadoop common even in the latest hadoop 3.4.2
still wants to use protobuf 2.5.0. You can't even find a version of alpine that
has that protobuf.!!!!! Thats how old it is. That dependency downstream then
forces everything below it to also have the problem. Hive will need to include
a protobuf-lib 6 years old to keep up with hadoop that' s 8 years old. Protobuf
2.5.0 is this old
https://groups.google.com/g/protobuf/c/CZngjTrQqdI?pli=1
"This is how Spark does it, which is also the main reason why users are more
likely to adopt Spark as a SQL engine"
Spark somehow still depends on hive 2.8:
https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.13/3.5.7
But if you read between the lines that is going away, HMS is end of life, unity
catalog goes forward.
https://community.databricks.com/t5/data-engineering/hive-metastore-end-of-life/td-p/136152
No disrespect to any of the platform maintainers. I understand their work and
its value. "HDP" "CDH" "big top" and all the distros. They shouldn't be hives
target. It is resulting in this insane backporting to support protobuf from
2013 to run it on rocky linux. No one builds software like this anymore.
Starrocks https://www.starrocks.io/ is running on ubuntu latest,
Back in the day hive built off master, there were no CDHs or HDP, then came the
"shim layer" which was clever but now it is a crutch. it is making everyone
target into the past. Literally targeting a protobuf version from 2013.
All the duckdb people brag on linked in that they can query json. Hive is so
unhip I have to jam it into the conversation :) If you see someone talking
about a metastore it is nessy, polaris, unity catalog. or GLUE!
See this:
https://issues.apache.org/jira/browse/HADOOP-19756
A fortress of if statements to make it run on sun and glibc redhat 4 :). We
maintain this basura we just continue the push into irrelevance.
It's really time to stop targeting the "distos" they are fading fast. Make it
build on master, make it build on alpine.
https://github.com/edwardcapriolo/edgy-ansible/blob/main/imaging/hadoop/compositions/ha_rm_zk/compose.yml
Hip, docker, features, winning! not "compatibility with rh4 runnon on
mainframe, stable releases from vendor"
Edward
On Wed, Oct 9, 2024 at 8:02 AM lisoda <[email protected]> wrote:
HI TEAM.
I would like to discuss with everyone the issue of running Hive4 in Hadoop
environments below version 3.3.6. Currently, a large number of Hive users are
still using low-version environments such as Hadoop 2.6/2.7/3.1.1. To be
honest, upgrading Hadoop is a challenging task. We cannot force users to
upgrade their Hadoop cluster versions just to use Hive4. In order to encourage
these potential users to adopt and use Hive4, we need to provide a general
solution that allows Hive4 to run on low-version Hadoop (at least we need to
address the compatibility issues with Hadoop version 3.1.0).
The general plan is as follows: In both the Hive and Tez projects, in addition
to providing the existing tar packages, we should also provide tar packages
that include high-version Hadoop dependencies. By defining configuration files,
users can avoid using any jar package dependencies from the Hadoop cluster. In
this way, users can initiate Tez tasks on low-version Hadoop clusters using
only the built-in Hadoop dependencies.
This is how Spark does it, which is also the main reason why users are more
likely to adopt Spark as a SQL engine. Spark not only provides tar packages
without Hadoop dependencies but also provides tar packages with built-in Hadoop
3 and Hadoop 2. Users can upgrade to a new version of Spark without upgrading
the Hadoop version.
We have implemented such a plan in our production environment, and we have
successfully run Hive4.0.0 and Hive4.0.1 in the HDP 3.1.0 environment. They are
currently working well.
Based on our successful experience, I believe it is necessary for us to provide
tar packages with all Hadoop dependencies built in. At the very least, we
should document that users can successfully run Hive4 on low-version Hadoop in
this way.
However, my idea may not be mature enough, so I would like to know what others
think. It would be great if someone could participate in this topic and discuss
it.
TKS.
LISODA.