Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x

Clint Wylie Tue, 08 Jun 2021 22:47:25 -0700

@itai, I think pending the outcome of this discussion that it makes sense
to have a wider community thread to announce any decisions we make here,
thanks for bringing that up.

@rajiv, Minio support seems unrelated to this discussion. It seems like a
reasonable request, but I recommend starting another thread to see if
someone is interested in taking up this effort.

@jihoon I definitely agree that Hadoop should be refactored to be an
extension longer term. I don't think this upgrade would necessarily
make doing such a refactor any easier, but not harder either. Just moving
Hadoop to an extension also unfortunately doesn't really do anything to
help our dependency problem though, which is the thing that has agitated me
enough to start this thread and start looking into solutions.

@will/@frank I feel like the stranglehold Hadoop has on our dependencies
has started to become especially more painful in the last couple of
years. Most painful to me is that we are stuck using a version of Apache
Calcite from 2019 (six versions behind the latest), because newer versions
require a newer version of Guava. This means we cannot get any bug fixes
and improvements in our SQL parsing layer without doing something like
packaging a shaded version of it ourselves or solving our Hadoop dependency
problem.

Many other dependencies have also proved problematic with Hadoop as well in
the past, and since we aren't able to run the Hadoop integration tests in
Travis, there is always the chance that sometimes we don't catch these when
they go in. I imagine now that we have turned on dependabot this week,
https://github.com/apache/druid/pull/11079, that we are going to have to
proceed very carefully with it until we are able to resolve this dependency
issue.

Hadoop 3.3.0 is also the first to support running on a Java version that is
newer than Java 8 per
https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions,
which is another area we have been working towards - Druid to officially
support Java 11+ environments.

I'm sort of at a loss of what else to do besides one of
- switching to these Hadoop 3 shaded jars and dropping 2.x support
- figuring out how to custom package our own Hadoop 2.x dependendencies
that are shaded similarly to the Hadoop 3 client jars, and only supporting
Hadoop with application classpath isolation (mapreduce.job.classloader =
true)
- just dropping support for Hadoop completely

I would much rather devote all effort into making Druids native batch
ingestion better to encourage people to migrate to that, than continuing to
fight with figuring out how to keep supporting Hadoop, so upgrading and
switching to the shaded client jars at least seemed like a reasonable
compromise to dropping it completely. Maybe making custom shaded Hadoop
dependencies in the spirit of the Hadoop 3 shaded jars isn't as hard as I
am imagining, but it does seem like the most amount of work between the
solutions I could think of to potentially resolve this problem.

Does anyone have any other ideas of how we can isolate our dependencies
from Hadoop? Solutions like shading Guava,
https://github.com/apache/druid/pull/10964, would let Druid itself use
newer Guava, but that doesn't help conflicts within our dependencies which
has always seemed to be the larger problem to me. Moving Hadoop support to
an extension doesn't help anything unless we can ensure that we can run
Druid ingestion tasks on Hadoop without having to match all of the Hadoop
clusters dependencies with some sort of classloader wizardry.

Maybe we could consider keeping a 0.22.x release line in Druid that gets
security and minor bug fixes for some period of time to give people a
longer period to migrate off of Hadoop 2.x? I can't speak for the rest of
the committers, but I would personally be more open to maintaining such a
branch if it meant that moving forward at least we could update all of our
dependencies to newer versions, while providing a transition path to still
have at least some support until migrating to Hadoop 3 or native Druid
batch ingestion.

Any other ideas?

On Tue, Jun 8, 2021 at 7:44 PM frank chen <frankc...@apache.org> wrote:

> Considering Druid takes advantage of lots of external components to work, I
> think we should upgrade Druid in a little bit conservitive way. Dropping
> support of hadoop2 is not a good idea.
> The upgrading of the ZooKeeper client in Druid also prevents me from
> adopting 0.22 for a longer time.
>
> Although users could upgrade these dependencies first to use the latest
> Druid releases, frankly speaking, these upgrades are not so easy in
> production and usually take longer time, which would prevent users from
> experiencing new features of Druid.
> For hadoop3, I have heard of some performance issues, which also makes me
> have no confidence to upgrade.
>
> I think what Jihoon proposes is a good idea, separating hadoop2 from Druid
> core as an extension.
> Since hadoop2 has not been EOF, to achieve balance between compatibility
> and long term evolution, maybe we could provide two extensions, one for
> hadoop2, one for hadoop3.
>
>
>
> Will Lauer <wla...@verizonmedia.com.invalid> 于2021年6月9日周三 上午4:13写道：
>
> > Just to follow up on this, our main problem with hadoop3 right now has
> been
> > instability in HDFS, to the extent that we put on hold any plans to
> deploy
> > it to our production systems. I would claim Hadoop3 isn't mature enough
> yet
> > to consider migrating Druid to it.
> >
> > WIll
> >
> > <http://www.verizonmedia.com>
> >
> > Will Lauer
> >
> > Senior Principal Architect, Audience & Advertising Reporting
> > Data Platforms & Systems Engineering
> >
> > M 508 561 6427
> > 1908 S. First St
> > Champaign, IL 61822
> >
> > <http://www.facebook.com/verizonmedia>   <
> http://twitter.com/verizonmedia>
> > <https://www.linkedin.com/company/verizon-media/>
> > <http://www.instagram.com/verizonmedia>
> >
> >
> >
> > On Tue, Jun 8, 2021 at 2:59 PM Will Lauer <wla...@verizonmedia.com>
> wrote:
> >
> > > Unfortunately, the migration off of hadoop3 is a hard one (maybe not
> for
> > > Druid, but certainly for big organizations running large hadoop2
> > > workloads). If druid migrated to hadoop3 after 0.22, that would
> probably
> > > prevent me from taking any new versions of Druid for at least the
> > remainder
> > > of the year and possibly longer.
> > >
> > > Will
> > >
> > >
> > > <http://www.verizonmedia.com>
> > >
> > > Will Lauer
> > >
> > > Senior Principal Architect, Audience & Advertising Reporting
> > > Data Platforms & Systems Engineering
> > >
> > > M 508 561 6427
> > > 1908 S. First St
> > > Champaign, IL 61822
> > >
> > > <http://www.facebook.com/verizonmedia>   <
> > http://twitter.com/verizonmedia>
> > >    <https://www.linkedin.com/company/verizon-media/>
> > > <http://www.instagram.com/verizonmedia>
> > >
> > >
> > >
> > > On Tue, Jun 8, 2021 at 3:08 AM Clint Wylie <cwy...@apache.org> wrote:
> > >
> > >> Hi all,
> > >>
> > >> I've been assisting with some experiments to see how we might want to
> > >> migrate Druid to support Hadoop 3.x, and more importantly, see if
> maybe
> > we
> > >> can finally be free of some of the dependency issues it has been
> causing
> > >> for as long as I can remember working with Druid.
> > >>
> > >> Hadoop 3 introduced shaded client jars,
> > >>
> > >>
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D11804&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=rBnEOMf2IKDMeWUo4TZyqf5CzrnbiYTfZUkjHr8GOHo&e=
> > >> , with the purpose to
> > >> allow applications to talk to the Hadoop cluster without drowning in
> its
> > >> transitive dependencies. The experimental branch that I have been
> > helping
> > >> with, which is using these new shaded client jars, can be seen in this
> > PR
> > >>
> > >>
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11314&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=424doHggbejAz5XswosgVkJK98VUBcUj0pD5bAcBjT0&e=
> > >> , and is currently working with
> > >> the HDFS integration tests as well as the Hadoop tutorial flow in the
> > >> Druid
> > >> docs (which is pretty much equivalent to the HDFS integration test).
> > >>
> > >> The cloud deep storages still need some further testing and some minor
> > >> cleanup still needs done for the docs and such. Additionally we still
> > need
> > >> to figure out how to handle the Kerberos extension, because it extends
> > >> some
> > >> Hadoop classes so isn't able to use the shaded client jars in a
> > >> straight-forward manner, and so still has heavy dependencies and
> hasn't
> > >> been tested. However, the experiment has started to pan out enough to
> > >> where
> > >> I think it is worth starting this discussion, because it does have
> some
> > >> implications.
> > >>
> > >> Making this change I think will allow us to update our dependencies
> > with a
> > >> lot more freedom (I'm looking at you, Guava), but the catch is that
> once
> > >> we
> > >> make this change and start updating these dependencies, it will become
> > >> hard, nearing impossible to support Hadoop 2.x, since as far as I know
> > >> there isn't an equivalent set of shaded client jars. I am also not
> > certain
> > >> how far back the Hadoop job classpath isolation stuff goes
> > >> (mapreduce.job.classloader = true) which I think is required to be set
> > on
> > >> Druid tasks for this shaded stuff to work alongside updated Druid
> > >> dependencies.
> > >>
> > >> Is anyone opposed to or worried about dropping Hadoop 2.x support
> after
> > >> the
> > >> Druid 0.22 release?
> > >>
> > >
> >
>

Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x

Reply via email to