Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x for 24.0

Maytas Monsereenusorn Mon, 22 Aug 2022 00:15:57 -0700

Hi Julian,

Thank you so much for your contribution on Spark support. As an existing
committer, I would like to help get the Spark connector merged into OSS
(including PR reviews and any other development work that may be needed).
We can move the conversation regarding Spark support into a new thread or
reuse the Github issue already opened to keep this thread on topic with
dropping support for Hadoop 2.x.


Best Regards,
Maytas

On Sun, Aug 21, 2022 at 11:55 PM Julian Jaffe <[email protected]>
wrote:

> For Spark support, the connector I wrote remains functional but I haven’t
> updated the PR for six months or so since it didn’t seem like there was an
> appetite for review. If that’s changing I could migrate back some more
> recent changes to the OSS PR. Even with an up-to-date patch though I see
> two problems:
>
> First, I remain worried that there isn’t sufficient support among
> committers for the Spark connector. I don’t want Druid to end up in the
> same place it is now for Hadoop 2 support where no one really maintains the
> Spark code and we wind up with another awkward corner of the code base that
> holds back other development.
>
> Secondly, the PR I have up is for Spark 2.4, which is now 2 years further
> out of date than it was back in 2020. Similarly to Hadoop there is a
> bifurcation in the community and Spark 2.4 is still in heavy use but we
> might be trading one problem for another if we deprecate Hadoop 2 in favor
> of Spark 2.4. I have written a Spark 3.2 connector as well but it’s been
> deployed to significantly smaller use cases than the 2.4 line.
>
> Even with these two caveats, if there’s a desire among the Druid
> development community to add Spark functionality and support it I’d love to
> push this across the finish line.
>
> > On Aug 9, 2022, at 1:04 AM, Abhishek Agarwal <[email protected]>
> wrote:
> >
> > Yes. We should deprecate it first which is similar to dropping the
> support
> > (no more active development) but we will still ship it for a release or
> > two. In a way, we are already in that mode to a certain extent. Many
> > features are being built with native ingestion as a first-class citizen.
> > E.g. range partitioning is still not supported on Hadoop ingestion. It's
> > hard for developers to build and test their business logic for all the
> > ingestion modes.
> >
> > It will be good to hear what gaps do community sees between native
> > ingestion vs Hadoop-based batch ingestion. And then work toward fixing
> > those gaps before dropping the Hadoop ingestion entirely. For example, if
> > users want the resource elasticity that a Hadoop cluster gives, we could
> > push forward PRs such as https://github.com/apache/druid/pull/10910.
> It's
> > not the same as a Hadoop cluster but nonetheless will let user reuse
> their
> > existing infrastructure to run druid jobs.
> >
> >> On Tue, Aug 9, 2022 at 9:43 AM Gian Merlino <[email protected]> wrote:
> >>
> >> It's always good to deprecate things for some time prior to removing
> them,
> >> so we don't need to (nor should we) remove Hadoop 2 support right now.
> My
> >> vote is that in this upcoming release, we should deprecate it. The main
> >> problem in my eyes is the one Abhishek brought up: the dependency
> >> management situation with Hadoop 2 is really messy, and I'm not sure
> >> there's a good way to handle them given the limited classloader
> isolation.
> >> This situation becomes tougher to manage with each release, and we
> haven't
> >> had people volunteering to find and build comprehensive solutions. It is
> >> time to move on.
> >>
> >> The concern Samarth raised, that people may end up stuck on older Druid
> >> versions because they aren't able to upgrade to Hadoop 3, is valid. I
> can
> >> see two good solutions to this. First: we can improve native ingest to
> the
> >> point where people feel broadly comfortable moving Hadoop 2 workloads to
> >> native. The work planned as part of doing ingest via multi-stage
> >> distributed query <https://github.com/apache/druid/issues/12262> is
> going
> >> to be useful here, by improving the speed and scalability of native
> ingest.
> >> Second: it would also be great to have something similar that runs on
> >> Spark, for people that have made investments in Spark. I suspect that
> most
> >> people that used Hadoop 2 have moved on to Hadoop 3 or Spark, so
> supporting
> >> both of those would ease a lot of the potential pain of dropping Hadoop
> 2
> >> support.
> >>
> >> On Spark: I'm not familiar with the current state of the Spark work. Is
> it
> >> stuck? If so could something be done to unstick it? I agree with
> Abhishek
> >> that I wouldn't want to block moving off Hadoop 2 on this. However,
> it'd be
> >> great if we could get it done before actually removing Hadoop 2 support
> >> from the code base.
> >>
> >>
> >> On Wed, Aug 3, 2022 at 6:17 AM Abhishek Agarwal <
> [email protected]
> >>>
> >> wrote:
> >>
> >>> I was thinking that moving from Hadoop 2 to Hadoop 3 will be a
> >>> low-resistance path than moving from Hadoop to Spark. even if we get
> that
> >>> PR merged, it will take good time for spark integration to reach the
> same
> >>> level of maturity as Hadoop or Native ingestion. BTW I am not making an
> >>> argument against spark integration. it will certainly be nice to have
> >> Spark
> >>> as an option. Just that spark integration doesn't become a blocker for
> us
> >>> to get off Hadoop.
> >>>
> >>> btw are you using Hadoop 2 right now with the latest druid version? If
> >> so,
> >>> did you run into similar errors that I posted in my last email?
> >>>
> >>> On Wed, Jul 27, 2022 at 12:02 AM Samarth Jain <[email protected]>
> >>> wrote:
> >>>
> >>>> I am sure there are other companies out there who are still on Hadoop
> >> 2.x
> >>>> with migration to Hadoop 3.x being a no-go.
> >>>> If Druid was to drop support for Hadoop 3.x completely, I am afraid it
> >>>> would prevent users from updating to newer versions of Druid which
> >> would
> >>> be
> >>>> a shame.
> >>>>
> >>>> FWIW, we have found in practice for high volume use cases that
> >> compaction
> >>>> based on Druid's Hadoop based batch ingestion is a lot more scale-able
> >>> than
> >>>> the native compaction.
> >>>>
> >>>> Having said that, as an alternative, if we can merge Julian's Spark
> >> based
> >>>> ingestion PR <https://github.com/apache/druid/issues/9780>s in Druid,
> >>> that
> >>>> might provide an alternate way for users to get rid of the Hadoop
> >>>> dependency.
> >>>>
> >>>> On Tue, Jul 26, 2022 at 3:19 AM Abhishek Agarwal <
> >>>> [email protected]>
> >>>> wrote:
> >>>>
> >>>>> Reviving this conversation again.
> >>>>> @Will - Do you still have concerns about HDFS stability? Hadoop 3 has
> >>>> been
> >>>>> around for some time now and is very stable as far as I know.
> >>>>>
> >>>>> The dependencies coming from Hadoop 2 are also old enough that they
> >>> cause
> >>>>> dependency scans to fail. E.g. Log4j 1.x dependencies that are coming
> >>>> from
> >>>>> Hadoop 2, get flagged during these scans. We have also seen issues
> >> when
> >>>>> customers try to use Hadoop ingestion with the latest log4j2 library.
> >>>>>
> >>>>> Exception in thread "main" java.lang.NoSuchMethodError:
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> org.apache.log4j.helpers.OptionConverter.convertLevel(Ljava/lang/String;Lorg/apache/logging/log4j/Level;)Lorg/apache/logging/log4j/Level;
> >>>>> at
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> org.apache.log4j.config.PropertiesConfiguration.parseLogger(PropertiesConfiguration.java:393)
> >>>>> at
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> org.apache.log4j.config.PropertiesConfiguration.configureRoot(PropertiesConfiguration.java:326)
> >>>>> at
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> org.apache.log4j.config.PropertiesConfiguration.doConfigure(PropertiesConfiguration.java:303)
> >>>>>
> >>>>>
> >>>>> Instead of fixing these point issues, we would be better served by
> >>>>> completely moving to Hadoop 3 entirely. Hadoop 3 does get more
> >> frequent
> >>>>> releases and dependencies are well isolated.
> >>>>>
> >>>>> On Tue, Oct 12, 2021 at 12:05 PM Karan Kumar <
> >> [email protected]
> >>>>
> >>>>> wrote:
> >>>>>
> >>>>>> Hello
> >>>>>> We can also use maven profiles. We keep hadoop2 support by default
> >>> and
> >>>>> add
> >>>>>> a new maven profile with hadoop3. This will allow the user to
> >> choose
> >>>> the
> >>>>>> profile which is best suited for the use case.
> >>>>>> Agreed, it will not help in the Hadoop dependency problems but does
> >>>>> enable
> >>>>>> our users to use druid with multiple flavors.
> >>>>>> Also with hadoop3, as clint mentioned, the dependencies come
> >>> pre-shaded
> >>>>> so
> >>>>>> we significantly reduce our effort in solving the dependency
> >>> problems.
> >>>>>> I have the PR in the last phases where I am able to run the entire
> >>> test
> >>>>>> suit unit + integration tests on both the default ie hadoop2 and
> >> the
> >>>> new
> >>>>>> hadoop3 profile.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 2021/06/09 11:55:31, Will Lauer <[email protected]
> >> .INVALID>
> >>>>>> wrote:
> >>>>>>> Clint,
> >>>>>>>
> >>>>>>> I fully understand what type of headache dealing with these
> >>>> dependency
> >>>>>>> issues is. We deal with this all the time, and based on
> >>> conversations
> >>>>>> I've
> >>>>>>> had with our internal hadoop development team, they are quite
> >> aware
> >>>> of
> >>>>>> them
> >>>>>>> and just as frustrated by them as you are. I'm certainly in favor
> >>> of
> >>>>>> doing
> >>>>>>> something to improve this situation, as long as it doesn't
> >> abandon
> >>> a
> >>>>>> large
> >>>>>>> section of the user base, which I think DROPPING hadoop2 would
> >> do.
> >>>>>>>
> >>>>>>> I think there are solutions there that can help solve the
> >>> conflicting
> >>>>>>> dependency problem. Refactoring Hadoop support into an
> >> independent
> >>>>>>> extension is certainly a start. But I think the dependency
> >> problem
> >>> is
> >>>>>>> bigger than that. There are always going to be conflicts between
> >>>>>>> dependencies in the core system and in extensions as the system
> >>> gets
> >>>>>>> bigger. We have one right now internally that prevents us from
> >>>> enabling
> >>>>>> SQL
> >>>>>>> in our instance of Druid due to conflicts between versions of
> >>>> protobuf
> >>>>>> used
> >>>>>>> by Calcite vs one of our critical extensions. Long term, I think
> >>> you
> >>>>> are
> >>>>>>> going to need to carefully think through a ClassLoader based
> >>> strategy
> >>>>> to
> >>>>>>> truly separate the impact of various dependencies.
> >>>>>>>
> >>>>>>> While I'm not seriously suggesting it for Druid, OSGi WOULD solve
> >>>> this
> >>>>>>> problem. It's a system that allows you to explicitly declare what
> >>>> each
> >>>>>>> bundle exposes to the system, and what each bundle consumes from
> >>> the
> >>>>>>> system, allowing multiple conflicting dependencies to co-exist
> >>>> without
> >>>>>>> impacting each other. OSGi is the big hammer approach, but I bet
> >> a
> >>>> more
> >>>>>>> appropriate solution would be a simpler custom-ClassLoader based
> >>>>> solution
> >>>>>>> that hid all dependencies in extensions, keeping them from
> >>> impacting
> >>>>> the
> >>>>>>> core, and that only exposed "public" pieces of the core to
> >>>> extensions.
> >>>>> If
> >>>>>>> Druid's core could be extended without impacting the various
> >>>>> extensions,
> >>>>>>> and the extensions' dependencies could be modified without
> >>> impacting
> >>>>> the
> >>>>>>> core, this would go a long way towards solving the problem that
> >> you
> >>>>> have
> >>>>>>> described.
> >>>>>>>
> >>>>>>> Will
> >>>>>>>
> >>>>>>> <http://www.verizonmedia.com>
> >>>>>>>
> >>>>>>> Will Lauer
> >>>>>>>
> >>>>>>> Senior Principal Architect, Audience & Advertising Reporting
> >>>>>>> Data Platforms & Systems Engineering
> >>>>>>>
> >>>>>>> M 508 561 6427
> >>>>>>> 1908 S. First St
> >>>>>>> Champaign, IL 61822
> >>>>>>>
> >>>>>>> <http://www.facebook.com/verizonmedia>   <
> >>>>>> http://twitter.com/verizonmedia>
> >>>>>>> <https://www.linkedin.com/company/verizon-media/>
> >>>>>>> <http://www.instagram.com/verizonmedia>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Jun 9, 2021 at 12:47 AM Clint Wylie <[email protected]>
> >>>> wrote:
> >>>>>>>
> >>>>>>>> @itai, I think pending the outcome of this discussion that it
> >>> makes
> >>>>>> sense
> >>>>>>>> to have a wider community thread to announce any decisions we
> >>> make
> >>>>>> here,
> >>>>>>>> thanks for bringing that up.
> >>>>>>>>
> >>>>>>>> @rajiv, Minio support seems unrelated to this discussion. It
> >>> seems
> >>>>>> like a
> >>>>>>>> reasonable request, but I recommend starting another thread to
> >>> see
> >>>> if
> >>>>>>>> someone is interested in taking up this effort.
> >>>>>>>>
> >>>>>>>> @jihoon I definitely agree that Hadoop should be refactored to
> >> be
> >>>> an
> >>>>>>>> extension longer term. I don't think this upgrade would
> >>> necessarily
> >>>>>>>> make doing such a refactor any easier, but not harder either.
> >>> Just
> >>>>>> moving
> >>>>>>>> Hadoop to an extension also unfortunately doesn't really do
> >>>> anything
> >>>>> to
> >>>>>>>> help our dependency problem though, which is the thing that has
> >>>>>> agitated me
> >>>>>>>> enough to start this thread and start looking into solutions.
> >>>>>>>>
> >>>>>>>> @will/@frank I feel like the stranglehold Hadoop has on our
> >>>>>> dependencies
> >>>>>>>> has started to become especially more painful in the last
> >> couple
> >>> of
> >>>>>>>> years. Most painful to me is that we are stuck using a version
> >> of
> >>>>>> Apache
> >>>>>>>> Calcite from 2019 (six versions behind the latest), because
> >> newer
> >>>>>> versions
> >>>>>>>> require a newer version of Guava. This means we cannot get any
> >>> bug
> >>>>>> fixes
> >>>>>>>> and improvements in our SQL parsing layer without doing
> >> something
> >>>>> like
> >>>>>>>> packaging a shaded version of it ourselves or solving our
> >> Hadoop
> >>>>>> dependency
> >>>>>>>> problem.
> >>>>>>>>
> >>>>>>>> Many other dependencies have also proved problematic with
> >> Hadoop
> >>> as
> >>>>>> well in
> >>>>>>>> the past, and since we aren't able to run the Hadoop
> >> integration
> >>>>> tests
> >>>>>> in
> >>>>>>>> Travis, there is always the chance that sometimes we don't
> >> catch
> >>>>> these
> >>>>>> when
> >>>>>>>> they go in. I imagine now that we have turned on dependabot
> >> this
> >>>>> week,
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11079&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=0LG0RjDQ1wFfBdl9aPg3-4oJPvJJs26aQsK8KSYLp2s&e=
> >>>>>>>> , that we are going to have to
> >>>>>>>> proceed very carefully with it until we are able to resolve
> >> this
> >>>>>> dependency
> >>>>>>>> issue.
> >>>>>>>>
> >>>>>>>> Hadoop 3.3.0 is also the first to support running on a Java
> >>> version
> >>>>>> that is
> >>>>>>>> newer than Java 8 per
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_HADOOP_Hadoop-2BJava-2BVersions&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=a5NmwtOWhCNvY4si_OKf0BRh_FTRpmCJHpTaGs8S64A&e=
> >>>>>>>> ,
> >>>>>>>> which is another area we have been working towards - Druid to
> >>>>>> officially
> >>>>>>>> support Java 11+ environments.
> >>>>>>>>
> >>>>>>>> I'm sort of at a loss of what else to do besides one of
> >>>>>>>> - switching to these Hadoop 3 shaded jars and dropping 2.x
> >>> support
> >>>>>>>> - figuring out how to custom package our own Hadoop 2.x
> >>>>> dependendencies
> >>>>>>>> that are shaded similarly to the Hadoop 3 client jars, and only
> >>>>>> supporting
> >>>>>>>> Hadoop with application classpath isolation
> >>>>> (mapreduce.job.classloader
> >>>>>> =
> >>>>>>>> true)
> >>>>>>>> - just dropping support for Hadoop completely
> >>>>>>>>
> >>>>>>>> I would much rather devote all effort into making Druids native
> >>>> batch
> >>>>>>>> ingestion better to encourage people to migrate to that, than
> >>>>>> continuing to
> >>>>>>>> fight with figuring out how to keep supporting Hadoop, so
> >>> upgrading
> >>>>> and
> >>>>>>>> switching to the shaded client jars at least seemed like a
> >>>> reasonable
> >>>>>>>> compromise to dropping it completely. Maybe making custom
> >> shaded
> >>>>> Hadoop
> >>>>>>>> dependencies in the spirit of the Hadoop 3 shaded jars isn't as
> >>>> hard
> >>>>>> as I
> >>>>>>>> am imagining, but it does seem like the most amount of work
> >>> between
> >>>>> the
> >>>>>>>> solutions I could think of to potentially resolve this problem.
> >>>>>>>>
> >>>>>>>> Does anyone have any other ideas of how we can isolate our
> >>>>> dependencies
> >>>>>>>> from Hadoop? Solutions like shading Guava,
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_10964&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=RmDhdAX6x_cU5sebIqzFpGXpo3NnYAYqeyEvwnA-pgw&e=
> >>>>>>>> , would let Druid itself use
> >>>>>>>> newer Guava, but that doesn't help conflicts within our
> >>>> dependencies
> >>>>>> which
> >>>>>>>> has always seemed to be the larger problem to me. Moving Hadoop
> >>>>>> support to
> >>>>>>>> an extension doesn't help anything unless we can ensure that we
> >>> can
> >>>>> run
> >>>>>>>> Druid ingestion tasks on Hadoop without having to match all of
> >>> the
> >>>>>> Hadoop
> >>>>>>>> clusters dependencies with some sort of classloader wizardry.
> >>>>>>>>
> >>>>>>>> Maybe we could consider keeping a 0.22.x release line in Druid
> >>> that
> >>>>>> gets
> >>>>>>>> security and minor bug fixes for some period of time to give
> >>>> people a
> >>>>>>>> longer period to migrate off of Hadoop 2.x? I can't speak for
> >> the
> >>>>> rest
> >>>>>> of
> >>>>>>>> the committers, but I would personally be more open to
> >>> maintaining
> >>>>>> such a
> >>>>>>>> branch if it meant that moving forward at least we could update
> >>> all
> >>>>> of
> >>>>>> our
> >>>>>>>> dependencies to newer versions, while providing a transition
> >> path
> >>>> to
> >>>>>> still
> >>>>>>>> have at least some support until migrating to Hadoop 3 or
> >> native
> >>>>> Druid
> >>>>>>>> batch ingestion.
> >>>>>>>>
> >>>>>>>> Any other ideas?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, Jun 8, 2021 at 7:44 PM frank chen <
> >> [email protected]>
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Considering Druid takes advantage of lots of external
> >>> components
> >>>> to
> >>>>>>>> work, I
> >>>>>>>>> think we should upgrade Druid in a little bit conservitive
> >> way.
> >>>>>> Dropping
> >>>>>>>>> support of hadoop2 is not a good idea.
> >>>>>>>>> The upgrading of the ZooKeeper client in Druid also prevents
> >> me
> >>>>> from
> >>>>>>>>> adopting 0.22 for a longer time.
> >>>>>>>>>
> >>>>>>>>> Although users could upgrade these dependencies first to use
> >>> the
> >>>>>> latest
> >>>>>>>>> Druid releases, frankly speaking, these upgrades are not so
> >>> easy
> >>>> in
> >>>>>>>>> production and usually take longer time, which would prevent
> >>>> users
> >>>>>> from
> >>>>>>>>> experiencing new features of Druid.
> >>>>>>>>> For hadoop3, I have heard of some performance issues, which
> >>> also
> >>>>>> makes me
> >>>>>>>>> have no confidence to upgrade.
> >>>>>>>>>
> >>>>>>>>> I think what Jihoon proposes is a good idea, separating
> >> hadoop2
> >>>>> from
> >>>>>>>> Druid
> >>>>>>>>> core as an extension.
> >>>>>>>>> Since hadoop2 has not been EOF, to achieve balance between
> >>>>>> compatibility
> >>>>>>>>> and long term evolution, maybe we could provide two
> >> extensions,
> >>>> one
> >>>>>> for
> >>>>>>>>> hadoop2, one for hadoop3.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Will Lauer <[email protected]> 于2021年6月9日周三
> >>>>> 上午4:13写道：
> >>>>>>>>>
> >>>>>>>>>> Just to follow up on this, our main problem with hadoop3
> >>> right
> >>>>> now
> >>>>>> has
> >>>>>>>>> been
> >>>>>>>>>> instability in HDFS, to the extent that we put on hold any
> >>>> plans
> >>>>> to
> >>>>>>>>> deploy
> >>>>>>>>>> it to our production systems. I would claim Hadoop3 isn't
> >>>> mature
> >>>>>> enough
> >>>>>>>>> yet
> >>>>>>>>>> to consider migrating Druid to it.
> >>>>>>>>>>
> >>>>>>>>>> WIll
> >>>>>>>>>>
> >>>>>>>>>> <http://www.verizonmedia.com>
> >>>>>>>>>>
> >>>>>>>>>> Will Lauer
> >>>>>>>>>>
> >>>>>>>>>> Senior Principal Architect, Audience & Advertising
> >> Reporting
> >>>>>>>>>> Data Platforms & Systems Engineering
> >>>>>>>>>>
> >>>>>>>>>> M 508 561 6427
> >>>>>>>>>> 1908 S. First St
> >>>>>>>>>> Champaign, IL 61822
> >>>>>>>>>>
> >>>>>>>>>> <
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e=
> >>>>>>>>>  <
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e=
> >>>>>>>>>
> >>>>>>>>>> <
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e=
> >>>>>>>>>
> >>>>>>>>>> <
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e=
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Jun 8, 2021 at 2:59 PM Will Lauer <
> >>>>> [email protected]
> >>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Unfortunately, the migration off of hadoop3 is a hard one
> >>>>> (maybe
> >>>>>> not
> >>>>>>>>> for
> >>>>>>>>>>> Druid, but certainly for big organizations running large
> >>>>> hadoop2
> >>>>>>>>>>> workloads). If druid migrated to hadoop3 after 0.22, that
> >>>> would
> >>>>>>>>> probably
> >>>>>>>>>>> prevent me from taking any new versions of Druid for at
> >>> least
> >>>>> the
> >>>>>>>>>> remainder
> >>>>>>>>>>> of the year and possibly longer.
> >>>>>>>>>>>
> >>>>>>>>>>> Will
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> <http://www.verizonmedia.com>
> >>>>>>>>>>>
> >>>>>>>>>>> Will Lauer
> >>>>>>>>>>>
> >>>>>>>>>>> Senior Principal Architect, Audience & Advertising
> >>> Reporting
> >>>>>>>>>>> Data Platforms & Systems Engineering
> >>>>>>>>>>>
> >>>>>>>>>>> M 508 561 6427
> >>>>>>>>>>> 1908 S. First St
> >>>>>>>>>>> Champaign, IL 61822
> >>>>>>>>>>>
> >>>>>>>>>>> <
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e=
> >>>>>>>>>  <
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e=
> >>>>>>>>>
> >>>>>>>>>>>   <
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e=
> >>>>>>>>>
> >>>>>>>>>>> <
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e=
> >>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Jun 8, 2021 at 3:08 AM Clint Wylie <
> >>>> [email protected]>
> >>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I've been assisting with some experiments to see how we
> >>>> might
> >>>>>> want
> >>>>>>>> to
> >>>>>>>>>>>> migrate Druid to support Hadoop 3.x, and more
> >> importantly,
> >>>> see
> >>>>>> if
> >>>>>>>>> maybe
> >>>>>>>>>> we
> >>>>>>>>>>>> can finally be free of some of the dependency issues it
> >>> has
> >>>>> been
> >>>>>>>>> causing
> >>>>>>>>>>>> for as long as I can remember working with Druid.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hadoop 3 introduced shaded client jars,
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D11804&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=rBnEOMf2IKDMeWUo4TZyqf5CzrnbiYTfZUkjHr8GOHo&e=
> >>>>>>>>>>>> , with the purpose to
> >>>>>>>>>>>> allow applications to talk to the Hadoop cluster without
> >>>>>> drowning in
> >>>>>>>>> its
> >>>>>>>>>>>> transitive dependencies. The experimental branch that I
> >>> have
> >>>>>> been
> >>>>>>>>>> helping
> >>>>>>>>>>>> with, which is using these new shaded client jars, can
> >> be
> >>>> seen
> >>>>>> in
> >>>>>>>> this
> >>>>>>>>>> PR
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11314&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=424doHggbejAz5XswosgVkJK98VUBcUj0pD5bAcBjT0&e=
> >>>>>>>>>>>> , and is currently working with
> >>>>>>>>>>>> the HDFS integration tests as well as the Hadoop
> >> tutorial
> >>>> flow
> >>>>>> in
> >>>>>>>> the
> >>>>>>>>>>>> Druid
> >>>>>>>>>>>> docs (which is pretty much equivalent to the HDFS
> >>>> integration
> >>>>>> test).
> >>>>>>>>>>>>
> >>>>>>>>>>>> The cloud deep storages still need some further testing
> >>> and
> >>>>> some
> >>>>>>>> minor
> >>>>>>>>>>>> cleanup still needs done for the docs and such.
> >>> Additionally
> >>>>> we
> >>>>>>>> still
> >>>>>>>>>> need
> >>>>>>>>>>>> to figure out how to handle the Kerberos extension,
> >>> because
> >>>> it
> >>>>>>>> extends
> >>>>>>>>>>>> some
> >>>>>>>>>>>> Hadoop classes so isn't able to use the shaded client
> >> jars
> >>>> in
> >>>>> a
> >>>>>>>>>>>> straight-forward manner, and so still has heavy
> >>> dependencies
> >>>>> and
> >>>>>>>>> hasn't
> >>>>>>>>>>>> been tested. However, the experiment has started to pan
> >>> out
> >>>>>> enough
> >>>>>>>> to
> >>>>>>>>>>>> where
> >>>>>>>>>>>> I think it is worth starting this discussion, because it
> >>>> does
> >>>>>> have
> >>>>>>>>> some
> >>>>>>>>>>>> implications.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Making this change I think will allow us to update our
> >>>>>> dependencies
> >>>>>>>>>> with a
> >>>>>>>>>>>> lot more freedom (I'm looking at you, Guava), but the
> >>> catch
> >>>> is
> >>>>>> that
> >>>>>>>>> once
> >>>>>>>>>>>> we
> >>>>>>>>>>>> make this change and start updating these dependencies,
> >> it
> >>>>> will
> >>>>>>>> become
> >>>>>>>>>>>> hard, nearing impossible to support Hadoop 2.x, since as
> >>> far
> >>>>> as
> >>>>>> I
> >>>>>>>> know
> >>>>>>>>>>>> there isn't an equivalent set of shaded client jars. I
> >> am
> >>>> also
> >>>>>> not
> >>>>>>>>>> certain
> >>>>>>>>>>>> how far back the Hadoop job classpath isolation stuff
> >> goes
> >>>>>>>>>>>> (mapreduce.job.classloader = true) which I think is
> >>> required
> >>>>> to
> >>>>>> be
> >>>>>>>> set
> >>>>>>>>>> on
> >>>>>>>>>>>> Druid tasks for this shaded stuff to work alongside
> >>> updated
> >>>>>> Druid
> >>>>>>>>>>>> dependencies.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Is anyone opposed to or worried about dropping Hadoop
> >> 2.x
> >>>>>> support
> >>>>>>>>> after
> >>>>>>>>>>>> the
> >>>>>>>>>>>> Druid 0.22 release?
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: [email protected]
> >>>>>> For additional commands, e-mail: [email protected]
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x for 24.0

Reply via email to