Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x for 24.0

Julian Jaffe Sun, 21 Aug 2022 23:55:03 -0700

For Spark support, the connector I wrote remains functional but I haven’t 
updated the PR for six months or so since it didn’t seem like there was an 
appetite for review. If that’s changing I could migrate back some more recent 
changes to the OSS PR. Even with an up-to-date patch though I see two problems:


First, I remain worried that there isn’t sufficient support among committers 
for the Spark connector. I don’t want Druid to end up in the same place it is 
now for Hadoop 2 support where no one really maintains the Spark code and we 
wind up with another awkward corner of the code base that holds back other 
development.

Secondly, the PR I have up is for Spark 2.4, which is now 2 years further out 
of date than it was back in 2020. Similarly to Hadoop there is a bifurcation in 
the community and Spark 2.4 is still in heavy use but we might be trading one 
problem for another if we deprecate Hadoop 2 in favor of Spark 2.4. I have 
written a Spark 3.2 connector as well but it’s been deployed to significantly 
smaller use cases than the 2.4 line.

Even with these two caveats, if there’s a desire among the Druid development 
community to add Spark functionality and support it I’d love to push this 
across the finish line.

> On Aug 9, 2022, at 1:04 AM, Abhishek Agarwal <[email protected]> 
> wrote:
> 
> Yes. We should deprecate it first which is similar to dropping the support
> (no more active development) but we will still ship it for a release or
> two. In a way, we are already in that mode to a certain extent. Many
> features are being built with native ingestion as a first-class citizen.
> E.g. range partitioning is still not supported on Hadoop ingestion. It's
> hard for developers to build and test their business logic for all the
> ingestion modes.
> 
> It will be good to hear what gaps do community sees between native
> ingestion vs Hadoop-based batch ingestion. And then work toward fixing
> those gaps before dropping the Hadoop ingestion entirely. For example, if
> users want the resource elasticity that a Hadoop cluster gives, we could
> push forward PRs such as https://github.com/apache/druid/pull/10910. It's
> not the same as a Hadoop cluster but nonetheless will let user reuse their
> existing infrastructure to run druid jobs.
> 
>> On Tue, Aug 9, 2022 at 9:43 AM Gian Merlino <[email protected]> wrote:
>> 
>> It's always good to deprecate things for some time prior to removing them,
>> so we don't need to (nor should we) remove Hadoop 2 support right now. My
>> vote is that in this upcoming release, we should deprecate it. The main
>> problem in my eyes is the one Abhishek brought up: the dependency
>> management situation with Hadoop 2 is really messy, and I'm not sure
>> there's a good way to handle them given the limited classloader isolation.
>> This situation becomes tougher to manage with each release, and we haven't
>> had people volunteering to find and build comprehensive solutions. It is
>> time to move on.
>> 
>> The concern Samarth raised, that people may end up stuck on older Druid
>> versions because they aren't able to upgrade to Hadoop 3, is valid. I can
>> see two good solutions to this. First: we can improve native ingest to the
>> point where people feel broadly comfortable moving Hadoop 2 workloads to
>> native. The work planned as part of doing ingest via multi-stage
>> distributed query <https://github.com/apache/druid/issues/12262> is going
>> to be useful here, by improving the speed and scalability of native ingest.
>> Second: it would also be great to have something similar that runs on
>> Spark, for people that have made investments in Spark. I suspect that most
>> people that used Hadoop 2 have moved on to Hadoop 3 or Spark, so supporting
>> both of those would ease a lot of the potential pain of dropping Hadoop 2
>> support.
>> 
>> On Spark: I'm not familiar with the current state of the Spark work. Is it
>> stuck? If so could something be done to unstick it? I agree with Abhishek
>> that I wouldn't want to block moving off Hadoop 2 on this. However, it'd be
>> great if we could get it done before actually removing Hadoop 2 support
>> from the code base.
>> 
>> 
>> On Wed, Aug 3, 2022 at 6:17 AM Abhishek Agarwal <[email protected]
>>> 
>> wrote:
>> 
>>> I was thinking that moving from Hadoop 2 to Hadoop 3 will be a
>>> low-resistance path than moving from Hadoop to Spark. even if we get that
>>> PR merged, it will take good time for spark integration to reach the same
>>> level of maturity as Hadoop or Native ingestion. BTW I am not making an
>>> argument against spark integration. it will certainly be nice to have
>> Spark
>>> as an option. Just that spark integration doesn't become a blocker for us
>>> to get off Hadoop.
>>> 
>>> btw are you using Hadoop 2 right now with the latest druid version? If
>> so,
>>> did you run into similar errors that I posted in my last email?
>>> 
>>> On Wed, Jul 27, 2022 at 12:02 AM Samarth Jain <[email protected]>
>>> wrote:
>>> 
>>>> I am sure there are other companies out there who are still on Hadoop
>> 2.x
>>>> with migration to Hadoop 3.x being a no-go.
>>>> If Druid was to drop support for Hadoop 3.x completely, I am afraid it
>>>> would prevent users from updating to newer versions of Druid which
>> would
>>> be
>>>> a shame.
>>>> 
>>>> FWIW, we have found in practice for high volume use cases that
>> compaction
>>>> based on Druid's Hadoop based batch ingestion is a lot more scale-able
>>> than
>>>> the native compaction.
>>>> 
>>>> Having said that, as an alternative, if we can merge Julian's Spark
>> based
>>>> ingestion PR <https://github.com/apache/druid/issues/9780>s in Druid,
>>> that
>>>> might provide an alternate way for users to get rid of the Hadoop
>>>> dependency.
>>>> 
>>>> On Tue, Jul 26, 2022 at 3:19 AM Abhishek Agarwal <
>>>> [email protected]>
>>>> wrote:
>>>> 
>>>>> Reviving this conversation again.
>>>>> @Will - Do you still have concerns about HDFS stability? Hadoop 3 has
>>>> been
>>>>> around for some time now and is very stable as far as I know.
>>>>> 
>>>>> The dependencies coming from Hadoop 2 are also old enough that they
>>> cause
>>>>> dependency scans to fail. E.g. Log4j 1.x dependencies that are coming
>>>> from
>>>>> Hadoop 2, get flagged during these scans. We have also seen issues
>> when
>>>>> customers try to use Hadoop ingestion with the latest log4j2 library.
>>>>> 
>>>>> Exception in thread "main" java.lang.NoSuchMethodError:
>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.log4j.helpers.OptionConverter.convertLevel(Ljava/lang/String;Lorg/apache/logging/log4j/Level;)Lorg/apache/logging/log4j/Level;
>>>>> at
>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.log4j.config.PropertiesConfiguration.parseLogger(PropertiesConfiguration.java:393)
>>>>> at
>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.log4j.config.PropertiesConfiguration.configureRoot(PropertiesConfiguration.java:326)
>>>>> at
>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.log4j.config.PropertiesConfiguration.doConfigure(PropertiesConfiguration.java:303)
>>>>> 
>>>>> 
>>>>> Instead of fixing these point issues, we would be better served by
>>>>> completely moving to Hadoop 3 entirely. Hadoop 3 does get more
>> frequent
>>>>> releases and dependencies are well isolated.
>>>>> 
>>>>> On Tue, Oct 12, 2021 at 12:05 PM Karan Kumar <
>> [email protected]
>>>> 
>>>>> wrote:
>>>>> 
>>>>>> Hello
>>>>>> We can also use maven profiles. We keep hadoop2 support by default
>>> and
>>>>> add
>>>>>> a new maven profile with hadoop3. This will allow the user to
>> choose
>>>> the
>>>>>> profile which is best suited for the use case.
>>>>>> Agreed, it will not help in the Hadoop dependency problems but does
>>>>> enable
>>>>>> our users to use druid with multiple flavors.
>>>>>> Also with hadoop3, as clint mentioned, the dependencies come
>>> pre-shaded
>>>>> so
>>>>>> we significantly reduce our effort in solving the dependency
>>> problems.
>>>>>> I have the PR in the last phases where I am able to run the entire
>>> test
>>>>>> suit unit + integration tests on both the default ie hadoop2 and
>> the
>>>> new
>>>>>> hadoop3 profile.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 2021/06/09 11:55:31, Will Lauer <[email protected]
>> .INVALID>
>>>>>> wrote:
>>>>>>> Clint,
>>>>>>> 
>>>>>>> I fully understand what type of headache dealing with these
>>>> dependency
>>>>>>> issues is. We deal with this all the time, and based on
>>> conversations
>>>>>> I've
>>>>>>> had with our internal hadoop development team, they are quite
>> aware
>>>> of
>>>>>> them
>>>>>>> and just as frustrated by them as you are. I'm certainly in favor
>>> of
>>>>>> doing
>>>>>>> something to improve this situation, as long as it doesn't
>> abandon
>>> a
>>>>>> large
>>>>>>> section of the user base, which I think DROPPING hadoop2 would
>> do.
>>>>>>> 
>>>>>>> I think there are solutions there that can help solve the
>>> conflicting
>>>>>>> dependency problem. Refactoring Hadoop support into an
>> independent
>>>>>>> extension is certainly a start. But I think the dependency
>> problem
>>> is
>>>>>>> bigger than that. There are always going to be conflicts between
>>>>>>> dependencies in the core system and in extensions as the system
>>> gets
>>>>>>> bigger. We have one right now internally that prevents us from
>>>> enabling
>>>>>> SQL
>>>>>>> in our instance of Druid due to conflicts between versions of
>>>> protobuf
>>>>>> used
>>>>>>> by Calcite vs one of our critical extensions. Long term, I think
>>> you
>>>>> are
>>>>>>> going to need to carefully think through a ClassLoader based
>>> strategy
>>>>> to
>>>>>>> truly separate the impact of various dependencies.
>>>>>>> 
>>>>>>> While I'm not seriously suggesting it for Druid, OSGi WOULD solve
>>>> this
>>>>>>> problem. It's a system that allows you to explicitly declare what
>>>> each
>>>>>>> bundle exposes to the system, and what each bundle consumes from
>>> the
>>>>>>> system, allowing multiple conflicting dependencies to co-exist
>>>> without
>>>>>>> impacting each other. OSGi is the big hammer approach, but I bet
>> a
>>>> more
>>>>>>> appropriate solution would be a simpler custom-ClassLoader based
>>>>> solution
>>>>>>> that hid all dependencies in extensions, keeping them from
>>> impacting
>>>>> the
>>>>>>> core, and that only exposed "public" pieces of the core to
>>>> extensions.
>>>>> If
>>>>>>> Druid's core could be extended without impacting the various
>>>>> extensions,
>>>>>>> and the extensions' dependencies could be modified without
>>> impacting
>>>>> the
>>>>>>> core, this would go a long way towards solving the problem that
>> you
>>>>> have
>>>>>>> described.
>>>>>>> 
>>>>>>> Will
>>>>>>> 
>>>>>>> <http://www.verizonmedia.com>
>>>>>>> 
>>>>>>> Will Lauer
>>>>>>> 
>>>>>>> Senior Principal Architect, Audience & Advertising Reporting
>>>>>>> Data Platforms & Systems Engineering
>>>>>>> 
>>>>>>> M 508 561 6427
>>>>>>> 1908 S. First St
>>>>>>> Champaign, IL 61822
>>>>>>> 
>>>>>>> <http://www.facebook.com/verizonmedia>   <
>>>>>> http://twitter.com/verizonmedia>
>>>>>>> <https://www.linkedin.com/company/verizon-media/>
>>>>>>> <http://www.instagram.com/verizonmedia>
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Jun 9, 2021 at 12:47 AM Clint Wylie <[email protected]>
>>>> wrote:
>>>>>>> 
>>>>>>>> @itai, I think pending the outcome of this discussion that it
>>> makes
>>>>>> sense
>>>>>>>> to have a wider community thread to announce any decisions we
>>> make
>>>>>> here,
>>>>>>>> thanks for bringing that up.
>>>>>>>> 
>>>>>>>> @rajiv, Minio support seems unrelated to this discussion. It
>>> seems
>>>>>> like a
>>>>>>>> reasonable request, but I recommend starting another thread to
>>> see
>>>> if
>>>>>>>> someone is interested in taking up this effort.
>>>>>>>> 
>>>>>>>> @jihoon I definitely agree that Hadoop should be refactored to
>> be
>>>> an
>>>>>>>> extension longer term. I don't think this upgrade would
>>> necessarily
>>>>>>>> make doing such a refactor any easier, but not harder either.
>>> Just
>>>>>> moving
>>>>>>>> Hadoop to an extension also unfortunately doesn't really do
>>>> anything
>>>>> to
>>>>>>>> help our dependency problem though, which is the thing that has
>>>>>> agitated me
>>>>>>>> enough to start this thread and start looking into solutions.
>>>>>>>> 
>>>>>>>> @will/@frank I feel like the stranglehold Hadoop has on our
>>>>>> dependencies
>>>>>>>> has started to become especially more painful in the last
>> couple
>>> of
>>>>>>>> years. Most painful to me is that we are stuck using a version
>> of
>>>>>> Apache
>>>>>>>> Calcite from 2019 (six versions behind the latest), because
>> newer
>>>>>> versions
>>>>>>>> require a newer version of Guava. This means we cannot get any
>>> bug
>>>>>> fixes
>>>>>>>> and improvements in our SQL parsing layer without doing
>> something
>>>>> like
>>>>>>>> packaging a shaded version of it ourselves or solving our
>> Hadoop
>>>>>> dependency
>>>>>>>> problem.
>>>>>>>> 
>>>>>>>> Many other dependencies have also proved problematic with
>> Hadoop
>>> as
>>>>>> well in
>>>>>>>> the past, and since we aren't able to run the Hadoop
>> integration
>>>>> tests
>>>>>> in
>>>>>>>> Travis, there is always the chance that sometimes we don't
>> catch
>>>>> these
>>>>>> when
>>>>>>>> they go in. I imagine now that we have turned on dependabot
>> this
>>>>> week,
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11079&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=0LG0RjDQ1wFfBdl9aPg3-4oJPvJJs26aQsK8KSYLp2s&e=
>>>>>>>> , that we are going to have to
>>>>>>>> proceed very carefully with it until we are able to resolve
>> this
>>>>>> dependency
>>>>>>>> issue.
>>>>>>>> 
>>>>>>>> Hadoop 3.3.0 is also the first to support running on a Java
>>> version
>>>>>> that is
>>>>>>>> newer than Java 8 per
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_HADOOP_Hadoop-2BJava-2BVersions&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=a5NmwtOWhCNvY4si_OKf0BRh_FTRpmCJHpTaGs8S64A&e=
>>>>>>>> ,
>>>>>>>> which is another area we have been working towards - Druid to
>>>>>> officially
>>>>>>>> support Java 11+ environments.
>>>>>>>> 
>>>>>>>> I'm sort of at a loss of what else to do besides one of
>>>>>>>> - switching to these Hadoop 3 shaded jars and dropping 2.x
>>> support
>>>>>>>> - figuring out how to custom package our own Hadoop 2.x
>>>>> dependendencies
>>>>>>>> that are shaded similarly to the Hadoop 3 client jars, and only
>>>>>> supporting
>>>>>>>> Hadoop with application classpath isolation
>>>>> (mapreduce.job.classloader
>>>>>> =
>>>>>>>> true)
>>>>>>>> - just dropping support for Hadoop completely
>>>>>>>> 
>>>>>>>> I would much rather devote all effort into making Druids native
>>>> batch
>>>>>>>> ingestion better to encourage people to migrate to that, than
>>>>>> continuing to
>>>>>>>> fight with figuring out how to keep supporting Hadoop, so
>>> upgrading
>>>>> and
>>>>>>>> switching to the shaded client jars at least seemed like a
>>>> reasonable
>>>>>>>> compromise to dropping it completely. Maybe making custom
>> shaded
>>>>> Hadoop
>>>>>>>> dependencies in the spirit of the Hadoop 3 shaded jars isn't as
>>>> hard
>>>>>> as I
>>>>>>>> am imagining, but it does seem like the most amount of work
>>> between
>>>>> the
>>>>>>>> solutions I could think of to potentially resolve this problem.
>>>>>>>> 
>>>>>>>> Does anyone have any other ideas of how we can isolate our
>>>>> dependencies
>>>>>>>> from Hadoop? Solutions like shading Guava,
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_10964&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=RmDhdAX6x_cU5sebIqzFpGXpo3NnYAYqeyEvwnA-pgw&e=
>>>>>>>> , would let Druid itself use
>>>>>>>> newer Guava, but that doesn't help conflicts within our
>>>> dependencies
>>>>>> which
>>>>>>>> has always seemed to be the larger problem to me. Moving Hadoop
>>>>>> support to
>>>>>>>> an extension doesn't help anything unless we can ensure that we
>>> can
>>>>> run
>>>>>>>> Druid ingestion tasks on Hadoop without having to match all of
>>> the
>>>>>> Hadoop
>>>>>>>> clusters dependencies with some sort of classloader wizardry.
>>>>>>>> 
>>>>>>>> Maybe we could consider keeping a 0.22.x release line in Druid
>>> that
>>>>>> gets
>>>>>>>> security and minor bug fixes for some period of time to give
>>>> people a
>>>>>>>> longer period to migrate off of Hadoop 2.x? I can't speak for
>> the
>>>>> rest
>>>>>> of
>>>>>>>> the committers, but I would personally be more open to
>>> maintaining
>>>>>> such a
>>>>>>>> branch if it meant that moving forward at least we could update
>>> all
>>>>> of
>>>>>> our
>>>>>>>> dependencies to newer versions, while providing a transition
>> path
>>>> to
>>>>>> still
>>>>>>>> have at least some support until migrating to Hadoop 3 or
>> native
>>>>> Druid
>>>>>>>> batch ingestion.
>>>>>>>> 
>>>>>>>> Any other ideas?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Tue, Jun 8, 2021 at 7:44 PM frank chen <
>> [email protected]>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Considering Druid takes advantage of lots of external
>>> components
>>>> to
>>>>>>>> work, I
>>>>>>>>> think we should upgrade Druid in a little bit conservitive
>> way.
>>>>>> Dropping
>>>>>>>>> support of hadoop2 is not a good idea.
>>>>>>>>> The upgrading of the ZooKeeper client in Druid also prevents
>> me
>>>>> from
>>>>>>>>> adopting 0.22 for a longer time.
>>>>>>>>> 
>>>>>>>>> Although users could upgrade these dependencies first to use
>>> the
>>>>>> latest
>>>>>>>>> Druid releases, frankly speaking, these upgrades are not so
>>> easy
>>>> in
>>>>>>>>> production and usually take longer time, which would prevent
>>>> users
>>>>>> from
>>>>>>>>> experiencing new features of Druid.
>>>>>>>>> For hadoop3, I have heard of some performance issues, which
>>> also
>>>>>> makes me
>>>>>>>>> have no confidence to upgrade.
>>>>>>>>> 
>>>>>>>>> I think what Jihoon proposes is a good idea, separating
>> hadoop2
>>>>> from
>>>>>>>> Druid
>>>>>>>>> core as an extension.
>>>>>>>>> Since hadoop2 has not been EOF, to achieve balance between
>>>>>> compatibility
>>>>>>>>> and long term evolution, maybe we could provide two
>> extensions,
>>>> one
>>>>>> for
>>>>>>>>> hadoop2, one for hadoop3.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Will Lauer <[email protected]> 于2021年6月9日周三
>>>>> 上午4:13写道：
>>>>>>>>> 
>>>>>>>>>> Just to follow up on this, our main problem with hadoop3
>>> right
>>>>> now
>>>>>> has
>>>>>>>>> been
>>>>>>>>>> instability in HDFS, to the extent that we put on hold any
>>>> plans
>>>>> to
>>>>>>>>> deploy
>>>>>>>>>> it to our production systems. I would claim Hadoop3 isn't
>>>> mature
>>>>>> enough
>>>>>>>>> yet
>>>>>>>>>> to consider migrating Druid to it.
>>>>>>>>>> 
>>>>>>>>>> WIll
>>>>>>>>>> 
>>>>>>>>>> <http://www.verizonmedia.com>
>>>>>>>>>> 
>>>>>>>>>> Will Lauer
>>>>>>>>>> 
>>>>>>>>>> Senior Principal Architect, Audience & Advertising
>> Reporting
>>>>>>>>>> Data Platforms & Systems Engineering
>>>>>>>>>> 
>>>>>>>>>> M 508 561 6427
>>>>>>>>>> 1908 S. First St
>>>>>>>>>> Champaign, IL 61822
>>>>>>>>>> 
>>>>>>>>>> <
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e=
>>>>>>>>>  <
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e=
>>>>>>>>> 
>>>>>>>>>> <
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e=
>>>>>>>>> 
>>>>>>>>>> <
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e=
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Tue, Jun 8, 2021 at 2:59 PM Will Lauer <
>>>>> [email protected]
>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Unfortunately, the migration off of hadoop3 is a hard one
>>>>> (maybe
>>>>>> not
>>>>>>>>> for
>>>>>>>>>>> Druid, but certainly for big organizations running large
>>>>> hadoop2
>>>>>>>>>>> workloads). If druid migrated to hadoop3 after 0.22, that
>>>> would
>>>>>>>>> probably
>>>>>>>>>>> prevent me from taking any new versions of Druid for at
>>> least
>>>>> the
>>>>>>>>>> remainder
>>>>>>>>>>> of the year and possibly longer.
>>>>>>>>>>> 
>>>>>>>>>>> Will
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> <http://www.verizonmedia.com>
>>>>>>>>>>> 
>>>>>>>>>>> Will Lauer
>>>>>>>>>>> 
>>>>>>>>>>> Senior Principal Architect, Audience & Advertising
>>> Reporting
>>>>>>>>>>> Data Platforms & Systems Engineering
>>>>>>>>>>> 
>>>>>>>>>>> M 508 561 6427
>>>>>>>>>>> 1908 S. First St
>>>>>>>>>>> Champaign, IL 61822
>>>>>>>>>>> 
>>>>>>>>>>> <
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e=
>>>>>>>>>  <
>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e=
>>>>>>>>> 
>>>>>>>>>>>   <
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e=
>>>>>>>>> 
>>>>>>>>>>> <
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e=
>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Jun 8, 2021 at 3:08 AM Clint Wylie <
>>>> [email protected]>
>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>> 
>>>>>>>>>>>> I've been assisting with some experiments to see how we
>>>> might
>>>>>> want
>>>>>>>> to
>>>>>>>>>>>> migrate Druid to support Hadoop 3.x, and more
>> importantly,
>>>> see
>>>>>> if
>>>>>>>>> maybe
>>>>>>>>>> we
>>>>>>>>>>>> can finally be free of some of the dependency issues it
>>> has
>>>>> been
>>>>>>>>> causing
>>>>>>>>>>>> for as long as I can remember working with Druid.
>>>>>>>>>>>> 
>>>>>>>>>>>> Hadoop 3 introduced shaded client jars,
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D11804&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=rBnEOMf2IKDMeWUo4TZyqf5CzrnbiYTfZUkjHr8GOHo&e=
>>>>>>>>>>>> , with the purpose to
>>>>>>>>>>>> allow applications to talk to the Hadoop cluster without
>>>>>> drowning in
>>>>>>>>> its
>>>>>>>>>>>> transitive dependencies. The experimental branch that I
>>> have
>>>>>> been
>>>>>>>>>> helping
>>>>>>>>>>>> with, which is using these new shaded client jars, can
>> be
>>>> seen
>>>>>> in
>>>>>>>> this
>>>>>>>>>> PR
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11314&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=424doHggbejAz5XswosgVkJK98VUBcUj0pD5bAcBjT0&e=
>>>>>>>>>>>> , and is currently working with
>>>>>>>>>>>> the HDFS integration tests as well as the Hadoop
>> tutorial
>>>> flow
>>>>>> in
>>>>>>>> the
>>>>>>>>>>>> Druid
>>>>>>>>>>>> docs (which is pretty much equivalent to the HDFS
>>>> integration
>>>>>> test).
>>>>>>>>>>>> 
>>>>>>>>>>>> The cloud deep storages still need some further testing
>>> and
>>>>> some
>>>>>>>> minor
>>>>>>>>>>>> cleanup still needs done for the docs and such.
>>> Additionally
>>>>> we
>>>>>>>> still
>>>>>>>>>> need
>>>>>>>>>>>> to figure out how to handle the Kerberos extension,
>>> because
>>>> it
>>>>>>>> extends
>>>>>>>>>>>> some
>>>>>>>>>>>> Hadoop classes so isn't able to use the shaded client
>> jars
>>>> in
>>>>> a
>>>>>>>>>>>> straight-forward manner, and so still has heavy
>>> dependencies
>>>>> and
>>>>>>>>> hasn't
>>>>>>>>>>>> been tested. However, the experiment has started to pan
>>> out
>>>>>> enough
>>>>>>>> to
>>>>>>>>>>>> where
>>>>>>>>>>>> I think it is worth starting this discussion, because it
>>>> does
>>>>>> have
>>>>>>>>> some
>>>>>>>>>>>> implications.
>>>>>>>>>>>> 
>>>>>>>>>>>> Making this change I think will allow us to update our
>>>>>> dependencies
>>>>>>>>>> with a
>>>>>>>>>>>> lot more freedom (I'm looking at you, Guava), but the
>>> catch
>>>> is
>>>>>> that
>>>>>>>>> once
>>>>>>>>>>>> we
>>>>>>>>>>>> make this change and start updating these dependencies,
>> it
>>>>> will
>>>>>>>> become
>>>>>>>>>>>> hard, nearing impossible to support Hadoop 2.x, since as
>>> far
>>>>> as
>>>>>> I
>>>>>>>> know
>>>>>>>>>>>> there isn't an equivalent set of shaded client jars. I
>> am
>>>> also
>>>>>> not
>>>>>>>>>> certain
>>>>>>>>>>>> how far back the Hadoop job classpath isolation stuff
>> goes
>>>>>>>>>>>> (mapreduce.job.classloader = true) which I think is
>>> required
>>>>> to
>>>>>> be
>>>>>>>> set
>>>>>>>>>> on
>>>>>>>>>>>> Druid tasks for this shaded stuff to work alongside
>>> updated
>>>>>> Druid
>>>>>>>>>>>> dependencies.
>>>>>>>>>>>> 
>>>>>>>>>>>> Is anyone opposed to or worried about dropping Hadoop
>> 2.x
>>>>>> support
>>>>>>>>> after
>>>>>>>>>>>> the
>>>>>>>>>>>> Druid 0.22 release?
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>> For additional commands, e-mail: [email protected]
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x for 24.0

Reply via email to