For Spark support, the connector I wrote remains functional but I haven’t updated the PR for six months or so since it didn’t seem like there was an appetite for review. If that’s changing I could migrate back some more recent changes to the OSS PR. Even with an up-to-date patch though I see two problems:
First, I remain worried that there isn’t sufficient support among committers for the Spark connector. I don’t want Druid to end up in the same place it is now for Hadoop 2 support where no one really maintains the Spark code and we wind up with another awkward corner of the code base that holds back other development. Secondly, the PR I have up is for Spark 2.4, which is now 2 years further out of date than it was back in 2020. Similarly to Hadoop there is a bifurcation in the community and Spark 2.4 is still in heavy use but we might be trading one problem for another if we deprecate Hadoop 2 in favor of Spark 2.4. I have written a Spark 3.2 connector as well but it’s been deployed to significantly smaller use cases than the 2.4 line. Even with these two caveats, if there’s a desire among the Druid development community to add Spark functionality and support it I’d love to push this across the finish line. > On Aug 9, 2022, at 1:04 AM, Abhishek Agarwal <abhishek.agar...@imply.io> > wrote: > > Yes. We should deprecate it first which is similar to dropping the support > (no more active development) but we will still ship it for a release or > two. In a way, we are already in that mode to a certain extent. Many > features are being built with native ingestion as a first-class citizen. > E.g. range partitioning is still not supported on Hadoop ingestion. It's > hard for developers to build and test their business logic for all the > ingestion modes. > > It will be good to hear what gaps do community sees between native > ingestion vs Hadoop-based batch ingestion. And then work toward fixing > those gaps before dropping the Hadoop ingestion entirely. For example, if > users want the resource elasticity that a Hadoop cluster gives, we could > push forward PRs such as https://github.com/apache/druid/pull/10910. It's > not the same as a Hadoop cluster but nonetheless will let user reuse their > existing infrastructure to run druid jobs. > >> On Tue, Aug 9, 2022 at 9:43 AM Gian Merlino <g...@apache.org> wrote: >> >> It's always good to deprecate things for some time prior to removing them, >> so we don't need to (nor should we) remove Hadoop 2 support right now. My >> vote is that in this upcoming release, we should deprecate it. The main >> problem in my eyes is the one Abhishek brought up: the dependency >> management situation with Hadoop 2 is really messy, and I'm not sure >> there's a good way to handle them given the limited classloader isolation. >> This situation becomes tougher to manage with each release, and we haven't >> had people volunteering to find and build comprehensive solutions. It is >> time to move on. >> >> The concern Samarth raised, that people may end up stuck on older Druid >> versions because they aren't able to upgrade to Hadoop 3, is valid. I can >> see two good solutions to this. First: we can improve native ingest to the >> point where people feel broadly comfortable moving Hadoop 2 workloads to >> native. The work planned as part of doing ingest via multi-stage >> distributed query <https://github.com/apache/druid/issues/12262> is going >> to be useful here, by improving the speed and scalability of native ingest. >> Second: it would also be great to have something similar that runs on >> Spark, for people that have made investments in Spark. I suspect that most >> people that used Hadoop 2 have moved on to Hadoop 3 or Spark, so supporting >> both of those would ease a lot of the potential pain of dropping Hadoop 2 >> support. >> >> On Spark: I'm not familiar with the current state of the Spark work. Is it >> stuck? If so could something be done to unstick it? I agree with Abhishek >> that I wouldn't want to block moving off Hadoop 2 on this. However, it'd be >> great if we could get it done before actually removing Hadoop 2 support >> from the code base. >> >> >> On Wed, Aug 3, 2022 at 6:17 AM Abhishek Agarwal <abhishek.agar...@imply.io >>> >> wrote: >> >>> I was thinking that moving from Hadoop 2 to Hadoop 3 will be a >>> low-resistance path than moving from Hadoop to Spark. even if we get that >>> PR merged, it will take good time for spark integration to reach the same >>> level of maturity as Hadoop or Native ingestion. BTW I am not making an >>> argument against spark integration. it will certainly be nice to have >> Spark >>> as an option. Just that spark integration doesn't become a blocker for us >>> to get off Hadoop. >>> >>> btw are you using Hadoop 2 right now with the latest druid version? If >> so, >>> did you run into similar errors that I posted in my last email? >>> >>> On Wed, Jul 27, 2022 at 12:02 AM Samarth Jain <samarth.j...@gmail.com> >>> wrote: >>> >>>> I am sure there are other companies out there who are still on Hadoop >> 2.x >>>> with migration to Hadoop 3.x being a no-go. >>>> If Druid was to drop support for Hadoop 3.x completely, I am afraid it >>>> would prevent users from updating to newer versions of Druid which >> would >>> be >>>> a shame. >>>> >>>> FWIW, we have found in practice for high volume use cases that >> compaction >>>> based on Druid's Hadoop based batch ingestion is a lot more scale-able >>> than >>>> the native compaction. >>>> >>>> Having said that, as an alternative, if we can merge Julian's Spark >> based >>>> ingestion PR <https://github.com/apache/druid/issues/9780>s in Druid, >>> that >>>> might provide an alternate way for users to get rid of the Hadoop >>>> dependency. >>>> >>>> On Tue, Jul 26, 2022 at 3:19 AM Abhishek Agarwal < >>>> abhishek.agar...@imply.io> >>>> wrote: >>>> >>>>> Reviving this conversation again. >>>>> @Will - Do you still have concerns about HDFS stability? Hadoop 3 has >>>> been >>>>> around for some time now and is very stable as far as I know. >>>>> >>>>> The dependencies coming from Hadoop 2 are also old enough that they >>> cause >>>>> dependency scans to fail. E.g. Log4j 1.x dependencies that are coming >>>> from >>>>> Hadoop 2, get flagged during these scans. We have also seen issues >> when >>>>> customers try to use Hadoop ingestion with the latest log4j2 library. >>>>> >>>>> Exception in thread "main" java.lang.NoSuchMethodError: >>>>> >>>>> >>>> >>> >> org.apache.log4j.helpers.OptionConverter.convertLevel(Ljava/lang/String;Lorg/apache/logging/log4j/Level;)Lorg/apache/logging/log4j/Level; >>>>> at >>>>> >>>>> >>>> >>> >> org.apache.log4j.config.PropertiesConfiguration.parseLogger(PropertiesConfiguration.java:393) >>>>> at >>>>> >>>>> >>>> >>> >> org.apache.log4j.config.PropertiesConfiguration.configureRoot(PropertiesConfiguration.java:326) >>>>> at >>>>> >>>>> >>>> >>> >> org.apache.log4j.config.PropertiesConfiguration.doConfigure(PropertiesConfiguration.java:303) >>>>> >>>>> >>>>> Instead of fixing these point issues, we would be better served by >>>>> completely moving to Hadoop 3 entirely. Hadoop 3 does get more >> frequent >>>>> releases and dependencies are well isolated. >>>>> >>>>> On Tue, Oct 12, 2021 at 12:05 PM Karan Kumar < >> karankumar1...@gmail.com >>>> >>>>> wrote: >>>>> >>>>>> Hello >>>>>> We can also use maven profiles. We keep hadoop2 support by default >>> and >>>>> add >>>>>> a new maven profile with hadoop3. This will allow the user to >> choose >>>> the >>>>>> profile which is best suited for the use case. >>>>>> Agreed, it will not help in the Hadoop dependency problems but does >>>>> enable >>>>>> our users to use druid with multiple flavors. >>>>>> Also with hadoop3, as clint mentioned, the dependencies come >>> pre-shaded >>>>> so >>>>>> we significantly reduce our effort in solving the dependency >>> problems. >>>>>> I have the PR in the last phases where I am able to run the entire >>> test >>>>>> suit unit + integration tests on both the default ie hadoop2 and >> the >>>> new >>>>>> hadoop3 profile. >>>>>> >>>>>> >>>>>> >>>>>> On 2021/06/09 11:55:31, Will Lauer <wla...@verizonmedia.com >> .INVALID> >>>>>> wrote: >>>>>>> Clint, >>>>>>> >>>>>>> I fully understand what type of headache dealing with these >>>> dependency >>>>>>> issues is. We deal with this all the time, and based on >>> conversations >>>>>> I've >>>>>>> had with our internal hadoop development team, they are quite >> aware >>>> of >>>>>> them >>>>>>> and just as frustrated by them as you are. I'm certainly in favor >>> of >>>>>> doing >>>>>>> something to improve this situation, as long as it doesn't >> abandon >>> a >>>>>> large >>>>>>> section of the user base, which I think DROPPING hadoop2 would >> do. >>>>>>> >>>>>>> I think there are solutions there that can help solve the >>> conflicting >>>>>>> dependency problem. Refactoring Hadoop support into an >> independent >>>>>>> extension is certainly a start. But I think the dependency >> problem >>> is >>>>>>> bigger than that. There are always going to be conflicts between >>>>>>> dependencies in the core system and in extensions as the system >>> gets >>>>>>> bigger. We have one right now internally that prevents us from >>>> enabling >>>>>> SQL >>>>>>> in our instance of Druid due to conflicts between versions of >>>> protobuf >>>>>> used >>>>>>> by Calcite vs one of our critical extensions. Long term, I think >>> you >>>>> are >>>>>>> going to need to carefully think through a ClassLoader based >>> strategy >>>>> to >>>>>>> truly separate the impact of various dependencies. >>>>>>> >>>>>>> While I'm not seriously suggesting it for Druid, OSGi WOULD solve >>>> this >>>>>>> problem. It's a system that allows you to explicitly declare what >>>> each >>>>>>> bundle exposes to the system, and what each bundle consumes from >>> the >>>>>>> system, allowing multiple conflicting dependencies to co-exist >>>> without >>>>>>> impacting each other. OSGi is the big hammer approach, but I bet >> a >>>> more >>>>>>> appropriate solution would be a simpler custom-ClassLoader based >>>>> solution >>>>>>> that hid all dependencies in extensions, keeping them from >>> impacting >>>>> the >>>>>>> core, and that only exposed "public" pieces of the core to >>>> extensions. >>>>> If >>>>>>> Druid's core could be extended without impacting the various >>>>> extensions, >>>>>>> and the extensions' dependencies could be modified without >>> impacting >>>>> the >>>>>>> core, this would go a long way towards solving the problem that >> you >>>>> have >>>>>>> described. >>>>>>> >>>>>>> Will >>>>>>> >>>>>>> <http://www.verizonmedia.com> >>>>>>> >>>>>>> Will Lauer >>>>>>> >>>>>>> Senior Principal Architect, Audience & Advertising Reporting >>>>>>> Data Platforms & Systems Engineering >>>>>>> >>>>>>> M 508 561 6427 >>>>>>> 1908 S. First St >>>>>>> Champaign, IL 61822 >>>>>>> >>>>>>> <http://www.facebook.com/verizonmedia> < >>>>>> http://twitter.com/verizonmedia> >>>>>>> <https://www.linkedin.com/company/verizon-media/> >>>>>>> <http://www.instagram.com/verizonmedia> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Jun 9, 2021 at 12:47 AM Clint Wylie <cwy...@apache.org> >>>> wrote: >>>>>>> >>>>>>>> @itai, I think pending the outcome of this discussion that it >>> makes >>>>>> sense >>>>>>>> to have a wider community thread to announce any decisions we >>> make >>>>>> here, >>>>>>>> thanks for bringing that up. >>>>>>>> >>>>>>>> @rajiv, Minio support seems unrelated to this discussion. It >>> seems >>>>>> like a >>>>>>>> reasonable request, but I recommend starting another thread to >>> see >>>> if >>>>>>>> someone is interested in taking up this effort. >>>>>>>> >>>>>>>> @jihoon I definitely agree that Hadoop should be refactored to >> be >>>> an >>>>>>>> extension longer term. I don't think this upgrade would >>> necessarily >>>>>>>> make doing such a refactor any easier, but not harder either. >>> Just >>>>>> moving >>>>>>>> Hadoop to an extension also unfortunately doesn't really do >>>> anything >>>>> to >>>>>>>> help our dependency problem though, which is the thing that has >>>>>> agitated me >>>>>>>> enough to start this thread and start looking into solutions. >>>>>>>> >>>>>>>> @will/@frank I feel like the stranglehold Hadoop has on our >>>>>> dependencies >>>>>>>> has started to become especially more painful in the last >> couple >>> of >>>>>>>> years. Most painful to me is that we are stuck using a version >> of >>>>>> Apache >>>>>>>> Calcite from 2019 (six versions behind the latest), because >> newer >>>>>> versions >>>>>>>> require a newer version of Guava. This means we cannot get any >>> bug >>>>>> fixes >>>>>>>> and improvements in our SQL parsing layer without doing >> something >>>>> like >>>>>>>> packaging a shaded version of it ourselves or solving our >> Hadoop >>>>>> dependency >>>>>>>> problem. >>>>>>>> >>>>>>>> Many other dependencies have also proved problematic with >> Hadoop >>> as >>>>>> well in >>>>>>>> the past, and since we aren't able to run the Hadoop >> integration >>>>> tests >>>>>> in >>>>>>>> Travis, there is always the chance that sometimes we don't >> catch >>>>> these >>>>>> when >>>>>>>> they go in. I imagine now that we have turned on dependabot >> this >>>>> week, >>>>>>>> >>>>>>>> >>>>>> >>>>> >>>> >>> >> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11079&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=0LG0RjDQ1wFfBdl9aPg3-4oJPvJJs26aQsK8KSYLp2s&e= >>>>>>>> , that we are going to have to >>>>>>>> proceed very carefully with it until we are able to resolve >> this >>>>>> dependency >>>>>>>> issue. >>>>>>>> >>>>>>>> Hadoop 3.3.0 is also the first to support running on a Java >>> version >>>>>> that is >>>>>>>> newer than Java 8 per >>>>>>>> >>>>>>>> >>>>>> >>>>> >>>> >>> >> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_HADOOP_Hadoop-2BJava-2BVersions&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=a5NmwtOWhCNvY4si_OKf0BRh_FTRpmCJHpTaGs8S64A&e= >>>>>>>> , >>>>>>>> which is another area we have been working towards - Druid to >>>>>> officially >>>>>>>> support Java 11+ environments. >>>>>>>> >>>>>>>> I'm sort of at a loss of what else to do besides one of >>>>>>>> - switching to these Hadoop 3 shaded jars and dropping 2.x >>> support >>>>>>>> - figuring out how to custom package our own Hadoop 2.x >>>>> dependendencies >>>>>>>> that are shaded similarly to the Hadoop 3 client jars, and only >>>>>> supporting >>>>>>>> Hadoop with application classpath isolation >>>>> (mapreduce.job.classloader >>>>>> = >>>>>>>> true) >>>>>>>> - just dropping support for Hadoop completely >>>>>>>> >>>>>>>> I would much rather devote all effort into making Druids native >>>> batch >>>>>>>> ingestion better to encourage people to migrate to that, than >>>>>> continuing to >>>>>>>> fight with figuring out how to keep supporting Hadoop, so >>> upgrading >>>>> and >>>>>>>> switching to the shaded client jars at least seemed like a >>>> reasonable >>>>>>>> compromise to dropping it completely. Maybe making custom >> shaded >>>>> Hadoop >>>>>>>> dependencies in the spirit of the Hadoop 3 shaded jars isn't as >>>> hard >>>>>> as I >>>>>>>> am imagining, but it does seem like the most amount of work >>> between >>>>> the >>>>>>>> solutions I could think of to potentially resolve this problem. >>>>>>>> >>>>>>>> Does anyone have any other ideas of how we can isolate our >>>>> dependencies >>>>>>>> from Hadoop? Solutions like shading Guava, >>>>>>>> >>>>>>>> >>>>>> >>>>> >>>> >>> >> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_10964&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=RmDhdAX6x_cU5sebIqzFpGXpo3NnYAYqeyEvwnA-pgw&e= >>>>>>>> , would let Druid itself use >>>>>>>> newer Guava, but that doesn't help conflicts within our >>>> dependencies >>>>>> which >>>>>>>> has always seemed to be the larger problem to me. Moving Hadoop >>>>>> support to >>>>>>>> an extension doesn't help anything unless we can ensure that we >>> can >>>>> run >>>>>>>> Druid ingestion tasks on Hadoop without having to match all of >>> the >>>>>> Hadoop >>>>>>>> clusters dependencies with some sort of classloader wizardry. >>>>>>>> >>>>>>>> Maybe we could consider keeping a 0.22.x release line in Druid >>> that >>>>>> gets >>>>>>>> security and minor bug fixes for some period of time to give >>>> people a >>>>>>>> longer period to migrate off of Hadoop 2.x? I can't speak for >> the >>>>> rest >>>>>> of >>>>>>>> the committers, but I would personally be more open to >>> maintaining >>>>>> such a >>>>>>>> branch if it meant that moving forward at least we could update >>> all >>>>> of >>>>>> our >>>>>>>> dependencies to newer versions, while providing a transition >> path >>>> to >>>>>> still >>>>>>>> have at least some support until migrating to Hadoop 3 or >> native >>>>> Druid >>>>>>>> batch ingestion. >>>>>>>> >>>>>>>> Any other ideas? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Jun 8, 2021 at 7:44 PM frank chen < >> frankc...@apache.org> >>>>>> wrote: >>>>>>>> >>>>>>>>> Considering Druid takes advantage of lots of external >>> components >>>> to >>>>>>>> work, I >>>>>>>>> think we should upgrade Druid in a little bit conservitive >> way. >>>>>> Dropping >>>>>>>>> support of hadoop2 is not a good idea. >>>>>>>>> The upgrading of the ZooKeeper client in Druid also prevents >> me >>>>> from >>>>>>>>> adopting 0.22 for a longer time. >>>>>>>>> >>>>>>>>> Although users could upgrade these dependencies first to use >>> the >>>>>> latest >>>>>>>>> Druid releases, frankly speaking, these upgrades are not so >>> easy >>>> in >>>>>>>>> production and usually take longer time, which would prevent >>>> users >>>>>> from >>>>>>>>> experiencing new features of Druid. >>>>>>>>> For hadoop3, I have heard of some performance issues, which >>> also >>>>>> makes me >>>>>>>>> have no confidence to upgrade. >>>>>>>>> >>>>>>>>> I think what Jihoon proposes is a good idea, separating >> hadoop2 >>>>> from >>>>>>>> Druid >>>>>>>>> core as an extension. >>>>>>>>> Since hadoop2 has not been EOF, to achieve balance between >>>>>> compatibility >>>>>>>>> and long term evolution, maybe we could provide two >> extensions, >>>> one >>>>>> for >>>>>>>>> hadoop2, one for hadoop3. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Will Lauer <wla...@verizonmedia.com.invalid> 于2021年6月9日周三 >>>>> 上午4:13写道: >>>>>>>>> >>>>>>>>>> Just to follow up on this, our main problem with hadoop3 >>> right >>>>> now >>>>>> has >>>>>>>>> been >>>>>>>>>> instability in HDFS, to the extent that we put on hold any >>>> plans >>>>> to >>>>>>>>> deploy >>>>>>>>>> it to our production systems. I would claim Hadoop3 isn't >>>> mature >>>>>> enough >>>>>>>>> yet >>>>>>>>>> to consider migrating Druid to it. >>>>>>>>>> >>>>>>>>>> WIll >>>>>>>>>> >>>>>>>>>> <http://www.verizonmedia.com> >>>>>>>>>> >>>>>>>>>> Will Lauer >>>>>>>>>> >>>>>>>>>> Senior Principal Architect, Audience & Advertising >> Reporting >>>>>>>>>> Data Platforms & Systems Engineering >>>>>>>>>> >>>>>>>>>> M 508 561 6427 >>>>>>>>>> 1908 S. First St >>>>>>>>>> Champaign, IL 61822 >>>>>>>>>> >>>>>>>>>> < >>>>>>>> >>>>>> >>>>> >>>> >>> >> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e= >>>>>>>>> < >>>>>>>>> >>>>>>>> >>>>>> >>>>> >>>> >>> >> https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e= >>>>>>>>> >>>>>>>>>> < >>>>>>>> >>>>>> >>>>> >>>> >>> >> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e= >>>>>>>>> >>>>>>>>>> < >>>>>>>> >>>>>> >>>>> >>>> >>> >> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e= >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Jun 8, 2021 at 2:59 PM Will Lauer < >>>>> wla...@verizonmedia.com >>>>>>> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Unfortunately, the migration off of hadoop3 is a hard one >>>>> (maybe >>>>>> not >>>>>>>>> for >>>>>>>>>>> Druid, but certainly for big organizations running large >>>>> hadoop2 >>>>>>>>>>> workloads). If druid migrated to hadoop3 after 0.22, that >>>> would >>>>>>>>> probably >>>>>>>>>>> prevent me from taking any new versions of Druid for at >>> least >>>>> the >>>>>>>>>> remainder >>>>>>>>>>> of the year and possibly longer. >>>>>>>>>>> >>>>>>>>>>> Will >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> <http://www.verizonmedia.com> >>>>>>>>>>> >>>>>>>>>>> Will Lauer >>>>>>>>>>> >>>>>>>>>>> Senior Principal Architect, Audience & Advertising >>> Reporting >>>>>>>>>>> Data Platforms & Systems Engineering >>>>>>>>>>> >>>>>>>>>>> M 508 561 6427 >>>>>>>>>>> 1908 S. First St >>>>>>>>>>> Champaign, IL 61822 >>>>>>>>>>> >>>>>>>>>>> < >>>>>>>> >>>>>> >>>>> >>>> >>> >> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e= >>>>>>>>> < >>>>>>>>>> >>>>>>>> >>>>>> >>>>> >>>> >>> >> https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e= >>>>>>>>> >>>>>>>>>>> < >>>>>>>> >>>>>> >>>>> >>>> >>> >> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e= >>>>>>>>> >>>>>>>>>>> < >>>>>>>> >>>>>> >>>>> >>>> >>> >> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e= >>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Jun 8, 2021 at 3:08 AM Clint Wylie < >>>> cwy...@apache.org> >>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi all, >>>>>>>>>>>> >>>>>>>>>>>> I've been assisting with some experiments to see how we >>>> might >>>>>> want >>>>>>>> to >>>>>>>>>>>> migrate Druid to support Hadoop 3.x, and more >> importantly, >>>> see >>>>>> if >>>>>>>>> maybe >>>>>>>>>> we >>>>>>>>>>>> can finally be free of some of the dependency issues it >>> has >>>>> been >>>>>>>>> causing >>>>>>>>>>>> for as long as I can remember working with Druid. >>>>>>>>>>>> >>>>>>>>>>>> Hadoop 3 introduced shaded client jars, >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>> >>>>> >>>> >>> >> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D11804&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=rBnEOMf2IKDMeWUo4TZyqf5CzrnbiYTfZUkjHr8GOHo&e= >>>>>>>>>>>> , with the purpose to >>>>>>>>>>>> allow applications to talk to the Hadoop cluster without >>>>>> drowning in >>>>>>>>> its >>>>>>>>>>>> transitive dependencies. The experimental branch that I >>> have >>>>>> been >>>>>>>>>> helping >>>>>>>>>>>> with, which is using these new shaded client jars, can >> be >>>> seen >>>>>> in >>>>>>>> this >>>>>>>>>> PR >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>> >>>>> >>>> >>> >> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11314&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=424doHggbejAz5XswosgVkJK98VUBcUj0pD5bAcBjT0&e= >>>>>>>>>>>> , and is currently working with >>>>>>>>>>>> the HDFS integration tests as well as the Hadoop >> tutorial >>>> flow >>>>>> in >>>>>>>> the >>>>>>>>>>>> Druid >>>>>>>>>>>> docs (which is pretty much equivalent to the HDFS >>>> integration >>>>>> test). >>>>>>>>>>>> >>>>>>>>>>>> The cloud deep storages still need some further testing >>> and >>>>> some >>>>>>>> minor >>>>>>>>>>>> cleanup still needs done for the docs and such. >>> Additionally >>>>> we >>>>>>>> still >>>>>>>>>> need >>>>>>>>>>>> to figure out how to handle the Kerberos extension, >>> because >>>> it >>>>>>>> extends >>>>>>>>>>>> some >>>>>>>>>>>> Hadoop classes so isn't able to use the shaded client >> jars >>>> in >>>>> a >>>>>>>>>>>> straight-forward manner, and so still has heavy >>> dependencies >>>>> and >>>>>>>>> hasn't >>>>>>>>>>>> been tested. However, the experiment has started to pan >>> out >>>>>> enough >>>>>>>> to >>>>>>>>>>>> where >>>>>>>>>>>> I think it is worth starting this discussion, because it >>>> does >>>>>> have >>>>>>>>> some >>>>>>>>>>>> implications. >>>>>>>>>>>> >>>>>>>>>>>> Making this change I think will allow us to update our >>>>>> dependencies >>>>>>>>>> with a >>>>>>>>>>>> lot more freedom (I'm looking at you, Guava), but the >>> catch >>>> is >>>>>> that >>>>>>>>> once >>>>>>>>>>>> we >>>>>>>>>>>> make this change and start updating these dependencies, >> it >>>>> will >>>>>>>> become >>>>>>>>>>>> hard, nearing impossible to support Hadoop 2.x, since as >>> far >>>>> as >>>>>> I >>>>>>>> know >>>>>>>>>>>> there isn't an equivalent set of shaded client jars. I >> am >>>> also >>>>>> not >>>>>>>>>> certain >>>>>>>>>>>> how far back the Hadoop job classpath isolation stuff >> goes >>>>>>>>>>>> (mapreduce.job.classloader = true) which I think is >>> required >>>>> to >>>>>> be >>>>>>>> set >>>>>>>>>> on >>>>>>>>>>>> Druid tasks for this shaded stuff to work alongside >>> updated >>>>>> Druid >>>>>>>>>>>> dependencies. >>>>>>>>>>>> >>>>>>>>>>>> Is anyone opposed to or worried about dropping Hadoop >> 2.x >>>>>> support >>>>>>>>> after >>>>>>>>>>>> the >>>>>>>>>>>> Druid 0.22 release? >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org >>>>>> For additional commands, e-mail: dev-h...@druid.apache.org >>>>>> >>>>>> >>>>> >>>> >>> >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org For additional commands, e-mail: dev-h...@druid.apache.org