Sharing as another data point - We still use YARN to run Hadoop-based batch ingestion. Very useful on-premise for resource sharing, where autoscaling isn't always an option. But we plan to move to Kubernetes for ingestion sometime next year.
On Tue, Jun 17, 2025 at 12:20 PM Gian Merlino <g...@apache.org> wrote: > I'm on board with this. I also think we should deprecate it ASAP, starting > in the next major release. It'd be nice to also build a migration guide > that helps people move from Hadoop ingestion to SQL/MSQ ingestion, and from > YARN to K8S pod runners. > > Gian > > On 2025/06/09 20:10:03 Clint Wylie wrote: > > Following up on this, I want to propose the first release of 2026 for > > removal, which I think would be Druid 36, to give some lead time for > > those affected to prepare. > > > > On Wed, Apr 9, 2025 at 8:42 AM Frank Chen <frankc...@apache.org> wrote: > > > > > > We don't use Hadoop ingestion, it's OK for us to drop the support of > Hadoop. > > > > > > We can make an announcement to deprecate it first(from 33?), remove it > from > > > official distribution( but keep the ability to build it as above > suggested, > > > from 34?), > > > and remove it completely at a proper time. > > > > > > > > > > > > > > > On Wed, Apr 9, 2025 at 5:02 AM Maytas Monsereenusorn < > mayt...@apache.org> > > > wrote: > > > > > > > I'm in favor of removing too but we should not rush the removal and > make > > > > sure we give enough time for users to migrate to other types of > ingestion. > > > > Similar to what Lucas said, if Hadoop is holding back Druid then we > should > > > > remove it. Druid also supports many other types of ingestion > compared to > > > > back when Hadoop ingestion was added. > > > > For Netflix, we will be migrating to MM-less Druid ingestion in K8s. > I > > > > think MM-less Druid ingestion in K8s is probably the closest to > Hadoop > > > > ingestion as we do not have to maintain a dedicated Druid specific MM > > > > cluster (works well for companies with existing large/shared Compute > > > > clusters). Personally, I feel we should focus our energy on things > > > > like MM-less Druid in K8s (which is still marked as Experimental) > rather > > > > than Hadoop. > > > > > > > > Best Regards, > > > > Maytas > > > > > > > > On Tue, Apr 8, 2025 at 4:06 AM Lucas Capistrant < > > > > capistrant.lu...@gmail.com> > > > > wrote: > > > > > > > > > Yes, I’m in favor of removing it from the core release and also in > favor > > > > of > > > > > officially announcing deprecation with a timeline for removal, if > we have > > > > > not yet. It stinks to lose the Hadoop ingest support, but if that > project > > > > > is going to hold back Druid, it seems we don’t have much choice. > > > > > > > > > > Thanks, > > > > > Lucas > > > > > > > > > > On Tue, Apr 8, 2025 at 4:27 AM Karan Kumar <ka...@apache.org> > wrote: > > > > > > > > > > > > > > > > > Like the plan of having a hadoop profile, not shipping it a part > of the > > > > > > apache release and then we can eventually remove it in a release > or 2 . > > > > > > Does that work for you folks Maytas, Lucas ? > > > > > > > > > > > > On Mon, Apr 7, 2025 at 3:59 PM Zoltan Haindrich <k...@rxd.hu> > wrote: > > > > > > > > > > > >> Hey, > > > > > >> > > > > > >> I was also bumping into this while I was running > dependency-checks for > > > > > >> Druid-33 > > > > > >> * I've encountered a CVE [1] in hadoop-runtime-3.3.6 which is a > > > > shaded > > > > > >> jar > > > > > >> * we have a PR to upgrade to 3.4.0 ; so I checked also 3.4.1 - > but > > > > they > > > > > >> are also affected as they ship with (jetty is 9.4.53.v20231009) > [2] > > > > > >> > > > > > >> ..so right now there is no normal way to solve this - the fact > that > > > > its > > > > > a > > > > > >> shaded jar further complicates things.. > > > > > >> > > > > > >> Note: the trunk Hadoop uses jetty 9.4.57 [3] - which is good; > so there > > > > > >> will be some future version which might be not affected > > > > > >> I wanted to be thorough and digged into a few things - to see > how soon > > > > > an > > > > > >> updated version may come out: > > > > > >> * there are a 300+ tickets targeted for 3.5.0 .. so that > doesn't looks > > > > > >> promising > > > > > >> * but even for 3.4.2 there is a huge jira [4] with 159 subtasks > out of > > > > > >> which 123 is unassigned... > > > > > >> if that's really needed for 3.4.2 then I doubt they'll be > rolling > > > > out > > > > > >> a release soon... > > > > > >> * I was also peeking into jdk17 jiras which will most likely > arrive in > > > > > >> 3.5.0 [5] > > > > > >> > > > > > >> Keeping Hadoop like this will hold us back from: > > > > > >> * upgrading 3rd party deps > > > > > >> * forces us to add security supressions > > > > > >> * slows down newer jdk adoption - as officially hadoop only > supports > > > > 11 > > > > > >> > > > > > >> I think most of the companies using Hadoop are utilizing > binaries > > > > which > > > > > >> are being built from forks - and they also have the > ability&bandwidth > > > > to > > > > > >> fix these 3rd party > > > > > >> libraries... > > > > > >> I would also guess that they might be also using a custom built > Druid > > > > - > > > > > >> and as a result: they have more control over what kind of > features > > > > they > > > > > >> have or not. > > > > > >> > > > > > >> So I was wondering about the following: > > > > > >> * add a maven profile for hadoop support (defaults to off) > > > > > >> * retain compaibility: during CI runs: build with jdk11 and run > all > > > > > >> hadoop tests > > > > > >> * future releases (>=34) would ship w/o hadoop ingestion > > > > > >> * companies using hadoop-ingestion could turn on the profile > and use > > > > it > > > > > >> > > > > > >> What do you guys think? > > > > > >> > > > > > >> cheers, > > > > > >> Zoltan > > > > > >> > > > > > >> > > > > > >> [1] https://nvd.nist.gov/vuln/detail/cve-2024-22201 > > > > > >> [2] > > > > > >> > > > > > > > > > > https://github.com/apache/hadoop/blob/626b227094027ed08883af97a0734d2db7863864/hadoop-project/pom.xml#L40 > > > > > >> [3] > > > > > >> > > > > > > > > > > https://github.com/apache/hadoop/blob/3d2f4d669edcf321509ceacde58a8160aef06a8c/hadoop-project/pom.xml#L40 > > > > > >> [4] https://issues.apache.org/jira/browse/HADOOP-19353 > > > > > >> [5] https://issues.apache.org/jira/browse/HADOOP-17177 > > > > > >> > > > > > >> > > > > > >> On 1/8/25 11:56, Abhishek Agarwal wrote: > > > > > >> > @Adarsh - FYI since you are the release manager for 32. > > > > > >> > > > > > > >> > On Wed, Jan 8, 2025 at 11:53 AM Abhishek Agarwal < > > > > abhis...@apache.org > > > > > > > > > > > >> > wrote: > > > > > >> > > > > > > >> >> I don't want to kick that can too far down the road either > :) We > > > > > don't > > > > > >> >> want to give a false hope that it's going to remain around > forever. > > > > > >> But yes > > > > > >> >> let's deprecate both Hadoop and Java 11 support in the > upcoming 32 > > > > > >> release. > > > > > >> >> It's unfortunate that Hadoop still doesn't support Java 17. > We > > > > > >> shouldn't > > > > > >> >> let it hold us back. Jetty, pac4j are dropping Java 11 > support and > > > > we > > > > > >> would > > > > > >> >> want to upgrade to newer versions of these dependencies > soon. There > > > > > are > > > > > >> >> also nice language features in Java 17 such as pattern > matching, > > > > > >> multiline > > > > > >> >> strings, and a lot more that we can't use if we have to be > compile > > > > > >> >> compatible with Java 11. If you need the resource elasticity > that > > > > > >> Hadoop > > > > > >> >> provides or want to reuse shared infrastructure in the > company, > > > > > MM-less > > > > > >> >> ingestion is a good alternative. > > > > > >> >> > > > > > >> >> So let's deprecate it in 32. We can decide on removal later > but > > > > > >> hopefully, > > > > > >> >> it doesn't take too many releases to do that. > > > > > >> >> > > > > > >> >> On Tue, Jan 7, 2025 at 4:22 PM Karan Kumar <ka...@apache.org > > > > > > wrote: > > > > > >> >> > > > > > >> >>> Okay from what I can gather few folks still need hadoop > ingestion. > > > > > So > > > > > >> >>> let's > > > > > >> >>> kick the can down the road regarding removal of that > support but > > > > > let's > > > > > >> >>> agree on the deprecation plan. Since druid 32 is around the > corner > > > > > >> let's > > > > > >> >>> atleast deprecated hadoop ingestion so that any new users > are not > > > > > >> >>> onboarded > > > > > >> >>> to this way of ingestion. Deprecation also becomes a forcing > > > > > function > > > > > >> in > > > > > >> >>> internal company channel's for prioritization of getting off > > > > hadoop. > > > > > >> >>> > > > > > >> >>> How does this plan look? > > > > > >> >>> > > > > > >> >>> On Fri, Dec 13, 2024 at 1:11 AM Maytas Monsereenusorn < > > > > > >> mayt...@apache.org > > > > > >> >>>> > > > > > >> >>> wrote: > > > > > >> >>> > > > > > >> >>>> We at Netflix are in a similar situation to Target > Corporation > > > > > >> (Lucas C > > > > > >> >>>> email above). > > > > > >> >>>> We currently rely on Hadoop ingestion for all our batch > ingestion > > > > > >> jobs. > > > > > >> >>> The > > > > > >> >>>> main reason for this is that we already have a large Hadoop > > > > cluster > > > > > >> >>>> supporting our Spark workloads that we can leverage for > Druid > > > > > >> >>> ingestion. I > > > > > >> >>>> imagine that the closest alternative for us would be > moving to > > > > K8 / > > > > > >> >>>> MiddleManager-less ingestion job. > > > > > >> >>>> > > > > > >> >>>> On Thu, Dec 12, 2024 at 10:56 PM Lucas Capistrant < > > > > > >> >>>> capistrant.lu...@gmail.com> wrote: > > > > > >> >>>> > > > > > >> >>>>> Apologies for the empty email… fat fingers. > > > > > >> >>>>> > > > > > >> >>>>> Just wanted to say that we at Target Corporation (USA), > still > > > > rely > > > > > >> >>>> heavily > > > > > >> >>>>> on Hadoop ingest. We’d selfishly want support forever, > but if > > > > > forced > > > > > >> >>> to > > > > > >> >>>>> pivot to a new ingestion style for our larger batch > ingest jobs > > > > > that > > > > > >> >>>>> currently leverage the cheap compute on YARN, the longer > the > > > > lead > > > > > >> time > > > > > >> >>>>> between announcement by the community to the actual > release with > > > > > no > > > > > >> >>>>> support, the better. Making these types of changes can be > a slow > > > > > >> >>> process > > > > > >> >>>>> for the slow to maneuver corporate cruise ship. > > > > > >> >>>>> > > > > > >> >>>>> On Thu, Dec 12, 2024 at 9:46 AM Lucas Capistrant < > > > > > >> >>>>> capistrant.lu...@gmail.com> > > > > > >> >>>>> wrote: > > > > > >> >>>>> > > > > > >> >>>>>> > > > > > >> >>>>>> > > > > > >> >>>>>> On Wed, Dec 11, 2024 at 9:10 PM Karan Kumar < > ka...@apache.org> > > > > > >> >>> wrote: > > > > > >> >>>>>> > > > > > >> >>>>>>> +1 for removal of Hadoop based ingestion. It's a > maintenance > > > > > >> >>> overhead > > > > > >> >>>>> and > > > > > >> >>>>>>> stops us from moving to java 17. > > > > > >> >>>>>>> I am not aware of any gaps in sql based ingestion which > limits > > > > > >> >>> users > > > > > >> >>>> to > > > > > >> >>>>>>> move off from hadoop. If there are any, please feel > free to > > > > > reach > > > > > >> >>> out > > > > > >> >>>>> via > > > > > >> >>>>>>> slack/github. > > > > > >> >>>>>>> > > > > > >> >>>>>>> On Thu, Dec 12, 2024 at 3:22 AM Clint Wylie < > > > > cwy...@apache.org> > > > > > >> >>>> wrote: > > > > > >> >>>>>>> > > > > > >> >>>>>>>> Hey everyone, > > > > > >> >>>>>>>> > > > > > >> >>>>>>>> It is about that time again to take a pulse on how > commonly > > > > > >> >>> Hadoop > > > > > >> >>>>>>>> based ingestion is used with Druid in order to > determine if > > > > we > > > > > >> >>>> should > > > > > >> >>>>>>>> keep supporting it or not going forward. > > > > > >> >>>>>>>> > > > > > >> >>>>>>>> In my view, Hadoop based ingestion has unofficially > been on > > > > > life > > > > > >> >>>>>>>> support for quite some time as we do not really go out > of our > > > > > >> >>> way to > > > > > >> >>>>>>>> add new features to it, and we perform very minimal > testing > > > > to > > > > > >> >>>> ensure > > > > > >> >>>>>>>> everything keeps working. The most recent changes to > it I am > > > > > >> >>> aware > > > > > >> >>>> of > > > > > >> >>>>>>>> was to bump versions and require Hadoop 3, but that was > > > > > primarily > > > > > >> >>>>>>>> motivated by selfish reasons of wanting to use its > contained > > > > > >> >>> client > > > > > >> >>>>>>>> library and better isolation so that we could free up > our own > > > > > >> >>>>>>>> dependencies to be updated. This thread is motivated > by a > > > > > similar > > > > > >> >>>>>>>> reason I guess, see the other thread I started recently > > > > > >> >>> discussing > > > > > >> >>>>>>>> dropping support for Java 11 where Hadoop does not yet > > > > support > > > > > >> >>> Java > > > > > >> >>>> 17 > > > > > >> >>>>>>>> runtime, and so the outcome of this discussion is > involved in > > > > > >> >>> those > > > > > >> >>>>>>>> plans. > > > > > >> >>>>>>>> > > > > > >> >>>>>>>> I think SQL based ingestion with the multi-stage query > engine > > > > > is > > > > > >> >>> the > > > > > >> >>>>>>>> future of batch ingestion, and the Kubernetes based > task > > > > runner > > > > > >> >>>>>>>> provides an alternative for task auto scaling > capabilities. > > > > > >> >>> Because > > > > > >> >>>> of > > > > > >> >>>>>>>> this, I don't personally see a lot of compelling > reasons to > > > > > keep > > > > > >> >>>>>>>> supporting Hadoop, so I would be in favor of just > dropping > > > > > >> >>> support > > > > > >> >>>> for > > > > > >> >>>>>>>> it completely, though I see no harm in keeping HDFS > deep > > > > > storage > > > > > >> >>>>>>>> around. In past discussions I think we had tied Hadoop > > > > removal > > > > > to > > > > > >> >>>>>>>> adding something like Spark to replace it, but I > wonder if > > > > this > > > > > >> >>>> still > > > > > >> >>>>>>>> needs to be the case. > > > > > >> >>>>>>>> > > > > > >> >>>>>>>> I do know that classically there have been quite a lot > of > > > > large > > > > > >> >>>> Druid > > > > > >> >>>>>>>> clusters in the wild still relying on Hadoop in > previous dev > > > > > list > > > > > >> >>>>>>>> discussions about this topic, so I wanted to check to > see if > > > > > >> >>> this is > > > > > >> >>>>>>>> still true and if so if any of these clusters have > plans to > > > > > >> >>>> transition > > > > > >> >>>>>>>> to newer ways of ingesting data like SQL based > ingestion. > > > > While > > > > > >> >>>> from a > > > > > >> >>>>>>>> dev/maintenance perspective it would be best to just > drop it > > > > > >> >>>>>>>> completely, if there is still a large user base I > think we > > > > need > > > > > >> >>> to > > > > > >> >>>> be > > > > > >> >>>>>>>> open to keeping it around for a while longer. If we do > need > > > > to > > > > > >> >>> keep > > > > > >> >>>>>>>> it, maybe it would be worth it to invest some time in > moving > > > > it > > > > > >> >>>> into a > > > > > >> >>>>>>>> contrib extension so that it isn't bundled by default > with > > > > > Druid > > > > > >> >>>>>>>> releases to discourage new adoption and more accurately > > > > reflect > > > > > >> >>> its > > > > > >> >>>>>>>> current status in Druid. > > > > > >> >>>>>>>> > > > > > >> >>>>>>>> > > > > > >> >>>> > > > > > > --------------------------------------------------------------------- > > > > > >> >>>>>>>> To unsubscribe, e-mail: > dev-unsubscr...@druid.apache.org > > > > > >> >>>>>>>> For additional commands, e-mail: > dev-h...@druid.apache.org > > > > > >> >>>>>>>> > > > > > >> >>>>>>>> > > > > > >> >>>>>>> > > > > > >> >>>>>> > > > > > >> >>>>> > > > > > >> >>>> > > > > > >> >>> > > > > > >> >> > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org > > For additional commands, e-mail: dev-h...@druid.apache.org > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org > For additional commands, e-mail: dev-h...@druid.apache.org > > -- Best regards, Eyal Yurman