Re: [DISCUSS] Hadoop ingestion support

Gian Merlino Tue, 17 Jun 2025 12:20:30 -0700

I'm on board with this. I also think we should deprecate it ASAP, starting in 
the next major release. It'd be nice to also build a migration guide that helps 
people move from Hadoop ingestion to SQL/MSQ ingestion, and from YARN to K8S 
pod runners.


Gian

On 2025/06/09 20:10:03 Clint Wylie wrote:
> Following up on this, I want to propose the first release of 2026 for
> removal, which I think would be Druid 36, to give some lead time for
> those affected to prepare.
> 
> On Wed, Apr 9, 2025 at 8:42 AM Frank Chen <frankc...@apache.org> wrote:
> >
> > We don't use Hadoop ingestion, it's OK for us to drop the support of Hadoop.
> >
> > We can make an announcement to deprecate it first(from 33?), remove it from
> > official distribution( but keep the ability to build it as above suggested,
> > from 34?),
> > and remove it completely at a proper time.
> >
> >
> >
> >
> > On Wed, Apr 9, 2025 at 5:02 AM Maytas Monsereenusorn <mayt...@apache.org>
> > wrote:
> >
> > > I'm in favor of removing too but we should not rush the removal and make
> > > sure we give enough time for users to migrate to other types of ingestion.
> > > Similar to what Lucas said, if Hadoop is holding back Druid then we should
> > > remove it. Druid also supports many other types of ingestion compared to
> > > back when Hadoop ingestion was added.
> > > For Netflix, we will be migrating to MM-less Druid ingestion in K8s. I
> > > think MM-less Druid ingestion in K8s is probably the closest to Hadoop
> > > ingestion as we do not have to maintain a dedicated Druid specific MM
> > > cluster (works well for companies with existing large/shared Compute
> > > clusters). Personally, I feel we should focus our energy on things
> > > like MM-less Druid in K8s (which is still marked as Experimental) rather
> > > than Hadoop.
> > >
> > > Best Regards,
> > > Maytas
> > >
> > > On Tue, Apr 8, 2025 at 4:06 AM Lucas Capistrant <
> > > capistrant.lu...@gmail.com>
> > > wrote:
> > >
> > > > Yes, I’m in favor of removing it from the core release and also in favor
> > > of
> > > > officially announcing deprecation with a timeline for removal, if we 
> > > > have
> > > > not yet. It stinks to lose the Hadoop ingest support, but if that 
> > > > project
> > > > is going to hold back Druid, it seems we don’t have much choice.
> > > >
> > > > Thanks,
> > > > Lucas
> > > >
> > > > On Tue, Apr 8, 2025 at 4:27 AM Karan Kumar <ka...@apache.org> wrote:
> > > >
> > > > >
> > > > > Like the plan of having a hadoop profile, not shipping it a part of 
> > > > > the
> > > > > apache release and then we can eventually remove it in a release or 2 
> > > > > .
> > > > > Does that work for you folks Maytas, Lucas ?
> > > > >
> > > > > On Mon, Apr 7, 2025 at 3:59 PM Zoltan Haindrich <k...@rxd.hu> wrote:
> > > > >
> > > > >> Hey,
> > > > >>
> > > > >> I was also bumping into this while I was running dependency-checks 
> > > > >> for
> > > > >> Druid-33
> > > > >> * I've  encountered a CVE [1] in hadoop-runtime-3.3.6 which is a
> > > shaded
> > > > >> jar
> > > > >> * we have a PR to upgrade to 3.4.0 ; so I checked also 3.4.1 - but
> > > they
> > > > >> are also affected as they ship with (jetty is 9.4.53.v20231009) [2]
> > > > >>
> > > > >> ..so right now there is no normal way to solve this - the fact that
> > > its
> > > > a
> > > > >> shaded jar further complicates things..
> > > > >>
> > > > >> Note: the trunk Hadoop uses jetty 9.4.57 [3] - which is good; so 
> > > > >> there
> > > > >> will be some future version which might be not affected
> > > > >> I wanted to be thorough and digged into a few things - to see how 
> > > > >> soon
> > > > an
> > > > >> updated version may come out:
> > > > >> * there are a 300+ tickets targeted for 3.5.0 .. so that doesn't 
> > > > >> looks
> > > > >> promising
> > > > >> * but even for 3.4.2 there is a huge jira [4] with 159 subtasks out 
> > > > >> of
> > > > >> which 123 is unassigned...
> > > > >>    if that's really needed for 3.4.2 then I doubt they'll be rolling
> > > out
> > > > >> a release soon...
> > > > >> * I was also peeking into jdk17 jiras which will most likely arrive 
> > > > >> in
> > > > >> 3.5.0 [5]
> > > > >>
> > > > >> Keeping Hadoop like this will hold us back from:
> > > > >> * upgrading 3rd party deps
> > > > >> * forces us to add security supressions
> > > > >> * slows down newer jdk adoption - as officially hadoop only supports
> > > 11
> > > > >>
> > > > >> I think most of the companies using Hadoop are utilizing binaries
> > > which
> > > > >> are being built from forks - and they also have the ability&bandwidth
> > > to
> > > > >> fix these 3rd party
> > > > >> libraries...
> > > > >> I would also guess that they might be also using a custom built Druid
> > > -
> > > > >> and as a result: they have more control over what kind of features
> > > they
> > > > >> have or not.
> > > > >>
> > > > >> So I was wondering about the following:
> > > > >> * add a maven profile for hadoop support (defaults to off)
> > > > >> * retain compaibility: during CI runs: build with jdk11 and run all
> > > > >> hadoop tests
> > > > >> * future releases (>=34) would ship w/o hadoop ingestion
> > > > >> * companies using hadoop-ingestion could turn on the profile and use
> > > it
> > > > >>
> > > > >> What do you guys think?
> > > > >>
> > > > >> cheers,
> > > > >> Zoltan
> > > > >>
> > > > >>
> > > > >> [1] https://nvd.nist.gov/vuln/detail/cve-2024-22201
> > > > >> [2]
> > > > >>
> > > >
> > > https://github.com/apache/hadoop/blob/626b227094027ed08883af97a0734d2db7863864/hadoop-project/pom.xml#L40
> > > > >> [3]
> > > > >>
> > > >
> > > https://github.com/apache/hadoop/blob/3d2f4d669edcf321509ceacde58a8160aef06a8c/hadoop-project/pom.xml#L40
> > > > >> [4] https://issues.apache.org/jira/browse/HADOOP-19353
> > > > >> [5] https://issues.apache.org/jira/browse/HADOOP-17177
> > > > >>
> > > > >>
> > > > >> On 1/8/25 11:56, Abhishek Agarwal wrote:
> > > > >> > @Adarsh - FYI since you are the release manager for 32.
> > > > >> >
> > > > >> > On Wed, Jan 8, 2025 at 11:53 AM Abhishek Agarwal <
> > > abhis...@apache.org
> > > > >
> > > > >> > wrote:
> > > > >> >
> > > > >> >> I don't want to kick that can too far down the road either :) We
> > > > don't
> > > > >> >> want to give a false hope that it's going to remain around 
> > > > >> >> forever.
> > > > >> But yes
> > > > >> >> let's deprecate both Hadoop and Java 11 support in the upcoming 32
> > > > >> release.
> > > > >> >> It's unfortunate that Hadoop still doesn't support Java 17. We
> > > > >> shouldn't
> > > > >> >> let it hold us back. Jetty, pac4j are dropping Java 11 support and
> > > we
> > > > >> would
> > > > >> >> want to upgrade to newer versions of these dependencies soon. 
> > > > >> >> There
> > > > are
> > > > >> >> also nice language features in Java 17 such as pattern matching,
> > > > >> multiline
> > > > >> >> strings, and a lot more that we can't use if we have to be compile
> > > > >> >> compatible with Java 11. If you need the resource elasticity that
> > > > >> Hadoop
> > > > >> >> provides or want to reuse shared infrastructure in the company,
> > > > MM-less
> > > > >> >> ingestion is a good alternative.
> > > > >> >>
> > > > >> >> So let's deprecate it in 32. We can decide on removal later but
> > > > >> hopefully,
> > > > >> >> it doesn't take too many releases to do that.
> > > > >> >>
> > > > >> >> On Tue, Jan 7, 2025 at 4:22 PM Karan Kumar <ka...@apache.org>
> > > wrote:
> > > > >> >>
> > > > >> >>> Okay from what I can gather few folks still need hadoop 
> > > > >> >>> ingestion.
> > > > So
> > > > >> >>> let's
> > > > >> >>> kick the can down the road regarding removal of that support but
> > > > let's
> > > > >> >>> agree on the deprecation plan. Since druid 32 is around the 
> > > > >> >>> corner
> > > > >> let's
> > > > >> >>> atleast deprecated hadoop ingestion so that any new users are not
> > > > >> >>> onboarded
> > > > >> >>> to this way of ingestion. Deprecation also becomes a forcing
> > > > function
> > > > >> in
> > > > >> >>> internal company channel's for prioritization of getting off
> > > hadoop.
> > > > >> >>>
> > > > >> >>> How does this plan look?
> > > > >> >>>
> > > > >> >>> On Fri, Dec 13, 2024 at 1:11 AM Maytas Monsereenusorn <
> > > > >> mayt...@apache.org
> > > > >> >>>>
> > > > >> >>> wrote:
> > > > >> >>>
> > > > >> >>>> We at Netflix are in a similar situation to Target Corporation
> > > > >> (Lucas C
> > > > >> >>>> email above).
> > > > >> >>>> We currently rely on Hadoop ingestion for all our batch 
> > > > >> >>>> ingestion
> > > > >> jobs.
> > > > >> >>> The
> > > > >> >>>> main reason for this is that we already have a large Hadoop
> > > cluster
> > > > >> >>>> supporting our Spark workloads that we can leverage for Druid
> > > > >> >>> ingestion. I
> > > > >> >>>> imagine that the closest alternative for us would be moving to
> > > K8 /
> > > > >> >>>> MiddleManager-less ingestion job.
> > > > >> >>>>
> > > > >> >>>> On Thu, Dec 12, 2024 at 10:56 PM Lucas Capistrant <
> > > > >> >>>> capistrant.lu...@gmail.com> wrote:
> > > > >> >>>>
> > > > >> >>>>> Apologies for the empty email… fat fingers.
> > > > >> >>>>>
> > > > >> >>>>> Just wanted to say that we at Target Corporation (USA), still
> > > rely
> > > > >> >>>> heavily
> > > > >> >>>>> on Hadoop ingest. We’d selfishly want support forever, but if
> > > > forced
> > > > >> >>> to
> > > > >> >>>>> pivot to a new ingestion style for our larger batch ingest jobs
> > > > that
> > > > >> >>>>> currently leverage the cheap compute on YARN, the longer the
> > > lead
> > > > >> time
> > > > >> >>>>> between announcement by the community to the actual release 
> > > > >> >>>>> with
> > > > no
> > > > >> >>>>> support, the better. Making these types of changes can be a 
> > > > >> >>>>> slow
> > > > >> >>> process
> > > > >> >>>>> for the slow to maneuver corporate cruise ship.
> > > > >> >>>>>
> > > > >> >>>>> On Thu, Dec 12, 2024 at 9:46 AM Lucas Capistrant <
> > > > >> >>>>> capistrant.lu...@gmail.com>
> > > > >> >>>>> wrote:
> > > > >> >>>>>
> > > > >> >>>>>>
> > > > >> >>>>>>
> > > > >> >>>>>> On Wed, Dec 11, 2024 at 9:10 PM Karan Kumar <ka...@apache.org>
> > > > >> >>> wrote:
> > > > >> >>>>>>
> > > > >> >>>>>>> +1 for removal of Hadoop based ingestion. It's a maintenance
> > > > >> >>> overhead
> > > > >> >>>>> and
> > > > >> >>>>>>> stops us from moving to java 17.
> > > > >> >>>>>>> I am not aware of any gaps in sql based ingestion which 
> > > > >> >>>>>>> limits
> > > > >> >>> users
> > > > >> >>>> to
> > > > >> >>>>>>> move off from hadoop. If there are any, please feel free to
> > > > reach
> > > > >> >>> out
> > > > >> >>>>> via
> > > > >> >>>>>>> slack/github.
> > > > >> >>>>>>>
> > > > >> >>>>>>> On Thu, Dec 12, 2024 at 3:22 AM Clint Wylie <
> > > cwy...@apache.org>
> > > > >> >>>> wrote:
> > > > >> >>>>>>>
> > > > >> >>>>>>>> Hey everyone,
> > > > >> >>>>>>>>
> > > > >> >>>>>>>> It is about that time again to take a pulse on how commonly
> > > > >> >>> Hadoop
> > > > >> >>>>>>>> based ingestion is used with Druid in order to determine if
> > > we
> > > > >> >>>> should
> > > > >> >>>>>>>> keep supporting it or not going forward.
> > > > >> >>>>>>>>
> > > > >> >>>>>>>> In my view, Hadoop based ingestion has unofficially been on
> > > > life
> > > > >> >>>>>>>> support for quite some time as we do not really go out of 
> > > > >> >>>>>>>> our
> > > > >> >>> way to
> > > > >> >>>>>>>> add new features to it, and we perform very minimal testing
> > > to
> > > > >> >>>> ensure
> > > > >> >>>>>>>> everything keeps working. The most recent changes to it I am
> > > > >> >>> aware
> > > > >> >>>> of
> > > > >> >>>>>>>> was to bump versions and require Hadoop 3, but that was
> > > > primarily
> > > > >> >>>>>>>> motivated by selfish reasons of wanting to use its contained
> > > > >> >>> client
> > > > >> >>>>>>>> library and better isolation so that we could free up our 
> > > > >> >>>>>>>> own
> > > > >> >>>>>>>> dependencies to be updated. This thread is motivated by a
> > > > similar
> > > > >> >>>>>>>> reason I guess, see the other thread I started recently
> > > > >> >>> discussing
> > > > >> >>>>>>>> dropping support for Java 11 where Hadoop does not yet
> > > support
> > > > >> >>> Java
> > > > >> >>>> 17
> > > > >> >>>>>>>> runtime, and so the outcome of this discussion is involved 
> > > > >> >>>>>>>> in
> > > > >> >>> those
> > > > >> >>>>>>>> plans.
> > > > >> >>>>>>>>
> > > > >> >>>>>>>> I think SQL based ingestion with the multi-stage query 
> > > > >> >>>>>>>> engine
> > > > is
> > > > >> >>> the
> > > > >> >>>>>>>> future of batch ingestion, and the Kubernetes based task
> > > runner
> > > > >> >>>>>>>> provides an alternative for task auto scaling capabilities.
> > > > >> >>> Because
> > > > >> >>>> of
> > > > >> >>>>>>>> this, I don't personally see a lot of compelling reasons to
> > > > keep
> > > > >> >>>>>>>> supporting Hadoop, so I would be in favor of just dropping
> > > > >> >>> support
> > > > >> >>>> for
> > > > >> >>>>>>>> it completely, though I see no harm in keeping HDFS deep
> > > > storage
> > > > >> >>>>>>>> around. In past discussions I think we had tied Hadoop
> > > removal
> > > > to
> > > > >> >>>>>>>> adding something like Spark to replace it, but I wonder if
> > > this
> > > > >> >>>> still
> > > > >> >>>>>>>> needs to be the case.
> > > > >> >>>>>>>>
> > > > >> >>>>>>>> I do know that classically there have been quite a lot of
> > > large
> > > > >> >>>> Druid
> > > > >> >>>>>>>> clusters in the wild still relying on Hadoop in previous dev
> > > > list
> > > > >> >>>>>>>> discussions about this topic, so I wanted to check to see if
> > > > >> >>> this is
> > > > >> >>>>>>>> still true and if so if any of these clusters have plans to
> > > > >> >>>> transition
> > > > >> >>>>>>>> to newer ways of ingesting data like SQL based ingestion.
> > > While
> > > > >> >>>> from a
> > > > >> >>>>>>>> dev/maintenance perspective it would be best to just drop it
> > > > >> >>>>>>>> completely, if there is still a large user base I think we
> > > need
> > > > >> >>> to
> > > > >> >>>> be
> > > > >> >>>>>>>> open to keeping it around for a while longer. If we do need
> > > to
> > > > >> >>> keep
> > > > >> >>>>>>>> it, maybe it would be worth it to invest some time in moving
> > > it
> > > > >> >>>> into a
> > > > >> >>>>>>>> contrib extension so that it isn't bundled by default with
> > > > Druid
> > > > >> >>>>>>>> releases to discourage new adoption and more accurately
> > > reflect
> > > > >> >>> its
> > > > >> >>>>>>>> current status in Druid.
> > > > >> >>>>>>>>
> > > > >> >>>>>>>>
> > > > >> >>>>
> > > > ---------------------------------------------------------------------
> > > > >> >>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > > > >> >>>>>>>> For additional commands, e-mail: dev-h...@druid.apache.org
> > > > >> >>>>>>>>
> > > > >> >>>>>>>>
> > > > >> >>>>>>>
> > > > >> >>>>>>
> > > > >> >>>>>
> > > > >> >>>>
> > > > >> >>>
> > > > >> >>
> > > > >> >
> > > > >>
> > > > >>
> > > >
> > >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org

Re: [DISCUSS] Hadoop ingestion support

Reply via email to