Re: [DISCUSS] Hadoop ingestion support

Eyal Yurman Tue, 17 Jun 2025 15:28:25 -0700

Sharing as another data point -

We still use YARN to run Hadoop-based batch ingestion. Very useful
on-premise for resource sharing, where autoscaling isn't always an option.
But we plan to move to Kubernetes for ingestion sometime next year.



On Tue, Jun 17, 2025 at 12:20 PM Gian Merlino <[email protected]> wrote:

> I'm on board with this. I also think we should deprecate it ASAP, starting
> in the next major release. It'd be nice to also build a migration guide
> that helps people move from Hadoop ingestion to SQL/MSQ ingestion, and from
> YARN to K8S pod runners.
>
> Gian
>
> On 2025/06/09 20:10:03 Clint Wylie wrote:
> > Following up on this, I want to propose the first release of 2026 for
> > removal, which I think would be Druid 36, to give some lead time for
> > those affected to prepare.
> >
> > On Wed, Apr 9, 2025 at 8:42 AM Frank Chen <[email protected]> wrote:
> > >
> > > We don't use Hadoop ingestion, it's OK for us to drop the support of
> Hadoop.
> > >
> > > We can make an announcement to deprecate it first(from 33?), remove it
> from
> > > official distribution( but keep the ability to build it as above
> suggested,
> > > from 34?),
> > > and remove it completely at a proper time.
> > >
> > >
> > >
> > >
> > > On Wed, Apr 9, 2025 at 5:02 AM Maytas Monsereenusorn <
> [email protected]>
> > > wrote:
> > >
> > > > I'm in favor of removing too but we should not rush the removal and
> make
> > > > sure we give enough time for users to migrate to other types of
> ingestion.
> > > > Similar to what Lucas said, if Hadoop is holding back Druid then we
> should
> > > > remove it. Druid also supports many other types of ingestion
> compared to
> > > > back when Hadoop ingestion was added.
> > > > For Netflix, we will be migrating to MM-less Druid ingestion in K8s.
> I
> > > > think MM-less Druid ingestion in K8s is probably the closest to
> Hadoop
> > > > ingestion as we do not have to maintain a dedicated Druid specific MM
> > > > cluster (works well for companies with existing large/shared Compute
> > > > clusters). Personally, I feel we should focus our energy on things
> > > > like MM-less Druid in K8s (which is still marked as Experimental)
> rather
> > > > than Hadoop.
> > > >
> > > > Best Regards,
> > > > Maytas
> > > >
> > > > On Tue, Apr 8, 2025 at 4:06 AM Lucas Capistrant <
> > > > [email protected]>
> > > > wrote:
> > > >
> > > > > Yes, I’m in favor of removing it from the core release and also in
> favor
> > > > of
> > > > > officially announcing deprecation with a timeline for removal, if
> we have
> > > > > not yet. It stinks to lose the Hadoop ingest support, but if that
> project
> > > > > is going to hold back Druid, it seems we don’t have much choice.
> > > > >
> > > > > Thanks,
> > > > > Lucas
> > > > >
> > > > > On Tue, Apr 8, 2025 at 4:27 AM Karan Kumar <[email protected]>
> wrote:
> > > > >
> > > > > >
> > > > > > Like the plan of having a hadoop profile, not shipping it a part
> of the
> > > > > > apache release and then we can eventually remove it in a release
> or 2 .
> > > > > > Does that work for you folks Maytas, Lucas ?
> > > > > >
> > > > > > On Mon, Apr 7, 2025 at 3:59 PM Zoltan Haindrich <[email protected]>
> wrote:
> > > > > >
> > > > > >> Hey,
> > > > > >>
> > > > > >> I was also bumping into this while I was running
> dependency-checks for
> > > > > >> Druid-33
> > > > > >> * I've  encountered a CVE [1] in hadoop-runtime-3.3.6 which is a
> > > > shaded
> > > > > >> jar
> > > > > >> * we have a PR to upgrade to 3.4.0 ; so I checked also 3.4.1 -
> but
> > > > they
> > > > > >> are also affected as they ship with (jetty is 9.4.53.v20231009)
> [2]
> > > > > >>
> > > > > >> ..so right now there is no normal way to solve this - the fact
> that
> > > > its
> > > > > a
> > > > > >> shaded jar further complicates things..
> > > > > >>
> > > > > >> Note: the trunk Hadoop uses jetty 9.4.57 [3] - which is good;
> so there
> > > > > >> will be some future version which might be not affected
> > > > > >> I wanted to be thorough and digged into a few things - to see
> how soon
> > > > > an
> > > > > >> updated version may come out:
> > > > > >> * there are a 300+ tickets targeted for 3.5.0 .. so that
> doesn't looks
> > > > > >> promising
> > > > > >> * but even for 3.4.2 there is a huge jira [4] with 159 subtasks
> out of
> > > > > >> which 123 is unassigned...
> > > > > >>    if that's really needed for 3.4.2 then I doubt they'll be
> rolling
> > > > out
> > > > > >> a release soon...
> > > > > >> * I was also peeking into jdk17 jiras which will most likely
> arrive in
> > > > > >> 3.5.0 [5]
> > > > > >>
> > > > > >> Keeping Hadoop like this will hold us back from:
> > > > > >> * upgrading 3rd party deps
> > > > > >> * forces us to add security supressions
> > > > > >> * slows down newer jdk adoption - as officially hadoop only
> supports
> > > > 11
> > > > > >>
> > > > > >> I think most of the companies using Hadoop are utilizing
> binaries
> > > > which
> > > > > >> are being built from forks - and they also have the
> ability&bandwidth
> > > > to
> > > > > >> fix these 3rd party
> > > > > >> libraries...
> > > > > >> I would also guess that they might be also using a custom built
> Druid
> > > > -
> > > > > >> and as a result: they have more control over what kind of
> features
> > > > they
> > > > > >> have or not.
> > > > > >>
> > > > > >> So I was wondering about the following:
> > > > > >> * add a maven profile for hadoop support (defaults to off)
> > > > > >> * retain compaibility: during CI runs: build with jdk11 and run
> all
> > > > > >> hadoop tests
> > > > > >> * future releases (>=34) would ship w/o hadoop ingestion
> > > > > >> * companies using hadoop-ingestion could turn on the profile
> and use
> > > > it
> > > > > >>
> > > > > >> What do you guys think?
> > > > > >>
> > > > > >> cheers,
> > > > > >> Zoltan
> > > > > >>
> > > > > >>
> > > > > >> [1] https://nvd.nist.gov/vuln/detail/cve-2024-22201
> > > > > >> [2]
> > > > > >>
> > > > >
> > > >
> https://github.com/apache/hadoop/blob/626b227094027ed08883af97a0734d2db7863864/hadoop-project/pom.xml#L40
> > > > > >> [3]
> > > > > >>
> > > > >
> > > >
> https://github.com/apache/hadoop/blob/3d2f4d669edcf321509ceacde58a8160aef06a8c/hadoop-project/pom.xml#L40
> > > > > >> [4] https://issues.apache.org/jira/browse/HADOOP-19353
> > > > > >> [5] https://issues.apache.org/jira/browse/HADOOP-17177
> > > > > >>
> > > > > >>
> > > > > >> On 1/8/25 11:56, Abhishek Agarwal wrote:
> > > > > >> > @Adarsh - FYI since you are the release manager for 32.
> > > > > >> >
> > > > > >> > On Wed, Jan 8, 2025 at 11:53 AM Abhishek Agarwal <
> > > > [email protected]
> > > > > >
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> >> I don't want to kick that can too far down the road either
> :) We
> > > > > don't
> > > > > >> >> want to give a false hope that it's going to remain around
> forever.
> > > > > >> But yes
> > > > > >> >> let's deprecate both Hadoop and Java 11 support in the
> upcoming 32
> > > > > >> release.
> > > > > >> >> It's unfortunate that Hadoop still doesn't support Java 17.
> We
> > > > > >> shouldn't
> > > > > >> >> let it hold us back. Jetty, pac4j are dropping Java 11
> support and
> > > > we
> > > > > >> would
> > > > > >> >> want to upgrade to newer versions of these dependencies
> soon. There
> > > > > are
> > > > > >> >> also nice language features in Java 17 such as pattern
> matching,
> > > > > >> multiline
> > > > > >> >> strings, and a lot more that we can't use if we have to be
> compile
> > > > > >> >> compatible with Java 11. If you need the resource elasticity
> that
> > > > > >> Hadoop
> > > > > >> >> provides or want to reuse shared infrastructure in the
> company,
> > > > > MM-less
> > > > > >> >> ingestion is a good alternative.
> > > > > >> >>
> > > > > >> >> So let's deprecate it in 32. We can decide on removal later
> but
> > > > > >> hopefully,
> > > > > >> >> it doesn't take too many releases to do that.
> > > > > >> >>
> > > > > >> >> On Tue, Jan 7, 2025 at 4:22 PM Karan Kumar <[email protected]
> >
> > > > wrote:
> > > > > >> >>
> > > > > >> >>> Okay from what I can gather few folks still need hadoop
> ingestion.
> > > > > So
> > > > > >> >>> let's
> > > > > >> >>> kick the can down the road regarding removal of that
> support but
> > > > > let's
> > > > > >> >>> agree on the deprecation plan. Since druid 32 is around the
> corner
> > > > > >> let's
> > > > > >> >>> atleast deprecated hadoop ingestion so that any new users
> are not
> > > > > >> >>> onboarded
> > > > > >> >>> to this way of ingestion. Deprecation also becomes a forcing
> > > > > function
> > > > > >> in
> > > > > >> >>> internal company channel's for prioritization of getting off
> > > > hadoop.
> > > > > >> >>>
> > > > > >> >>> How does this plan look?
> > > > > >> >>>
> > > > > >> >>> On Fri, Dec 13, 2024 at 1:11 AM Maytas Monsereenusorn <
> > > > > >> [email protected]
> > > > > >> >>>>
> > > > > >> >>> wrote:
> > > > > >> >>>
> > > > > >> >>>> We at Netflix are in a similar situation to Target
> Corporation
> > > > > >> (Lucas C
> > > > > >> >>>> email above).
> > > > > >> >>>> We currently rely on Hadoop ingestion for all our batch
> ingestion
> > > > > >> jobs.
> > > > > >> >>> The
> > > > > >> >>>> main reason for this is that we already have a large Hadoop
> > > > cluster
> > > > > >> >>>> supporting our Spark workloads that we can leverage for
> Druid
> > > > > >> >>> ingestion. I
> > > > > >> >>>> imagine that the closest alternative for us would be
> moving to
> > > > K8 /
> > > > > >> >>>> MiddleManager-less ingestion job.
> > > > > >> >>>>
> > > > > >> >>>> On Thu, Dec 12, 2024 at 10:56 PM Lucas Capistrant <
> > > > > >> >>>> [email protected]> wrote:
> > > > > >> >>>>
> > > > > >> >>>>> Apologies for the empty email… fat fingers.
> > > > > >> >>>>>
> > > > > >> >>>>> Just wanted to say that we at Target Corporation (USA),
> still
> > > > rely
> > > > > >> >>>> heavily
> > > > > >> >>>>> on Hadoop ingest. We’d selfishly want support forever,
> but if
> > > > > forced
> > > > > >> >>> to
> > > > > >> >>>>> pivot to a new ingestion style for our larger batch
> ingest jobs
> > > > > that
> > > > > >> >>>>> currently leverage the cheap compute on YARN, the longer
> the
> > > > lead
> > > > > >> time
> > > > > >> >>>>> between announcement by the community to the actual
> release with
> > > > > no
> > > > > >> >>>>> support, the better. Making these types of changes can be
> a slow
> > > > > >> >>> process
> > > > > >> >>>>> for the slow to maneuver corporate cruise ship.
> > > > > >> >>>>>
> > > > > >> >>>>> On Thu, Dec 12, 2024 at 9:46 AM Lucas Capistrant <
> > > > > >> >>>>> [email protected]>
> > > > > >> >>>>> wrote:
> > > > > >> >>>>>
> > > > > >> >>>>>>
> > > > > >> >>>>>>
> > > > > >> >>>>>> On Wed, Dec 11, 2024 at 9:10 PM Karan Kumar <
> [email protected]>
> > > > > >> >>> wrote:
> > > > > >> >>>>>>
> > > > > >> >>>>>>> +1 for removal of Hadoop based ingestion. It's a
> maintenance
> > > > > >> >>> overhead
> > > > > >> >>>>> and
> > > > > >> >>>>>>> stops us from moving to java 17.
> > > > > >> >>>>>>> I am not aware of any gaps in sql based ingestion which
> limits
> > > > > >> >>> users
> > > > > >> >>>> to
> > > > > >> >>>>>>> move off from hadoop. If there are any, please feel
> free to
> > > > > reach
> > > > > >> >>> out
> > > > > >> >>>>> via
> > > > > >> >>>>>>> slack/github.
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> On Thu, Dec 12, 2024 at 3:22 AM Clint Wylie <
> > > > [email protected]>
> > > > > >> >>>> wrote:
> > > > > >> >>>>>>>
> > > > > >> >>>>>>>> Hey everyone,
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>> It is about that time again to take a pulse on how
> commonly
> > > > > >> >>> Hadoop
> > > > > >> >>>>>>>> based ingestion is used with Druid in order to
> determine if
> > > > we
> > > > > >> >>>> should
> > > > > >> >>>>>>>> keep supporting it or not going forward.
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>> In my view, Hadoop based ingestion has unofficially
> been on
> > > > > life
> > > > > >> >>>>>>>> support for quite some time as we do not really go out
> of our
> > > > > >> >>> way to
> > > > > >> >>>>>>>> add new features to it, and we perform very minimal
> testing
> > > > to
> > > > > >> >>>> ensure
> > > > > >> >>>>>>>> everything keeps working. The most recent changes to
> it I am
> > > > > >> >>> aware
> > > > > >> >>>> of
> > > > > >> >>>>>>>> was to bump versions and require Hadoop 3, but that was
> > > > > primarily
> > > > > >> >>>>>>>> motivated by selfish reasons of wanting to use its
> contained
> > > > > >> >>> client
> > > > > >> >>>>>>>> library and better isolation so that we could free up
> our own
> > > > > >> >>>>>>>> dependencies to be updated. This thread is motivated
> by a
> > > > > similar
> > > > > >> >>>>>>>> reason I guess, see the other thread I started recently
> > > > > >> >>> discussing
> > > > > >> >>>>>>>> dropping support for Java 11 where Hadoop does not yet
> > > > support
> > > > > >> >>> Java
> > > > > >> >>>> 17
> > > > > >> >>>>>>>> runtime, and so the outcome of this discussion is
> involved in
> > > > > >> >>> those
> > > > > >> >>>>>>>> plans.
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>> I think SQL based ingestion with the multi-stage query
> engine
> > > > > is
> > > > > >> >>> the
> > > > > >> >>>>>>>> future of batch ingestion, and the Kubernetes based
> task
> > > > runner
> > > > > >> >>>>>>>> provides an alternative for task auto scaling
> capabilities.
> > > > > >> >>> Because
> > > > > >> >>>> of
> > > > > >> >>>>>>>> this, I don't personally see a lot of compelling
> reasons to
> > > > > keep
> > > > > >> >>>>>>>> supporting Hadoop, so I would be in favor of just
> dropping
> > > > > >> >>> support
> > > > > >> >>>> for
> > > > > >> >>>>>>>> it completely, though I see no harm in keeping HDFS
> deep
> > > > > storage
> > > > > >> >>>>>>>> around. In past discussions I think we had tied Hadoop
> > > > removal
> > > > > to
> > > > > >> >>>>>>>> adding something like Spark to replace it, but I
> wonder if
> > > > this
> > > > > >> >>>> still
> > > > > >> >>>>>>>> needs to be the case.
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>> I do know that classically there have been quite a lot
> of
> > > > large
> > > > > >> >>>> Druid
> > > > > >> >>>>>>>> clusters in the wild still relying on Hadoop in
> previous dev
> > > > > list
> > > > > >> >>>>>>>> discussions about this topic, so I wanted to check to
> see if
> > > > > >> >>> this is
> > > > > >> >>>>>>>> still true and if so if any of these clusters have
> plans to
> > > > > >> >>>> transition
> > > > > >> >>>>>>>> to newer ways of ingesting data like SQL based
> ingestion.
> > > > While
> > > > > >> >>>> from a
> > > > > >> >>>>>>>> dev/maintenance perspective it would be best to just
> drop it
> > > > > >> >>>>>>>> completely, if there is still a large user base I
> think we
> > > > need
> > > > > >> >>> to
> > > > > >> >>>> be
> > > > > >> >>>>>>>> open to keeping it around for a while longer. If we do
> need
> > > > to
> > > > > >> >>> keep
> > > > > >> >>>>>>>> it, maybe it would be worth it to invest some time in
> moving
> > > > it
> > > > > >> >>>> into a
> > > > > >> >>>>>>>> contrib extension so that it isn't bundled by default
> with
> > > > > Druid
> > > > > >> >>>>>>>> releases to discourage new adoption and more accurately
> > > > reflect
> > > > > >> >>> its
> > > > > >> >>>>>>>> current status in Druid.
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>>
> > > > > >> >>>>
> > > > >
> ---------------------------------------------------------------------
> > > > > >> >>>>>>>> To unsubscribe, e-mail:
> [email protected]
> > > > > >> >>>>>>>> For additional commands, e-mail:
> [email protected]
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>
> > > > > >> >>>>>>
> > > > > >> >>>>>
> > > > > >> >>>>
> > > > > >> >>>
> > > > > >> >>
> > > > > >> >
> > > > > >>
> > > > > >>
> > > > >
> > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

-- 

Best regards,
Eyal Yurman

Re: [DISCUSS] Hadoop ingestion support

Reply via email to