Re: [DISCUSS] Hadoop ingestion support

Maytas Monsereenusorn Thu, 12 Dec 2024 11:41:06 -0800

We at Netflix are in a similar situation to Target Corporation (Lucas C
email above).
We currently rely on Hadoop ingestion for all our batch ingestion jobs. The
main reason for this is that we already have a large Hadoop cluster
supporting our Spark workloads that we can leverage for Druid ingestion. I
imagine that the closest alternative for us would be moving to K8 /
MiddleManager-less ingestion job.


On Thu, Dec 12, 2024 at 10:56 PM Lucas Capistrant <
capistrant.lu...@gmail.com> wrote:

> Apologies for the empty email… fat fingers.
>
> Just wanted to say that we at Target Corporation (USA), still rely heavily
> on Hadoop ingest. We’d selfishly want support forever, but if forced to
> pivot to a new ingestion style for our larger batch ingest jobs that
> currently leverage the cheap compute on YARN, the longer the lead time
> between announcement by the community to the actual release with no
> support, the better. Making these types of changes can be a slow process
> for the slow to maneuver corporate cruise ship.
>
> On Thu, Dec 12, 2024 at 9:46 AM Lucas Capistrant <
> capistrant.lu...@gmail.com>
> wrote:
>
> >
> >
> > On Wed, Dec 11, 2024 at 9:10 PM Karan Kumar <ka...@apache.org> wrote:
> >
> >> +1 for removal of Hadoop based ingestion. It's a maintenance overhead
> and
> >> stops us from moving to java 17.
> >> I am not aware of any gaps in sql based ingestion which limits users to
> >> move off from hadoop. If there are any, please feel free to reach out
> via
> >> slack/github.
> >>
> >> On Thu, Dec 12, 2024 at 3:22 AM Clint Wylie <cwy...@apache.org> wrote:
> >>
> >> > Hey everyone,
> >> >
> >> > It is about that time again to take a pulse on how commonly Hadoop
> >> > based ingestion is used with Druid in order to determine if we should
> >> > keep supporting it or not going forward.
> >> >
> >> > In my view, Hadoop based ingestion has unofficially been on life
> >> > support for quite some time as we do not really go out of our way to
> >> > add new features to it, and we perform very minimal testing to ensure
> >> > everything keeps working. The most recent changes to it I am aware of
> >> > was to bump versions and require Hadoop 3, but that was primarily
> >> > motivated by selfish reasons of wanting to use its contained client
> >> > library and better isolation so that we could free up our own
> >> > dependencies to be updated. This thread is motivated by a similar
> >> > reason I guess, see the other thread I started recently discussing
> >> > dropping support for Java 11 where Hadoop does not yet support Java 17
> >> > runtime, and so the outcome of this discussion is involved in those
> >> > plans.
> >> >
> >> > I think SQL based ingestion with the multi-stage query engine is the
> >> > future of batch ingestion, and the Kubernetes based task runner
> >> > provides an alternative for task auto scaling capabilities. Because of
> >> > this, I don't personally see a lot of compelling reasons to keep
> >> > supporting Hadoop, so I would be in favor of just dropping support for
> >> > it completely, though I see no harm in keeping HDFS deep storage
> >> > around. In past discussions I think we had tied Hadoop removal to
> >> > adding something like Spark to replace it, but I wonder if this still
> >> > needs to be the case.
> >> >
> >> > I do know that classically there have been quite a lot of large Druid
> >> > clusters in the wild still relying on Hadoop in previous dev list
> >> > discussions about this topic, so I wanted to check to see if this is
> >> > still true and if so if any of these clusters have plans to transition
> >> > to newer ways of ingesting data like SQL based ingestion. While from a
> >> > dev/maintenance perspective it would be best to just drop it
> >> > completely, if there is still a large user base I think we need to be
> >> > open to keeping it around for a while longer. If we do need to keep
> >> > it, maybe it would be worth it to invest some time in moving it into a
> >> > contrib extension so that it isn't bundled by default with Druid
> >> > releases to discourage new adoption and more accurately reflect its
> >> > current status in Druid.
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> >> > For additional commands, e-mail: dev-h...@druid.apache.org
> >> >
> >> >
> >>
> >
>

Re: [DISCUSS] Hadoop ingestion support

Reply via email to