Re: KIP-1182 Quality of Service (QoS) for Apache Kafka

Peter Corless Tue, 13 May 2025 10:51:55 -0700

Responses in-line below:

On Tue, May 13, 2025 at 8:26 AM Almog Gavra <almog.ga...@gmail.com> wrote:

> Thanks for the KIP Peter! Curious to see where this one goes, I think it's
> good to start a discussion around this though perhaps we'll need to split
> it up into more focused improvements as there's a lot bundled in this one
> idea!
>
> A0. I'd like to see some folk that are more familiar with the broker
> implementation to chime in around the feasibility of implementing some of
> this. AFAIK, there's no capabilities that allow (for example) shifting
> resources between topics. Isolating that from a resource allocation
> perspective may be a huge lift, though certainly a valuable one.
>

Correct. While I think the Confluent people are likely in the lead in this
regard, even for them there are limits:

*CFK does not support an automated change to storage classes on an existing
> deployment. To make changes to a storage class on an existing deployment,
> such as listed below, contact Confluent Support:*
>
>    - *Migrating from one storage class to another.*
>
>
>    - *Changing the storage class, for example, enabling encryption on the
>    persistent volumes.*
>
> — Docs here <https://docs.confluent.io/operator/current/co-storage.html>

For today, I am presuming what most people will do is have entirely
different clusters for different tiers of service. ("Do you want to go with
an S3 slow lane, or an NVMe-backed fast lane?" And the producer decides
which to shunt their traffic to.) But the proposal should not *preclude*
the concepts of dynamic provisioning on the same cluster.

Yet if the work and thought for this requires forking it off to a different
KIP (or series of KIPs) to handle resource allocation, I would not be
adverse.

> A1. With A0 in mind, I'm wondering what the benefit for making the QoS spec
> an open standard - it depends heavily both on the broker implementation and
> on how it's deployed (containerized? bare metal? k8s?). That makes what we
> can practically offer bundled with the default implementation limited.
> OTOH, I'm not sure whether users benefit from "open standards, free of
> vendor bias as much as possible" If the specification is customizable
> enough to allow for vendor specific extensions.
>

Good point. How this gets implemented on bare metal, or with containers or
k8s is a vital consideration. The design should not *preclude* any of
those. Yet having more advanced infrastructure like k8s will, invariably,
enable more advanced capabilities like, say, dynamic provisioning, etc.

To me, it *has* to be an open standard if we want to see something take off
today, now-ish, and get implemented before the end of time. Even if a major
vendor or two weighs in with their take, I would hope that the overall OSS
Apache Kafka community has some strong opinions on how to ensure it's the
best overall implementation to meet the widest and most pressing of
community needs, and to prevent it from being a walled garden of
proprietary implementation.

Yet I don't want to hold back vendors if they want to add some special
sauce because their implementation of Kafka-compatible services is capable
of something that vanilla Apache Kafka just can't (or won't) do.

> A2. More a technical note, but the dynamic negotiation between producer and
> consumer seems to break a key abstraction of Kafka which is decoupling
> producers from consumers. That might work well if you have one consumer,
> but if you have multiple I imagine you wouldn't want one lagging to cause
> the producer to back up.
>

That's the thing. Right now, from what I understand [please correct me if I
am mistaken] they are basically unaware of each other. The producer just
pumps out data. The client just takes in the topic if and as they can
manage it. While there is dynamic scaling of Kafka consumer groups, you
still might have caps on spending, time-to-scale issues (provisioning,
rebalancing, which can take seconds or a minute or more), or other resource
constraints (disk IO), etc.

Today, you just get impedance mismatches between producers and consumer
groups.

In a QoS-enabled future world, there could be new ways to deal with it.

The consumer group could say, "Wait! You're doing 1,000,000 objects/second!
I can't keep up! Please, can I get it downsampled to 10,000 objects per
second?"

The *producer* would say "Well, that's all well, and good, but I am going
to keep pumping out my data stream. Deal with it."

The producer could get notified that one, or more than one, consumers were
incapable of drinking from the firehose. Let's say *every* consumer barked
at the producer and said "Just so you know, *none* of us can ingest that
data." The producer could, of course, just keep on going. Or, potentially,
it could say "whoops! my bad," and do its own downsampling.

Alternatively it could fork a process to do a separate, downsampled topic
just for a consumer (or class of consumers).

Or the cluster, on behalf of the producers and consumers, could say, "You
know what? Let me fire up a Flink job to do the downsampling for you."

This would enable more dynamic end-to-end negotiations of services than
simply telling the Kafka consumer group to "deal with it."

> I'll be following along, I'm sure there will be some good discussions
> around this!
>
> - Almog
>

Definitely! Great discussion. Hope my answers aren't completely off-base
technically. Gentle corrections (and links to related docs) are welcome.

-Peter.

> On Mon, May 12, 2025 at 4:47 PM Peter Corless
> <peter.corl...@startree.ai.invalid> wrote:
>
> > David Kjerrumgaard and I wrote up the following KIP for Kafka Quality of
> > Service (QoS). It would be a mechanism to describe desired behaviors and
> > actual capabilities of producers, clusters and consumers, and to allow
> them
> > to negotiate desired throughputs, latencies, data retention, and other
> > elements of data streaming. It would also provide instrumentality for
> > observability to measure actual performance to compare to desired
> > performance.
> >
> > Would love to hear frank and thoughtful feedback, as well as committers
> who
> > would be interested in working on implementation.
> >
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1182%3A+Quality+of+Service+%28QoS%29+Framework
> >
> > --
> >
> > [image: StarTree] <https://startree.ai>
> > Peter Corless
> > Director of Product Marketing
> > 650-906-3134
> > Follow us: [image: LinkedIn] <https://www.linkedin.com/in/petercorless/
> > >[image:
> > Twitter] <https://twitter.com/petercorless>[image: Slack]
> > <https://stree.ai/slack>[image: YouTube]
> > <https://youtube.com/StarTreeData>[image:
> > Calendly] <https://calendly.com/peter-corless/30min>
> >
> > [image: Save my spot for Real-Time Analytics Summit 2025]
> > <
> >
> https://rtasummit.startree.ai/?utm_source=referral&utm_medium=email&utm_campaign=signature
> > >
> >
>

-- 

[image: StarTree] <https://startree.ai>
Peter Corless
Director of Product Marketing
650-906-3134
Follow us: [image: LinkedIn] <https://www.linkedin.com/in/petercorless/>[image:
Twitter] <https://twitter.com/petercorless>[image: Slack]
<https://stree.ai/slack>[image: YouTube]
<https://youtube.com/StarTreeData>[image:
Calendly] <https://calendly.com/peter-corless/30min>

[image: Save my spot for Real-Time Analytics Summit 2025]
<https://rtasummit.startree.ai/?utm_source=referral&utm_medium=email&utm_campaign=signature>

Re: KIP-1182 Quality of Service (QoS) for Apache Kafka

Reply via email to