Responses in-line below: On Tue, May 13, 2025 at 8:26 AM Almog Gavra <almog.ga...@gmail.com> wrote:
> Thanks for the KIP Peter! Curious to see where this one goes, I think it's > good to start a discussion around this though perhaps we'll need to split > it up into more focused improvements as there's a lot bundled in this one > idea! > > A0. I'd like to see some folk that are more familiar with the broker > implementation to chime in around the feasibility of implementing some of > this. AFAIK, there's no capabilities that allow (for example) shifting > resources between topics. Isolating that from a resource allocation > perspective may be a huge lift, though certainly a valuable one. > Correct. While I think the Confluent people are likely in the lead in this regard, even for them there are limits: *CFK does not support an automated change to storage classes on an existing > deployment. To make changes to a storage class on an existing deployment, > such as listed below, contact Confluent Support:* > > - *Migrating from one storage class to another.* > > > - *Changing the storage class, for example, enabling encryption on the > persistent volumes.* > > — Docs here <https://docs.confluent.io/operator/current/co-storage.html> For today, I am presuming what most people will do is have entirely different clusters for different tiers of service. ("Do you want to go with an S3 slow lane, or an NVMe-backed fast lane?" And the producer decides which to shunt their traffic to.) But the proposal should not *preclude* the concepts of dynamic provisioning on the same cluster. Yet if the work and thought for this requires forking it off to a different KIP (or series of KIPs) to handle resource allocation, I would not be adverse. > A1. With A0 in mind, I'm wondering what the benefit for making the QoS spec > an open standard - it depends heavily both on the broker implementation and > on how it's deployed (containerized? bare metal? k8s?). That makes what we > can practically offer bundled with the default implementation limited. > OTOH, I'm not sure whether users benefit from "open standards, free of > vendor bias as much as possible" If the specification is customizable > enough to allow for vendor specific extensions. > Good point. How this gets implemented on bare metal, or with containers or k8s is a vital consideration. The design should not *preclude* any of those. Yet having more advanced infrastructure like k8s will, invariably, enable more advanced capabilities like, say, dynamic provisioning, etc. To me, it *has* to be an open standard if we want to see something take off today, now-ish, and get implemented before the end of time. Even if a major vendor or two weighs in with their take, I would hope that the overall OSS Apache Kafka community has some strong opinions on how to ensure it's the best overall implementation to meet the widest and most pressing of community needs, and to prevent it from being a walled garden of proprietary implementation. Yet I don't want to hold back vendors if they want to add some special sauce because their implementation of Kafka-compatible services is capable of something that vanilla Apache Kafka just can't (or won't) do. > A2. More a technical note, but the dynamic negotiation between producer and > consumer seems to break a key abstraction of Kafka which is decoupling > producers from consumers. That might work well if you have one consumer, > but if you have multiple I imagine you wouldn't want one lagging to cause > the producer to back up. > That's the thing. Right now, from what I understand [please correct me if I am mistaken] they are basically unaware of each other. The producer just pumps out data. The client just takes in the topic if and as they can manage it. While there is dynamic scaling of Kafka consumer groups, you still might have caps on spending, time-to-scale issues (provisioning, rebalancing, which can take seconds or a minute or more), or other resource constraints (disk IO), etc. Today, you just get impedance mismatches between producers and consumer groups. In a QoS-enabled future world, there could be new ways to deal with it. The consumer group could say, "Wait! You're doing 1,000,000 objects/second! I can't keep up! Please, can I get it downsampled to 10,000 objects per second?" The *producer* would say "Well, that's all well, and good, but I am going to keep pumping out my data stream. Deal with it." The producer could get notified that one, or more than one, consumers were incapable of drinking from the firehose. Let's say *every* consumer barked at the producer and said "Just so you know, *none* of us can ingest that data." The producer could, of course, just keep on going. Or, potentially, it could say "whoops! my bad," and do its own downsampling. Alternatively it could fork a process to do a separate, downsampled topic just for a consumer (or class of consumers). Or the cluster, on behalf of the producers and consumers, could say, "You know what? Let me fire up a Flink job to do the downsampling for you." This would enable more dynamic end-to-end negotiations of services than simply telling the Kafka consumer group to "deal with it." > I'll be following along, I'm sure there will be some good discussions > around this! > > - Almog > Definitely! Great discussion. Hope my answers aren't completely off-base technically. Gentle corrections (and links to related docs) are welcome. -Peter. > On Mon, May 12, 2025 at 4:47 PM Peter Corless > <peter.corl...@startree.ai.invalid> wrote: > > > David Kjerrumgaard and I wrote up the following KIP for Kafka Quality of > > Service (QoS). It would be a mechanism to describe desired behaviors and > > actual capabilities of producers, clusters and consumers, and to allow > them > > to negotiate desired throughputs, latencies, data retention, and other > > elements of data streaming. It would also provide instrumentality for > > observability to measure actual performance to compare to desired > > performance. > > > > Would love to hear frank and thoughtful feedback, as well as committers > who > > would be interested in working on implementation. > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1182%3A+Quality+of+Service+%28QoS%29+Framework > > > > -- > > > > [image: StarTree] <https://startree.ai> > > Peter Corless > > Director of Product Marketing > > 650-906-3134 > > Follow us: [image: LinkedIn] <https://www.linkedin.com/in/petercorless/ > > >[image: > > Twitter] <https://twitter.com/petercorless>[image: Slack] > > <https://stree.ai/slack>[image: YouTube] > > <https://youtube.com/StarTreeData>[image: > > Calendly] <https://calendly.com/peter-corless/30min> > > > > [image: Save my spot for Real-Time Analytics Summit 2025] > > < > > > https://rtasummit.startree.ai/?utm_source=referral&utm_medium=email&utm_campaign=signature > > > > > > -- [image: StarTree] <https://startree.ai> Peter Corless Director of Product Marketing 650-906-3134 Follow us: [image: LinkedIn] <https://www.linkedin.com/in/petercorless/>[image: Twitter] <https://twitter.com/petercorless>[image: Slack] <https://stree.ai/slack>[image: YouTube] <https://youtube.com/StarTreeData>[image: Calendly] <https://calendly.com/peter-corless/30min> [image: Save my spot for Real-Time Analytics Summit 2025] <https://rtasummit.startree.ai/?utm_source=referral&utm_medium=email&utm_campaign=signature>