Hi Christo and all! Thank you for your questions.
> CL - 1: In the same lane as Luke's comment, it would be very useful to see > explicitly what will stay on disk and what won't stay on disk Good point. I added a small section "Disk usage and lack thereof" to KIP-1150. > CL - 2: It would also be very useful to explicitly say what the > interactions will be with the Kraft-related topic - would it be diskless or > on disk? If I understand your question correctly, the KRaft metadata topic is not affected, it remains on disk. > CL - 3: Do you envision that this feature will work with KIP-932? Yes. We haven't checked this explicitly in our PoC, but it seems KIP-932 is orthogonal to how data is stored, so we expect little change (if any) needed to make those work together. > CL - 4: KIP-1163 says that there won't be a production-grade implementation > of the Batch Coordinator and KIP-1164 says the opposite. Which one would it > be? Sorry for this confusion. The correct message is that there will be a production-grade implementation. However, I wonder: KIP-1163 says there won't be a production-grade implementation of the storage backend interface, maybe you meant this? > CL - 5: KIP-1163 says that the Batch Coordinator doesn't need to concern > itself with object storage and KIP-1164 says that it will manage the object > physical deletion. Which one would it be? Yes, you're right: it's a bit confusing. The KIP-1164 used to say that the coordinator "manages" object physical deletion, but not performs it (which brokers do). That's surely unclear. I added this clarification to the KIP. Does it make sense? > CL - 6: Could you go in a bit more details on whether we would need changes > to the Kafka clients to achieve what you are proposing? If no changes are > necessary to the clients then what changes would be necessary to brokers to > make clients believe they are communicating with the "right" brokers? Would > those make it in KIP-1163? The Fetch API already supports sending the rack ID in the modern versions. We plan to allow the broker to use the consumer racks passed through this for forming the topic metadata response the right way for consumers. Unfortunately, there's no such field in the Produce request and we plan to propose to introduce it in the (yet to be published) "KIP-E: Producer Rack Awareness". There's also the possibility to provide a fallback for older and slower developing third-party clients with the "client.id" field (e.g. add some `,diskless_rack=abc`). > CL - 7: Where and how would indexes (offset, time, producer snapshot) live? > In particular, I am interested in how the reference Batch Coordinator will > quickly (for a certain definition of quickly) rebuild state? This will be a concern of a particular batch coordinator implementation. For example, if the implementation is backed by a DB, using the native indexing mechanism of the said DB would be the most natural. If it's the topic-based implementation proposed by the KIP-1164, these indices should be built, updated, and stored locally when the state is materialized from the topic content. If we converge on the usage of SQLite, it would be either the native SQLite indexing mechanism or dedicated tables (if we figure out we need something fancy). > CL - 8: I think that we try to have as few Kafka dependencies as possible. > The closure of compile + runtime broker-only dependencies is currently 16 > (if I have done my analysis correctly). What problem(s) do you envision > w.r.t. spilling to disk which we wouldn't be able to solve with our own > implementation that require SQLite? We will need to perform reads and writes on (potentially) multi-gigabyte state. It would be great to have atomic transactions on it and fault-tolerant writes. As an example: when we read a record from the metadata topic and apply it, we need to update the state accordingly and at the same time update the offset in the metadata partition and we want this to survive a process crash without corrupting the whole state. SQLite makes this trivial, but implementing this on our own would be quite challenging. There are also quality-of-life improvements like aforementioned indices and query language. Does it make sense? Best, Ivan On Tue, Apr 22, 2025, at 16:04, Christo Lolov wrote: > Hello! > > I want to start with saying that this is a big and impressive undertaking > and I am really excited to see its progression! I am posting my initial > comments in this thread, but they span a few of the child KIPs. Let me know > which questions you would like to move elsewhere. I understand that you > want first a consensus on the direction, but I think I still need designs > on a few of the core areas to form an opinion. > > CL - 1: In the same lane as Luke's comment, it would be very useful to see > explicitly what will stay on disk and what won't stay on disk > > CL - 2: It would also be very useful to explicitly say what the > interactions will be with the Kraft-related topic - would it be diskless or > on disk? > > CL - 3: Do you envision that this feature will work with KIP-932? > > CL - 4: KIP-1163 says that there won't be a production-grade implementation > of the Batch Coordinator and KIP-1164 says the opposite. Which one would it > be? > > CL - 5: KIP-1163 says that the Batch Coordinator doesn't need to concern > itself with object storage and KIP-1164 says that it will manage the object > physical deletion. Which one would it be? > > CL - 6: Could you go in a bit more details on whether we would need changes > to the Kafka clients to achieve what you are proposing? If no changes are > necessary to the clients then what changes would be necessary to brokers to > make clients believe they are communicating with the "right" brokers? Would > those make it in KIP-1163? > > CL - 7: Where and how would indexes (offset, time, producer snapshot) live? > In particular, I am interested in how the reference Batch Coordinator will > quickly (for a certain definition of quickly) rebuild state? > > CL - 8: I think that we try to have as few Kafka dependencies as possible. > The closure of compile + runtime broker-only dependencies is currently 16 > (if I have done my analysis correctly). What problem(s) do you envision > w.r.t. spilling to disk which we wouldn't be able to solve with our own > implementation that require SQLite? > > Once again, great work so far! > > Best, > Christo > > On Sun, 20 Apr 2025 at 23:04, Stanislav Kozlovski < > stanislavkozlov...@apache.org> wrote: > > > This is an amazing initiative. Huge kudos for driving it. We should > > incorporate it one way or another. > > > > I have a suggestion I'd like to hear your thoughts on. I'm cognizant of > > the effort required for KIP-1150 so I don't necessarily want to increase > > the scope - but thinking about this early on can help design later on, plus > > shape the motivation. > > > > The idea is to introduce support for replicationless acks=1 writes. This > > would be very similar to how AutoMQ's WAL+S3 feature works, as far as I > > understand it. > > > > Could we have Diskless Brokers serve acks=1 produce requests by > > immediately persisting the data on disk (not sure if we should use fsync or > > not), responding to the request, and then still asynchronously batching > > said data with regular acks=all data via the " > > diskless.append.commit.interval.ms"/ "diskless.append.buffer.max.bytes" > > configs? > > > > If I'm not mistaken, this would offer very similar guarantees as today's > > acks=1 requests, where a period of low durability exists b/w the time the > > leader persists to its local disk and the time all followers persist to > > their disk. Granted, in traditional Kafka this period is probably no more > > than a hundred milliseconds, and here it'd be at least 2x higher. But I > > believe that given the major savings, many acks=1 users will be happy to > > make the tradeoff. > > > > While on the topic of cost, I hastily ran some cost calculations and found > > that the KIP should reduce replication costs by more than 80x. ( > > https://topicpartition.io/blog/kip-1150-diskless-topics-in-apache-kafka). > > There may be some errors there as the batch coordinator RPC and merging > > isn't fully fleshed out - but I believe it's directionally correct. It may > > be worth to add that to the motivation in one way or another - so as to be > > able to quantify the numbers. > > > > Best, > > Stanislav > > > > On 2025/04/19 11:02:30 Ivan Yurchenko wrote: > > > Hi Ziming, > > > > > > > 1. Is this feature available by just a minor adjust of config or it > > will intrude current code heavily, say, AutoMq is 100% compatible with > > Kafka and doesn’t intrude the code heavily > > > > > > If we speak about the part visible to the user, we expect: > > > 1. Minimal changes to the client code (with potential fallback with > > even 0 changes for older clients). > > > 2. A limited set of new configurations for broker and topics. > > > Otherwise, this should be a perfectly normal Apache Kafka. > > > > > > > 2. Though we are not discussing implement details, it’s worth giving > > some high-level architecture ideas, and it’s better to compare with AutoMq > > like systems. > > > > > > There's quite a bit of high-level architecture in a sub-KIP-1163 [1]. > > > We didn't do comparison to AutoMQ (to the best of our knowledge, they > > have a fairly different approach), but if this helps the community to get > > the idea then sure, we should do this. > > > > > > > 3. What we will provide through it, I think we will just provide a > > common interface and put implementations in another repos, just as we did > > for Kafka Connect and Kafka Tired Storage. > > > > > > This is true for the component that does CRUD operations on object > > storage. However, for the batch coordinator we would like to provide a > > decent out-of-the-box self-contained (i.e. no external deps like database) > > implementation that many Kafka users who don't have challenging scaling > > requirements would benefit from. There's the sub-KIP-1164 [2] for this. > > > > > > > 4. How to deal with KRaft related protocol, since metadata topic is > > managed differently with __cluster_metadata, through this KIP, will we > > align the gap between __cluster_metadata and data topics by put metadata > > in an object storage? if so, there will be no standby controller? since > > standby controller is the __cluster_metadata followers and there will be no > > followers. > > > > > > The current plan is to not directly work with the KRaft and > > __cluster_metadata. What we need from KRaft is 3 types of events: > > topic/partition creation, topic deletion, and topic configuration changes > > (with the possibility to limit this set to topic deletion only). We think > > that'd be enough if we have a "bridge" that watches for these events in > > __cluster_metadata and reflects them in the batch coordinator (basically, > > by sending requests). > > > Does this answer the question or maybe I misunderstood? > > > > > > Best, > > > Ivan > > > > > > [1] > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1163%3A+Diskless+Core > > > [2] > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1164%3A+Topic+Based+Batch+Coordinator > > > > > > On Fri, Apr 18, 2025, at 12:42, Ziming Deng wrote: > > > > Hi Josep, > > > > > > > > This would be a fascinating feature, some well known Kafka users are > > using Kafka in a cloud-native env. As for as I know, there are already some > > secondary development version Kafka which provide this feature, for > > example, I am using AutoMq(https://github.com/AutoMQ/automq) in my > > environment, which significantly helped ms reduced the cost, so I think > > it’s worthwhile to clarify some related details: > > > > 1. Is this feature available by just a minor adjust of config or it > > will intrude current code heavily, say, AutoMq is 100% compatible with > > Kafka and doesn’t intrude the code heavily > > > > 2. Though we are not discussing implement details, it’s worth giving > > some high-level architecture ideas, and it’s better to compare with AutoMq > > like systems. > > > > 3. What we will provide through it, I think we will just provide a > > common interface and put implementations in another repos, just as we did > > for Kafka Connect and Kafka Tired Storage. > > > > 4. How to deal with KRaft related protocol, since metadata topic is > > managed differently with __cluster_metadata, through this KIP, will we > > align the gap between __cluster_metadata and data topics by put metadata > > in an object storage? if so, there will be no standby controller? since > > standby controller is the __cluster_metadata followers and there will be no > > followers. > > > > > > > > — > > > > Ziming > > > > > > > > > On Apr 16, 2025, at 19:58, Josep Prat <josep.p...@aiven.io.INVALID> > > wrote: > > > > > > > > > > Hi Kafka Devs! > > > > > > > > > > We want to start a new KIP discussion about introducing a new type of > > > > > topics that would make use of Object Storage as the primary source of > > > > > storage. However, as this KIP is big we decided to split it into > > multiple > > > > > related KIPs. > > > > > We have the motivational KIP-1150 ( > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150%3A+Diskless+Topics > > ) > > > > > that aims to discuss if Apache Kafka should aim to have this type of > > > > > feature at all. This KIP doesn't go onto details on how to implement > > it. > > > > > This follows the same approach used when we discussed KRaft. > > > > > > > > > > But as we know that it is sometimes really hard to discuss on that > > meta > > > > > level, we also created several sub-kips (linked in KIP-1150) that > > offer an > > > > > implementation of this feature. > > > > > > > > > > We kindly ask you to use the proper DISCUSS threads for each type of > > > > > concern and keep this one to discuss whether Apache Kafka wants to > > have > > > > > this feature or not. > > > > > > > > > > Thanks in advance on behalf of all the authors of this KIP. > > > > > > > > > > ------------------ > > > > > Josep Prat > > > > > Open Source Engineering Director, Aiven > > > > > josep.p...@aiven.io | +491715557497 | aiven.io > > > > > Aiven Deutschland GmbH > > > > > Alexanderufer 3-7, 10117 Berlin > > > > > Geschäftsführer: Oskari Saarenmaa, Hannu Valtonen, > > > > > Anna Richardson, Kenneth Chen > > > > > Amtsgericht Charlottenburg, HRB 209739 B > > > > > > > > > > > > > >