Hi Assane, Thank you for the further information about your motivation and intended use-cases, that adds a lot of context.
> Our motivation here is to accelerate compression with the use of hardware > accelerators. This is a very broad statement, so let me break it down into cases, and what I would recommend in each: Case A: Open source accelerators for supported compression codecs (e.g. zstd) 1. Try to add your accelerator to an existing upstream implementation (e.g. zstd-jni), so that whenever that library is used, people benefit from your accelerator. 2. Fork an existing implementation, and propose that the Kafka project use your fork. Case B: Closed-source accelerators for supported compression codecs (e.g. zstd) 1. Fork an existing implementation, and structure your fork such that it can be swapped out at runtime by operators that want a particular accelerator. 2. Kafka can add a Java pluggable interface to the broker and clients to pick among the accelerated and non-accelerated plugins, falling back to non-accelerated "reference implementations" as necessary. This wouldn't require protocol changes. Case C: Accelerators for unsupported open source compression codecs (e.g. brotli) 1. I think that these should be proposed as official codecs for Kafka to support, and then the acceleration can be implemented as in Case A or B. Case D: Accelerators for unsupported closed-source compression codecs These are codecs that would require a fully pluggable implementation, and reserved bits in the binary protocol. They are also the codecs which are most damaging to the ecosystem. If you have a specific proprietary codec in mind please say so, otherwise I want to invoke the YAGNI principle here. Thanks, Greg On Mon, May 13, 2024 at 11:22 AM Diop, Assane <assane.d...@intel.com> wrote: > > Hi Greg, > > Thank you for your thoughtful response. Resending this email to continue > engagement on the KIP discussion. > > Our motivation here is to accelerate compression with the use of hardware > accelerators. > > If the community prefers, we would be happy to contribute code to support > compression accelerators, but we believe that introducing a pluggable > compression framework is more scalable than enabling new compression > algorithms in an ad hoc manner. > > A pluggable compression interface would enable hardware accelerators without > requiring vendor-specific code in Kafka code base. > > We aim to ensure robustness by supporting all possible language-clients. In > this latest iteration, this design provides a path to support other languages > where each client has its own topic holding the plugin information for that > language. > > The pluggable interface does not replace the built-in functionally, rather, > it is an optional compression path seamlessly added for Kafka users who would > like to use custom compression algorithms or simply accelerate current > algorithms. In this latter case, a vendor providing acceleration for > compression will need to support their plugins. > > As far as your concerns, I appreciate you taking the time to respond. Let me > address them the best I can: > 1) When an operator adds a plugin to a cluster, they must ensure that the > compression algorithms for all the supported language-clients of that plugin > are compatible . For the plugin to be installed, the language must support > dynamic loading or linking of libraries and these mechanisms exist in at > least Java, C, Go and Python. Clients written in a language that does not > support dynamic loading or linking can still use built-in codecs and coexist > in a cluster where plugins were registered. This coexistence highlights that > the use of plugins is an optional feature. > > 2) Plugins source should come from a reputable developer. This is true of any > dependencies. Should an operator register a plugin, the plugin should have a > path for support including deprecation of such plugin. If the community finds > it useful, there could be an official Kafka repository and we are open to > discussing ways to provide governance of the plugin ecosystem. > > 3) We do not see this as a fork of the binary protocol, but rather an > evolution of the protocol to provide additional flexibility for compression. > Once a plugin is registered, it is compatible with all the “flavors” of > the plugins which here means different minor versions of a codec. Compression > algorithms typically follow semantic versioning where v1.x is compatible with > v1.y and where v2.x is not necessarily compatible with v1.x. > If a plugin version breaks compatibility with an older version, then it > should be treated as a new plugin with a new plugin alias. > In parallel to the plugin topic holding plugin information during > registration, additional topics holding the plugin binaries can be published > by the plugin admin tool during installation to ensure compatibility. We view > this as improving performance at the cost of extra operator work. > > 4) We only require the operator to register and then install the plugins. > During the registration process, the plugin admin tool takes in a plugin > information (plugin alias and classname/library) and then internally assigns > the pluginID. The operator is only responsible for providing the plugin > alias and the className/library. The plugin admin tool is a new Java class in > Tools that interacts with the operator to setup the plugins in the cluster. > At this stage of the KIP, we have assumed a manual installation of the > plugin. Installing here means the deployment of the plugin binary making it > ready to be dynamically loaded/linked when needed. > We are looking at an option for dynamic installation of the plugin which > would require the operator to install the binary using the plugin admin tool. > Using the same concept as plugin registration, the operator can install the > plugin binary by publishing it to a topic using the plugin admin tool. > Clients that register a plugin by consuming the plugin list would also > consume the necessary binaries from the cluster. > > 5) When a plugin is used, the set of built-in codecs is augmented by the set > of plugins described in the plugin topic. The additional set of codecs is > cluster-dependent, so, while a given batch of records stays within a cluster, > they remain self-contained. If these batches are produced into another > cluster, then the operator needs to either recompress data using > builtins/available plugins or install plugins in the dependent cluster. In > this scenario a consumer would decompress the data, and, if the mirrored data > needs the same compression plugin, then the operator is required to register > and install the plugins in the secondary cluster. > Our assertion is that the additional work required by an operator could be > justified by improved performance. > > 6) There is a finite number of pluginID available based on the number of bits > used in the attribute. If a developer or operator is experimenting with > multiple plugins then they can also unregister a plugin if they hit the > limit. The number of attribute bits request to represent the pluginID is > arbitrary and we are open to community input here. Ultimately, with the > ability to unregister a plugin, fewer bits could be used to represent the > pluginID. > > 7) While plugins add some complexity to a Kafka deployment, that complexity > is mostly the work of the operator to register and install the plugins. > Additionally, this increased complexity is all upfront and out-of-band. We > try to manage it by using existing Kafka mechanisms such as the Kafka plugin > topic described earlier. > > We have discussed using a custom Serializer/Deserializer, but, since > compression happens at the batch level, using a custom > Serializer/Deserializer would compress each message rather than compressing > the whole batch. It seems only large records could benefit from this scheme. > We are open to suggestions or clarification on this topic. > Again, thank you for sharing your concerns about balancing this proposal > against the impact to the ecosystem. We think the additional performance that > this could provide along with the improved flexibility to add or accelerate > compression codecs outweighs the increased complexity for the operators. > > Assane > > > -----Original Message----- > From: Greg Harris <greg.har...@aiven.io.INVALID> > Sent: Wednesday, May 1, 2024 12:09 PM > To: dev@kafka.apache.org > Subject: Re: DISCUSS KIP-984 Add pluggable compression interface to Kafka > > Hi Assane, > > Thanks for the update. Unfortunately, I don't think that the design changes > have solved all of the previous concerns, and I feel it has raised new ones. > > From my earlier email: > 1. The KIP has now included Python, but this feature is still > disproportionately difficult for statically-linked languages to support. > 2. This is unaddressed. > 3. This is unaddressed. > 4. The KIP now includes a metadata topic that is used to persist a mapping > from the binary ID to full class name, but requires the operator to manage > this mapping. > > My new concerns are: > 5. It is not possible to interpret a single message without also reading from > this additional metadata (messages are not self > contained) > 6. There are a finite number of pluggable IDs, and this finite number is > baked into the protocol. > 6a. This is a problem with the existing binary protocol, but this is > acceptable as the frequency that a new protocol is added is quite low, and is > discussed with the community. > 6b. Someone experimenting with compression plugins could easily exhaust this > limit in a single day, and the limit is exhausted for the lifetime of the > cluster. This could be done accidentally or maliciously. > 6c. Consuming 4 of the remaining 8 reserved bits feels wasteful, compared to > the benefit that the protocol is receiving from this feature. > 7. Implementing support for this feature would require distributing and > caching the metadata, which is a significant increase in complexity compared > to the current compression mechanisms. > > From your motivation section: > > Although compression is not a new problem, it has continued to be an > > important research topic. > > The integration and testing of new compression algorithms into Kafka > > currently requires significant code changes and rebuilding of the > > distribution package for Kafka. > > I think it is completely appropriate for someone testing an experimental > compression algorithm to temporarily fork Kafka, and then discard that fork > and all of the compressed data when the experiment is over. > The project has to balance the experience of upstream developers (including > compression researchers), ecosystem developers, and operators, and this > proposal's cost to ecosystem developers and operators is too high to justify > the benefits. > > As an alternative, have you considered implementing a custom > Serializer/Deserializer that could implement this feature, and just leave the > Kafka compression off? > I think an "Add Brotli Compression" KIP is definitely worth pursuing, if that > is the compression algorithm you have in mind currently. > > Thanks, > Greg > > > On Mon, Apr 29, 2024 at 3:10 PM Diop, Assane <assane.d...@intel.com> wrote: > > > > Hi Divij, Greg and Luke, > > I have updated the KIP for Kafka pluggable compression addressing the > > concerns from the original design. > > I believe this new design takes into account lots of concerns and have > > solved them. I would like to receive feedback on them as I am working on > > getting this KIP accepted. Not targeting a release or anything but > > accepting the concept will help getting towards this direction. > > > > The link to the KIP is here > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-984%3A+Add+plugg > > able+compression+interface+to+Kafka > > > > Assane > > > > -----Original Message----- > > From: Diop, Assane <assane.d...@intel.com> > > Sent: Wednesday, April 24, 2024 4:58 PM > > To: dev@kafka.apache.org > > Subject: RE:DISCUSS KIP-984 Add pluggable compression interface to > > Kafka > > > > Hi, > > > > I would like to bring back attention to > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-984%3A+Add+plugg > > able+compression+interface+to+Kafka > > I have made significant changes to the design to accommodate the concerns > > and would like some feedback from the community and engage communication. > > > > Assane > > > > -----Original Message----- > > From: Diop, Assane > > Sent: Friday, March 1, 2024 4:45 PM > > To: dev@kafka.apache.org > > Subject: RE: DISCUSS KIP-984 Add pluggable compression interface to > > Kafka > > > > Hi Luke, > > > > The proposal doesn't preclude supporting multiple clients but each client > > would need an implementation of the pluggable architecture. > > At the very least we envision other clients such as librdkafka and > > kafka-python could be supported by C implementations. > > > > We agree with community feedback regarding the need to support these > > clients, and we are looking at alternative approaches for brokers and > > clients to coordinate the plugin. > > > > One way to do this coordination is each client should have a configuration > > mapping of the plugin name to its implementation. > > > > Assane > > > > > > > > > > > > > > -----Original Message----- > > From: Luke Chen <show...@gmail.com> > > Sent: Monday, February 26, 2024 7:50 PM > > To: dev@kafka.apache.org > > Subject: Re: DISCUSS KIP-984 Add pluggable compression interface to > > Kafka > > > > Hi Assane, > > > > I also share the same concern as Greg has, which is that the KIP is not > > kafka ecosystem friendly. > > And this will make the kafka client and broker have high dependencies that > > once you use the pluggable compression interface, the producer must be java > > client. > > This seems to go against the original Kafka's design. > > > > If the proposal can support all kinds of clients, that would be great. > > > > Thanks. > > Luke > > > > On Tue, Feb 27, 2024 at 7:44 AM Diop, Assane <assane.d...@intel.com> wrote: > > > > > Hi Greg, > > > > > > Thanks for taking the time to give some feedback. It was very insightful. > > > > > > I have some answers: > > > > > > 1. The current proposal is Java centric. We want to figure out with > > > Java first and then later incorporate other languages. We will get there. > > > > > > 2. The question of where the plugins would live is an important one. > > > I would like to get the community engagement on where a plugin would live. > > > Officially supported plugins could be part of Kafka and others > > > could live in a plugin repository. Is there currently a way to store > > > plugins in Kafka and load them into the classpath? If such a space > > > could be allowed then it would provide an standard way of installing > > > officially supported plugins. > > > In OpenSearch for example, there is a plugin utility that takes > > > the jar and installs it across the cluster, privileges can be granted by > > > an admin. > > > Such utility could be implemented in Kafka. > > > > > > 3. There is many way to look at this, we could change the message > > > format that use the pluggable interface to be for example v3 and > > > synchronize against that. > > > In order to use the pluggable codec, you will have to be at > > > message version 3 for example. > > > > > > 4. Passing the class name as metadata is one way to have the > > > producer talk to the broker about which plugin to use. However there > > > could be other implementation > > > where you could set every thing to know about the topic using > > > topic level compression. In this case for example a rule could be > > > that in order to use the > > > pluggable interface, you should use topic level compression. > > > > > > I would like to have your valuable inputs on this!! > > > > > > Thanks before end, > > > Assane > > > > > > -----Original Message----- > > > From: Greg Harris <greg.har...@aiven.io.INVALID> > > > Sent: Wednesday, February 14, 2024 2:36 PM > > > To: dev@kafka.apache.org > > > Subject: Re: DISCUSS KIP-984 Add pluggable compression interface to > > > Kafka > > > > > > Hi Assane, > > > > > > Thanks for the KIP! > > > Looking back, it appears that the project has only ever added > > > compression types twice: lz4 in 2014 and zstd in 2018, and perhaps > > > Kafka has fallen behind the state-of-the-art compression algorithms. > > > Thanks for working to fix that! > > > > > > I do have some concerns: > > > > > > 1. I think this is a very "java centric" proposal, and doesn't take > > > non-java clients into enough consideration. librdkafka [1] is a > > > great example of an implementation of the Kafka protocol which > > > doesn't have the same classloading and plugin infrastructure that > > > Java has, which would make implementing this feature much more difficult. > > > > > > 2. By making the interface pluggable, it puts the burden of > > > maintaining individual compression codecs onto external developers, > > > which may not be willing to maintain a codec for the service-lifetime of > > > such a codec. > > > An individual developer can easily implement a plugin to allow them > > > to use a cutting-edge compression algorithm without consulting the > > > Kafka project, but as soon as data is compressed using that > > > algorithm, they are on the hook to support that plugin going forward > > > by the organizations using their implementation. > > > Part of the collective benefits of the Kafka project is to ensure > > > the ongoing maintenance of such codecs, and provide a long > > > deprecation window when a codec reaches EOL. I think the Kafka > > > project is well-equipped to evaluate the maturity and properties of > > > compression codecs and then maintain them going forward. > > > > > > 3. Also by making the interface pluggable, it reduces the scope of > > > individual compression codecs. No longer is there a single lineage > > > of Kafka protocols, where vN+1 of a protocol supports a codec that > > > vN does not. Now there will be "flavors" of the protocol, and > > > operators will need to ensure that their servers and their clients > > > support the same "flavors" or else encounter errors. > > > This is the sort of protocol forking which can be dangerous to the > > > Kafka community going forward. If there is a single lineage of > > > codecs such that the upstream Kafka vX.Y supports codec Z, it is > > > much simpler for other implementations to check and specify "Kafka > > > vX.Y compatible", than it is to check & specify "Kafka vX.Y & Z > > > compatible". > > > > > > 4. The Java class namespace is distributed, as anyone can name their > > > class anything. It achieves this by being very verbose, with long > > > fully-qualified names for classes. This is in conflict with a binary > > > protocol, where it is desirable for the overhead to be as small as > > > possible. > > > This may incentivise developers to keep their class names short, > > > which also makes conflict more likely. If you have the option of > > > naming your class "B" instead of > > > "org.example.blah.BrotlCompressionCodecVersionOne", > > > and meaningfully save a flat 47 bytes on every request, > > > somebody/everybody is going to do that. > > > This now increases the likelihood for conflict, as perhaps two > > > developers want the same short name. Yes there are 52 one-letter > > > class names, but to ensure that no two codecs ever conflict requires > > > global coordination that a pluggable interface tries to avoid. > > > Operators then take on the burden of ensuring that the "B" codec on > > > the other machine is indeed the "B" codec that they have installed > > > on their machines, or else encounter errors. > > > > > > I think that having contributors propose that Kafka support their > > > favorite compression type in order to get assigned a globally unique > > > number is much healthier for the ecosystem than making this a > > > pluggable interface and leaving the namespace to be wrangled by operators > > > and client libraries. > > > > > > Thanks, > > > Greg > > > > > > [1] https://github.com/confluentinc/librdkafka > > > [2] > > > https://github.com/apache/kafka/blob/e8c70fce26626ed2ab90f2728a45f6e > > > 55 > > > e907ec1/clients/src/main/java/org/apache/kafka/common/record/Default > > > Re > > > cordBatch.java#L130 > > > > > > On Wed, Feb 14, 2024 at 12:59 PM Diop, Assane > > > <assane.d...@intel.com> > > > wrote: > > > > > > > > Hi Divij, Mickael, > > > > Since Mickael KIP-390 was accepted, I did not want to respond in > > > > that > > > thread to not confuse the work. > > > > > > > > As mentioned in the thread, the KIP-390 and KIP-984 do not > > > > supercede > > > each other. However the scope of KIP-984 goes beyond the scope of KIP-390. > > > Pluggable compression interface is added as a new codec. The other > > > codecs already implemented are not affected by this change. I > > > believe these 2 KIP are not the same but they compliment each other. > > > > > > > > As I stated before, the motivation is to give the users the > > > > ability to > > > use different compressors without needing future changes in Kafka. > > > > Kafka currently supports zstd, snappy, gzip and lz4. However, > > > > other > > > opensource compression projects like the Brotli algorithm are also > > > gaining traction. For example the HTTP servers Apache and nginx > > > offer Brotli compression as an option. With a pluggable interface, > > > any Kafka developer could integrate and test Brotli with Kafka > > > simply by writing a plugin. This same motivation can be applied to > > > any other compression algorithm including hardware accelerated > > > compression. There are hardware companies including intel and AMD that > > > are working on accelerating compression. > > > > > > > > The main change in itself is an update in the message format to > > > > allow > > > for metadata to be passed indicating the which plugin to use to the > > > broker. This only happens if the user selects the pluggable codec. > > > The metadata adds on an additional 52 bytes to the message format. > > > > > > > > Broker recompression is taking care of when producer and brokers > > > > have > > > different codec because it is just another codec being added as far > > > as Kafka. > > > > I have added more information to the > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-984%3A+Add+p > > > > lu > > > > gg > > > > able+compression+interface+to+Kafka I am ready for a PR if this > > > > able+compression+interface+to+KIP > > > > gets accepted > > > > > > > > Assane > > > > > > > > -----Original Message----- > > > > From: Diop, Assane <assane.d...@intel.com> > > > > Sent: Wednesday, January 31, 2024 10:24 AM > > > > To: dev@kafka.apache.org > > > > Subject: RE: DISCUSS KIP-984 Add pluggable compression interface > > > > to Kafka > > > > > > > > Hi Divij, > > > > Thank you for your response! > > > > > > > > Although compression is not a new problem, it has continued to be > > > > an > > > important research topic. > > > > The integration and testing of new compression algorithms into > > > > Kafka > > > currently requires significant code changes and rebuilding of the > > > distribution package for Kafka. > > > > This KIP will allow for any compression algorithm to be seamlessly > > > integrated into Kafka by writing a plugin that would bind into the > > > wrapForInput and wrapForOutput methods in Kafka. > > > > > > > > As you mentioned, Kafka currently supports zstd, snappy, gzip and lz4. > > > However, other opensource compression projects like the Brotli > > > algorithm are also gaining traction. For example the HTTP servers > > > Apache and nginx offer Brotli compression as an option. With a > > > pluggable interface, any Kafka developer could integrate and test > > > Brotli with Kafka simply by writing a plugin. This same motivation > > > can be applied to any other compression algorithm including hardware > > > accelerated compression. There are hardware companies including > > > intel and AMD that are working on accelerating compression. > > > > > > > > This KIP would certainly complement the current > > > https://issues.apache.org/jira/browse/KAFKA-7632 by adding even more > > > flexibility for the users. > > > > A plugin could be tailored to arbitrary datasets in response to a > > > > user's > > > specific resource requirements. > > > > > > > > For reference, other opensource projects have already started or > > > implemented this type of plugin technology such as: > > > > 1. Cassandra, which has implemented the same concept of > > > pluggable interface. > > > > 2. OpenSearch is also working on enabling the same type of > > > plugin framework. > > > > > > > > With respect to message recompression, the plugin interface would > > > > handle > > > this use case on the broker side similar to the current > > > recompression process. > > > > > > > > Assane > > > > > > > > -----Original Message----- > > > > From: Divij Vaidya <divijvaidy...@gmail.com> > > > > Sent: Friday, December 22, 2023 2:27 AM > > > > To: dev@kafka.apache.org > > > > Subject: Re: DISCUSS KIP-984 Add pluggable compression interface > > > > to Kafka > > > > > > > > Thank you for writing the KIP Assane. > > > > > > > > In general, exposing a "pluggable" interface is not a decision > > > > made > > > lightly because it limits our ability to remove / change that > > > interface in future. > > > > Any future changes to the interface will have to remain compatible > > > > with > > > existing plugins which limits the flexibility of changes we can make > > > inside Kafka. Hence, we need a strong motivation for adding a pluggable > > > interface. > > > > > > > > 1\ May I ask the motivation for this KIP? Are the current > > > > compression codecs (zstd, gzip, lz4, snappy) not sufficient for your > > > > use case? > > > > Would proving fine grained compression options as proposed in > > > > https://issues.apache.org/jira/browse/KAFKA-7632 and > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-390%3A+Suppo > > > > rt > > > > +C > > > > ompression+Level > > > > address your use case? > > > > 2\ "This option impacts the following processes" -> This should > > > > also > > > include the decompression and compression that occurs during message > > > version transformation, i.e. when client send message with V1 and > > > broker expects in V2, we convert the message and recompress it. > > > > > > > > -- > > > > Divij Vaidya > > > > > > > > > > > > > > > > On Mon, Dec 18, 2023 at 7:22 PM Diop, Assane > > > > <assane.d...@intel.com> > > > wrote: > > > > > > > > > I would like to bring some attention to this KIP. We have added > > > > > an interface to the compression code that allow anyone to build > > > > > their own compression plugin and integrate easily back to kafka. > > > > > > > > > > Assane > > > > > > > > > > -----Original Message----- > > > > > From: Diop, Assane <assane.d...@intel.com> > > > > > Sent: Monday, October 2, 2023 9:27 AM > > > > > To: dev@kafka.apache.org > > > > > Subject: DISCUSS KIP-984 Add pluggable compression interface to > > > > > Kafka > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-984%3A+Add > > > > > +p > > > > > lu > > > > > gg > > > > > able+compression+interface+to+Kafka > > > > > > > >