Re: DISCUSS KIP-984 Add pluggable compression interface to Kafka

Greg Harris Mon, 13 May 2024 13:23:30 -0700

Hi Assane,

Thank you for the further information about your motivation and
intended use-cases, that adds a lot of context.


> Our motivation here is to accelerate compression with the use of hardware 
> accelerators.

This is a very broad statement, so let me break it down into cases,
and what I would recommend in each:

Case A: Open source accelerators for supported compression codecs (e.g. zstd)
1. Try to add your accelerator to an existing upstream implementation
(e.g. zstd-jni), so that whenever that library is used, people benefit
from your accelerator.
2. Fork an existing implementation, and propose that the Kafka project
use your fork.

Case B: Closed-source accelerators for supported compression codecs (e.g. zstd)
1. Fork an existing implementation, and structure your fork such that
it can be swapped out at runtime by operators that want a particular
accelerator.
2. Kafka can add a Java pluggable interface to the broker and clients
to pick among the accelerated and non-accelerated plugins, falling
back to non-accelerated "reference implementations" as necessary. This
wouldn't require protocol changes.

Case C: Accelerators for unsupported open source compression codecs
(e.g. brotli)
1. I think that these should be proposed as official codecs for Kafka
to support, and then the acceleration can be implemented as in Case A
or B.

Case D: Accelerators for unsupported closed-source compression codecs
These are codecs that would require a fully pluggable implementation,
and reserved bits in the binary protocol. They are also the codecs
which are most damaging to the ecosystem. If you have a specific
proprietary codec in mind please say so, otherwise I want to invoke
the YAGNI principle here.

Thanks,
Greg





On Mon, May 13, 2024 at 11:22 AM Diop, Assane <assane.d...@intel.com> wrote:
>
> Hi Greg,
>
> Thank you for your thoughtful response. Resending this email to continue 
> engagement on the KIP discussion.
>
> Our motivation here is to accelerate compression with the use of hardware 
> accelerators.
>
> If the community prefers, we would be happy to contribute code to support 
> compression accelerators, but we believe that introducing a pluggable 
> compression framework   is more scalable than enabling new compression 
> algorithms in an ad hoc manner.
>
> A pluggable compression interface would enable hardware accelerators without 
> requiring vendor-specific code in Kafka code base.
>
> We aim to ensure robustness by supporting all possible language-clients. In 
> this latest iteration, this design provides a path to support other languages 
> where each client has its own topic holding the plugin information for that 
> language.
>
> The pluggable interface does not replace the built-in functionally, rather, 
> it is an optional compression path seamlessly added for Kafka users who would 
> like to use custom compression algorithms or simply accelerate current 
> algorithms. In this latter case, a vendor providing acceleration for 
> compression will need to support their plugins.
>
> As far as your concerns, I appreciate you taking the time to respond. Let me 
> address them the best I can:
> 1) When an operator adds a plugin to a cluster, they must ensure that the 
> compression algorithms for all the supported language-clients of that plugin 
> are compatible . For the plugin to be installed, the language must support 
> dynamic loading or linking of libraries and these mechanisms exist in at 
> least Java, C, Go and Python. Clients written in a language that does not 
> support dynamic loading or linking can still use built-in codecs and coexist 
> in a cluster where plugins were registered. This coexistence highlights that 
> the use of plugins is an optional feature.
>
> 2) Plugins source should come from a reputable developer. This is true of any 
> dependencies. Should an operator register a plugin, the plugin should have a 
> path for support including deprecation of such plugin. If the community finds 
> it useful, there could be an official Kafka repository and we are open to 
> discussing ways to provide governance of the plugin ecosystem.
>
> 3) We do not see this as a fork of the binary protocol, but rather an 
> evolution of the protocol to provide additional flexibility for compression.
> Once a plugin is registered, it is compatible with all the “flavors”     of 
> the plugins which here means different minor versions of a codec. Compression 
> algorithms typically follow semantic versioning where v1.x is compatible with 
> v1.y and where v2.x is not necessarily compatible with v1.x.
> If a plugin version breaks compatibility with an older version, then it 
> should be treated as a new plugin with a new plugin alias.
> In parallel to the plugin topic holding plugin information during 
> registration, additional topics holding the plugin binaries can be published 
> by the plugin admin tool during installation to ensure compatibility. We view 
> this as improving performance at the cost of extra operator work.
>
> 4) We only require the operator to register and then install the plugins. 
> During the registration  process,  the plugin admin tool  takes in a plugin 
> information (plugin alias and classname/library) and then  internally assigns 
> the pluginID.  The operator is only responsible for providing the plugin 
> alias and the className/library. The plugin admin tool is a new Java class in 
> Tools that interacts with the operator to setup the plugins in the cluster. 
> At this stage of the KIP, we have assumed a manual installation of the 
> plugin. Installing here means the deployment of the plugin binary making it 
> ready to be dynamically loaded/linked when needed.
> We are looking at an option for dynamic installation of the plugin which 
> would require the operator to install the binary using the plugin admin tool. 
> Using the same concept as plugin registration, the operator can install the 
> plugin binary by publishing it to a topic using the plugin admin tool. 
> Clients that register a plugin by consuming the plugin list would also 
> consume the necessary binaries from the cluster.
>
> 5) When a plugin is used, the set of built-in codecs is augmented by the set 
> of plugins described in the plugin topic. The additional set of codecs is 
> cluster-dependent, so, while a given batch of records stays within a cluster, 
> they remain self-contained. If these batches are produced into another 
> cluster, then the operator needs to either recompress data using 
> builtins/available plugins or install plugins in the dependent cluster. In 
> this scenario a consumer would decompress the data, and, if the mirrored data 
> needs the same compression plugin, then the operator is required to register 
> and install the plugins in the secondary cluster.
> Our assertion is that the additional work required by an operator could be 
> justified by improved performance.
>
> 6) There is a finite number of pluginID available based on the number of bits 
> used in the attribute. If a developer or operator is experimenting with 
> multiple plugins then they can also unregister a plugin if they hit the 
> limit. The number of attribute bits request to represent the pluginID is 
> arbitrary and we are open to community input here. Ultimately, with the 
> ability to unregister a plugin, fewer bits could be used to represent  the 
> pluginID.
>
> 7) While plugins add some complexity to a Kafka deployment, that complexity 
> is mostly the work of the operator to register and install the plugins. 
> Additionally, this increased complexity is all upfront and out-of-band. We 
> try to manage it by using existing Kafka mechanisms such as the Kafka plugin 
> topic described earlier.
>
> We have discussed using a custom Serializer/Deserializer, but, since 
> compression happens at the batch level, using a custom 
> Serializer/Deserializer would compress each message rather than compressing 
> the whole batch. It seems only large records could benefit from this scheme. 
> We are open to suggestions or clarification on this topic.
> Again, thank you for sharing your concerns about balancing this proposal 
> against the impact to the ecosystem. We think the additional performance that 
> this could provide along with the improved flexibility to add or accelerate 
> compression codecs outweighs the increased complexity for the operators.
>
> Assane
>
>
> -----Original Message-----
> From: Greg Harris <greg.har...@aiven.io.INVALID>
> Sent: Wednesday, May 1, 2024 12:09 PM
> To: dev@kafka.apache.org
> Subject: Re: DISCUSS KIP-984 Add pluggable compression interface to Kafka
>
> Hi Assane,
>
> Thanks for the update. Unfortunately, I don't think that the design changes 
> have solved all of the previous concerns, and I feel it has raised new ones.
>
> From my earlier email:
> 1. The KIP has now included Python, but this feature is still 
> disproportionately difficult for statically-linked languages to support.
> 2. This is unaddressed.
> 3. This is unaddressed.
> 4. The KIP now includes a metadata topic that is used to persist a mapping 
> from the binary ID to full class name, but requires the operator to manage 
> this mapping.
>
> My new concerns are:
> 5. It is not possible to interpret a single message without also reading from 
> this additional metadata (messages are not self
> contained)
> 6. There are a finite number of pluggable IDs, and this finite number is 
> baked into the protocol.
> 6a. This is a problem with the existing binary protocol, but this is 
> acceptable as the frequency that a new protocol is added is quite low, and is 
> discussed with the community.
> 6b. Someone experimenting with compression plugins could easily exhaust this 
> limit in a single day, and the limit is exhausted for the lifetime of the 
> cluster. This could be done accidentally or maliciously.
> 6c. Consuming 4 of the remaining 8 reserved bits feels wasteful, compared to 
> the benefit that the protocol is receiving from this feature.
> 7. Implementing support for this feature would require distributing and 
> caching the metadata, which is a significant increase in complexity compared 
> to the current compression mechanisms.
>
> From your motivation section:
> > Although compression is not a new problem, it has continued to be an 
> > important research topic.
> > The integration and testing of new compression algorithms into Kafka 
> > currently requires significant code changes and rebuilding of the 
> > distribution package for Kafka.
>
> I think it is completely appropriate for someone testing an experimental 
> compression algorithm to temporarily fork Kafka, and then discard that fork 
> and all of the compressed data when the experiment is over.
> The project has to balance the experience of upstream developers (including 
> compression researchers), ecosystem developers, and operators, and this 
> proposal's cost to ecosystem developers and operators is too high to justify 
> the benefits.
>
> As an alternative, have you considered implementing a custom 
> Serializer/Deserializer that could implement this feature, and just leave the 
> Kafka compression off?
> I think an "Add Brotli Compression" KIP is definitely worth pursuing, if that 
> is the compression algorithm you have in mind currently.
>
> Thanks,
> Greg
>
>
> On Mon, Apr 29, 2024 at 3:10 PM Diop, Assane <assane.d...@intel.com> wrote:
> >
> > Hi Divij, Greg and Luke,
> > I have updated the KIP for Kafka pluggable compression addressing the 
> > concerns from the original design.
> > I believe this new design takes into account lots of concerns and have 
> > solved them. I would like to receive feedback on them as I am working on 
> > getting this KIP accepted. Not targeting a release or anything but 
> > accepting the concept will help getting towards this direction.
> >
> > The link to the KIP is here
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-984%3A+Add+plugg
> > able+compression+interface+to+Kafka
> >
> > Assane
> >
> > -----Original Message-----
> > From: Diop, Assane <assane.d...@intel.com>
> > Sent: Wednesday, April 24, 2024 4:58 PM
> > To: dev@kafka.apache.org
> > Subject: RE:DISCUSS KIP-984 Add pluggable compression interface to
> > Kafka
> >
> > Hi,
> >
> > I would like to bring back attention to
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-984%3A+Add+plugg
> > able+compression+interface+to+Kafka
> > I have made significant changes to the design to accommodate the concerns 
> > and would like some feedback from the community and engage communication.
> >
> > Assane
> >
> > -----Original Message-----
> > From: Diop, Assane
> > Sent: Friday, March 1, 2024 4:45 PM
> > To: dev@kafka.apache.org
> > Subject: RE: DISCUSS KIP-984 Add pluggable compression interface to
> > Kafka
> >
> > Hi Luke,
> >
> > The proposal doesn't preclude supporting multiple clients but each client 
> > would need an implementation of the pluggable architecture.
> > At the very least we envision other clients such as librdkafka and 
> > kafka-python could be supported by C implementations.
> >
> > We agree with community feedback regarding the need to support these 
> > clients, and we are looking at alternative approaches for brokers and 
> > clients to coordinate the plugin.
> >
> > One way to do this coordination is each client should have a configuration 
> > mapping of the plugin name to its implementation.
> >
> > Assane
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: Luke Chen <show...@gmail.com>
> > Sent: Monday, February 26, 2024 7:50 PM
> > To: dev@kafka.apache.org
> > Subject: Re: DISCUSS KIP-984 Add pluggable compression interface to
> > Kafka
> >
> > Hi Assane,
> >
> > I also share the same concern as Greg has, which is that the KIP is not 
> > kafka ecosystem friendly.
> > And this will make the kafka client and broker have high dependencies that 
> > once you use the pluggable compression interface, the producer must be java 
> > client.
> > This seems to go against the original Kafka's design.
> >
> > If the proposal can support all kinds of clients, that would be great.
> >
> > Thanks.
> > Luke
> >
> > On Tue, Feb 27, 2024 at 7:44 AM Diop, Assane <assane.d...@intel.com> wrote:
> >
> > > Hi Greg,
> > >
> > > Thanks for taking the time to give some feedback. It was very insightful.
> > >
> > > I have some answers:
> > >
> > > 1. The current proposal is Java centric. We want to figure out with
> > > Java first and then later incorporate other languages. We will get there.
> > >
> > > 2. The question of where the plugins would live is an important one.
> > > I would like to get the community engagement on where a plugin would live.
> > >    Officially supported plugins could be part of Kafka and others
> > > could live in a plugin repository. Is there currently a way to store
> > > plugins in Kafka and load them into the classpath? If such a space
> > > could be allowed then it would provide an standard way of installing
> > > officially supported plugins.
> > >    In OpenSearch for example, there is a plugin utility that takes
> > > the jar and installs it across the cluster, privileges can be granted by 
> > > an admin.
> > > Such utility could be implemented in Kafka.
> > >
> > > 3. There is many way to look at this, we could change the message
> > > format that use the pluggable interface to be for example v3 and
> > > synchronize against that.
> > >    In order to use the pluggable codec, you will have to be at
> > > message version 3 for example.
> > >
> > > 4. Passing the class name as metadata is one way to have the
> > > producer talk to the broker about which plugin to use. However there
> > > could be other implementation
> > >    where you could set every thing to know about the topic using
> > > topic level compression. In this case for example a rule could be
> > > that in order to use the
> > >    pluggable interface, you should use topic level compression.
> > >
> > >  I would like to have your valuable inputs on this!!
> > >
> > > Thanks before end,
> > > Assane
> > >
> > > -----Original Message-----
> > > From: Greg Harris <greg.har...@aiven.io.INVALID>
> > > Sent: Wednesday, February 14, 2024 2:36 PM
> > > To: dev@kafka.apache.org
> > > Subject: Re: DISCUSS KIP-984 Add pluggable compression interface to
> > > Kafka
> > >
> > > Hi Assane,
> > >
> > > Thanks for the KIP!
> > > Looking back, it appears that the project has only ever added
> > > compression types twice: lz4 in 2014 and zstd in 2018, and perhaps
> > > Kafka has fallen behind the state-of-the-art compression algorithms.
> > > Thanks for working to fix that!
> > >
> > > I do have some concerns:
> > >
> > > 1. I think this is a very "java centric" proposal, and doesn't take
> > > non-java clients into enough consideration. librdkafka [1] is a
> > > great example of an implementation of the Kafka protocol which
> > > doesn't have the same classloading and plugin infrastructure that
> > > Java has, which would make implementing this feature much more difficult.
> > >
> > > 2. By making the interface pluggable, it puts the burden of
> > > maintaining individual compression codecs onto external developers,
> > > which may not be willing to maintain a codec for the service-lifetime of 
> > > such a codec.
> > > An individual developer can easily implement a plugin to allow them
> > > to use a cutting-edge compression algorithm without consulting the
> > > Kafka project, but as soon as data is compressed using that
> > > algorithm, they are on the hook to support that plugin going forward
> > > by the organizations using their implementation.
> > > Part of the collective benefits of the Kafka project is to ensure
> > > the ongoing maintenance of such codecs, and provide a long
> > > deprecation window when a codec reaches EOL. I think the Kafka
> > > project is well-equipped to evaluate the maturity and properties of
> > > compression codecs and then maintain them going forward.
> > >
> > > 3. Also by making the interface pluggable, it reduces the scope of
> > > individual compression codecs. No longer is there a single lineage
> > > of Kafka protocols, where vN+1 of a protocol supports a codec that
> > > vN does not. Now there will be "flavors" of the protocol, and
> > > operators will need to ensure that their servers and their clients
> > > support the same "flavors" or else encounter errors.
> > > This is the sort of protocol forking which can be dangerous to the
> > > Kafka community going forward. If there is a single lineage of
> > > codecs such that the upstream Kafka vX.Y supports codec Z, it is
> > > much simpler for other implementations to check and specify "Kafka
> > > vX.Y compatible", than it is to check & specify "Kafka vX.Y & Z 
> > > compatible".
> > >
> > > 4. The Java class namespace is distributed, as anyone can name their
> > > class anything. It achieves this by being very verbose, with long
> > > fully-qualified names for classes. This is in conflict with a binary
> > > protocol, where it is desirable for the overhead to be as small as 
> > > possible.
> > > This may incentivise developers to keep their class names short,
> > > which also makes conflict more likely. If you have the option of
> > > naming your class "B" instead of
> > > "org.example.blah.BrotlCompressionCodecVersionOne",
> > > and meaningfully save a flat 47 bytes on every request,
> > > somebody/everybody is going to do that.
> > > This now increases the likelihood for conflict, as perhaps two
> > > developers want the same short name. Yes there are 52 one-letter
> > > class names, but to ensure that no two codecs ever conflict requires
> > > global coordination that a pluggable interface tries to avoid.
> > > Operators then take on the burden of ensuring that the "B" codec on
> > > the other machine is indeed the "B" codec that they have installed
> > > on their machines, or else encounter errors.
> > >
> > > I think that having contributors propose that Kafka support their
> > > favorite compression type in order to get assigned a globally unique
> > > number is much healthier for the ecosystem than making this a
> > > pluggable interface and leaving the namespace to be wrangled by operators 
> > > and client libraries.
> > >
> > > Thanks,
> > > Greg
> > >
> > > [1] https://github.com/confluentinc/librdkafka
> > > [2]
> > > https://github.com/apache/kafka/blob/e8c70fce26626ed2ab90f2728a45f6e
> > > 55
> > > e907ec1/clients/src/main/java/org/apache/kafka/common/record/Default
> > > Re
> > > cordBatch.java#L130
> > >
> > > On Wed, Feb 14, 2024 at 12:59 PM Diop, Assane
> > > <assane.d...@intel.com>
> > > wrote:
> > > >
> > > > Hi Divij, Mickael,
> > > > Since Mickael KIP-390 was accepted, I did not want to respond in
> > > > that
> > > thread to not confuse the work.
> > > >
> > > > As mentioned in the thread, the KIP-390 and KIP-984 do not
> > > > supercede
> > > each other. However the scope of KIP-984 goes beyond the scope of KIP-390.
> > > Pluggable compression interface is added as a new codec. The other
> > > codecs already implemented are not affected by this change.  I
> > > believe these 2 KIP are not the same but they compliment each other.
> > > >
> > > > As I stated before, the motivation is to give the users the
> > > > ability to
> > > use different compressors without needing future changes in Kafka.
> > > > Kafka currently supports zstd, snappy, gzip and lz4. However,
> > > > other
> > > opensource compression projects like the Brotli algorithm are also
> > > gaining traction. For example the HTTP servers Apache and nginx
> > > offer Brotli compression as an option. With a pluggable interface,
> > > any Kafka developer could integrate and test Brotli with Kafka
> > > simply by writing a plugin. This same motivation can be applied to
> > > any other compression algorithm including hardware accelerated
> > > compression. There are hardware companies including intel and AMD that 
> > > are working on accelerating compression.
> > > >
> > > > The main change in itself is an update in the message format to
> > > > allow
> > > for metadata to be passed indicating the which plugin to use  to the
> > > broker. This only happens if the user selects the pluggable codec.
> > > The metadata adds on an additional 52 bytes to the message format.
> > > >
> > > > Broker recompression is taking care of when producer and brokers
> > > > have
> > > different codec because it is just another codec being added as far
> > > as Kafka.
> > > > I have added more information to the
> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-984%3A+Add+p
> > > > lu
> > > > gg
> > > > able+compression+interface+to+Kafka I am ready for a PR if this
> > > > able+compression+interface+to+KIP
> > > > gets accepted
> > > >
> > > > Assane
> > > >
> > > > -----Original Message-----
> > > > From: Diop, Assane <assane.d...@intel.com>
> > > > Sent: Wednesday, January 31, 2024 10:24 AM
> > > > To: dev@kafka.apache.org
> > > > Subject: RE: DISCUSS KIP-984 Add pluggable compression interface
> > > > to Kafka
> > > >
> > > > Hi Divij,
> > > > Thank you for your response!
> > > >
> > > > Although compression is not a new problem, it has continued to be
> > > > an
> > > important research topic.
> > > > The integration and testing of new compression algorithms into
> > > > Kafka
> > > currently requires significant code changes and rebuilding of the
> > > distribution package for Kafka.
> > > > This KIP will allow for any compression algorithm to be seamlessly
> > > integrated into Kafka by writing a plugin that would bind into the
> > > wrapForInput and wrapForOutput methods in Kafka.
> > > >
> > > > As you mentioned, Kafka currently supports zstd, snappy, gzip and lz4.
> > > However, other opensource compression projects like the Brotli
> > > algorithm are also gaining traction. For example the HTTP servers
> > > Apache and nginx offer Brotli compression as an option. With a
> > > pluggable interface, any Kafka developer could integrate and test
> > > Brotli with Kafka simply by writing a plugin. This same motivation
> > > can be applied to any other compression algorithm including hardware
> > > accelerated compression. There are hardware companies including
> > > intel and AMD that are working on accelerating compression.
> > > >
> > > > This KIP would certainly complement the current
> > > https://issues.apache.org/jira/browse/KAFKA-7632 by adding even more
> > > flexibility for the users.
> > > > A plugin could be tailored to arbitrary datasets in response to a
> > > > user's
> > > specific resource requirements.
> > > >
> > > > For reference, other opensource projects have already started or
> > > implemented this type of plugin technology such as:
> > > >         1. Cassandra, which has implemented the same concept of
> > > pluggable interface.
> > > >         2. OpenSearch is also working on enabling the same type of
> > > plugin framework.
> > > >
> > > > With respect to message recompression, the plugin interface would
> > > > handle
> > > this use case on the broker side similar to the current
> > > recompression process.
> > > >
> > > > Assane
> > > >
> > > > -----Original Message-----
> > > > From: Divij Vaidya <divijvaidy...@gmail.com>
> > > > Sent: Friday, December 22, 2023 2:27 AM
> > > > To: dev@kafka.apache.org
> > > > Subject: Re: DISCUSS KIP-984 Add pluggable compression interface
> > > > to Kafka
> > > >
> > > > Thank you for writing the KIP Assane.
> > > >
> > > > In general, exposing a "pluggable" interface is not a decision
> > > > made
> > > lightly because it limits our ability to remove / change that
> > > interface in future.
> > > > Any future changes to the interface will have to remain compatible
> > > > with
> > > existing plugins which limits the flexibility of changes we can make
> > > inside Kafka. Hence, we need a strong motivation for adding a pluggable 
> > > interface.
> > > >
> > > > 1\ May I ask the motivation for this KIP? Are the current
> > > > compression codecs (zstd, gzip, lz4, snappy) not sufficient for your 
> > > > use case?
> > > > Would proving fine grained compression options as proposed in
> > > > https://issues.apache.org/jira/browse/KAFKA-7632 and
> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-390%3A+Suppo
> > > > rt
> > > > +C
> > > > ompression+Level
> > > > address your use case?
> > > > 2\ "This option impacts the following processes" -> This should
> > > > also
> > > include the decompression and compression that occurs during message
> > > version transformation, i.e. when client send message with V1 and
> > > broker expects in V2, we convert the message and recompress it.
> > > >
> > > > --
> > > > Divij Vaidya
> > > >
> > > >
> > > >
> > > > On Mon, Dec 18, 2023 at 7:22 PM Diop, Assane
> > > > <assane.d...@intel.com>
> > > wrote:
> > > >
> > > > > I would like to bring some attention to this KIP. We have added
> > > > > an interface to the compression code that allow anyone to build
> > > > > their own compression plugin and integrate easily back to kafka.
> > > > >
> > > > > Assane
> > > > >
> > > > > -----Original Message-----
> > > > > From: Diop, Assane <assane.d...@intel.com>
> > > > > Sent: Monday, October 2, 2023 9:27 AM
> > > > > To: dev@kafka.apache.org
> > > > > Subject: DISCUSS KIP-984 Add pluggable compression interface to
> > > > > Kafka
> > > > >
> > > > >
> > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-984%3A+Add
> > > > > +p
> > > > > lu
> > > > > gg
> > > > > able+compression+interface+to+Kafka
> > > > >
> > >

Re: DISCUSS KIP-984 Add pluggable compression interface to Kafka

Reply via email to