Oh, and meant to say, zstd is a good compromise between CPU and compression ratio, IIRC it was far less costly on CPU than gzip.
So yeah, I generally recommend setting your topic's compression to "producer", and then going from there. On Wed, 16 Mar 2022 at 11:49, Liam Clarke-Hutchinson <[email protected]> wrote: > Sounds like a goer then :) Those strings in the protobuf always get ya, > can't use clever encodings for them like you can with numbers. > > On Wed, 16 Mar 2022 at 11:29, Dan Hill <[email protected]> wrote: > >> We're using protos but there are still a bunch of custom fields where >> clients specify redundant strings. >> >> My local test is showing 75% reduction in size if I use zstd or gzip. I >> care the most about Kafka storage costs right now. >> >> On Tue, Mar 15, 2022 at 2:25 PM Liam Clarke-Hutchinson < >> [email protected]> >> wrote: >> >> > Hi Dan, >> > >> > Okay, so if you're looking for low latency, I'm guessing that you're >> using >> > a very low linger.ms in the producers? Also, what format are the >> records? >> > If they're already in a binary format like Protobuf or Avro, unless >> they're >> > composed largely of strings, compression may offer little benefit. >> > >> > With your small records, I'd suggest running some tests with your >> current >> > config with different compression settings - none, snappy, lz4, (don't >> > bother with gzip unless that's all you have) and checking producer >> metrics >> > (available via JMX if you're using the Java clients) for avg-batch-size >> and >> > compression-ratio. >> > >> > You may just wish to start with no compression, and then consider >> moving to >> > it if/when network bandwidth becomes a bottleneck. >> > >> > Regards, >> > >> > Liam >> > >> > On Tue, 15 Mar 2022 at 17:05, Dan Hill <[email protected]> wrote: >> > >> > > Thanks, Liam! >> > > >> > > I have a mixture of Kafka record size. 10% are large (>100kbs) and >> 90% >> > of >> > > the records are smaller than 1kb. I'm working on a streaming >> analytics >> > > solution that streams impressions, user actions and serving info and >> > > combines them together. End-to-end latency is more important than >> > storage >> > > size. >> > > >> > > >> > > On Mon, Mar 14, 2022 at 3:27 PM Liam Clarke-Hutchinson < >> > > [email protected]> >> > > wrote: >> > > >> > > > Hi Dan, >> > > > >> > > > Decompression generally only happens in the broker if the topic has >> a >> > > > particular compression algorithm set, and the producer is using a >> > > different >> > > > one - then the broker will decompress records from the producer, >> then >> > > > recompress it using the topic's configured algorithm. (The >> LogCleaner >> > > will >> > > > also decompress then recompress records when compacting compressed >> > > topics). >> > > > >> > > > The consumer decompresses compressed record batches it receives. >> > > > >> > > > In my opinion, using topic compression instead of producer >> compression >> > > > would only make sense if the overhead of a few more CPU cycles >> > > compression >> > > > uses was not tolerable for the producing app. In all of my use >> cases, >> > > > network throughput becomes a bottleneck long before producer >> > compression >> > > > CPU cost does. >> > > > >> > > > For your "if X, do Y" formulation I'd say - if your producer is >> sending >> > > > tiny batches, do some analysis of compressed vs. uncompressed size >> for >> > > your >> > > > given compression algorithm - you may find that compression overhead >> > > > increases batch size for tiny batches. >> > > > >> > > > If you're sending a large amount of data, do tune your batching and >> use >> > > > compression to reduce data being sent over the wire. >> > > > >> > > > If you can tell us more about what your problem domain, there might >> be >> > > more >> > > > advice that's applicable :) >> > > > >> > > > Cheers, >> > > > >> > > > Liam Clarke-Hutchinson >> > > > >> > > > On Tue, 15 Mar 2022 at 10:05, Dan Hill <[email protected]> >> wrote: >> > > > >> > > > > Hi. I looked around for advice about Kafka compression. I've >> seen >> > > mixed >> > > > > and conflicting advice. >> > > > > >> > > > > Is there any sorta "if X, do Y" type of documentation around Kafka >> > > > > compression? >> > > > > >> > > > > Any advice? Any good posts to read that talk about this trade >> off? >> > > > > >> > > > > *Detailed comments* >> > > > > I tried looking for producer vs topic compression. I didn't find >> > much. >> > > > > Some of the information I see is back from 2011 (which I'm >> guessing >> > is >> > > > > pretty stale). >> > > > > >> > > > > I can guess some potential benefits but I don't know if they are >> > > actually >> > > > > real. I've also seen some sites claim certain trade offs but it's >> > > > unclear >> > > > > if they're true. >> > > > > >> > > > > It looks like I can modify an existing topic's compression. I >> don't >> > > know >> > > > > if that actually works. I'd assume it'd just impact data going >> > > forward. >> > > > > >> > > > > I've seen multiple sites say that decompression happens in the >> broker >> > > and >> > > > > multiple that say it happens in the consumer. >> > > > > >> > > > >> > > >> > >> >
