Re: Compression - producer vs topic?

Dan Hill Tue, 15 Mar 2022 17:30:59 -0700

Thanks, Liam!  I was convinced to do zstd.  I'm using an older version of
Flink that uses an older Kafka Producer (so zstd isn't available in it).
I'll switch to zstd when I upgrade.


On Tue, Mar 15, 2022 at 3:52 PM Liam Clarke-Hutchinson <lclar...@redhat.com>
wrote:

> Oh, and meant to say, zstd is a good compromise between CPU and compression
> ratio, IIRC it was far less costly on CPU than gzip.
>
> So yeah, I generally recommend setting your topic's compression to
> "producer", and then going from there.
>
> On Wed, 16 Mar 2022 at 11:49, Liam Clarke-Hutchinson <lclar...@redhat.com>
> wrote:
>
> > Sounds like a goer then :) Those strings in the protobuf always get ya,
> > can't use clever encodings for them like you can with numbers.
> >
> > On Wed, 16 Mar 2022 at 11:29, Dan Hill <quietgol...@gmail.com> wrote:
> >
> >> We're using protos but there are still a bunch of custom fields where
> >> clients specify redundant strings.
> >>
> >> My local test is showing 75% reduction in size if I use zstd or gzip.  I
> >> care the most about Kafka storage costs right now.
> >>
> >> On Tue, Mar 15, 2022 at 2:25 PM Liam Clarke-Hutchinson <
> >> lclar...@redhat.com>
> >> wrote:
> >>
> >> > Hi Dan,
> >> >
> >> > Okay, so if you're looking for low latency, I'm guessing that you're
> >> using
> >> > a very low linger.ms in the producers? Also, what format are the
> >> records?
> >> > If they're already in a binary format like Protobuf or Avro, unless
> >> they're
> >> > composed largely of strings, compression may offer little benefit.
> >> >
> >> > With your small records, I'd suggest running some tests with your
> >> current
> >> > config with different compression settings - none, snappy, lz4, (don't
> >> > bother with gzip unless that's all you have) and checking producer
> >> metrics
> >> > (available via JMX if you're using the Java clients) for
> avg-batch-size
> >> and
> >> > compression-ratio.
> >> >
> >> > You may just wish to start with no compression, and then consider
> >> moving to
> >> > it if/when network bandwidth becomes a bottleneck.
> >> >
> >> > Regards,
> >> >
> >> > Liam
> >> >
> >> > On Tue, 15 Mar 2022 at 17:05, Dan Hill <quietgol...@gmail.com> wrote:
> >> >
> >> > > Thanks, Liam!
> >> > >
> >> > > I have a mixture of Kafka record size.  10% are large (>100kbs) and
> >> 90%
> >> > of
> >> > > the records are smaller than 1kb.  I'm working on a streaming
> >> analytics
> >> > > solution that streams impressions, user actions and serving info and
> >> > > combines them together.  End-to-end latency is more important than
> >> > storage
> >> > > size.
> >> > >
> >> > >
> >> > > On Mon, Mar 14, 2022 at 3:27 PM Liam Clarke-Hutchinson <
> >> > > lclar...@redhat.com>
> >> > > wrote:
> >> > >
> >> > > > Hi Dan,
> >> > > >
> >> > > > Decompression generally only happens in the broker if the topic
> has
> >> a
> >> > > > particular compression algorithm set, and the producer is using a
> >> > > different
> >> > > > one - then the broker will decompress records from the producer,
> >> then
> >> > > > recompress it using the topic's configured algorithm. (The
> >> LogCleaner
> >> > > will
> >> > > > also decompress then recompress records when compacting compressed
> >> > > topics).
> >> > > >
> >> > > > The consumer decompresses compressed record batches it receives.
> >> > > >
> >> > > > In my opinion, using topic compression instead of producer
> >> compression
> >> > > > would only make sense if the overhead of a few more CPU cycles
> >> > > compression
> >> > > > uses was not tolerable for the producing app. In all of my use
> >> cases,
> >> > > > network throughput becomes a bottleneck long before producer
> >> > compression
> >> > > > CPU cost does.
> >> > > >
> >> > > > For your "if X, do Y" formulation I'd say - if your producer is
> >> sending
> >> > > > tiny batches, do some analysis of compressed vs. uncompressed size
> >> for
> >> > > your
> >> > > > given compression algorithm - you may find that compression
> overhead
> >> > > > increases batch size for tiny batches.
> >> > > >
> >> > > > If you're sending a large amount of data, do tune your batching
> and
> >> use
> >> > > > compression to reduce data being sent over the wire.
> >> > > >
> >> > > > If you can tell us more about what your problem domain, there
> might
> >> be
> >> > > more
> >> > > > advice that's applicable :)
> >> > > >
> >> > > > Cheers,
> >> > > >
> >> > > > Liam Clarke-Hutchinson
> >> > > >
> >> > > > On Tue, 15 Mar 2022 at 10:05, Dan Hill <quietgol...@gmail.com>
> >> wrote:
> >> > > >
> >> > > > > Hi.  I looked around for advice about Kafka compression.  I've
> >> seen
> >> > > mixed
> >> > > > > and conflicting advice.
> >> > > > >
> >> > > > > Is there any sorta "if X, do Y" type of documentation around
> Kafka
> >> > > > > compression?
> >> > > > >
> >> > > > > Any advice?  Any good posts to read that talk about this trade
> >> off?
> >> > > > >
> >> > > > > *Detailed comments*
> >> > > > > I tried looking for producer vs topic compression.  I didn't
> find
> >> > much.
> >> > > > > Some of the information I see is back from 2011 (which I'm
> >> guessing
> >> > is
> >> > > > > pretty stale).
> >> > > > >
> >> > > > > I can guess some potential benefits but I don't know if they are
> >> > > actually
> >> > > > > real.  I've also seen some sites claim certain trade offs but
> it's
> >> > > > unclear
> >> > > > > if they're true.
> >> > > > >
> >> > > > > It looks like I can modify an existing topic's compression.  I
> >> don't
> >> > > know
> >> > > > > if that actually works.  I'd assume it'd just impact data going
> >> > > forward.
> >> > > > >
> >> > > > > I've seen multiple sites say that decompression happens in the
> >> broker
> >> > > and
> >> > > > > multiple that say it happens in the consumer.
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: Compression - producer vs topic?

Reply via email to