Hey David,

The followers replicate from the leader and when they do that they write to
their own local log. For the ceph cluster, it sounds like the followers
writes to their local log are slower? Seems like that would sense if those
writes are going over the network. This could explain why the leader ends
up having to wait longer to hear back from the followers before sending the
produce response, which in turn could explain why the producer purgatory is
bigger. See the section "Commit time: Replicating the record from leader to
followers" in
https://www.confluent.io/blog/configure-kafka-to-minimize-latency/.

To amortize the cost of slower followers you could look into increasing
linger.ms so that the producer batches a bit more.

Hope that helps a bit.

Andrew

On Mon, Feb 27, 2023 at 3:39 PM David Ballano Fernandez <
dfernan...@demonware.net> wrote:

> thank you!
>
> On Mon, Feb 27, 2023 at 12:37 PM David Ballano Fernandez <
> dfernan...@demonware.net> wrote:
>
> > Hi guys,
> >
> > I am loadtesting a couple clusters one with local ssd disks and another
> > one with ceph.
> >
> > Both clusters have the same amount of cpu/ram and they are configured the
> > same way.
> > im sending the same amount of messages and producing with linger.ms=0
> and
> > acks=all
> >
> > besides seeing higuer latencies on ceph for the most part, compared to
> > local disk. There is something that I don't understand.
> >
> > On the local disk cluster. messages per second matches exactly the
> > number of requests.
> > but on the ceph cluster messages  do not match total produce requests per
> > second.
> >
> > and the only thing I can find is that the Producer purgatory in ceph
> kafka
> > cluster has more request queued up than the local disk.
> >
> > Also RemoteTime-ms for producers is high, which could explain why there
> > are more requests on the purgatory.
> >
> > To me , I think this means that the Producer is waiting to hear from all
> > the acks. which are set to all. But I don't understand why the local disk
> > Kafka cluster purgatory queue is way lower.
> >
> > since I don't think disk is used for this? could be network saturation
> > since ceph  is network storage is interfering with the  producer waiting
> > for acks? is there a way to tune the producer purgatory? I did change
> > num.replica.fetchers but that only lowered the fetch purgatory.
> >
> >
> >
> >
> >
>

Reply via email to