Hi guys, I am loadtesting a couple clusters one with local ssd disks and another one with ceph.
Both clusters have the same amount of cpu/ram and they are configured the same way. im sending the same amount of messages and producing with linger.ms=0 and acks=all besides seeing higuer latencies on ceph for the most part, compared to local disk. There is something that I don't understand. On the local disk cluster. messages per second matches exactly the number of requests. but on the ceph cluster messages do not match total produce requests per second. and the only thing I can find is that the Producer purgatory in ceph kafka cluster has more request queued up than the local disk. Also RemoteTime-ms for producers is high, which could explain why there are more requests on the purgatory. To me , I think this means that the Producer is waiting to hear from all the acks. which are set to all. But I don't understand why the local disk Kafka cluster purgatory queue is way lower. since I don't think disk is used for this? could be network saturation since ceph is network storage is interfering with the producer waiting for acks? is there a way to tune the producer purgatory? I did change num.replica.fetchers but that only lowered the fetch purgatory.