I've recently implemented further monitoring of our Kafka cluster to hone in on where I think we have bottlenecks. I'm interested in one metric in particular: *kafka.network:type=RequestMetrics,name=RemoteTimeMs,request={Produce|FetchConsumer|FetchFollower}*
All the docs I've seen accompanying the metric state "non-zero for produce requests when ack=-1". What does it mean however in relation to consume requests (FetchConsumer), or follower requests (FetchFollower)? On my cluster - the TotalTimeMs is nice and low for produce requests, which I would expect as we don't set a high acks value. For follower and consume requests however, TotalTimeMs is nearly 500ms in the 99th percentile, of which the RemoteTimeMS is the vast proportion. My gut is telling me that followers are struggling to replicate from leaders fast enough, and therefore RemoteTimeMs for FetchConsumer is telling me there is a high commit lag (waiting for all replicas in the ISR to be updated)? Many thanks in advance, Marcus