Hi Wes, are you monitoring librdkafka statistics as well? If so, are there any correlating spikes in the per-broker and per-partition statistics? Such as: - brokers.<NAME>.rtt.avg <--- broker round-trip-time (latency) - brokers.<NAME>.waitresp_cnt <-- requests in flight - topics.<TOPIC>.partitions.<N>.msgq_cnt <-- internal message queue - topics.<TOPIC>.partitions.<N>.xmitq_cnt <-- transmit queue
Regards, Magnus 2014-11-11 19:50 GMT+01:00 Wes Chow <w...@chartbeat.com>: > > We're seeing periodic spikes in req/sec rates across our nodes. Our > cluster is 10 nodes, and the topic has a replication factor of 3. We push > around 200k messages / sec into Kafka. > > > The machines are running the most recent version of Kafka and we're > connecting via librdkafka. pingstream02-10 are using the CMS garbage > collector, but I switched pingstream01 to use G1GC under the theory that > maybe these were GC pauses. The graph shows that likely didn't improve the > situation. > > My next thought is that maybe this is the effect of log rolling. Checking > in the logs, I see a lot of this: > > [2014-11-11 13:46:45,836] 72952071 [ReplicaFetcherThread-0-7] INFO > kafka.log.Log - Rolled new log segment for 'pings-342' in 3 ms. > [2014-11-11 13:46:47,116] 72953351 [kafka-request-handler-0] INFO > kafka.log.Log - Rolled new log segment for 'pings-186' in 2 ms. > [2014-11-11 13:46:48,155] 72954390 [ReplicaFetcherThread-0-8] INFO > kafka.log.Log - Rolled new log segment for 'pings-253' in 3 ms. > [2014-11-11 13:46:48,408] 72954643 [ReplicaFetcherThread-0-4] INFO > kafka.log.Log - Rolled new log segment for 'pings-209' in 3 ms. > [2014-11-11 13:46:48,436] 72954671 [ReplicaFetcherThread-0-4] INFO > kafka.log.Log - Rolled new log segment for 'pings-299' in 2 ms. > [2014-11-11 13:46:48,687] 72954922 [kafka-request-handler-0] INFO > kafka.log.Log - Rolled new log segment for 'pings-506' in 2 ms. > > The "pings" topic in question has 512 partitions, so it does this 512 > times every so often. We have an effective retention period of a bit less > than 30 min, so rolling happens pretty frequently. Still, if I assume worst > case that rolling locks up the process for 2ms and there are 512 rolls > every few minutes, I'd expect halting to happen for about a second at a > time. The graphs seem to indicate much longer dips, but it's hard for me to > know if I'm looking at real data or some sort of artifact. > > Fwiw, the producers are not reporting any errors, so it does not seem like > we're losing data. > > I'm new to Kafka. Should I be worried? If so, how should I be debugging > this? > > Thanks, > Wes > >