[
https://issues.apache.org/jira/browse/KAFKA-12225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Harel Ben Attia updated KAFKA-12225:
------------------------------------
Description:
*TLDR*: There seems to be a major lock contention that can happen on
*{{Log.lock}}* during producer-scaling when produce-request sending is
time-based ({{linger.ms}}) and not data-size based (max batch size).
Hi,
We're running a 5-node Kafka cluster on one of our production systems on AWS.
Recently, we have started to notice that as our producer services scale out,
the Kafka idle-percentage drops abruptly from ~70% idle percentage to 0% on all
brokers, even though none of the physical resources of the brokers are
exhausted.
Initially, we realised that our {{io.thread}} count was too low, causing high
request queuing and the low idle percentage, so we have increased it, hoping to
see one of the physical resources maxing out. After changing it we still
continued to see abrupt drops of the idle-percentage to 0% (with no physical
resource maxing out), so we continued to investigate.
The investigation has shown that there's a direct relation to {{linger.ms}}
being the controlling factor of sending out produce requests. Whenever messages
are being sent out from the producer due to the {{linger.ms}} threshold,
scaling out the service increased the number of produce requests in a way which
is not proportional to our traffic increase, bringing down all the brokers to a
near-halt in terms of being able to process requests and, as mentioned, without
any exhaustion of physical resources.
After some more experiments and profiling a broker through flight recorder, we
have found out that the cause of the issue is a lock contention on a
*{{java.lang.Object}}*, wasting a lot of time on all the
{{data-plane-kafka-request-handler}} threads. 90% of the locks were on Log's
*{{lock: Object}}* instance, inside the *{{Log.append()}}* method. The stack
traces show that these locks occur during the {{handleProductRequest}} method.
We have ruled out replication as the source of the issues, as there were no
replication issues, and the control-plane has a separate thread pool, so this
focused us back on the actual producers, leading back to the behaviour of our
producer service when scaling out.
At that point we thought that maybe the issue is related to the number of
partitions of the topic (60 currently), and increasing it would reduce the lock
contention on each {{Log}} instance, but since each producer writes to all
partitions (data is evenly spread and not skewed), then increasing the number
of partitions would only cause each producer to generate more produce-requests,
not alleviating the lock contention. Also, increasing the number of brokers
would increase the idle percentage per broker, but essentially would not help
reducing the produce-request latency, since this would not change the rate of
produce-requests per Log.
Eventually, we've worked around the issue by making the {{linger.ms}} value
high enough so it stopped being the controlling factor of sending messages
(e.g. produce-requests became coupled to the size of the traffic due to the max
batch size becoming the controlling factor). This allowed us to utilise the
cluster better without upscaling it.
>From our analysis, it seems that this lock behaviour limits Kafka's ability to
>be robust to producer configuration and scaling, and hurts the ability to do
>efficient capacity planning for the cluster, increasing the risk of an
>unexpected bottleneck when traffic increases.
It would be great if you can validate these conclusions, or provide any more
information that will help us understand the issue better or work around it in
a more efficient way.
was:
*TLDR*: There seems to be a major lock contention that can happen on
*{{Log.lock}}* which can happen during producer-scaling when produce-request
sending is time-based ({{linger.ms}}) and not data-size based (max batch size).
Hi,
We're running a 5-node Kafka cluster on one of our production systems on AWS.
Recently, we have started to notice that as our producer services scale out,
the Kafka idle-percentage drops abruptly from ~70% idle percentage to 0% on all
brokers, even though none of the physical resources of the brokers are
exhausted.
Initially, we realised that our {{io.thread}} count was too low, causing high
request queuing and the low idle percentage, so we have increased it, hoping to
see one of the physical resources maxing out. After changing it we still
continued to see abrupt drops of the idle-percentage to 0% (with no physical
resource maxing out), so we continued to investigate.
The investigation has shown that there's a direct relation to {{linger.ms}}
being the controlling factor of sending out produce requests. Whenever messages
are being sent out from the producer due to the {{linger.ms}} threshold,
scaling out the service increased the number of produce requests in a way which
is not proportional to our traffic increase, bringing down all the brokers to a
near-halt in terms of being able to process requests and, as mentioned, without
any exhaustion of physical resources.
After some more experiments and profiling a broker through flight recorder, we
have found out that the cause of the issue is a lock contention on a
*{{java.lang.Object}}*, wasting a lot of time on all the
{{data-plane-kafka-request-handler}} threads. 90% of the locks were on Log's
*{{lock: Object}}* instance, inside the *{{Log.append()}}* method. The stack
traces show that these locks occur during the {{handleProductRequest}} method.
We have ruled out replication as the source of the issues, as there were no
replication issues, and the control-plane has a separate thread pool, so this
focused us back on the actual producers, leading back to the behaviour of our
producer service when scaling out.
At that point we thought that maybe the issue is related to the number of
partitions of the topic (60 currently), and increasing it would reduce the lock
contention on each {{Log}} instance, but since each producer writes to all
partitions (data is evenly spread and not skewed), then increasing the number
of partitions would only cause each producer to generate more produce-requests,
not alleviating the lock contention. Also, increasing the number of brokers
would increase the idle percentage per broker, but essentially would not help
reducing the produce-request latency, since this would not change the rate of
produce-requests per Log.
Eventually, we've worked around the issue by making the {{linger.ms}} value
high enough so it stopped being the controlling factor of sending messages
(e.g. produce-requests became coupled to the size of the traffic due to the max
batch size becoming the controlling factor). This allowed us to utilise the
cluster better without upscaling it.
>From our analysis, it seems that this lock behaviour limits Kafka's ability to
>be robust to producer configuration and scaling, and hurts the ability to do
>efficient capacity planning for the cluster, increasing the risk of an
>unexpected bottleneck when traffic increases.
It would be great if you can validate these conclusions, or provide any more
information that will help us understand the issue better or work around it in
a more efficient way.
> Unexpected broker bottleneck when scaling producers
> ---------------------------------------------------
>
> Key: KAFKA-12225
> URL: https://issues.apache.org/jira/browse/KAFKA-12225
> Project: Kafka
> Issue Type: Improvement
> Components: core
> Environment: AWS Based
> 5-node cluster running on k8s with EBS attached disks (HDD)
> Kafka Version 2.5.0
> Multiple Producers (KafkaStreams, Akka Streams, golang Sarama)
> Reporter: Harel Ben Attia
> Priority: Major
>
>
> *TLDR*: There seems to be a major lock contention that can happen on
> *{{Log.lock}}* during producer-scaling when produce-request sending is
> time-based ({{linger.ms}}) and not data-size based (max batch size).
> Hi,
> We're running a 5-node Kafka cluster on one of our production systems on AWS.
> Recently, we have started to notice that as our producer services scale out,
> the Kafka idle-percentage drops abruptly from ~70% idle percentage to 0% on
> all brokers, even though none of the physical resources of the brokers are
> exhausted.
> Initially, we realised that our {{io.thread}} count was too low, causing high
> request queuing and the low idle percentage, so we have increased it, hoping
> to see one of the physical resources maxing out. After changing it we still
> continued to see abrupt drops of the idle-percentage to 0% (with no physical
> resource maxing out), so we continued to investigate.
> The investigation has shown that there's a direct relation to {{linger.ms}}
> being the controlling factor of sending out produce requests. Whenever
> messages are being sent out from the producer due to the {{linger.ms}}
> threshold, scaling out the service increased the number of produce requests
> in a way which is not proportional to our traffic increase, bringing down all
> the brokers to a near-halt in terms of being able to process requests and, as
> mentioned, without any exhaustion of physical resources.
> After some more experiments and profiling a broker through flight recorder,
> we have found out that the cause of the issue is a lock contention on a
> *{{java.lang.Object}}*, wasting a lot of time on all the
> {{data-plane-kafka-request-handler}} threads. 90% of the locks were on Log's
> *{{lock: Object}}* instance, inside the *{{Log.append()}}* method. The stack
> traces show that these locks occur during the {{handleProductRequest}}
> method. We have ruled out replication as the source of the issues, as there
> were no replication issues, and the control-plane has a separate thread pool,
> so this focused us back on the actual producers, leading back to the
> behaviour of our producer service when scaling out.
> At that point we thought that maybe the issue is related to the number of
> partitions of the topic (60 currently), and increasing it would reduce the
> lock contention on each {{Log}} instance, but since each producer writes to
> all partitions (data is evenly spread and not skewed), then increasing the
> number of partitions would only cause each producer to generate more
> produce-requests, not alleviating the lock contention. Also, increasing the
> number of brokers would increase the idle percentage per broker, but
> essentially would not help reducing the produce-request latency, since this
> would not change the rate of produce-requests per Log.
> Eventually, we've worked around the issue by making the {{linger.ms}} value
> high enough so it stopped being the controlling factor of sending messages
> (e.g. produce-requests became coupled to the size of the traffic due to the
> max batch size becoming the controlling factor). This allowed us to utilise
> the cluster better without upscaling it.
> From our analysis, it seems that this lock behaviour limits Kafka's ability
> to be robust to producer configuration and scaling, and hurts the ability to
> do efficient capacity planning for the cluster, increasing the risk of an
> unexpected bottleneck when traffic increases.
> It would be great if you can validate these conclusions, or provide any more
> information that will help us understand the issue better or work around it
> in a more efficient way.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)