[ 
https://issues.apache.org/jira/browse/KAFKA-7136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16534668#comment-16534668
 ] 

ASF GitHub Bot commented on KAFKA-7136:
---------------------------------------

rajinisivaram opened a new pull request #5341: KAFKA-7136: Avoid deadlocks in 
synchronized metrics reporters
URL: https://github.com/apache/kafka/pull/5341
 
 
   We need to use the same lock for metric update and read to avoid NPE and 
concurrent modification exceptions.  Sensor add/remove/update are synchronized 
on `Sensor` since they access lists and maps that are not thread-safe. 
Reporters are notified of metrics add/remove while holding (`Sensor`, 
`Metrics`) locks and reporters may synchronize on the reporter lock. Metric 
read may be invoked by metrics reporters while holding a reporter lock. So 
read/update cannot be synchronized using `Sensor` since that could lead to 
deadlock. This PR introduces a new lock in Sensor for update/read. 
   Locking order:
   ```
   - Sensor#add: Sensor -> Metrics -> MetricsReporter
   - Metrics#removeSensor: Sensor -> Metrics -> MetricsReporter
   - KafkaMetric#metricValue: MetricsReporter -> Sensor#metricLock
   - Sensor#record: Sensor -> Sensor#metricLock
   ```
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> PushHttpMetricsReporter may deadlock when processing metrics changes
> --------------------------------------------------------------------
>
>                 Key: KAFKA-7136
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7136
>             Project: Kafka
>          Issue Type: Bug
>          Components: metrics
>    Affects Versions: 1.1.0, 2.0.0
>            Reporter: Rajini Sivaram
>            Assignee: Rajini Sivaram
>            Priority: Blocker
>             Fix For: 2.0.0
>
>
> We noticed a deadlock in {{PushHttpMetricsReporter}}. Locking for metrics was 
> changed under KAFKA-6765 to avoid {{NullPointerException}} in metrics 
> reporters due to concurrent read and updates. {{PushHttpMetricsReporter}} 
> requires a lock to process metrics registration that is invoked while holding 
> the sensor lock. It also reads metrics attempting to acquire sensor lock 
> while holding its lock (inverse order). This resulted in the deadlock below.
> {quote}Found one Java-level deadlock:
>  Java stack information for the threads listed above:
>  ===================================================
>  "StreamThread-7":
>  at 
> org.apache.kafka.tools.PushHttpMetricsReporter.metricChange(PushHttpMetricsReporter.java:144)
>  - waiting to lock <0x0000000655a54310> (a java.lang.Object)
>  at org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:563)
>  - locked <0x0000000655a44a28> (a org.apache.kafka.common.metrics.Metrics)
>  at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:236)
>  - locked <0x000000065629c170> (a org.apache.kafka.common.metrics.Sensor)
>  at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:217)
>  at 
> org.apache.kafka.common.network.Selector$SelectorMetrics.maybeRegisterConnectionMetrics(Selector.java:1016)
>  at 
> org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:462)
>  at org.apache.kafka.common.network.Selector.poll(Selector.java:425)
>  at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:510)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:271)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:242)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:218)
>  at 
> org.apache.kafka.clients.consumer.internals.Fetcher.getTopicMetadata(Fetcher.java:274)
>  at 
> org.apache.kafka.clients.consumer.internals.Fetcher.getAllTopicMetadata(Fetcher.java:254)
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.listTopics(KafkaConsumer.java:1820)
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.listTopics(KafkaConsumer.java:1798)
>  at 
> org.apache.kafka.streams.processor.internals.StoreChangelogReader.refreshChangelogInfo(StoreChangelogReader.java:224)
>  at 
> org.apache.kafka.streams.processor.internals.StoreChangelogReader.initialize(StoreChangelogReader.java:121)
>  at 
> org.apache.kafka.streams.processor.internals.StoreChangelogReader.restore(StoreChangelogReader.java:74)
>  at 
> org.apache.kafka.streams.processor.internals.TaskManager.updateNewAndRestoringTasks(TaskManager.java:317)
>  at 
> org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:824)
>  at 
> org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:767)
>  at 
> org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:736)
>  "pool-17-thread-1":
>  at 
> org.apache.kafka.common.metrics.KafkaMetric.measurableValue(KafkaMetric.java:82)
>  - waiting to lock <0x000000065629c170> (a 
> org.apache.kafka.common.metrics.Sensor)
>  at org.apache.kafka.common.metrics.KafkaMetric.value(KafkaMetric.java:58)
>  at 
> org.apache.kafka.tools.PushHttpMetricsReporter$HttpReporter.run(PushHttpMetricsReporter.java:177)
>  - locked <0x0000000655a54310> (a java.lang.Object)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> Found 1 deadlock.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to