[ https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16602285#comment-16602285 ]
Rajini Sivaram commented on KAFKA-7304: --------------------------------------- [~yuyang08] Thanks for reporting this issue. Do you see an actual difference with the patch from [~yuzhih...@gmail.com]? I don't see how the patch helps since the changes are only in {{Selector.close()}}. When a broker goes down, with a very large number of clients, I imagine the connections made for metadata requests could create a load spike on other brokers. What is your {{connections.max.idle.ms}} for brokers? Just to confirm - all the brokers are up for longer than this after a broker restart/leader change, but don't see any drop in memory usage? Also, the metrics attached show a large number of failed authentications, is that expected? I wasn't sure if the metrics correspond to the heap objects in any of the attached screenshots (because the number of active connections in the metrics is quite low). > memory leakage in org.apache.kafka.common.network.Selector > ---------------------------------------------------------- > > Key: KAFKA-7304 > URL: https://issues.apache.org/jira/browse/KAFKA-7304 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 1.1.0, 1.1.1 > Reporter: Yu Yang > Priority: Critical > Fix For: 1.1.2, 2.0.1, 2.1.0 > > Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at > 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot > 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, > Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 > AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png, Screen Shot 2018-08-28 at > 11.09.45 AM.png, Screen Shot 2018-08-29 at 10.49.03 AM.png, Screen Shot > 2018-08-29 at 10.50.47 AM.png > > > We are testing secured writing to kafka through ssl. Testing at small scale, > ssl writing to kafka was fine. However, when we enabled ssl writing at a > larger scale (>40k clients write concurrently), the kafka brokers soon hit > OutOfMemory issue with 4G memory setting. We have tried with increasing the > heap size to 10Gb, but encountered the same issue. > We took a few heap dumps , and found that most of the heap memory is > referenced through org.apache.kafka.common.network.Selector objects. There > are two Channel maps field in Selector. It seems that somehow the objects is > not deleted from the map in a timely manner. > One observation is that the memory leak seems relate to kafka partition > leader changes. If there is broker restart etc. in the cluster that caused > partition leadership change, the brokers may hit the OOM issue faster. > {code} > private final Map<String, KafkaChannel> channels; > private final Map<String, KafkaChannel> closingChannels; > {code} > Please see the attached images and the following link for sample gc > analysis. > http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0 > the command line for running kafka: > {code} > java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m > -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC > -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 > -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 > -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M > -Djava.awt.headless=true > -Dlog4j.configuration=file:/etc/kafka/log4j.properties > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.port=9999 > -Dcom.sun.management.jmxremote.rmi.port=9999 -cp /usr/local/libs/* > kafka.Kafka /etc/kafka/server.properties > {code} > We use java 1.8.0_102, and has applied a TLS patch on reducing > X509Factory.certCache map size from 750 to 20. > {code} > java -version > java version "1.8.0_102" > Java(TM) SE Runtime Environment (build 1.8.0_102-b14) > Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)