[ https://issues.apache.org/jira/browse/KAFKA-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15423839#comment-15423839 ]
Joel Koshy commented on KAFKA-4050: ----------------------------------- A stack trace should help further clarify. (This is from a thread dump that Todd shared with us offline). Thanks [~toddpalino] and [~mgharat] for finding this. {noformat} "kafka-network-thread-1393-SSL-30" #114 prio=5 os_prio=0 tid=0x00007f2ec8c30800 nid=0x5c1e waiting for monitor entry [0x00007f213b8f9000] java.lang.Thread.State: BLOCKED (on object monitor) at sun.security.provider.NativePRNG$RandomIO.implNextBytes(NativePRNG.java:481) - waiting to lock <0x0000000641508bf8> (a java.lang.Object) at sun.security.provider.NativePRNG$RandomIO.access$400(NativePRNG.java:329) at sun.security.provider.NativePRNG.engineNextBytes(NativePRNG.java:218) at java.security.SecureRandom.nextBytes(SecureRandom.java:468) - locked <0x000000066aad9880> (a java.security.SecureRandom) at sun.security.ssl.CipherBox.createExplicitNonce(CipherBox.java:1015) at sun.security.ssl.EngineOutputRecord.write(EngineOutputRecord.java:287) at sun.security.ssl.EngineOutputRecord.write(EngineOutputRecord.java:225) at sun.security.ssl.EngineWriter.writeRecord(EngineWriter.java:186) - locked <0x0000000671c5c978> (a sun.security.ssl.EngineWriter) at sun.security.ssl.SSLEngineImpl.writeRecord(SSLEngineImpl.java:1300) at sun.security.ssl.SSLEngineImpl.writeAppRecord(SSLEngineImpl.java:1271) - locked <0x0000000671ce7170> (a java.lang.Object) at sun.security.ssl.SSLEngineImpl.wrap(SSLEngineImpl.java:1186) - locked <0x0000000671ce7150> (a java.lang.Object) at javax.net.ssl.SSLEngine.wrap(SSLEngine.java:469) at org.apache.kafka.common.network.SslTransportLayer.write(p.java:557) at kafka.api.TopicDataSend.writeTo(FetchResponse.scala:146) at org.apache.kafka.common.network.MultiSend.writeTo(MultiSend.java:81) at kafka.api.FetchResponseSend.writeTo(FetchResponse.scala:292) at org.apache.kafka.common.network.KafkaChannel.send(KafkaChannel.java:158) at org.apache.kafka.common.network.KafkaChannel.write(KafkaChannel.java:146) at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:329) at org.apache.kafka.common.network.Selector.poll(Selector.java:283) at kafka.network.Processor.poll(SocketServer.scala:472) at kafka.network.Processor.run(SocketServer.scala:412) at java.lang.Thread.run(Thread.java:745) {noformat} Of note is that all of the network threads are waiting on the same NativePRNG lock (0x0000000641508bf8) > Allow configuration of the PRNG used for SSL > -------------------------------------------- > > Key: KAFKA-4050 > URL: https://issues.apache.org/jira/browse/KAFKA-4050 > Project: Kafka > Issue Type: Improvement > Components: security > Affects Versions: 0.10.0.1 > Reporter: Todd Palino > Assignee: Todd Palino > Labels: security, ssl > > This change will make the pseudo-random number generator (PRNG) > implementation used by the SSLContext configurable. The configuration is not > required, and the default is to use whatever the default PRNG for the JDK/JRE > is. Providing a string, such as "SHA1PRNG", will cause that specific > SecureRandom implementation to get passed to the SSLContext. > When enabling inter-broker SSL in our certification cluster, we observed > severe performance issues. For reference, this cluster can take up to 600 > MB/sec of inbound produce traffic over SSL, with RF=2, before it gets close > to saturation, and the mirror maker normally produces about 400 MB/sec > (unless it is lagging). When we enabled inter-broker SSL, we saw persistent > replication problems in the cluster at any inbound rate of more than about 6 > or 7 MB/sec per-broker. This was narrowed down to all the network threads > blocking on a single lock in the SecureRandom code. > It turns out that the default PRNG implementation on Linux is NativePRNG. > This uses randomness from /dev/urandom (which, by itself, is a non-blocking > read) and mixes it with randomness from SHA1. The problem is that the entire > application shares a single SecureRandom instance, and NativePRNG has a > global lock within the implNextBytes method. Switching to another > implementation (SHA1PRNG, which has better performance characteristics and is > still considered secure) completely eliminated the bottleneck and allowed the > cluster to work properly at saturation. > The SSLContext initialization has an optional argument to provide a > SecureRandom instance, which the code currently sets to null. This change > creates a new config to specify an implementation, and instantiates that and > passes it to SSLContext if provided. This will also let someone select a > stronger source of randomness (obviously at a performance cost) if desired. -- This message was sent by Atlassian JIRA (v6.3.4#6332)