[ 
https://issues.apache.org/jira/browse/KAFKA-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15423839#comment-15423839
 ] 

Joel Koshy commented on KAFKA-4050:
-----------------------------------

A stack trace should help further clarify. (This is from a thread dump that 
Todd shared with us offline). Thanks [~toddpalino] and [~mgharat] for finding 
this.

{noformat}
"kafka-network-thread-1393-SSL-30" #114 prio=5 os_prio=0 tid=0x00007f2ec8c30800 
nid=0x5c1e waiting for monitor entry [0x00007f213b8f9000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at 
sun.security.provider.NativePRNG$RandomIO.implNextBytes(NativePRNG.java:481)
        - waiting to lock <0x0000000641508bf8> (a java.lang.Object)
        at 
sun.security.provider.NativePRNG$RandomIO.access$400(NativePRNG.java:329)
        at sun.security.provider.NativePRNG.engineNextBytes(NativePRNG.java:218)
        at java.security.SecureRandom.nextBytes(SecureRandom.java:468)
        - locked <0x000000066aad9880> (a java.security.SecureRandom)
        at sun.security.ssl.CipherBox.createExplicitNonce(CipherBox.java:1015)
        at 
sun.security.ssl.EngineOutputRecord.write(EngineOutputRecord.java:287)
        at 
sun.security.ssl.EngineOutputRecord.write(EngineOutputRecord.java:225)
        at sun.security.ssl.EngineWriter.writeRecord(EngineWriter.java:186)
        - locked <0x0000000671c5c978> (a sun.security.ssl.EngineWriter)
        at sun.security.ssl.SSLEngineImpl.writeRecord(SSLEngineImpl.java:1300)
        at 
sun.security.ssl.SSLEngineImpl.writeAppRecord(SSLEngineImpl.java:1271)
        - locked <0x0000000671ce7170> (a java.lang.Object)
        at sun.security.ssl.SSLEngineImpl.wrap(SSLEngineImpl.java:1186)
        - locked <0x0000000671ce7150> (a java.lang.Object)
        at javax.net.ssl.SSLEngine.wrap(SSLEngine.java:469)
        at org.apache.kafka.common.network.SslTransportLayer.write(p.java:557)
        at kafka.api.TopicDataSend.writeTo(FetchResponse.scala:146)
        at org.apache.kafka.common.network.MultiSend.writeTo(MultiSend.java:81)
        at kafka.api.FetchResponseSend.writeTo(FetchResponse.scala:292)
        at 
org.apache.kafka.common.network.KafkaChannel.send(KafkaChannel.java:158)
        at 
org.apache.kafka.common.network.KafkaChannel.write(KafkaChannel.java:146)
        at 
org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:329)
        at org.apache.kafka.common.network.Selector.poll(Selector.java:283)
        at kafka.network.Processor.poll(SocketServer.scala:472)
        at kafka.network.Processor.run(SocketServer.scala:412)
        at java.lang.Thread.run(Thread.java:745)
{noformat}

Of note is that all of the network threads are waiting on the same NativePRNG 
lock (0x0000000641508bf8)

> Allow configuration of the PRNG used for SSL
> --------------------------------------------
>
>                 Key: KAFKA-4050
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4050
>             Project: Kafka
>          Issue Type: Improvement
>          Components: security
>    Affects Versions: 0.10.0.1
>            Reporter: Todd Palino
>            Assignee: Todd Palino
>              Labels: security, ssl
>
> This change will make the pseudo-random number generator (PRNG) 
> implementation used by the SSLContext configurable. The configuration is not 
> required, and the default is to use whatever the default PRNG for the JDK/JRE 
> is. Providing a string, such as "SHA1PRNG", will cause that specific 
> SecureRandom implementation to get passed to the SSLContext.
> When enabling inter-broker SSL in our certification cluster, we observed 
> severe performance issues. For reference, this cluster can take up to 600 
> MB/sec of inbound produce traffic over SSL, with RF=2, before it gets close 
> to saturation, and the mirror maker normally produces about 400 MB/sec 
> (unless it is lagging). When we enabled inter-broker SSL, we saw persistent 
> replication problems in the cluster at any inbound rate of more than about 6 
> or 7 MB/sec per-broker. This was narrowed down to all the network threads 
> blocking on a single lock in the SecureRandom code.
> It turns out that the default PRNG implementation on Linux is NativePRNG. 
> This uses randomness from /dev/urandom (which, by itself, is a non-blocking 
> read) and mixes it with randomness from SHA1. The problem is that the entire 
> application shares a single SecureRandom instance, and NativePRNG has a 
> global lock within the implNextBytes method. Switching to another 
> implementation (SHA1PRNG, which has better performance characteristics and is 
> still considered secure) completely eliminated the bottleneck and allowed the 
> cluster to work properly at saturation.
> The SSLContext initialization has an optional argument to provide a 
> SecureRandom instance, which the code currently sets to null. This change 
> creates a new config to specify an implementation, and instantiates that and 
> passes it to SSLContext if provided. This will also let someone select a 
> stronger source of randomness (obviously at a performance cost) if desired.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to