Hello!

I have tried running SSL tests again and they seem to pass (only one test
fails for some different reason)

Can you try running those 2 nodes in stand-alone processes, see if problem
persists? I can see that you have SSL-enabled Tomcat running in same VM,
which I imaging could interfere with Ignite's SSL.

Note that you will need to do some load (such as REST cache operations) to
see if communication indeed works (or doesn't).

Regards,
-- 
Ilya Kasnacheev


ср, 28 нояб. 2018 г. в 16:01, Loredana Radulescu Ivanoff <[email protected]
>:

> Hello again,
>
> I haven't been able to solve this issue on my own, so I'm hoping you'd be
> able to take another look.
>
> To recap: only with Java 11 and TLS enabled, the second node I bring in
> the cluster never starts up, and remains stuck at "Still waiting for
> initial partition map exchange". The first nodes  keeps logging "Unable to
> await partitions release latch within timeout". To me, this looks like an
> Ignite issue, and no matter what causes the situation (arguably in this
> case an SSL error), there should be a more elegant exit out of it, i.e. the
> second node should give up after a while, if there isn't a better way to
> retry and achieve successful communication. The two nodes are able to
> communicate, and increasing various network timeouts/failure detection
> timeout does not help.
>
> Previously it was mentioned that the Ignite unit test did not show a
> repro. I suggest running a test that uses two different machines, because
> when I run the nodes on the same machine, I do not get a repro either.
>
> Attaching here logging from the two nodes including SSL messages.
>
> Is Ignite support for Java 11 going to be available before Oracle ends
> free commercial support for Java 1.8 in Jan 2019?
>
> Thank you.
>
> On Thu, Oct 25, 2018 at 9:29 AM Ilya Kasnacheev <[email protected]>
> wrote:
>
>> Hello!
>>
>> I have tried to run the test with protocol "TLSv1.2", didn't see any
>> difference.
>>
>> Regards,
>> --
>> Ilya Kasnacheev
>>
>>
>> ср, 24 окт. 2018 г. в 20:23, Loredana Radulescu Ivanoff <
>> [email protected]>:
>>
>>> Hello again,
>>>
>>> I am working on getting the full SSL logs over to you, but I have one
>>> more question in between: TLS v1.3 is enabled by default in JDK 11, and my
>>> app is using TLS v1.2 specifically. There's a known issue that's recently
>>> addressed by the JDK related to TLS v1.3 half close policy, details here:
>>> https://bugs.java.com/view_bug.do?bug_id=8207009
>>>
>>> Would you be able to confirm whether your SSL test runs successfully
>>> when the connecting client/server use TLS v.12 specifically ?
>>>
>>> FYI, I have tried specifically disabling TLS v1.3 using both the
>>> "jdk.tls.client.protocols" and "jdk.tls.server.protocols" system
>>> properties, and also set "jdk.tls.acknowledgeCloseNotify" to true on
>>> both sides as indicated here -
>>> https://bugs.java.com/view_bug.do?bug_id=8208526
>>>
>>> Based on my explorations so far, I think this may be a JDK issue
>>> (specifically in the JSSE provider) that has not been addressed yet. Either
>>> way, do you think three is anything that could be done in Ignite to
>>> explicitly close the connection on both sides in a scenario like this ?
>>>
>>> What I can safely share on the SSL logs so far is this (both nodes get
>>> stuck, node 1 in failing to close the SSL connection, node 2 in waiting for
>>> partition exchange)
>>>
>>> Node 1:
>>>
>>> "2018-10-23 14:18:40.981 PDT|SSLEngineImpl.java:715|Closing inbound of
>>> SSLEngine
>>> javax.net.ssl|ERROR|3F|grid-nio-worker-tcp-comm-1-#26%%|2018-10-23
>>> 14:18:40.981 PDT|TransportContext.java:313|Fatal (INTERNAL_ERROR): closing
>>> inbound before receiving peer's close_notify (
>>> "throwable" : {
>>>   javax.net.ssl.SSLException: closing inbound before receiving peer's
>>> close_notify
>>>   at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:129)
>>>   at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:117)
>>>   at
>>> java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:308)
>>>   at
>>> java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:264)
>>>   at
>>> java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:255)
>>>   at
>>> java.base/sun.security.ssl.SSLEngineImpl.closeInbound(SSLEngineImpl.java:724)
>>>   at
>>> org.apache.ignite.internal.util.nio.ssl.GridNioSslHandler.shutdown(GridNioSslHandler.java:185)
>>>   at
>>> org.apache.ignite.internal.util.nio.ssl.GridNioSslFilter.onSessionClosed(GridNioSslFilter.java:223)
>>>   at
>>> org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedSessionClosed(GridNioFilterAdapter.java:95)
>>>   at
>>> org.apache.ignite.internal.util.nio.GridNioServer$HeadFilter.onSessionClosed(GridNioServer.java:3447)
>>>   at
>>> org.apache.ignite.internal.util.nio.GridNioFilterChain.onSessionClosed(GridNioFilterChain.java:149)
>>>   at
>>> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.close(GridNioServer.java:2689)
>>>   at
>>> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2002)
>>>   at
>>> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1764)
>>>   at
>>> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
>>>   at java.base/java.lang.Thread.run(Thread.java:834)}
>>>
>>> )
>>> 2018-10-23 14:18:40.981 PDT|SSLSessionImpl.java:753|Invalidated
>>> session:  Session(1540329505125|TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384)
>>> DEBUG 14:18:40,981-0700 [] communication.tcp.TcpCommunicationSpi: Unable
>>> to correctly close inbound data stream (will ignore) "
>>>
>>>
>>> Server 2
>>>
>>> "javax.net.ssl|DEBUG|46|tcp-disco-sock-reader-#4%%|2018-10-23
>>> 14:18:09.194 PDT|SSLSocketOutputRecord.java:310|WRITE: TLS12
>>> application_data, length = 1
>>> javax.net.ssl|DEBUG|46|tcp-disco-sock-reader-#4%%|2018-10-23
>>> 14:18:09.194 PDT|SSLCipher.java:1727|Plaintext before ENCRYPTION (
>>>   0000: 01                                                 .
>>> )
>>> javax.net.ssl|DEBUG|46|tcp-disco-sock-reader-#4%%|2018-10-23
>>> 14:18:09.194 PDT|SSLSocketOutputRecord.java:324|Raw write (
>>>   0000: 17 03 03 00 19 00 00 00   00 00 00 00 A1 45 E5 C8
>>> .............E..
>>>   0010: 27 3D EE E2 89 4A 0F DD   B6 9A 2A A8 5F 6B        '=...J....*._k
>>> )
>>> WARN 14:18:10,585-0700 [*Initialization*]
>>> processors.cache.GridCachePartitionExchangeManager: Still waiting for
>>> initial partition map exchange "
>>>
>>> On Mon, Oct 22, 2018 at 12:31 AM Ilya Kasnacheev <
>>> [email protected]> wrote:
>>>
>>>> Hello!
>>>>
>>>> I would suggest regular (or DEBUG) Ignite logs + SSL debug logs.
>>>>
>>>>
>>>> https://docs.oracle.com/javase/7/docs/technotes/guides/security/jsse/ReadDebug.html
>>>>
>>>> Regards,
>>>> --
>>>> Ilya Kasnacheev
>>>>
>>>>
>>>> пт, 19 окт. 2018 г. в 20:58, Loredana Radulescu Ivanoff <
>>>> [email protected]>:
>>>>
>>>>> Definitely - do you want debug logs, and if yes, should I enable them
>>>>> on all Ignite packages, or just for certain ones? Any other diagnostic
>>>>> flags that I should set?
>>>>>
>>>>> On Fri, Oct 19, 2018 at 7:48 AM Ilya Kasnacheev <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hello!
>>>>>>
>>>>>> Apache Ignite does not claim support JDK 11 at this moment. However,
>>>>>> I was able to run SSL test (TcpDiscoverySslTrustedUntrustedTest) and it
>>>>>> turned out mostly fine.
>>>>>>
>>>>>> More info is needed from your side, such as full instances logs.
>>>>>>
>>>>>> Regards,
>>>>>> --
>>>>>> Ilya Kasnacheev
>>>>>>
>>>>>>
>>>>>> пт, 19 окт. 2018 г. в 0:56, Loredana Radulescu Ivanoff <
>>>>>> [email protected]>:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I can consistently reproduce this issue with Ignite 2.6.0, JDK 11
>>>>>>> and SSL enabled:
>>>>>>>
>>>>>>>
>>>>>>>    - the second node that I bring up joins, and then shortly after
>>>>>>>    freezes and prints this message every minute:
>>>>>>>
>>>>>>> "WARN ...[*Initialization*]
>>>>>>> processors.cache.GridCachePartitionExchangeManager: Still waiting for
>>>>>>> initial partition map exchange"
>>>>>>>
>>>>>>>
>>>>>>>    - once the second node joins, the first node starts experiencing
>>>>>>>    very frequent 100% CPU spikes; these are the messages I see:
>>>>>>>
>>>>>>> WARN 2018-10-18T13:50:52,728-0700 []
>>>>>>> communication.tcp.TcpCommunicationSpi: Communication SPI session write
>>>>>>> timed out (consider increasing 'socketWriteTimeout' configuration 
>>>>>>> property)
>>>>>>> [remoteAddr=/10.100.36.82:51620, writeTimeout=15000]
>>>>>>> WARN 2018-10-18T13:50:52,737-0700 []
>>>>>>> communication.tcp.TcpCommunicationSpi: Failed to shutdown SSL session
>>>>>>> gracefully (will force close) [ex=javax.net.ssl.SSLException: Incorrect 
>>>>>>> SSL
>>>>>>> engine status after closeOutbound call [status=OK,
>>>>>>> handshakeStatus=NEED_WRAP,
>>>>>>> WARN 2018-10-18T13:51:01,441-0700 []
>>>>>>> dht.preloader.GridDhtPartitionsExchangeFuture: Unable to await 
>>>>>>> partitions
>>>>>>> release latch within timeout: ServerLatch [permits=1,
>>>>>>> pendingAcks=[aeba8bb7-c9b8-4d46-be8a-df361eaa8fc5], 
>>>>>>> super=CompletableLatch
>>>>>>> [id=exchange, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0]]]
>>>>>>>
>>>>>>> Other observations:
>>>>>>>
>>>>>>> I can reproduce this every time I start the nodes, and it doesn't
>>>>>>> matter which node comes up first.
>>>>>>>
>>>>>>>
>>>>>>> The issue goes away if I disable SSL.
>>>>>>>
>>>>>>>
>>>>>>> Increasing the socketWriteTimeout, networkTimeout or the
>>>>>>> failureDetectionTimeout does not help.
>>>>>>>
>>>>>>> It seems to be happening only with JDK 11, and not with JDK 8.
>>>>>>>
>>>>>>>
>>>>>>> Do you have any suggestions/known issues about this?
>>>>>>>
>>>>>>> Thank you,
>>>>>>>
>>>>>>> Loredana
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>

Reply via email to