Hello! I have tried running SSL tests again and they seem to pass (only one test fails for some different reason)
Can you try running those 2 nodes in stand-alone processes, see if problem persists? I can see that you have SSL-enabled Tomcat running in same VM, which I imaging could interfere with Ignite's SSL. Note that you will need to do some load (such as REST cache operations) to see if communication indeed works (or doesn't). Regards, -- Ilya Kasnacheev ср, 28 нояб. 2018 г. в 16:01, Loredana Radulescu Ivanoff <[email protected] >: > Hello again, > > I haven't been able to solve this issue on my own, so I'm hoping you'd be > able to take another look. > > To recap: only with Java 11 and TLS enabled, the second node I bring in > the cluster never starts up, and remains stuck at "Still waiting for > initial partition map exchange". The first nodes keeps logging "Unable to > await partitions release latch within timeout". To me, this looks like an > Ignite issue, and no matter what causes the situation (arguably in this > case an SSL error), there should be a more elegant exit out of it, i.e. the > second node should give up after a while, if there isn't a better way to > retry and achieve successful communication. The two nodes are able to > communicate, and increasing various network timeouts/failure detection > timeout does not help. > > Previously it was mentioned that the Ignite unit test did not show a > repro. I suggest running a test that uses two different machines, because > when I run the nodes on the same machine, I do not get a repro either. > > Attaching here logging from the two nodes including SSL messages. > > Is Ignite support for Java 11 going to be available before Oracle ends > free commercial support for Java 1.8 in Jan 2019? > > Thank you. > > On Thu, Oct 25, 2018 at 9:29 AM Ilya Kasnacheev <[email protected]> > wrote: > >> Hello! >> >> I have tried to run the test with protocol "TLSv1.2", didn't see any >> difference. >> >> Regards, >> -- >> Ilya Kasnacheev >> >> >> ср, 24 окт. 2018 г. в 20:23, Loredana Radulescu Ivanoff < >> [email protected]>: >> >>> Hello again, >>> >>> I am working on getting the full SSL logs over to you, but I have one >>> more question in between: TLS v1.3 is enabled by default in JDK 11, and my >>> app is using TLS v1.2 specifically. There's a known issue that's recently >>> addressed by the JDK related to TLS v1.3 half close policy, details here: >>> https://bugs.java.com/view_bug.do?bug_id=8207009 >>> >>> Would you be able to confirm whether your SSL test runs successfully >>> when the connecting client/server use TLS v.12 specifically ? >>> >>> FYI, I have tried specifically disabling TLS v1.3 using both the >>> "jdk.tls.client.protocols" and "jdk.tls.server.protocols" system >>> properties, and also set "jdk.tls.acknowledgeCloseNotify" to true on >>> both sides as indicated here - >>> https://bugs.java.com/view_bug.do?bug_id=8208526 >>> >>> Based on my explorations so far, I think this may be a JDK issue >>> (specifically in the JSSE provider) that has not been addressed yet. Either >>> way, do you think three is anything that could be done in Ignite to >>> explicitly close the connection on both sides in a scenario like this ? >>> >>> What I can safely share on the SSL logs so far is this (both nodes get >>> stuck, node 1 in failing to close the SSL connection, node 2 in waiting for >>> partition exchange) >>> >>> Node 1: >>> >>> "2018-10-23 14:18:40.981 PDT|SSLEngineImpl.java:715|Closing inbound of >>> SSLEngine >>> javax.net.ssl|ERROR|3F|grid-nio-worker-tcp-comm-1-#26%%|2018-10-23 >>> 14:18:40.981 PDT|TransportContext.java:313|Fatal (INTERNAL_ERROR): closing >>> inbound before receiving peer's close_notify ( >>> "throwable" : { >>> javax.net.ssl.SSLException: closing inbound before receiving peer's >>> close_notify >>> at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:129) >>> at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:117) >>> at >>> java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:308) >>> at >>> java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:264) >>> at >>> java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:255) >>> at >>> java.base/sun.security.ssl.SSLEngineImpl.closeInbound(SSLEngineImpl.java:724) >>> at >>> org.apache.ignite.internal.util.nio.ssl.GridNioSslHandler.shutdown(GridNioSslHandler.java:185) >>> at >>> org.apache.ignite.internal.util.nio.ssl.GridNioSslFilter.onSessionClosed(GridNioSslFilter.java:223) >>> at >>> org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedSessionClosed(GridNioFilterAdapter.java:95) >>> at >>> org.apache.ignite.internal.util.nio.GridNioServer$HeadFilter.onSessionClosed(GridNioServer.java:3447) >>> at >>> org.apache.ignite.internal.util.nio.GridNioFilterChain.onSessionClosed(GridNioFilterChain.java:149) >>> at >>> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.close(GridNioServer.java:2689) >>> at >>> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2002) >>> at >>> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1764) >>> at >>> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) >>> at java.base/java.lang.Thread.run(Thread.java:834)} >>> >>> ) >>> 2018-10-23 14:18:40.981 PDT|SSLSessionImpl.java:753|Invalidated >>> session: Session(1540329505125|TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) >>> DEBUG 14:18:40,981-0700 [] communication.tcp.TcpCommunicationSpi: Unable >>> to correctly close inbound data stream (will ignore) " >>> >>> >>> Server 2 >>> >>> "javax.net.ssl|DEBUG|46|tcp-disco-sock-reader-#4%%|2018-10-23 >>> 14:18:09.194 PDT|SSLSocketOutputRecord.java:310|WRITE: TLS12 >>> application_data, length = 1 >>> javax.net.ssl|DEBUG|46|tcp-disco-sock-reader-#4%%|2018-10-23 >>> 14:18:09.194 PDT|SSLCipher.java:1727|Plaintext before ENCRYPTION ( >>> 0000: 01 . >>> ) >>> javax.net.ssl|DEBUG|46|tcp-disco-sock-reader-#4%%|2018-10-23 >>> 14:18:09.194 PDT|SSLSocketOutputRecord.java:324|Raw write ( >>> 0000: 17 03 03 00 19 00 00 00 00 00 00 00 A1 45 E5 C8 >>> .............E.. >>> 0010: 27 3D EE E2 89 4A 0F DD B6 9A 2A A8 5F 6B '=...J....*._k >>> ) >>> WARN 14:18:10,585-0700 [*Initialization*] >>> processors.cache.GridCachePartitionExchangeManager: Still waiting for >>> initial partition map exchange " >>> >>> On Mon, Oct 22, 2018 at 12:31 AM Ilya Kasnacheev < >>> [email protected]> wrote: >>> >>>> Hello! >>>> >>>> I would suggest regular (or DEBUG) Ignite logs + SSL debug logs. >>>> >>>> >>>> https://docs.oracle.com/javase/7/docs/technotes/guides/security/jsse/ReadDebug.html >>>> >>>> Regards, >>>> -- >>>> Ilya Kasnacheev >>>> >>>> >>>> пт, 19 окт. 2018 г. в 20:58, Loredana Radulescu Ivanoff < >>>> [email protected]>: >>>> >>>>> Definitely - do you want debug logs, and if yes, should I enable them >>>>> on all Ignite packages, or just for certain ones? Any other diagnostic >>>>> flags that I should set? >>>>> >>>>> On Fri, Oct 19, 2018 at 7:48 AM Ilya Kasnacheev < >>>>> [email protected]> wrote: >>>>> >>>>>> Hello! >>>>>> >>>>>> Apache Ignite does not claim support JDK 11 at this moment. However, >>>>>> I was able to run SSL test (TcpDiscoverySslTrustedUntrustedTest) and it >>>>>> turned out mostly fine. >>>>>> >>>>>> More info is needed from your side, such as full instances logs. >>>>>> >>>>>> Regards, >>>>>> -- >>>>>> Ilya Kasnacheev >>>>>> >>>>>> >>>>>> пт, 19 окт. 2018 г. в 0:56, Loredana Radulescu Ivanoff < >>>>>> [email protected]>: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> I can consistently reproduce this issue with Ignite 2.6.0, JDK 11 >>>>>>> and SSL enabled: >>>>>>> >>>>>>> >>>>>>> - the second node that I bring up joins, and then shortly after >>>>>>> freezes and prints this message every minute: >>>>>>> >>>>>>> "WARN ...[*Initialization*] >>>>>>> processors.cache.GridCachePartitionExchangeManager: Still waiting for >>>>>>> initial partition map exchange" >>>>>>> >>>>>>> >>>>>>> - once the second node joins, the first node starts experiencing >>>>>>> very frequent 100% CPU spikes; these are the messages I see: >>>>>>> >>>>>>> WARN 2018-10-18T13:50:52,728-0700 [] >>>>>>> communication.tcp.TcpCommunicationSpi: Communication SPI session write >>>>>>> timed out (consider increasing 'socketWriteTimeout' configuration >>>>>>> property) >>>>>>> [remoteAddr=/10.100.36.82:51620, writeTimeout=15000] >>>>>>> WARN 2018-10-18T13:50:52,737-0700 [] >>>>>>> communication.tcp.TcpCommunicationSpi: Failed to shutdown SSL session >>>>>>> gracefully (will force close) [ex=javax.net.ssl.SSLException: Incorrect >>>>>>> SSL >>>>>>> engine status after closeOutbound call [status=OK, >>>>>>> handshakeStatus=NEED_WRAP, >>>>>>> WARN 2018-10-18T13:51:01,441-0700 [] >>>>>>> dht.preloader.GridDhtPartitionsExchangeFuture: Unable to await >>>>>>> partitions >>>>>>> release latch within timeout: ServerLatch [permits=1, >>>>>>> pendingAcks=[aeba8bb7-c9b8-4d46-be8a-df361eaa8fc5], >>>>>>> super=CompletableLatch >>>>>>> [id=exchange, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0]]] >>>>>>> >>>>>>> Other observations: >>>>>>> >>>>>>> I can reproduce this every time I start the nodes, and it doesn't >>>>>>> matter which node comes up first. >>>>>>> >>>>>>> >>>>>>> The issue goes away if I disable SSL. >>>>>>> >>>>>>> >>>>>>> Increasing the socketWriteTimeout, networkTimeout or the >>>>>>> failureDetectionTimeout does not help. >>>>>>> >>>>>>> It seems to be happening only with JDK 11, and not with JDK 8. >>>>>>> >>>>>>> >>>>>>> Do you have any suggestions/known issues about this? >>>>>>> >>>>>>> Thank you, >>>>>>> >>>>>>> Loredana >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>
