---------- Forwarded message --------- From: Yash Khandelwal <khandelwalyash...@gmail.com> Date: Tue, Nov 12, 2024 at 11:09 AM Subject: Re: Intermittent Communication Timeout Issue Between Nodes in Mixed OS Ignite Cluster To: <u...@ignite.apache.org>
Dear Apache Ignite Community, I hope this message finds you well. I’m following up on my previous email regarding the intermittent connectivity issue in my Ignite 2.15 cluster (Linux-to-Windows). I’ve been able to conduct further analysis and would like to share my findings and ask for additional insights or recommendations. Problem Summary: We migrated from *Ignite 2.8* to *Ignite 2.15* in our cluster, and we are encountering issues where TCP connections between our Linux and Windows nodes aren’t being closed as expected. Despite configuring idleConnectionTimeout to 30 seconds in Ignite's TcpCommunicationSpi, some connections remain open for extended periods, leading to java.io.IOException: Connection timed out errors after up to 2 hours. Observations from Further Analysis: 1. *TCP Keep-Alive*: On Linux and Windows, the TCP Keep-Alive intervals are both set to 2 hours, and the firewall timeout is set to 1 hour. This could be contributing to the prolonged connection retention. 2. *Wireshark Analysis*: Using Wireshark, I observed that the last packet with an ACK was received approximately 1 hour or 2 hours before the timeout error, aligning with the OS-level and firewall timeout values, which might be preventing Ignite’s idleConnectionTimeout from being triggered. 3. *Log Snippets*: After a connection timeout, Ignite cleans up the stale connection and re-establishes it, allowing communication to continue. Below is a detailed snippet from the logs indicating the sequence of events: 09:58:54:979603|2923-00119: [class org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi]: {ERROR} Action {Failed to process selector key [ses=GridSelectorNioSessionImpl [worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=3, bytesRcvd=450, bytesSent=42303, bytesRcvd0=0, bytesSent0=0, select=true, super=GridWorker [name=grid-nio-worker-tcp-comm-3, igniteInstanceName=TcpCommunicationSpi, finished=false, heartbeatTs=1730973533970, hashCode=1781702350, interrupted=false, runner=grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%]]], writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], inRecovery=null, outRecovery=GridNioRecoveryDescriptor [acked=220, resendCnt=0, rcvCnt=0, sentCnt=221, reserved=true, lastAck=0, nodeLeft=false, node=TcpDiscoveryNode [id=6bd0cb88-2519-4fa5-af61-c5ced899486e, consistentId=192.168.1.112,127.0.0.1:48900, addrs=ArrayList [192.168.1.112, 127.0.0.1], sockAddrs=null, discPort=48500, order=1, intOrder=1, lastExchangeTime=1730940088189, loc=false, ver=2.15.0#20230425-sha1:f98f7f35, isClient=false], connected=false, connectCnt=8, queueLimit=4096, reserveCnt=8, pairedConnections=true], closeSocket=true, outboundMessagesQueueSizeMetric=o.a.i.i.processors.metric.impl.LongAdderMetric@69a257d1, super=GridNioSessionImpl [locAddr=/11.88.130.14:34662, rmtAddr=192.168.1.112:48102, createTime=1730967723110, closeTime=0, bytesSent=9307, bytesRcvd=70, bytesSent0=0, bytesRcvd0=0, sndSchedTime=1730973524081, lastSndTime=1730972583303, lastRcvTime=1730973524081, readsPaused=false, filterChain=FilterChain[filters=[GridNioCodecFilter [parser=o.a.i.i.util.nio.GridDirectParser@7d0aac30, directMode=true], GridConnectionBytesVerifyFilter], accepted=false, markedForClose=false]]]} Thread {grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%} java.io.IOException: Connection timed out at java.base/sun.nio.ch.SocketDispatcher.read0(Native Method) at java.base/sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:47) at java.base/sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:330) at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:284) at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:259) at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:417) at org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processRead(GridNioServer.java:1351) at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2526) at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2281) at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1910) at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:125) at java.base/java.lang.Thread.run(Thread.java:840) 09:58:54:980744|0364-00119: [class org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi]: {WARN} Action {Client disconnected abruptly due to network connection loss or because the connection was left open on application shutdown. [cls=class o.a.i.i.util.nio.GridNioException, msg=Connection timed out]} Thread {grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%} 09:58:54:980987|0266-00119: [class org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi]: {DEBUG} Action {The node was disconnected [nodeId=6bd0cb88-2519-4fa5-af61-c5ced899486e, err=Connection timed out]} Thread {grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%} 09:58:54:984441|2105-00119: [class org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi]: {DEBUG} Action {The client was removed [nodeId=6bd0cb88-2519-4fa5-af61-c5ced899486e, client=GridTcpNioCommunicationClient [ses=GridSelectorNioSessionImpl [worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=3, bytesRcvd=450, bytesSent=42303, bytesRcvd0=0, bytesSent0=0, select=true, super=GridWorker [name=grid-nio-worker-tcp-comm-3, igniteInstanceName=TcpCommunicationSpi, finished=false, heartbeatTs=1730973533970, hashCode=1781702350, interrupted=false, runner=grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%]]], writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], inRecovery=null, outRecovery=GridNioRecoveryDescriptor [acked=220, resendCnt=0, rcvCnt=0, sentCnt=221, reserved=true, lastAck=0, nodeLeft=false, node=TcpDiscoveryNode [id=6bd0cb88-2519-4fa5-af61-c5ced899486e, consistentId=192.168.1.112,127.0.0.1:48500, addrs=ArrayList [192.168.1.112, 127.0.0.1], sockAddrs=null, discPort=48500, order=1, intOrder=1, lastExchangeTime=1730940088189, loc=false, ver=2.15.0#20230425-sha1:f98f7f35, isClient=false], connected=false, connectCnt=8, queueLimit=4096, reserveCnt=8, pairedConnections=true], closeSocket=true, outboundMessagesQueueSizeMetric=org.apache.ignite.internal.processors.metric.impl.LongAdderMetric@69a257d1, super=GridNioSessionImpl [locAddr=/11.88.130.14:34662, rmtAddr=192.168.1.112:48102, createTime=1730967723110, closeTime=1730973534976, bytesSent=9307, bytesRcvd=70, bytesSent0=0, bytesRcvd0=0, sndSchedTime=1730973524081, lastSndTime=1730972583303, lastRcvTime=1730973524081, readsPaused=false, filterChain=FilterChain[filters=[GridNioCodecFilter [parser=org.apache.ignite.internal.util.nio.GridDirectParser@7d0aac30, directMode=true], GridConnectionBytesVerifyFilter], accepted=false, markedForClose=true]], super=GridAbstractCommunicationClient [lastUsed=1730967723110, closed=true, connIdx=0]]].} Thread {grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%} 09:58:54:984517|0295-00119: [class org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi]: {DEBUG} Action {Session was closed but there are unacknowledged messages, will try to reconnect [rmtNode=6bd0cb88-2519-4fa5-af61-c5ced899486e]} Thread {grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%} 09:58:54:984899|0229-00121: [class org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi]: {DEBUG} Action {Recovery reconnect [rmtNode=6bd0cb88-2519-4fa5-af61-c5ced899486e]} Thread {tcp-comm-worker-#1%WebCluster%-#28%WebCluster%}09:58:54:984908|0250-00119: [class org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi]: {DEBUG} Action {The node was disconnected [nodeId=6bd0cb88-2519-4fa5-af61-c5ced899486e, err=null]} Thread {grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%} 09:58:54:984960|0269-00121: [class org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi]: {DEBUG} Action {The node client is going to reserve a connection [nodeId=6bd0cb88-2519-4fa5-af61-c5ced899486e, connIdx=0]} Thread {tcp-comm-worker-#1%WebCluster%-#28%WebCluster%} 09:58:54:985005|0295-00119: [class org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi]: {DEBUG} Action {Session was closed but there are unacknowledged messages, will try to reconnect [rmtNode=6bd0cb88-2519-4fa5-af61-c5ced899486e]} Thread {grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%} 4. *Issue Consistency*: This behavior occurs intermittently, but the pattern is consistent with a 1-hour or 2-hour inactivity period before the connection is closed. Stack Overflow Link: I have posted a detailed analysis and follow-up question on Stack Overflow, where I seek advice regarding the root cause and any necessary configuration adjustments: - *Stack Overflow Link to My Question <https://stackoverflow.com/questions/79170201/apache-ignite-idle-connection-timeout-not-working-consistently-between-linux-and>* Questions: - *Why are some connections not closing after the configured idleConnectionTimeout of 30 seconds?* - *Are there any additional Ignite settings or OS configurations I should tweak to ensure idle connections are closed promptly?* I’d appreciate any insights or suggestions based on your experience with similar issues or cross-platform Ignite setups. Thank you for your time and help! Best regards, On Mon, Nov 4, 2024 at 10:50 AM Yash Khandelwal <khandelwalyash...@gmail.com> wrote: > Dear Apache Ignite Community, > > I am currently facing an intermittent issue in an Ignite 2.15 cluster > setup with the following configuration: > > *Cluster Setup:* > - Ignite version: 2.15 > - One Ignite server node and five additional nodes (node1 to node5). > - Nodes 1 to 4 and the server node run on Windows machines, while node5 > runs on a Linux machine. > - Each node has distinct functions and configurations. > > *Issue Details:* > Intermittently, communication from node5 (Linux) to node1 (Windows) fails, > and the following exception appears in the logs: > >> ``` >> java.io.IOException: Connection timed out >> at java.base/sun.nio.ch.SocketDispatcher.read0(Native Method) >> at >> java.base/sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:47) >> at java.base/sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:330) >> at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:284) >> at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:259) >> at >> java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:417) >> at >> org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processRead(GridNioServer.java:1351) >> ... >> ``` > > This occurs despite no system-critical thread logs indicating resource > starvation or high CPU usage. > > *Observations:* > 1. The last read and write actions on node5 occur approximately 15 minutes > and 50 seconds before the exception is logged. This pattern is consistent > across multiple instances. > 2. The `idleConnectionTimeout` is configured to 30 seconds, but no timeout > triggers after this duration, even with inactivity exceeding 15 minutes. > > *Configuration Summary:* > >> ```xml >> <bean id="communicationSpiBean" >> class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi"> >> <property name="localPort" value="48100" /> >> <property name="localPortRange" value="20" /> >> <property name="slowClientQueueLimit" value="100000" /> >> <property name="idleConnectionTimeout" value="30000" /> >> <property name="usePairedConnections" value="true" /> >> </bean> > > > <property name="failureDetectionTimeout" value="10000"/> > <property name="clientFailureDetectionTimeout" value="30000"/> > <property name="systemWorkerBlockedTimeout" value="300000"/> > <property name="peerClassLoadingEnabled" value="false"/> > ``` > > *Request for Assistance:* > I am looking for guidance or similar experiences that might help identify > the root cause of this problem. Specifically: > - Insights into why `idleConnectionTimeout` does not trigger as expected > after prolonged inactivity. > - What's this 15min 50 sec story that is happening every time this issue > occurs. > - Potential configuration adjustments or known issues related to mixed OS > (Windows and Linux) environments impacting communication. > - Any recommendations on handling cross-platform network idiosyncrasies in > Ignite setups. > > Any advice or assistance from the community would be greatly appreciated. > > Thank you in advance for your support. > > Best regards, > Yash Khandelwal > mail: khandelwalyash...@gmail.com >