Re: Intermittent Communication Timeout Issue Between Nodes in Mixed OS Ignite Cluster

Yash Khandelwal Tue, 12 Nov 2024 19:26:18 -0800

Hi Jeremy,

Thank you for the detailed response. I understand the distinction between
idleConnectionTimeout and connection timeouts, but we are facing a specific
issue that I hope you can help clarify.
Current Behavior:


   - We’ve set idleConnectionTimeout to 30 seconds, expecting idle
   connections to be closed after this period. However, we are observing that
   some connections remain open much longer (up to 1 or 2 hours) without any
   logs indicating that idleConnectionTimeout was triggered.
   - These connections then fail with a  Failed to process selector key
   ..... with java ioexception java.io.IOException: Connection timed out error
   when our clients attempt to use them after an extended idle period.
   - This behavior is inconsistent and only affects some connections,
   suggesting that idleConnectionTimeout is not being triggered for these
   specific cases.

My Questions:

   1.

   Why is Ignite’s idleConnectionTimeout not closing these connections,
   even when they remain idle for extended periods?
   2.

   Could this issue be related to external factors like OS-level TCP
   keep-alive settings or firewall configurations?
   3.

   What would be the recommended approach to troubleshoot why
   idleConnectionTimeout is not behaving as expected for certain
   connections?


Any guidance on how to further diagnose or resolve this issue would be
greatly appreciated.

Thank you for your support!

Best regards,
Yash Khandelwal


On Wed, Nov 13, 2024 at 4:21 AM Jeremy McMillan <j...@gridgain.com> wrote:

> Idle connection timeout errors are a (noise) nuisance in log files, but
> not operationally fatal unless you run out of log space...
>
> If you want to determine whether Ignite idle timeouts are working, it may
> help for you to attach a debugger to a running Ignite instance, connect
> some clients, let them idle for a while, and then inspect the session
> objects' idle timeouts to see if they are properly counting idle time. It's
> good that you used Wireshark to rule out TCP keepalive packets or other
> traffic which might affect socket timers.
>
> Connection timeouts are a completely different thing. Those are timers
> expiring between the time an initial connection request arrives and a
> session can be established. Please treat them as distinct from each other
> (idle timeouts and connection timeouts). Connection timeouts are usually
> client misconfiguration if only some connections fail and server
> misconfiguration if all connections fail this way (check SSL settings
> first).
>
>
> On Mon, Nov 11, 2024 at 11:57 PM Yash Khandelwal <
> khandelwalyash...@gmail.com> wrote:
>
>>
>>
>> ---------- Forwarded message ---------
>> From: Yash Khandelwal <khandelwalyash...@gmail.com>
>> Date: Tue, Nov 12, 2024 at 11:09 AM
>> Subject: Re: Intermittent Communication Timeout Issue Between Nodes in
>> Mixed OS Ignite Cluster
>> To: <user@ignite.apache.org>
>>
>>
>> Dear Apache Ignite Community,
>>
>> I hope this message finds you well. I’m following up on my previous email
>> regarding the intermittent connectivity issue in my Ignite 2.15 cluster
>> (Linux-to-Windows). I’ve been able to conduct further analysis and would
>> like to share my findings and ask for additional insights or
>> recommendations.
>> Problem Summary:
>>
>> We migrated from *Ignite 2.8* to *Ignite 2.15* in our cluster, and we
>> are encountering issues where TCP connections between our Linux and Windows
>> nodes aren’t being closed as expected. Despite configuring
>> idleConnectionTimeout to 30 seconds in Ignite's TcpCommunicationSpi,
>> some connections remain open for extended periods, leading to 
>> java.io.IOException:
>> Connection timed out errors after up to 2 hours.
>> Observations from Further Analysis:
>>
>>    1.
>>
>>    *TCP Keep-Alive*: On Linux and Windows, the TCP Keep-Alive intervals
>>    are both set to 2 hours, and the firewall timeout is set to 1 hour. This
>>    could be contributing to the prolonged connection retention.
>>    2.
>>
>>    *Wireshark Analysis*: Using Wireshark, I observed that the last
>>    packet with an ACK was received approximately 1 hour or 2 hours before the
>>    timeout error, aligning with the OS-level and firewall timeout values,
>>    which might be preventing Ignite’s idleConnectionTimeout from being
>>    triggered.
>>    3.
>>
>>    *Log Snippets*: After a connection timeout, Ignite cleans up the
>>    stale connection and re-establishes it, allowing communication to 
>> continue.
>>    Below is a detailed snippet from the logs indicating the sequence of 
>> events:
>>
>>
>>    09:58:54:979603|2923-00119: [class 
>> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi]: {ERROR} Action 
>> {Failed to process selector key [ses=GridSelectorNioSessionImpl 
>> [worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=3, 
>> bytesRcvd=450, bytesSent=42303, bytesRcvd0=0, bytesSent0=0, select=true, 
>> super=GridWorker [name=grid-nio-worker-tcp-comm-3, 
>> igniteInstanceName=TcpCommunicationSpi, finished=false, 
>> heartbeatTs=1730973533970, hashCode=1781702350, interrupted=false, 
>> runner=grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%]]], 
>> writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], 
>> readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], 
>> inRecovery=null, outRecovery=GridNioRecoveryDescriptor [acked=220, 
>> resendCnt=0, rcvCnt=0, sentCnt=221, reserved=true, lastAck=0, 
>> nodeLeft=false, node=TcpDiscoveryNode 
>> [id=6bd0cb88-2519-4fa5-af61-c5ced899486e, 
>> consistentId=192.168.1.112,127.0.0.1:48900, addrs=ArrayList [192.168.1.112, 
>> 127.0.0.1], sockAddrs=null, discPort=48500, order=1, intOrder=1, 
>> lastExchangeTime=1730940088189, loc=false, 
>> ver=2.15.0#20230425-sha1:f98f7f35, isClient=false], connected=false, 
>> connectCnt=8, queueLimit=4096, reserveCnt=8, pairedConnections=true], 
>> closeSocket=true, 
>> outboundMessagesQueueSizeMetric=o.a.i.i.processors.metric.impl.LongAdderMetric@69a257d1,
>>  super=GridNioSessionImpl [locAddr=/11.88.130.14:34662, 
>> rmtAddr=192.168.1.112:48102, createTime=1730967723110, closeTime=0, 
>> bytesSent=9307, bytesRcvd=70, bytesSent0=0, bytesRcvd0=0, 
>> sndSchedTime=1730973524081, lastSndTime=1730972583303, 
>> lastRcvTime=1730973524081, readsPaused=false, 
>> filterChain=FilterChain[filters=[GridNioCodecFilter 
>> [parser=o.a.i.i.util.nio.GridDirectParser@7d0aac30, directMode=true], 
>> GridConnectionBytesVerifyFilter], accepted=false, markedForClose=false]]]} 
>> Thread {grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%}
>>
>>    java.io.IOException: Connection timed out
>>    at java.base/sun.nio.ch.SocketDispatcher.read0(Native Method)
>>    at java.base/sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:47)
>>    at java.base/sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:330)
>>    at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:284)
>>    at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:259)
>>    at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:417)
>>    at 
>> org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processRead(GridNioServer.java:1351)
>>    at 
>> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2526)
>>    at 
>> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2281)
>>    at 
>> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1910)
>>    at 
>> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:125)
>>    at java.base/java.lang.Thread.run(Thread.java:840)
>>
>>    09:58:54:980744|0364-00119: [class 
>> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi]: {WARN} Action 
>> {Client disconnected abruptly due to network connection loss or because the 
>> connection was left open on application shutdown. [cls=class 
>> o.a.i.i.util.nio.GridNioException, msg=Connection timed out]} Thread 
>> {grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%}
>>    09:58:54:980987|0266-00119: [class 
>> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi]: {DEBUG} Action 
>> {The node was disconnected [nodeId=6bd0cb88-2519-4fa5-af61-c5ced899486e, 
>> err=Connection timed out]} Thread 
>> {grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%}
>>    09:58:54:984441|2105-00119: [class 
>> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi]: {DEBUG} Action 
>> {The client was removed [nodeId=6bd0cb88-2519-4fa5-af61-c5ced899486e,  
>> client=GridTcpNioCommunicationClient [ses=GridSelectorNioSessionImpl 
>> [worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=3, 
>> bytesRcvd=450, bytesSent=42303, bytesRcvd0=0, bytesSent0=0, select=true, 
>> super=GridWorker [name=grid-nio-worker-tcp-comm-3, 
>> igniteInstanceName=TcpCommunicationSpi, finished=false, 
>> heartbeatTs=1730973533970, hashCode=1781702350, interrupted=false, 
>> runner=grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%]]], 
>> writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], 
>> readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], 
>> inRecovery=null, outRecovery=GridNioRecoveryDescriptor [acked=220, 
>> resendCnt=0, rcvCnt=0, sentCnt=221, reserved=true, lastAck=0, 
>> nodeLeft=false, node=TcpDiscoveryNode 
>> [id=6bd0cb88-2519-4fa5-af61-c5ced899486e, 
>> consistentId=192.168.1.112,127.0.0.1:48500, addrs=ArrayList [192.168.1.112, 
>> 127.0.0.1], sockAddrs=null, discPort=48500, order=1, intOrder=1, 
>> lastExchangeTime=1730940088189, loc=false, 
>> ver=2.15.0#20230425-sha1:f98f7f35, isClient=false], connected=false, 
>> connectCnt=8, queueLimit=4096, reserveCnt=8, pairedConnections=true], 
>> closeSocket=true, 
>> outboundMessagesQueueSizeMetric=org.apache.ignite.internal.processors.metric.impl.LongAdderMetric@69a257d1,
>>  super=GridNioSessionImpl [locAddr=/11.88.130.14:34662, 
>> rmtAddr=192.168.1.112:48102, createTime=1730967723110, 
>> closeTime=1730973534976, bytesSent=9307, bytesRcvd=70, bytesSent0=0, 
>> bytesRcvd0=0, sndSchedTime=1730973524081, lastSndTime=1730972583303, 
>> lastRcvTime=1730973524081, readsPaused=false, 
>> filterChain=FilterChain[filters=[GridNioCodecFilter 
>> [parser=org.apache.ignite.internal.util.nio.GridDirectParser@7d0aac30, 
>> directMode=true], GridConnectionBytesVerifyFilter], accepted=false, 
>> markedForClose=true]], super=GridAbstractCommunicationClient 
>> [lastUsed=1730967723110, closed=true, connIdx=0]]].} Thread 
>> {grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%}
>>    09:58:54:984517|0295-00119: [class 
>> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi]: {DEBUG} Action 
>> {Session was closed but there are unacknowledged messages, will try to 
>> reconnect [rmtNode=6bd0cb88-2519-4fa5-af61-c5ced899486e]} Thread 
>> {grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%}
>>    09:58:54:984899|0229-00121: [class 
>> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi]: {DEBUG} Action 
>> {Recovery reconnect [rmtNode=6bd0cb88-2519-4fa5-af61-c5ced899486e]} Thread 
>> {tcp-comm-worker-#1%WebCluster%-#28%WebCluster%}09:58:54:984908|0250-00119: 
>> [class org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi]: {DEBUG} 
>> Action {The node was disconnected 
>> [nodeId=6bd0cb88-2519-4fa5-af61-c5ced899486e, err=null]} Thread 
>> {grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%}
>>    09:58:54:984960|0269-00121: [class 
>> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi]: {DEBUG} Action 
>> {The node client is going to reserve a connection 
>> [nodeId=6bd0cb88-2519-4fa5-af61-c5ced899486e, connIdx=0]} Thread 
>> {tcp-comm-worker-#1%WebCluster%-#28%WebCluster%}
>>    09:58:54:985005|0295-00119: [class 
>> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi]: {DEBUG} Action 
>> {Session was closed but there are unacknowledged messages, will try to 
>> reconnect [rmtNode=6bd0cb88-2519-4fa5-af61-c5ced899486e]} Thread 
>> {grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%}
>>
>>
>>    4.
>>
>>    *Issue Consistency*: This behavior occurs intermittently, but the
>>    pattern is consistent with a 1-hour or 2-hour inactivity period before the
>>    connection is closed.
>>
>> Stack Overflow Link:
>>
>> I have posted a detailed analysis and follow-up question on Stack
>> Overflow, where I seek advice regarding the root cause and any necessary
>> configuration adjustments:
>>
>>    - *Stack Overflow Link to My Question
>>    
>> <https://stackoverflow.com/questions/79170201/apache-ignite-idle-connection-timeout-not-working-consistently-between-linux-and>*
>>
>> Questions:
>>
>>    - *Why are some connections not closing after the configured
>>    idleConnectionTimeout of 30 seconds?*
>>    - *Are there any additional Ignite settings or OS configurations I
>>    should tweak to ensure idle connections are closed promptly?*
>>
>> I’d appreciate any insights or suggestions based on your experience with
>> similar issues or cross-platform Ignite setups. Thank you for your time and
>> help!
>>
>> Best regards,
>>
>> On Mon, Nov 4, 2024 at 10:50 AM Yash Khandelwal <
>> khandelwalyash...@gmail.com> wrote:
>>
>>> Dear Apache Ignite Community,
>>>
>>> I am currently facing an intermittent issue in an Ignite 2.15 cluster
>>> setup with the following configuration:
>>>
>>> *Cluster Setup:*
>>> - Ignite version: 2.15
>>> - One Ignite server node and five additional nodes (node1 to node5).
>>> - Nodes 1 to 4 and the server node run on Windows machines, while node5
>>> runs on a Linux machine.
>>> - Each node has distinct functions and configurations.
>>>
>>> *Issue Details:*
>>> Intermittently, communication from node5 (Linux) to node1 (Windows)
>>> fails, and the following exception appears in the logs:
>>>
>>>> ```
>>>> java.io.IOException: Connection timed out
>>>>     at java.base/sun.nio.ch.SocketDispatcher.read0(Native Method)
>>>>     at
>>>> java.base/sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:47)
>>>>     at java.base/sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:330)
>>>>     at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:284)
>>>>     at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:259)
>>>>     at
>>>> java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:417)
>>>>     at
>>>> org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processRead(GridNioServer.java:1351)
>>>>     ...
>>>> ```
>>>
>>> This occurs despite no system-critical thread logs indicating resource
>>> starvation or high CPU usage.
>>>
>>> *Observations:*
>>> 1. The last read and write actions on node5 occur approximately 15
>>> minutes and 50 seconds before the exception is logged. This pattern is
>>> consistent across multiple instances.
>>> 2. The `idleConnectionTimeout` is configured to 30 seconds, but no
>>> timeout triggers after this duration, even with inactivity exceeding 15
>>> minutes.
>>>
>>> *Configuration Summary:*
>>>
>>>> ```xml
>>>> <bean id="communicationSpiBean"
>>>> class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi">
>>>>     <property name="localPort" value="48100" />
>>>>     <property name="localPortRange" value="20" />
>>>>     <property name="slowClientQueueLimit" value="100000" />
>>>>     <property name="idleConnectionTimeout" value="30000" />
>>>>     <property name="usePairedConnections" value="true" />
>>>> </bean>
>>>
>>>
>>> <property name="failureDetectionTimeout" value="10000"/>
>>> <property name="clientFailureDetectionTimeout" value="30000"/>
>>> <property name="systemWorkerBlockedTimeout" value="300000"/>
>>> <property name="peerClassLoadingEnabled" value="false"/>
>>> ```
>>>
>>> *Request for Assistance:*
>>> I am looking for guidance or similar experiences that might help
>>> identify the root cause of this problem. Specifically:
>>> - Insights into why `idleConnectionTimeout` does not trigger as expected
>>> after prolonged inactivity.
>>> - What's this 15min 50 sec story that is happening every time this issue
>>> occurs.
>>> - Potential configuration adjustments or known issues related to mixed
>>> OS (Windows and Linux) environments impacting communication.
>>> - Any recommendations on handling cross-platform network idiosyncrasies
>>> in Ignite setups.
>>>
>>> Any advice or assistance from the community would be greatly appreciated.
>>>
>>> Thank you in advance for your support.
>>>
>>> Best regards,
>>> Yash Khandelwal
>>> mail: khandelwalyash...@gmail.com
>>>
>>

Re: Intermittent Communication Timeout Issue Between Nodes in Mixed OS Ignite Cluster

Reply via email to