The most common cause of a segmented cluster is not the network but your Java garbage collection configuration. Do you see any "Long JVM pause" warnings in your logs before the problem occurs?
On Wed, 8 Nov 2023 at 08:48, Alan Rose <alan_r...@trimble.com> wrote: > > I am hoping someone can help me understand some log entries better. > I have two ignite nodes A & B running in linux containers that appear to > have a network issue that result in node A restarting approx 5 seconds > later. > From the logs Node B states about Node A > "Previous node alive status [alive=false, > checkPreviousNodeId=fb9c943e-aa4a-4e6c-ae00-1df5212a3f3f, > actualPreviousNode=TcpDiscoveryNode > [id=58424f0b-e77f-4127-835a-4274f57955a1, > consistentId=5faca106-0c39-45ab-8c64-f38df8910238, etc. > What is this line telling me about Node A? > > I then get > "Node FAILED: TcpDiscoveryNode ..etc" > and "Close incoming connection, unknown node..?" I think talking about > node A > > Node A log states > Failed to send message to remote node [node=TcpDiscoveryNode [id= etc > but it does appear to be able to ping node B Ok > within 5 second I see in Node A log > Node is out of topology (probably, due to short-time network problems). > Local node SEGMENTED: TcpDiscoveryNode [id=58 etc > finally there is a restart of the node A. > I see no other evidence of a network issue. Is there something I can > configure, so it is not so quick to timeout > The only thing I see in the log at startup around 5 seconds > is netTimeout=5000 > > > > > > -- > *Alan Rose* > *Senior Software Engineer. * > > *CCSS Team Merino* > *Trimble Navigation New Zealand Limited* > P O Box 8729, Riccarton, Christchurch 8440 , New Zealand > +64 3 9635616 Ext 604016 > >