Alexey Goncharuk created IGNITE-5580:
----------------------------------------

             Summary: Improve node failure cause information
                 Key: IGNITE-5580
                 URL: https://issues.apache.org/jira/browse/IGNITE-5580
             Project: Ignite
          Issue Type: Improvement
          Components: general
    Affects Versions: 1.7
            Reporter: Alexey Goncharuk
             Fix For: 2.1


When a node fails, we do not print out any information about the root cause of 
this failure. This makes it extremely hard to investigate the failure causes - 
I need to find a previous node for the failed node and check the logs on the 
previous node.

I suggest that we add extensive information about the reason of the node 
failure and the sequence of events that led to this, e.g.:
[time] [NODE] Sending a message to next node - failed _because_ - write 
timeout, read timeout, ...?
[time] [NODE] Connection check - failed - why? Connection refused, handshake 
timed out, ...?
...
[time] [NODE] Decided to drop the node because of the sequence above

Maybe we do not need to print out this information always, but we do need this 
when troubleshooting logger is enabled.

Also, DiscoverySpi should collect a set of latest important events and dump 
these events in case of local node segmentation. This will allow users to match 
the events in the cluster and events on local node and get to the bottom of the 
failure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to