Han,

Thanks for your reply, and thanks for confirming my reading of code at the time 
as well: “from what I cans see, that raft.leader_sid are also updated in the 
only two places where raft.candidate_retrying (raft_start_election() and 
raft_set_leader()) is set. Which means it is not possible that 
raft.candidate_retrying is set to TRUE but raft->leader_sid is set to non-Zero”.

We saw it not very often, probably every half month or so. If it happens again, 
what information you think we should collect that can help with further 
investigation?

Thanks
Yun



From: Han Zhou <[email protected]>
Sent: Sunday, August 16, 2020 10:14 PM
To: Yun Zhou <[email protected]>
Cc: [email protected]; [email protected]; Girish 
Moodalbail <[email protected]>
Subject: Re: the raft_is_connected state of a raft server stays as false and 
cannot recover

External email: Use caution opening links or attachments



On Thu, Aug 13, 2020 at 5:26 PM Yun Zhou 
<[email protected]<mailto:[email protected]>> wrote:
Hi,

Need expert's view to address a problem we are seeing now and then:  A 
ovsdb-server node in a 3-nodes raft cluster keeps printing out the 
"raft_is_connected: false" message, and its "connected" state in its _Server DB 
stays as false.

According to the ovsdb-server(5) manpage, it means this server is not 
contacting with a majority of its cluster.

Except its "connected" state, from what we can see, this server is in the 
follower state and works fine, and connection between it and the other two 
servers appear healthy as well.

Below is its raft structure snapshot at the time of the problem. Note that its 
candidate_retrying field stays as true.

Hopefully the provide information can help to figure out what goes wrong here. 
Unfortunately we don't have a solid case to reproduce it:

Thanks for reporting the issue. This looks really strange. In the below state, 
leader_sid is non-zero, but candidate_retrying is true.
According to the latest code, whenever leader_sid is set to non-zero (in 
raft_set_leader()), candidate_retrying will be set to false; whenever 
candidate_retrying is set to true (in raft_start_election()), leader_sid will 
be set to UUID_ZERO. And the data struct is initialized with xzalloc, making 
sure candidate_retrying is false in the beginning. So, sorry that I can't 
explain how it ends up with this conflict situation. It would be helpful if 
there is a way to reproduce. How often does it happen?

Thanks,
Han


(gdb) print *(struct raft *)0xa872c0
$19 = {
  hmap_node = {
    hash = 2911123117,
    next = 0x0
  },
  log = 0xa83690,
  cid = {
    parts = {2699238234, 2258650653, 3035282424, 813064186}
  },
  sid = {
    parts = {1071328836, 400573240, 2626104521, 1746414343}
  },
  local_address = 0xa874e0 "tcp:10.8.51.55:6643<http://10.8.51.55:6643>",
  local_nickname = 0xa876d0 "3fdb",
  name = 0xa876b0 "OVN_Northbound",
  servers = {
    buckets = 0xad4bc0,
    one = 0x0,
    mask = 3,
    n = 3
  },
  election_timer = 1000,
  election_timer_new = 0,
  term = 3,
  vote = {
    parts = {1071328836, 400573240, 2626104521, 1746414343}
  },
  synced_term = 3,
  synced_vote = {
    parts = {1071328836, 400573240, 2626104521, 1746414343}
  },
  entries = 0xbf0fe0,
  log_start = 2,
  log_end = 312,
  log_synced = 311,
  allocated_log = 512,
  snap = {
    term = 1,
    data = 0xaafb10,
    eid = {
      parts = {1838862864, 1569866528, 2969429118, 3021055395}
    },
    servers = 0xaafa70,
    election_timer = 1000
  },
  role = RAFT_FOLLOWER,
  commit_index = 311,
  last_applied = 311,
  leader_sid = {
    parts = {642765114, 43797788, 2533161504, 3088745929}
  },
  election_base = 6043283367,
  election_timeout = 6043284593,
  joining = false,
  remote_addresses = {
    map = {
      buckets = 0xa87410,
      one = 0xa879c0,
      mask = 0,
      n = 1
    }
  },
  join_timeout = 6037634820,
  leaving = false,
  left = false,
  leave_timeout = 0,
  failed = false,
  waiters = {
    prev = 0xa87448,
    next = 0xa87448
  },
  listener = 0xaafad0,
  listen_backoff = -9223372036854775808,
  conns = {
    prev = 0xbcd660,
    next = 0xaafc20
  },
  add_servers = {
    buckets = 0xa87480,
    one = 0x0,
    mask = 0,
    n = 0
  },
  remove_server = 0x0,
  commands = {
    buckets = 0xa874a8,
    one = 0x0,
    mask = 0,
    n = 0
  },
  ping_timeout = 6043283700,
  n_votes = 1,
  candidate_retrying = true,
  had_leader = false,
  ever_had_leader = true
}

Thanks
- Yun

--
You received this message because you are subscribed to the Google Groups 
"ovn-kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
[email protected]<mailto:ovn-kubernetes%[email protected]>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/ovn-kubernetes/BY5PR12MB4132F190E4BFE9F381BC5A82B0400%40BY5PR12MB4132.namprd12.prod.outlook.com.
_______________________________________________
discuss mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to