Folks,

I found and fixed the root cause of the CI job failures.

Two of the four Nomad Server nodes were in a disconnected state wrt the Nomad Server Election pool (i.e. they both thought that there was no leader in the server pool and failed to initiate a new election).  The other two Nomad Servers were blissfully ignorant of the state as they both agreed upon the existing leader and apparently the leader was able to verify that all Nomad Server nodes were alive.

This issue was resolved by doing the following on each of the bogged up Nomad nodes sequentially:

1. Stop Nomad (sudo systemctl stop nomad)
2. Wait for the node to be set as 'left' (i.e. offline) in the leader's
   Nomad Server pool status
3. Start Nomad (sudo systemctl start nomad)

After both of the errant nodes had Nomad restarted, a new Nomad server pool leader was elected and the CI returned to normal. Unfortunately the current monitoring tools were insufficient to directly identify the disconnected state of the two bad nodes.  I will work with Peter Mikus who helps maintain the servers to put a monitor in place to detect this state send an alert to reduce the downtime should this happen again.

It is not yet clear what triggered the disconnected state in the two Nomad server nodes and/or whether one or both of them had been in that state for a long time.

After resolving the issue, I have issued a 'recheck' on all VPP gerrit changes which failed due to TCP timeouts.  So far I have not seen any new job failures due to TCP connection resets.

Thanks for your patience during this outage.
-daw-

On 6/1/22 2:22 PM, Dave Wallace via lists.fd.io wrote:
Folks,

The FD.io CI is currently experiencing an rash of CI job failures do to TCP connection failures between Jenkins and the docker executor images.  I am currently working with LF-IT and the lab hosting vendor to diagnose and fix the issues.

Thank you in advance for your patience while this issue is being resolved.
-daw-



-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#21494): https://lists.fd.io/g/vpp-dev/message/21494
Mute This Topic: https://lists.fd.io/mt/91483014/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to