Folks,
I found and fixed the root cause of the CI job failures.
Two of the four Nomad Server nodes were in a disconnected state wrt the
Nomad Server Election pool (i.e. they both thought that there was no
leader in the server pool and failed to initiate a new election). The
other two Nomad Servers were blissfully ignorant of the state as they
both agreed upon the existing leader and apparently the leader was able
to verify that all Nomad Server nodes were alive.
This issue was resolved by doing the following on each of the bogged up
Nomad nodes sequentially:
1. Stop Nomad (sudo systemctl stop nomad)
2. Wait for the node to be set as 'left' (i.e. offline) in the leader's
Nomad Server pool status
3. Start Nomad (sudo systemctl start nomad)
After both of the errant nodes had Nomad restarted, a new Nomad server
pool leader was elected and the CI returned to normal.
Unfortunately the current monitoring tools were insufficient to directly
identify the disconnected state of the two bad nodes. I will work with
Peter Mikus who helps maintain the servers to put a monitor in place to
detect this state send an alert to reduce the downtime should this
happen again.
It is not yet clear what triggered the disconnected state in the two
Nomad server nodes and/or whether one or both of them had been in that
state for a long time.
After resolving the issue, I have issued a 'recheck' on all VPP gerrit
changes which failed due to TCP timeouts. So far I have not seen any
new job failures due to TCP connection resets.
Thanks for your patience during this outage.
-daw-
On 6/1/22 2:22 PM, Dave Wallace via lists.fd.io wrote:
Folks,
The FD.io CI is currently experiencing an rash of CI job failures do
to TCP connection failures between Jenkins and the docker executor
images. I am currently working with LF-IT and the lab hosting vendor
to diagnose and fix the issues.
Thank you in advance for your patience while this issue is being resolved.
-daw-
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#21494): https://lists.fd.io/g/vpp-dev/message/21494
Mute This Topic: https://lists.fd.io/mt/91483014/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-