Folks,
As you may have already noticed, Jenkins job operations have returned to
normal.
The root cause of this outage was an error by a Vexxhost technician
incorrectly updating firewall rules that isolated a secondary data
center from the primary one hosting the FD.io Nomad hosts. Many CSIT
testbeds were similarly isolated.
Unfortunately the lead Nomad Server controlling the cluster was located
in the secondary data center and for some unknown reason the Nomad
cluster failed to elect a new lead Nomad Server. In theory this should
not have happened, but apparently we hit an untested error case. This
factor will be mitigated by ensuring all Nomad Server instances run on
hosts in the primary FD.io data center and reduction of Nomad Server
instances to 5 from 7. Lastly, automated testing of the Nomad Server
election process has been added to the Nomad Planning Wish List [0] so
that we can verify correct operation of the Nomad Server fail-over
functionality on a regular basis.
A hearty shout-out to Peter Mikus and Andrew Yourtchenko for their help
identifying the root cause and mitigation steps.
Thank you for your patience & have a great weekend!
-daw-
[0]
https://wiki.fd.io/view/Nomad_Operations_and_Planning#Nomad_Planning_Wish_List
On 5/29/2020 1:04 PM, Dave Wallace via lists.fd.io wrote:
Adding csit-dev to the thread...
On 5/29/2020 1:03 PM, Dave Wallace via lists.fd.io wrote:
FYI, I have opened a case with Vexxhost:
https://secure.vexxhost.com/billing/viewticket.php?tid=QDU-864405&c=xgaBi2wP
On 5/29/2020 12:56 PM, Dave Wallace via lists.fd.io wrote:
Folks,
There has been an outage in the Nomad cluster (2 nodes offline)
which is currently causing VPP jenkins jobs to not execute. I'm
working on getting hold of Vexxhost to get the servers that are down
back online.
Apparently raft doesn't handle multiple servers very well :(
Will post updates as I get more information.
Thank you for your patience.
-daw-
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#16583): https://lists.fd.io/g/vpp-dev/message/16583
Mute This Topic: https://lists.fd.io/mt/74548276/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-