Folks,

As you may have already noticed, Jenkins job operations have returned to normal.

The root cause of this outage was an error by a Vexxhost technician incorrectly updating firewall rules that isolated a secondary data center from the primary one hosting the FD.io Nomad hosts.  Many CSIT testbeds were similarly isolated.

Unfortunately the lead Nomad Server controlling the cluster was located in the secondary data center and for some unknown reason the Nomad cluster failed to elect a new lead Nomad Server.  In theory this should not have happened, but apparently we hit an untested error case.  This factor will be mitigated by ensuring all Nomad Server instances run on hosts in the primary FD.io data center and reduction of Nomad Server instances to 5 from 7. Lastly, automated testing of the Nomad Server election process has been added to the Nomad Planning Wish List [0] so that we can verify correct operation of the Nomad Server fail-over functionality on a regular basis.

A hearty shout-out to Peter Mikus and Andrew Yourtchenko for their help identifying the root cause and mitigation steps.

Thank you for your patience & have a great weekend!
-daw-

[0] https://wiki.fd.io/view/Nomad_Operations_and_Planning#Nomad_Planning_Wish_List

On 5/29/2020 1:04 PM, Dave Wallace via lists.fd.io wrote:
Adding csit-dev to the thread...

On 5/29/2020 1:03 PM, Dave Wallace via lists.fd.io wrote:
FYI, I have opened a case with Vexxhost: https://secure.vexxhost.com/billing/viewticket.php?tid=QDU-864405&c=xgaBi2wP

On 5/29/2020 12:56 PM, Dave Wallace via lists.fd.io wrote:
Folks,

There has been an outage in the Nomad cluster (2 nodes offline) which is currently causing VPP jenkins jobs to not execute. I'm working on getting hold of Vexxhost to get the servers that are down back online.

Apparently raft doesn't handle multiple servers very well :(

Will post updates as I get more information.

Thank you for your patience.
-daw-







-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#16583): https://lists.fd.io/g/vpp-dev/message/16583
Mute This Topic: https://lists.fd.io/mt/74548276/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to