Problems currently still ongoing:
1device cluster worker nodes are currently down.. I’ve notified csit in
slack and am cc’ing them here.. In the meantime I have a gerrit to remove
1device per patch
so it doesn’t delay voting on verify jobs.
Jenkins just crashed so that will take awhile to sort.
vanessa and I are trying to just empty the build queue at this
point to get back to zero so jenkins won’t just crash again when it gets opened.
History:
root cause:
a. will have to wait on csit folks for answers on the two 1device node
failure
b. during the night the internal docker registry stopped responding
(but still passed socket health check so didnt fail over)
Workflow:
1. I saw there was an issue reading email around 6am pacific this
morning.
2. saw that the registry wasn’t responding and attempted restart.
3. due to the jenkins server queue hammering on the nomad cluster it
took a long while to get that restart to go through (roughly 40 min)
4. once the bottle was uncorked the sixty jobs pending (including a
large number of checkstyle jobs) turned into 160.
5. jenkins ‘chokes’ and crashes
6. ‘we’ start scrubbing the queue which will cause a huge number of
rechecks but at least jenkins wont crash again..
******current time
future:
7. will force the ci-man patch removing per patch verify
8. jenkins queue will re-open and ill send another email.
9. Im adding myself to the queue high threshold alarm LF system so I
get paged/called when the queue gets above 90 (their current severe water mark)
10. Ill see if i can find a way to troll gerrit to manually recheck
what i can find
more as it rolls along
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#14147): https://lists.fd.io/g/vpp-dev/message/14147
Mute This Topic: https://lists.fd.io/mt/34443895/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-