Update: Still ongoing: 1device cluster still dead….per patch job has been removed so verify’s can happen
Jenkins is back up with an empty queue as of @30 minutes ago. I have three rechecks running and when those pass ill be going through all the open gerrits I see (without any verification vote at all) for the past day and recheck them. Wont send another update until/if something else goes sideways or 1device cluster is back in service and is part of the verify process again. Ill be looking into the health checker style so ’the right thing’ will happen when the port is open and receiving but without a fully functional brain behind it. Ed > On Oct 8, 2019, at 9:59 AM, Ed Kern via Lists.Fd.Io > <ejk=cisco....@lists.fd.io> wrote: > > > Problems currently still ongoing: > 1device cluster worker nodes are currently down.. I’ve notified csit in > slack and am cc’ing them here.. In the meantime I have a gerrit to remove > 1device per patch > so it doesn’t delay voting on verify jobs. > Jenkins just crashed so that will take awhile to sort. > vanessa and I are trying to just empty the build queue at this > point to get back to zero so jenkins won’t just crash again when it gets > opened. > > > History: > > root cause: > a. will have to wait on csit folks for answers on the two 1device node > failure > b. during the night the internal docker registry stopped responding > (but still passed socket health check so didnt fail over) > > Workflow: > 1. I saw there was an issue reading email around 6am pacific this > morning. > 2. saw that the registry wasn’t responding and attempted restart. > 3. due to the jenkins server queue hammering on the nomad cluster it > took a long while to get that restart to go through (roughly 40 min) > 4. once the bottle was uncorked the sixty jobs pending (including a > large number of checkstyle jobs) turned into 160. > 5. jenkins ‘chokes’ and crashes > 6. ‘we’ start scrubbing the queue which will cause a huge number of > rechecks but at least jenkins wont crash again.. > > ******current time > > future: > 7. will force the ci-man patch removing per patch verify > 8. jenkins queue will re-open and ill send another email. > 9. Im adding myself to the queue high threshold alarm LF system so I > get paged/called when the queue gets above 90 (their current severe water > mark) > 10. Ill see if i can find a way to troll gerrit to manually recheck > what i can find > > > more as it rolls along-=-=-=-=-=-=-=-=-=-=-=- > Links: You receive all messages sent to this group. > > View/Reply Online (#14147): https://lists.fd.io/g/vpp-dev/message/14147 > Mute This Topic: https://lists.fd.io/mt/34443895/675649 > Group Owner: vpp-dev+ow...@lists.fd.io > Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [e...@cisco.com] > -=-=-=-=-=-=-=-=-=-=-=-
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#14149): https://lists.fd.io/g/vpp-dev/message/14149 Mute This Topic: https://lists.fd.io/mt/34443895/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-