Update:

Still ongoing:   1device cluster still dead….per patch job has been removed so 
verify’s can happen

Jenkins is back up with an empty queue as of @30 minutes ago.   I have three 
rechecks running and when those pass
ill be going through all the open gerrits I see (without any verification vote 
at all) for the past day and recheck them.


Wont send another update until/if something else goes sideways or 1device 
cluster is back in service and is part
of the verify process again.

Ill be looking into the health checker style so ’the right thing’ will happen 
when the port is open and receiving but without
a fully functional brain behind it.

Ed



> On Oct 8, 2019, at 9:59 AM, Ed Kern via Lists.Fd.Io 
> <ejk=cisco....@lists.fd.io> wrote:
> 
> 
> Problems currently still ongoing:
>       1device cluster worker nodes are currently down.. I’ve notified csit in 
> slack and am cc’ing them here..  In the meantime I have a gerrit to remove 
> 1device per patch 
>               so it doesn’t delay voting on verify jobs.
>       Jenkins just crashed so that will take awhile to sort.
>               vanessa and I are trying to just empty the build queue at this 
> point to get back to zero so jenkins won’t just crash again when it gets 
> opened.
> 
> 
> History:
> 
> root cause: 
>       a. will have to wait on csit folks for answers on the two 1device node 
> failure
>       b. during the night the internal docker registry stopped responding 
> (but still passed socket health check so didnt fail over)
> 
> Workflow:
>       1. I saw there was an issue reading email around 6am pacific this 
> morning. 
>       2. saw that the registry wasn’t responding and attempted restart.
>       3. due to the jenkins server queue hammering on the nomad cluster it 
> took a long while to get that restart to go through (roughly 40 min)
>       4. once the bottle was uncorked the sixty jobs pending (including a 
> large number of checkstyle jobs) turned into 160.
>       5. jenkins ‘chokes’ and crashes
>       6. ‘we’ start scrubbing the queue which will cause a huge number of 
> rechecks but at least jenkins wont crash again..
> 
> ******current time
> 
>  future:
>       7.  will force the ci-man patch removing per patch verify 
>       8. jenkins queue will re-open and ill send another email.
>       9. Im adding myself to the queue high threshold alarm LF system so I 
> get paged/called when the queue gets above 90 (their current severe water 
> mark)
>       10. Ill see if i can find a way to troll gerrit to manually recheck 
> what i can find
> 
> 
> more as it rolls along-=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
> 
> View/Reply Online (#14147): https://lists.fd.io/g/vpp-dev/message/14147
> Mute This Topic: https://lists.fd.io/mt/34443895/675649
> Group Owner: vpp-dev+ow...@lists.fd.io
> Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [e...@cisco.com]
> -=-=-=-=-=-=-=-=-=-=-=-

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#14149): https://lists.fd.io/g/vpp-dev/message/14149
Mute This Topic: https://lists.fd.io/mt/34443895/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to