On Fri, Dec 08, 2017 at 08:38:24PM +1100, Ian Wienand wrote: > Hello, > > Just to save people reverse-engineering IRC logs... > > At ~04:00UTC frickler called out that things had been sitting in the > gate for ~17 hours. > > Upon investigation, one of the stuck jobs was a > legacy-tempest-dsvm-neutron-full job > (bba5d98bb7b14b99afb539a75ee86a80) as part of > https://review.openstack.org/475955 > > Checking the zuul logs, it had sent that to ze04 > > 2017-12-07 15:06:20,962 DEBUG zuul.Pipeline.openstack.gate: Build <Build > bba5d98bb7b14b99afb539a75ee86a80 of legacy-tempest-dsvm-neutron-full on > <Worker ze04.openstack.org>> started > > However, zuul-executor was not running on ze04. I believe there were > issues with this host yesterday. "/etc/init.d/zuul-executor start" and > "service zuul-executor start" reported as OK, but didn't actually > start the daemon. Rather than debug, I just used > _SYSTEMCTL_SKIP_REDIRECT=1 and that got it going. We should look into > that, I've noticed similar things with zuul-scheduler too. > > At this point, the evidence suggested zuul was waiting for jobs that > would never return. Thus I saved the queues, restarted zuul-scheduler > and re-queued. > > Soon after frickler again noticed that releasenotes jobs were now > failing with "could not import extension openstackdocstheme" [1]. We > suspect [2]. > > However, the gate did not become healthy. Upon further investigation, > the executors are very frequently failing jobs with > > 2017-12-08 06:41:10,412 ERROR zuul.AnsibleJob: [build: > 11062f1cca144052afb733813cdb16d8] Exception while executing job > Traceback (most recent call last): > File "/usr/local/lib/python3.5/dist-packages/zuul/executor/server.py", > line 588, in execute > str(self.job.unique)) > File "/usr/local/lib/python3.5/dist-packages/zuul/executor/server.py", > line 702, in _execute > File "/usr/local/lib/python3.5/dist-packages/zuul/executor/server.py", > line 1157, in prepareAnsibleFiles > File "/usr/local/lib/python3.5/dist-packages/zuul/executor/server.py", > line 500, in make_inventory_dict > for name in node['name']: > TypeError: unhashable type: 'list' > > This is leading to the very high "retry_limit" failures. > > We suspect change [3] as this did some changes in the node area. I > did not want to revert this via a force-merge, I unfortunately don't > have time to do something like apply manually on the host and babysit > (I did not have time for a short email, so I sent a long one instead :) > > At this point, I sent the alert to warn people the gate is unstable, > which is about the latest state. > > Good luck, > > -i > > [1] > http://logs.openstack.org/95/526595/1/check/build-openstack-releasenotes/f38ccb4/job-output.txt.gz > [2] https://review.openstack.org/525688 > [3] https://review.openstack.org/521324 > Digging into some of the issues this morning, I believe that citycloud-sto2 has been wedged for some time. I see ready / locked nodes sitting for 2+ days. We also have a few ready / locked nodes in rax-iad, which I think are related to the unhasable list from this morning.
As i understand it, the only way to release these nodes is to stop the scheduler, is that correct? If so, I'd like to request we add some sort of CLI --force option to delete, or some other command, if it make sense. I'll hold off on a restart until jeblair or shrews has a moment to look at logs. Paul _______________________________________________ OpenStack-Infra mailing list OpenStack-Infra@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra