----- Original Message ----- > On December 3, 2017 9:27 pm, Paul Belanger wrote: > [snip] > > Please reach out to me the next time you restart it, something is seriously > > wrong is we have to keep restarting nodepool every few days. > > At this rate, I would even leave nodepool-launcher is the bad state until > > we inspect it. > > > > Thanks, > > PB > > > > Hello, > > nodepoold was stuck again. Before restarting it I dumped the thread's > stack-trace and > it seems like 8 threads were trying to aquire a single lock (futex=0xe41de0): > https://review.rdoproject.org/paste/show/9VnzowfzBogKG4Gw0Kes/ > > This make the main loop stuck at > http://git.openstack.org/cgit/openstack-infra/nodepool/tree/nodepool/nodepool.py#n1281 > > I'm not entirely sure what caused this deadlock, the other threads involved > are quite complex: > * kazoo zk_loop > * zmq received > * apscheduler mainloop > * periodicCheck paramiko client connect > * paramiko transport run > * nodepool webapp handle request > > Next time, before restarting the process, it would be good to know what > thread is actually holding the lock, using (gdb) py-print, as explained > here: > https://stackoverflow.com/questions/42169768/debug-pythread-acquire-lock-deadlock/42256864#42256864 > > Paul: any other debug instructions would be appreciated. >
Hello, As a follow-up: the Zuul queue for rdoinfo, DLRN-rpmbuild and other jobs using the rdo-centos-7/rdo-centos-7-ssd nodes was moving very slowly. After checking, there were multiple nodes seen by nodepool as "ready", but those nodes were not in jenkins. For example: +-------+-------------------+------+--------------------------+------+---------+-------------+---------- | ID | Provider | AZ | Label |... | State | Age | Comment | +-------+-------------------+------+--------------------------+------+---------+-------------+---------+ | 62045 | rdo-cloud | None | rdo-centos-7 | ... | read | 01:10:24:24 | None | | 62047 | rdo-cloud | None | rdo-centos-7 | ... | ready | 01:10:24:19 | None | The queue was only moving when there were more pending requests than nodes in this state, since that is when nodepool tries to build new nodes. I have manually removed them to allow the reviews to move on. This is already documented in the etherpad at https://review.rdoproject.org/etherpad/p/nodepool-infra-debugging. Regards, Javier > Regards, > -Tristan > _______________________________________________ dev mailing list dev@lists.rdoproject.org http://lists.rdoproject.org/mailman/listinfo/dev To unsubscribe: dev-unsubscr...@lists.rdoproject.org