Reviewed: https://review.opendev.org/717662 Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=941559042f609ee43ff3160c0f0d0c45187be17f Submitter: Zuul Branch: master
commit 941559042f609ee43ff3160c0f0d0c45187be17f Author: melanie witt <melwi...@gmail.com> Date: Fri Apr 3 21:22:27 2020 +0000 Reset the cell cache for database access in Service We have had a gate bug for a long time where occasionally the scheduler service gets into a state where many requests fail in it with CellTimeout errors. Example: Timed out waiting for response from cell <cell uuid> Through the use of much DNM patch debug logging in oslo.db, it was revealed that service child processes (workers) were sometimes starting off with already locked internal oslo.db locks. This is a known issue in python [1] where if a parent process forks a child process while a lock is held, the child will inherit the held lock which can never be acquired. The python issue is not considered a bug and the recommended way to handle it is by making use of the os.register_at_fork() in the oslo.db to reinitialize its lock. The method is new in python 3.7, so as long as we still support python 3.6, we must handle the situation outside of oslo.db. We can do this by clearing the cell cache that holds oslo.db database transaction context manager objects during service start(). This way, we get fresh oslo.db locks that are in an unlocked state when a child process begins. We can also take this opportunity to resolve part of a TODO to clear the same cell cache during service reset() (SIGHUP) since it is another case where we intended to clear it. The rest of the TODO related to periodic clearing of the cache is removed after discussion on the review, as such clearing would be unsynchronized among multiple services and for periods of time each service might have a different view of cached cells than another. Closes-Bug: #1844929 [1] https://bugs.python.org/issue6721 Change-Id: Id233f673a57461cc312e304873a41442d732c051 ** Changed in: nova Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1844929 Title: grenade jobs failing due to "Timed out waiting for response from cell" in scheduler Status in grenade: New Status in OpenStack Compute (nova): Fix Released Bug description: Seen here: https://zuul.opendev.org/t/openstack/build/d53346210978403f888b85b82b2fe0c7/log/logs/screen-n-sch.txt.gz?severity=3#2368 Sep 22 00:50:54.174385 ubuntu-bionic-ovh-gra1-0011664420 nova- scheduler[18043]: WARNING nova.context [None req- 1929039e-1517-4326-9700-738d4b570ba6 tempest- AttachInterfacesUnderV243Test-2009753731 tempest- AttachInterfacesUnderV243Test-2009753731] Timed out waiting for response from cell 8acfb79b-2e40-4e1c-bc3d-d404dac6db90 Looks like something is causing timeouts reaching cell1 during grenade runs. The only errors I see in the rabbit logs are these for the uwsgi (API) servers: =ERROR REPORT==== 22-Sep-2019::00:35:30 === closing AMQP connection <0.1511.0> (217.182.141.188:48492 -> 217.182.141.188:5672 - uwsgi:19453:72e08501-61ca-4ade-865e- f0605979ed7d): missed heartbeats from client, timeout: 60s -- It looks like we don't have mysql logs in this grenade run, maybe we need a fix like this somewhere for grenade: https://github.com/openstack/devstack/commit/f92c346131db2c89b930b1a23f8489419a2217dc logstash shows 1101 hits in the last 7 days, since Sept 17 actually: http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Timed%20out%20waiting%20for%20response%20from%20cell%5C%22%20AND%20tags%3A%5C%22screen-n-sch.txt%5C%22&from=7d check and gate queues, all failures. It also appears to only show up on fortnebula and OVH nodes, primarily fortnebula. I wonder if there is a performing/timing issue if those nodes are slower and we aren't waiting for something during the grenade upgrade before proceeding. To manage notifications about this bug go to: https://bugs.launchpad.net/grenade/+bug/1844929/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp