Mike, I think the exact error (InterfaceError vs TimeoutException) varies depending on what code is being executed at the very moment when a process receives SIGALRM.
I tried to run the tests against PostgreSQL passing very small timeout values (OS_TEST_TIMEOUT=5 python -m testtools.run nova.tests.unit.db.test_migrations.TestNovaMigrationsPostgreSQL.test_walk_versions) and saw both InterfaceError and TimeoutException: http://paste.openstack.org/show/577410/ http://paste.openstack.org/show/577411/ strace'ing shows that a connection to PostgreSQL is closed right after SIGARLM is handled: http://paste.openstack.org/show/577425/ I tried to reproduce that manually by the means of gdb and set a breakpoint on close(): http://paste.openstack.org/show/577422/ ^ looks like psycopg2 closes the connection automatically if a query was interrupted by SIGALRM. The corresponding Python level backtrace is: http://paste.openstack.org/show/577423/ ^ i.e. connection closing happens in the middle of cursor.execute() call. In the end I see a similar InterfaceError: http://paste.openstack.org/show/577424/ That being said, this does not explain the "DB is in use" part. Thanks, Roman On Thu, Sep 15, 2016 at 6:05 AM, Mike Bayer <mba...@redhat.com> wrote: > There's a different set of logs attached to the launchpad issue, that's not > what I was looking at before. > > These logs are at > http://logs.openstack.org/90/369490/1/check/gate-nova-tox-db-functional-ubuntu-xenial/085ac3e/console.html#_2016-09-13_14_54_18_098031 > . In these logs, I see something *very* different, not just the MySQL > tests but the Postgresql tests are definitely hitting conflicts against the > randomly generated database. > > This set of traces, e.g.: > > sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) database > "dbzrtmgbxv" is being accessed by other users > 2016-09-13 14:54:18.093723 | DETAIL: There is 1 other session using the > database. > 2016-09-13 14:54:18.093736 | [SQL: 'DROP DATABASE dbzrtmgbxv'] > > and > > File > "/home/jenkins/workspace/gate-nova-tox-db-functional-ubuntu-xenial/.tox/functional/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", > line 668, in _rollback_impl > 2016-09-13 14:54:18.095470 | > self.engine.dialect.do_rollback(self.connection) > 2016-09-13 14:54:18.095513 | File > "/home/jenkins/workspace/gate-nova-tox-db-functional-ubuntu-xenial/.tox/functional/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py", > line 420, in do_rollback > 2016-09-13 14:54:18.095526 | dbapi_connection.rollback() > 2016-09-13 14:54:18.095548 | sqlalchemy.exc.InterfaceError: > (psycopg2.InterfaceError) connection already closed > > are a very different animal. For one thing, they're on Postgresql where the > driver and DB acts extremely rationally. For another, there's no timeout > exception here, and not all the conflicts are within the teardown. > > Are *these* errors also new as of version 4.13.3 of oslo.db ? Because here > I have more suspicion of one particular oslo.db change here. > > > > > > > > fits much more with your initial description > > > On 09/14/2016 10:48 PM, Mike Bayer wrote: >> >> >> >> On 09/14/2016 07:04 PM, Alan Pevec wrote: >>>> >>>> Olso.db 4.13.3 did hit the scene about the time this showed up. So I >>>> think we need to strongly consider blocking it and revisiting these >>>> issues post newton. >>> >>> >>> So that means reverting all stable/newton changes, previous 4.13.x >>> have been already blocked https://review.openstack.org/365565 >>> How would we proceed, do we need to revert all backport on stable/newton? >> >> >> In case my previous email wasn't clear, I don't *yet* see evidence that >> the recent 4.13.3 release of oslo.db is the cause of this problem. >> However, that is only based upon what I see in this stack trace, which >> is that the test framework is acting predictably (though erroneously) >> based on the timeout condition which is occurring. I don't (yet) see a >> reason that the same effect would not occur prior to 4.13.3 in the face >> of a signal pre-empting the work of the pymysql driver mid-stream. >> However, this assumes that the timeout condition itself is not a product >> of the current oslo.db version and that is not known yet. >> >> There's a list of questions that should all be answerable which could >> assist in giving some hints towards this. >> >> There's two parts to the error in the logs. There's the "timeout" >> condition, then there is the bad reaction of the PyMySQL driver and the >> test framework as a result of the operation being interrupted within the >> test. >> >> * Prior to oslo.db 4.13.3, did we ever see this "timeout" condition >> occur? If so, was it also accompanied by the same "resource closed" >> condition or did this second part of the condition only appear at 4.13.3? >> >> * Did we see a similar "timeout" / "resource closed" combination prior >> to 4.13.3, just with less frequency? >> >> * Was the version of PyMySQL also recently upgraded (I'm assuming this >> environment has been on PyMySQL for a long time at this point) ? What >> was the version change if so? Especially if we previously saw "timeout" >> but no "resource closed", perhaps an older version pf PyMySQL didn't >> react in this way? >> >> * Was the version of MySQL running in the CI environment changed? What >> was the version change if so? Were there any configurational changes >> such as transaction isolation, memory or process settings? >> >> * Have there been changes to the "timeout" logic itself in the test >> suite, e.g. whatever it is that sets up fixtures.Timeout()? Or some >> change that alters how teardown of tests occurs when a test is >> interrupted via this timeout? >> >> * What is the magnitude of the "timeout" this fixture is using, is it on >> the order of seconds, minutes, hours? >> >> * If many minutes or hours, can the test suite be observed to be stuck >> on this test? Has someone tried to run a "SHOW PROCESSLIST" while this >> condition is occurring to see what SQL is pausing? >> >> * Has there been some change such that the migration tests are running >> against non-empty tables or tables with much more data than was present >> before? >> >> * Is this failure only present within the Nova test suite or has it been >> observed in the test suites of other projects? >> >> * Is this failure present only on the "database migration" test suite or >> is it present in other opportunistic tests, for Nova and others? >> >> * Have there been new database migrations added to Nova which are being >> exercised here and may be involved? >> >> I'm not sure how much of an inconvenience it is to downgrade oslo.db. If >> downgrading it is feasible, that would at least be a way to eliminate it >> as a possibility if these same failures continue to occur, or a way to >> confirm its involvement if they disappear. But if downgrading is >> disruptive then there are other things to look at in order to have a >> better chance at predicting its involvement. >> >> >> >>> >>> Cheers, >>> Alan >>> >>> >>> __________________________________________________________________________ >>> >>> OpenStack Development Mailing List (not for usage questions) >>> Unsubscribe: >>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>> > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev