Some more comments inline. Salvatore
On 16 June 2015 at 19:00, Carl Baldwin <c...@ecbaldwin.net> wrote: > On Tue, Jun 16, 2015 at 12:33 AM, Kevin Benton <blak...@gmail.com> wrote: > >>Do these kinds of test even make sense? And are they feasible at all? I > >> doubt we have any framework for injecting anything in neutron code under > >> test. > > > > I was thinking about this in the context of a lot of the fixes we have > for > > other concurrency issues with the database. There are several exception > > handlers that aren't exercised in normal functional, tempest, and API > tests > > because they require a very specific order of events between workers. > > > > I wonder if we could write a small shim DB driver that wraps the python > one > > for use in tests that just makes a desired set of queries take a long > time > > or fail in particular ways? That wouldn't require changes to the neutron > > code, but it might not give us the right granularity of control. > > Might be worth a look. > It's a solution for pretty much mocking out the DB interactions. This would work for fault injection on most neutron-server scenarios, both for RESTful and RPC interfaces, but we'll need something else to "mock" interactions with the data plane that are performed by agents. I think we already have a mock for the AMQP bus on which we shall just install hooks for injecting faults. > >>Finally, please note I am using DB-level locks rather than non-locking > >> algorithms for making reservations. > > > > I thought these were effectively broken in Galera clusters. Is that not > > correct? > > As I understand it, if two writes to two different masters end up > violating some db-level constraint then the operation will cause a > failure regardless if there is a lock. > > Basically, on Galera, instead of waiting for the lock, each will > proceed with the transaction. Finally, on commit, a write > certification will double check constraints with the rest of the > cluster (with a write certification). It is at this point where > Galera will fail one of them as a deadlock for violating the > constraint. Hence the need to retry. To me, non-locking just means > that you embrace the fact that the lock won't work and you don't > bother to apply it in the first place. > This is correct. Db level locks are broken in galera. As Carl says, write sets are sent out for certification after a transaction is committed. So the write intent lock, or even primary key constraint violations cannot be verified before committing the transaction. As a result you incur a write set certification failure, which is notably more expensive than an instance-level rollback, and manifests as a DBDeadlock exception to the OpenStack service. Retrying a transaction is also a way of embracing this behaviour... you just accept the idea of having to reach to write set certifications. Non-locking approaches instead aim at avoiding write set certifications. The downside is that especially in high concurrency scenario, the operation is retries many times, and this might become even more expensive than dealing with the write set certification failure. But zzzeek (Mike Bayer) is coming to our help; as a part of his DBFacade work, we should be able to treat active/active cluster as active/passive for writes, and active/active for reads. This means that the write set certification issue just won't show up, and the benefits of active/active clusters will still be attained for most operations (I don't think there's any doubt that SELECT operations represent the majority of all DB statements). > If my understanding is incorrect, please set me straight. > You're already straight enough ;) > > > If you do go that route, I think you will have to contend with DBDeadlock > > errors when we switch to the new SQL driver anyway. From what I've > observed, > > it seems that if someone is holding a lock on a table and you try to grab > > it, pymsql immediately throws a deadlock exception. > > I'm not familiar with pymysql to know if this is true or not. But, > I'm sure that it is possible not to detect the lock at all on galera. > Someone else will have to chime in to set me straight on the details. > DBDeadlocks without multiple workers also suggest we should look closely at what eventlet is doing before placing the blame on pymysql. I don't think that the switch to pymysql is changing the behaviour of the database interface; I think it's changing the way in which neutron interacts to the database thus unveiling concurrency issues that we did not spot before as we were relying on a sort of implicit locking triggered by the fact that some parts of Mysql-Python were implemented in C. > > Carl > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >
__________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev