Some more comments inline.

Salvatore

On 16 June 2015 at 19:00, Carl Baldwin <c...@ecbaldwin.net> wrote:

> On Tue, Jun 16, 2015 at 12:33 AM, Kevin Benton <blak...@gmail.com> wrote:
> >>Do these kinds of test even make sense? And are they feasible at all? I
> >> doubt we have any framework for injecting anything in neutron code under
> >> test.
> >
> > I was thinking about this in the context of a lot of the fixes we have
> for
> > other concurrency issues with the database. There are several exception
> > handlers that aren't exercised in normal functional, tempest, and API
> tests
> > because they require a very specific order of events between workers.
> >
> > I wonder if we could write a small shim DB driver that wraps the python
> one
> > for use in tests that just makes a desired set of queries take a long
> time
> > or fail in particular ways? That wouldn't require changes to the neutron
> > code, but it might not give us the right granularity of control.
>
> Might be worth a look.
>

It's a solution for pretty much mocking out the DB interactions. This would
work for fault injection on most neutron-server scenarios, both for RESTful
and RPC interfaces, but we'll need something else to "mock" interactions
with the data plane  that are performed by agents. I think we already have
a mock for the AMQP bus on which we shall just install hooks for injecting
faults.


> >>Finally, please note I am using DB-level locks rather than non-locking
> >> algorithms for making reservations.
> >
> > I thought these were effectively broken in Galera clusters. Is that not
> > correct?
>
> As I understand it, if two writes to two different masters end up
> violating some db-level constraint then the operation will cause a
> failure regardless if there is a lock.
>


> Basically, on Galera, instead of waiting for the lock, each will
> proceed with the transaction.  Finally, on commit, a write
> certification will double check constraints with the rest of the
> cluster (with a write certification).  It is at this point where
> Galera will fail one of them as a deadlock for violating the
> constraint.  Hence the need to retry.  To me, non-locking just means
> that you embrace the fact that the lock won't work and you don't
> bother to apply it in the first place.
>

This is correct.

Db level locks are broken in galera. As Carl says, write sets are sent out
for certification after a transaction is committed.
So the write intent lock, or even primary key constraint violations cannot
be verified before committing the transaction.
As a result you incur a write set certification failure, which is notably
more expensive than an instance-level rollback, and manifests as a
DBDeadlock exception to the OpenStack service.

Retrying a transaction is also a way of embracing this behaviour... you
just accept the idea of having to reach to write set certifications.
Non-locking approaches instead aim at avoiding write set certifications.
The downside is that especially in high concurrency scenario, the operation
is retries many times, and this might become even more expensive than
dealing with the write set certification failure.

But zzzeek (Mike Bayer) is coming to our help; as a part of his DBFacade
work, we should be able to treat active/active cluster as active/passive
for writes, and active/active for reads. This means that the write set
certification issue just won't show up, and the benefits of active/active
clusters will still be attained for most operations (I don't think there's
any doubt that SELECT operations represent the majority of all DB
statements).


> If my understanding is incorrect, please set me straight.
>

You're already straight enough ;)


>
> > If you do go that route, I think you will have to contend with DBDeadlock
> > errors when we switch to the new SQL driver anyway. From what I've
> observed,
> > it seems that if someone is holding a lock on a table and you try to grab
> > it, pymsql immediately throws a deadlock exception.
>

> I'm not familiar with pymysql to know if this is true or not.  But,
> I'm sure that it is possible not to detect the lock at all on galera.
> Someone else will have to chime in to set me straight on the details.
>

DBDeadlocks without multiple workers also suggest we should look closely at
what eventlet is doing before placing the blame on pymysql. I don't think
that the switch to pymysql is changing the behaviour of the database
interface; I think it's changing the way in which neutron interacts to the
database thus unveiling concurrency issues that we did not spot before as
we were relying on a sort of implicit locking triggered by the fact that
some parts of Mysql-Python were implemented in C.


>
> Carl
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to