On 12/20/2013 11:34 AM, Clint Byrum wrote:
Excerpts from Radomir Dopieralski's message of 2013-12-20 01:13:20 -0800:
On 20/12/13 00:17, Jay Pipes wrote:
On 12/19/2013 04:55 AM, Radomir Dopieralski wrote:
On 14/12/13 16:51, Jay Pipes wrote:

[snip]

Instead of focusing on locking issues -- which I agree are very
important in the virtualized side of things where resources are
"thinner" -- I believe that in the bare-metal world, a more useful focus
would be to ensure that the Tuskar API service treats related group
operations (like "deploy an undercloud on these nodes") in a way that
can handle failures in a graceful and/or atomic way.

Atomicity of operations can be achieved by intoducing critical sections.
You basically have two ways of doing that, optimistic and pessimistic.
Pessimistic critical section is implemented with a locking mechanism
that prevents all other processes from entering the critical section
until it is finished.

I'm familiar with the traditional non-distributed software concept of a
mutex (or in Windows world, a critical section). But we aren't dealing
with traditional non-distributed software here. We're dealing with
highly distributed software where components involved in the
"transaction" may not be running on the same host or have much awareness
of each other at all.

Yes, that is precisely why you need to have a single point where they
can check if they are not stepping on each other's toes. If you don't,
you get race conditions and non-deterministic behavior. The only
difference with traditional, non-distributed software is that since the
components involved are communicating over a, relatively slow, network,
you have a much, much greater chance of actually having a conflict.
Scaling the whole thing to hundreds of nodes practically guarantees trouble.


Radomir, what Jay is suggesting is that it seems pretty unlikely that
two individuals would be given a directive to deploy OpenStack into a
single pool of hardware at such a scale where they will both use the
whole thing.

Worst case, if it does happen, they both run out of hardware, one
individual deletes their deployment, the other one resumes. This is the
optimistic position and it will work fine. Assuming you are driving this
all through Heat (which, AFAIK, Tuskar still uses Heat) there's even a
blueprint to support you that I'm working on:

https://blueprints.launchpad.net/heat/+spec/retry-failed-update

Even if both operators put the retry in a loop, one would actually
finish at some point.

Yes, thank you Clint. That is precisely what I was saying.

Trying to make a complex series of related but distributed actions --
like the underlying actions of the Tuskar -> Ironic API calls -- into an
atomic operation is just not a good use of programming effort, IMO.
Instead, I'm advocating that programming effort should instead be spent
coding a workflow/taskflow pipeline that can gracefully retry failed
operations and report the state of the total taskflow back to the user.

Sure, there are many ways to solve any particular synchronisation
problem. Let's say that we have one that can actually be solved by
retrying. Do you want to retry infinitely? Would you like to increase
the delays between retries exponentially? If so, where are you going to
keep the shared counters for the retries? Perhaps in tuskar-api, hmm?


I don't think a sane person would retry more than maybe once without
checking with the other operators.

Or are you just saying that we should pretend that the nondeteministic
bugs appearing due to the lack of synchronization simply don't exist?
They cannot be easily reproduced, after all. We could just close our
eyes, cover our ears, sing "lalalala" and close any bug reports with
such errors with "could not reproduce on my single-user, single-machine
development installation". I know that a lot of software companies do
exactly that, so I guess it's a valid business practice, I just want to
make sure that this is actually the tactic that we are going to take,
before commiting to an architectural decision that will make those bugs
impossible to fix.


OpenStack is non-deterministic. Deterministic systems are rigid and unable
to handle failure modes of any kind of diversity. We tend to err toward
pushing problems back to the user and giving them tools to resolve the
problem. Avoiding spurious problems is important too, no doubt. However,
what Jay has been suggesting is that the situation a pessimistic locking
system would avoid is entirely user created, and thus lower priority
than say, actually having a complete UI for deploying OpenStack.

Bingo.

Thanks,
-jay


_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to