Re: [openstack-dev] [TripleO] Tuskar CLI after architecture changes

Ladislav Smola Fri, 20 Dec 2013 03:33:22 -0800

May I propose we keep the conversation Icehouse related. I don't thinkwe can make any sort of locking

mechanism in I.

Though it would be worth of creating some WikiPage that would present itwhole in some consistent

manner. I am kind of lost in these emails. :-)


So, what do you thing are the biggest issues for the Icehouse tasks we have?

1. GET operations?

I don't think we need to be atomic here. We basically join resourcesfrom multiple APIs together. I thinkit's perfectly fine that something will be deleted in the process. Evenright now we join together only thingsthat exists. And we can handle when something is not. There is no needof locking or retrying here AFAIK.


2. Heat stack create, update

This is locked in the process of the operation, so nobody can mess withit while it is updating or creating.Once we will pack all operations that are now aside in this, we shouldbe alright. And that should be doable in I.So we should push towards this, rather then building some temporarylocking solution in Tuskar-API.


3. Reservation of resources

As we can deploy only one stack now, so I think it shouldn't be aproblem with multiple users there. Whensomebody will delete the resources from 'free pool' in the process, itwill fail with 'Not enough free resources'

I guess that is fine.

Also not sure how it's now, but it should be possible to deploy smartly,so the stack will be working evenwith smaller amount of resources. Then we would just heat stack-updatewith numbers it ended up with,

and it would switch to OK status without changing anything.

So, are there any other critical sections you see?

I know we did it bad way in the previous Tuskar-API and I think we areavoiding that now. And we will avoidit in the future. By simply not doing these kind of stuff until there isa proper way to do it.


Thanks,
Ladislav


On 12/20/2013 10:13 AM, Radomir Dopieralski wrote:

On 20/12/13 00:17, Jay Pipes wrote:

On 12/19/2013 04:55 AM, Radomir Dopieralski wrote:

On 14/12/13 16:51, Jay Pipes wrote:

[snip]

Instead of focusing on locking issues -- which I agree are very
important in the virtualized side of things where resources are
"thinner" -- I believe that in the bare-metal world, a more useful focus
would be to ensure that the Tuskar API service treats related group
operations (like "deploy an undercloud on these nodes") in a way that
can handle failures in a graceful and/or atomic way.

Atomicity of operations can be achieved by intoducing critical sections.
You basically have two ways of doing that, optimistic and pessimistic.
Pessimistic critical section is implemented with a locking mechanism
that prevents all other processes from entering the critical section
until it is finished.

I'm familiar with the traditional non-distributed software concept of a
mutex (or in Windows world, a critical section). But we aren't dealing
with traditional non-distributed software here. We're dealing with
highly distributed software where components involved in the
"transaction" may not be running on the same host or have much awareness
of each other at all.

Yes, that is precisely why you need to have a single point where they
can check if they are not stepping on each other's toes. If you don't,
you get race conditions and non-deterministic behavior. The only
difference with traditional, non-distributed software is that since the
components involved are communicating over a, relatively slow, network,
you have a much, much greater chance of actually having a conflict.
Scaling the whole thing to hundreds of nodes practically guarantees trouble.

And, in any case (see below), I don't think that this is a problem that
needs to be solved in Tuskar.

Perhaps you have some other way of making them atomic that I can't
think of?

I should not have used the term atomic above. I actually do not think
that the things that Tuskar/Ironic does should be viewed as an atomic
operation. More below.

OK, no operations performed by Tuskar need to be atomic, noted.

For example, if the construction or installation of one compute worker
failed, adding some retry or retry-after-wait-for-event logic would be
more useful than trying to put locks in a bunch of places to prevent
multiple sysadmins from trying to deploy on the same bare-metal nodes
(since it's just not gonna happen in the real world, and IMO, if it did
happen, the sysadmins/deployers should be punished and have to clean up
their own mess ;)

I don't see why they should be punished, if the UI was assuring them
that they are doing exactly the thing that they wanted to do, at every
step, and in the end it did something completely different, without any
warning. If anyone deserves punishment in such a situation, it's the
programmers who wrote the UI in such a way.

The issue I am getting at is that, in the real world, the problem of
multiple users of Tuskar attempting to deploy an undercloud on the exact
same set of bare metal machines is just not going to happen. If you
think this is actually a real-world problem, and have seen two sysadmins
actively trying to deploy an undercloud on bare-metal machines at the
same time without unbeknownst to each other, then I feel bad for the
sysadmins that found themselves in such a situation, but I feel its
their own fault for not knowing about what the other was doing.

How can it be their fault, when at every step of their interaction with
the user interface, the user interface was assuring them that they are
going to do the right thing (deploy a certain set of nodes), but when
they finally hit the confirmation button, did a completely different
thing (deployed a different set of nodes)? The only fault I see is in
them using such software. Or are you suggesting that they should
implement the lock themselves, through e-mails or some other means of
communication?

Don't get me wrong, the deploy button is just one easy example of this
problem. We have it all over the user interface. Even such a simple
operation, as retrieving a list of node ids, and then displaying the
corresponding information to the user has a race condition in it -- what
if some of the nodes get deleted after we get the list of ids, but
before we make the call to get node details about them? This should be
done as an atomic operation that either locks, or fails if there was a
change in the middle of it, and since the calls are to different
systems, the only place where you can set a lock or check if there was a
change, is the tuskar-api. And no, trying to get again the information
about a deleted node won't help -- you can keep retrying for years, and
the node will still remain deleted. This is all over the place. And,
saying that "this is the user's fault" doesn't help.

Trying to make a complex series of related but distributed actions --
like the underlying actions of the Tuskar -> Ironic API calls -- into an
atomic operation is just not a good use of programming effort, IMO.
Instead, I'm advocating that programming effort should instead be spent
coding a workflow/taskflow pipeline that can gracefully retry failed
operations and report the state of the total taskflow back to the user.

Sure, there are many ways to solve any particular synchronisation
problem. Let's say that we have one that can actually be solved by
retrying. Do you want to retry infinitely? Would you like to increase
the delays between retries exponentially? If so, where are you going to
keep the shared counters for the retries? Perhaps in tuskar-api, hmm?

Or are you just saying that we should pretend that the nondeteministic
bugs appearing due to the lack of synchronization simply don't exist?
They cannot be easily reproduced, after all. We could just close our
eyes, cover our ears, sing "lalalala" and close any bug reports with
such errors with "could not reproduce on my single-user, single-machine
development installation". I know that a lot of software companies do
exactly that, so I guess it's a valid business practice, I just want to
make sure that this is actually the tactic that we are going to take,
before commiting to an architectural decision that will make those bugs
impossible to fix.



_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [TripleO] Tuskar CLI after architecture changes

Reply via email to