On 03/06/2015 03:19 PM, Attila Fazekas wrote: > Looks like we need some kind of _per compute node_ mutex in the critical > section, > multiple scheduler MAY be able to schedule to two compute node at same time, > but not for scheduling to the same compute node. > > If we don't want to introduce another required component or > reinvent the wheel there are some possible trick with the existing globally > visible > components like with the RDMS. > > `Randomized` destination choose is recommended in most of the possible > solutions, > alternatives are much more complex. > > One SQL example: > > * Add `sched_cnt`, defaul=0, Integer field; to a hypervisors related table. > > When the scheduler picks one (or multiple) node, he needs to verify is the > node(s) are > still good before sending the message to the n-cpu. > > It can be done by re-reading the ONLY the picked hypervisor(s) related data. > with `LOCK IN SHARE MODE`. > If the destination hyper-visors still OK: > > Increase the sched_cnt value exactly by 1, > test is the UPDATE really update the required number of rows, > the WHERE part needs to contain the previous value. > > You also need to update the resource usage on the hypervisor, > by the expected cost of the new vms. > > If at least one selected node was ok, the transaction can be COMMITed. > If you were able to COMMIT the transaction, the relevant messages > can be sent. > > The whole process needs to be repeated with the items which did not passed the > post verification. > > If a message sending failed, `act like` migrating the vm to another host. > > If multiple scheduler tries to pick multiple different host in different > order, > it can lead to a DEADLOCK situation. > Solution: Try to have all scheduler to acquire to Shared RW locks in the same > order, > at the end. > > Galera multi-writer (Active-Active) implication: > As always, retry on deadlock. > > n-sch + n-cpu crash at the same time: > * If the scheduling is not finished properly, it might be fixed manually, > or we need to solve which still alive scheduler instance is > responsible for fixing the particular scheduling.. >
So if I am reading the above correctly - you are basically proposing to move claims to the scheduler (we would atomically check if there were changes since the time we picked the host with the UPDATE .. WHERE using LOCK IN SHARE MODE (assuming REPEATABLE READS is the used isolation level) and then updating the usage, a.k.a doing the claim in the same transaction. The issue here is that we still have a window between sending the message, and the message getting picked up by the compute host (or timing out) or the instance outright failing, so for sure we will need to ack/nack the claim in some way on the compute side. I believe something like this has come up before under the umbrella term of "moving claims to the scheduler", and was discussed in some detail on the latest Nova mid-cycle meetup, but only artifacts I could find were a few lines on this etherpad Sylvain pointed me to [1] that I am copying here: """ * White board the scheduler service interface ** note: this design won't change the existing way/logic of reconciling nova db != hypervisor view ** gantt should just return claim ids, not entire claim objects ** claims are acked as being in use via the resource tracker updates from nova-compute ** we still need scheduler retries for exceptional situations (admins doing things outside openstack, hardware changes / failures) ** retry logic in conductor? probably a separate item/spec """ As you can see - not much to go on (but that is material for a separate thread that I may start soon). The problem I have with this particular approach is that while it claims to fix some of the races (and probably does) it does so by 1) turning the current scheduling mechanism on it's head 2) and not providing any thought into the trade-offs that it will make. For example, we may get more correct scheduling in the general case and the correctness will not be affected by the number of workers, but how does the fact that we now do locking DB access on every request fare against the retry mechanism for some of the more common usage patterns. What is the increased overhead of calling back to he scheduler to confirm the claim? In the end - how do we even measure that we are going in the right direction with the new design. I personally think that different workloads will have different needs from the scheduler in terms of response times and tolerance to failure, and that we need to design for that. So as an example a cloud operator with very simple scheduling requirements may want to go for the no locking approach and optimize for response times allowing for a small number of instances to fail under high load/utilization due to retries, while some others with more complicated scheduling requirements, or less tolerance for data inconsistency might want to trade in response times by doing locking claims in the scheduler. Some similar trade-offs and how to deal with them are discussed in [2] N. [1] https://etherpad.openstack.org/p/kilo-nova-midcycle [2] http://research.google.com/pubs/pub41684.html __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev