Thanks Matt, good points raised here. I think at minimum we should address the single-node case by @synchronized-ing all access to the image cache data on the datastore, in a way that is fairly easy to reason about. Spawn failures are not desirable but acceptable in edge cases if a subsequent retry on same or different node can succeed. What we need to absolutely avoid is a partially deleted image cache directory rendering future spawns impossible.
There are some non-elegant ways to work around the multi-node scenarios if one really has to (e.g. dedicating a different cache location for each node), but i think the multi-node should get addressed at some point too. There is distributing locking provisions in the a vSphere datastore for files in use (you cannot delete a disk used by a running VM, say). The image cache as implemented exists as a collection of files on a datastore, many of which may not be used by any process. Either that picture changes or we need to adopt some other means to achieve distributed locking. Vui ----- Original Message ----- | From: "Matthew Booth" <mbo...@redhat.com> | To: "OpenStack Development Mailing List (not for usage questions)" <openstack-dev@lists.openstack.org> | Sent: Friday, March 7, 2014 8:53:26 AM | Subject: [openstack-dev] [nova][vmware] Locking: A can of worms | | We need locking in the VMware driver. There are 2 questions: | | 1. How much locking do we need? | 2. Do we need single-node or multi-node locking? | | I believe these are quite separate issues, so I'm going to try not to | confuse them. I'm going to deal with the first question first. | | In reviewing the image cache ageing patch, I came across a race | condition between cache ageing and spawn(). One example of the race is: | | Cache Ageing spawn() | * Check timestamps | * Delete timestamp | * Check for image cache directory | * Delete directory | * Use image cache directory | | This will cause spawn() to explode. There are various permutations of | this. For example, the following are all possible: | | * A simple failure of spawn() with no additional consequences. | | * Calling volumeops.attach_disk_to_vm() with a vmdk_path that doesn't | exist. It's not 100% clear to me that ReconfigVM_Task will throw an | error in this case, btw, which would probably be bad. | | * Leaving a partially deleted image cache directory which doesn't | contain the base image. This would be really bad. | | The last comes about because recursive directory delete isn't atomic, | and may partially succeed, which is a tricky problem. However, in | discussion, Gary also pointed out that directory moves are not atomic | (see MoveDatastoreFile_Task). This is positively nasty. We already knew | that spawn() races with itself to create an image cache directory, and | we've hit this problem in practise. We haven't fixed the race, but we do | manage it. The management relies on the atomicity of a directory move. | Unfortunately it isn't atomic, which presents the potential problem of | spawn() attempting to use an incomplete image cache directory. We also | have the problem of 2 spawns using a linked clone image racing to create | the same resized copy. | | We could go through all of the above very carefully to assure ourselves | that we've found all the possible failure paths, and that in every case | the problems are manageable and documented. However, I would place a | good bet that the above is far from a complete list, and we would have | to revisit it in its entirety every time we touched any affected code. | And that would be a lot of code. | | We need something to manage concurrent access to resources. In all of | the above cases, if we make the rule that everything which uses an image | cache directory must hold a lock on it whilst using it, all of the above | problems go away. Reasoning about their correctness becomes the | comparatively simple matter of ensuring that the lock is used correctly. | Note that we need locking in both the single and multi node cases, | because even single node is multi-threaded. | | The next question is whether that locking needs to be single node or | multi node. Specifically, do we currently, or do we plan to, allow an | architecture where multiple Nova nodes access the same datastore | concurrently. If we do, then we need to find a distributed locking | solution. Ideally this would use the datastore itself for lock | mediation. Failing that, apparently this tool is used elsewhere within | the project: | | https://urldefense.proofpoint.com/v1/url?u=http://zookeeper.apache.org/doc/trunk/zookeeperOver.html&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=rsFzbyNhYbCCSvAxpT1LNg%3D%3D%0A&m=vBUMpOkjKNZ9junwuw6QvSABre8rkQgvEOsh3z8KCKw%3D%0A&s=307bfdefbb7713086bbfba506b21e22bfbe99aaccffb941ac9a3d107e00ebf05 | | That would be an added layer of architecture and deployment complexity, | but if we need it, it's there. | | If we can confidently say that 2 Nova instances should never be | accessing the same datastore (how about hot/warm/cold failover?), we can | use Nova's internal synchronisation tools. This would simplify matters | greatly! | | I think this is one of those areas which is going to improve both the | quality of the driver, and the confidence of reviewers to merge changes. | Right now it takes a lot of brain cycles to work through all the various | paths of a race to work out if any of them are really bad, and it has to | be repeated every time you touch the code. A little up-front effort will | make a whole class of problems go away. | | Matt | -- | Matthew Booth, RHCA, RHCSS | Red Hat Engineering, Virtualisation Team | | GPG ID: D33C3490 | GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 | | _______________________________________________ | OpenStack-dev mailing list | OpenStack-dev@lists.openstack.org | https://urldefense.proofpoint.com/v1/url?u=http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=rsFzbyNhYbCCSvAxpT1LNg%3D%3D%0A&m=vBUMpOkjKNZ9junwuw6QvSABre8rkQgvEOsh3z8KCKw%3D%0A&s=965f83bacb86fe52e58e0595e15b699b6b1e32694c1701e3770767b3cd9bb967 | _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev