Edison, One thing I forgot to say is that reference counting be an unnecessary complexity in the event that sharing of the same resource by multiple process concurrently is rare.
Thanks, -John On Jun 5, 2013, at 4:04 PM, John Burwell <jburw...@basho.com> wrote: > Edison, > > You have provided some great information below which helps greatly to > understand the role of the "NFS cache" mechanism. To summarize, this > mechanism is only currently required for Xen snapshot operations driven by > Xen's coalescing operations. Is my understanding correct? Just out of > curiosity, is their a Xen expert on the list who can provide a high-level > description of the coalescing operation -- in particular, the way it > interacts with storage? I have Googled a bit, and found very little > information about it. Has the object_store branch been tested with VMWare > and KVM? If so, what operations on these hypervisors have been tested? > > In reading through the description below, my operation concerns remain > regarding potential race conditions and resource exhaustion. Also, in > reading through the description, I think we should find a new name for this > mechanism. As Chip has previous mentioned, a cache implies the following > characteristics: > > 1. Optional: Systems can operate without caches just more slowly. > However, with this mechanism, snapshots on Xen will not function. > 2. Volatility: Caches are backed by durable, non-volitale storage. > Therefore, if the cache's data is lost, it can be rebuilt from the backing > store and no data will be permanently lost from the system. However, this > mechanism contains snapshots in-transit to an object store. If the data > contained in this "cache" were lost before its transfer to the object store > completed, the snapshot data would be lost. > > In order to set expectations with users and better frame our design > conversation, I think it would be appropriate this mechanism as a staging, > scratch, or temporary area. I also recommend removing the notion of NFS its > name as NFS is initial implementation of this mechanism. In the future, I > can see a desire for local filesystem, RBD, and iSCSI implementations of it. > > In terms of solving the potential race conditions and resource exhaustion > issues, I don't think an LRU approach will be sufficient because the least > recently used resource may be still be in use by the system. I think we > should look to a reservation model with reference counting where files are > deleted when once no processes are accessing them. The following is a > (handwave-handwave) overview of the process I think would meet these > requirements: > > 1. Request a reservation for the maximum size of the file(s) that will > be processed in the staging area. > - If the file is already in the staging area, increase its > reference count > - If the reservation can not be fulfilled, we can either drop > the process in a retry queue or reject it. > 2. Perform work and transfer file(s) to/from the object store > 3. Release the file(s) -- decrementing the reference count. When the > reference count is <= 0, delete the file(s) from the staging area > > We would also likely want to consider a TTL to purge files after a > configurable period of inactivity as a backstop against crashed processes > failing to properly decrementing the reference count. In this model, we will > either defer or reject work if resources are not available, and we properly > bound resources. > > Finally, in terms of decoupling the decision to use of this mechanism by > hypervisor plugins from the storage subsystem, I think we should expose > methods on the secondary storage services that allow clients to explicitly > request or create resources using files (i.e. java.io.File) instead of > streams (e.g. createXXX(File) or readXXXAsFile). These interfaces would > provide the storage subsystem with the hint that the client requires file > access to the request resource. For object store plugins, this hint would > be used to wrap the resource in an object that would transfer in and/out of > the staging area. > > Thoughts? > -John > > On Jun 3, 2013, at 7:17 PM, Edison Su <edison...@citrix.com> wrote: > >> Let's start a new thread about NFS cache storage issues on object_store. >> First, I'll go through how NFS storage works on master branch, then how it >> works on object_store branch, then let's talk about the "issues". >> >> 0. Why we need NFS secondary storage? Nfs secondary storage is used as >> a place to store templates/snapshots etc, it's zone wide, and it's widely >> supported by most of hypervisors(except HyperV). NFS storage exists in >> CloudStack since 1.x. With the rising of object storage, like S3/Swift, >> CloudStack adds the support of Swift in 3.x, and S3 in 4.0. You may wonder, >> if S3/Swift is used as the place to store templates/snapshots, then why we >> still need NFS secondary storage? >> >> There are two reasons for that: >> >> a. CloudStack storage code is tightly coupled with NFS secondary >> storage, so when adding Swift/S3 support, it's likely to take shortcut, >> leave NFS secondary storage as it is. >> >> b. Certain hypervisors, and certain storage related operations, can not >> directly operate on object storage. >> Examples: >> >> b.1 When backing up snapshot(the snapshot taken from xenserver hypervisor) >> from primary storage to S3 in xenserver >> >> If there are snapshot chains on the volume, and if we want to coalesce the >> snapshot chains into a new disk, then copy it to S3, we either, coalesce the >> snapshot chains on primary storage, or on an extra storage repository (SR) >> that supported by Xenserver. >> >> If we coalesce it on primary storage, then may blow up the primary storage, >> as the coalesced new disk may need a lot of space(thinking about, the new >> disk will contain all the content in from leaf snapshot, all the way up to >> base template), but the primary storage is not planned to this >> operation(cloudstack mgt server is unaware of this operation, the mgt server >> may think the primary storage still has enough space to create volumes). >> >> While xenserver doesn't have API to coalesce snapshots directly to S3, so we >> have to use other storages that supported by Xenserver, that's why the NFS >> storage is used during snapshot backup. So what we did is that first call >> xenserver api to coalesce the snapshot to NFS storage, then copy the newly >> created file into S3. This is what we did on both master branch and >> object_store branch. >> b.2 When create volume from snapshot if the >> snapshot is stored on S3. >> If the snapshot is a delta >> snapshot, we need to coalesce them into a new volume. We can't coalesce >> snapshots directly on S3, AFAIK, so we have to download the snapshot and its >> parents into somewhere, then coalesce them with xenserver's tools. Again, >> there are two options, one is to download all the snapshots into primary >> storage, or download them into NFS storage: >> If we download all the >> snapshots into primary storage directly from S3, then first we need find a >> way import snapshot from S3 into Primary storage(if primary storage is a >> block device, then need extra care) and then coalesce them. If we go this >> way, need to find a primary storage with enough space, and even worse, if >> the primary storage is not zone-wide, then later on, we may need to copy the >> volume from one primary storage to another, which is time consuming. >> If we download all the >> snapshots into NFS storage from S3, then coalesce them, and then copy the >> volume to primary storage. As the NFS storage is zone wide, so, you can copy >> the volume into whatever primary storage, without extra copy. This is what >> we did in master branch and object_store branch. >> b.3, some hypervisors, or some storages do not >> support directly import template into primary storage from a URL. For >> example, if Ceph is used as primary storage, when import a template into >> RBD, need transform a Qcow2 image into RAW disk, then into RBD format 2. In >> order to transform an image from Qcow2 image into RAW disk, you need extra >> file system, either a local file system(this is what other stack does, which >> is not scalable to me), or a NFS storage(this is what can be done on both >> master and object_store). Or one can modify hypervisor or storage to support >> directly import template from S3 into RBD. Here is the >> link(http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg14411.html), >> that Wido posted. >> Anyway, there are so many combination of hypervisors and >> storages: for some hypervisors with zone wide file system based storage(e.g. >> KVM + gluster/NFS as primary storage), you don't need extra nfs storage. >> Also if you are using VMware or HyperV, which can import template from a >> URL, regardless which storage your are using, then you don't need extra NFS >> storage. While if you are using xenserver, in order to create volume from >> delta snapshot, you will need a NFS storage, or if you are using KVM + Ceph, >> you also may need a NFS storage. >> Due to above reasons, NFS cache storage is need in certain >> cases if S3 is used as secondary storage. The combination of hypervisors and >> storages are quite complicated, to use cache storage or not, should be case >> by case. But as long as cloudstack provides a framework, gives people the >> choice to enable/disable cache storage on their own, then I think the >> framework is good enough. >> >> >> 1. Then let's talk about how NFS storage works on master branch, with >> or without S3. >> If S3 is not used, here is the how NFS storage is used: >> >> 1.1 Register a template/ISO: cloudstack downloads the template/ISO into >> NFS storage. >> >> 1.2 Backup snapshot: cloudstack sends a command to xenserver hypervisor, >> issue vdi.copy command copy the snapshot to NFS, for kvm, directly use "cp" >> or "qemu-img convert" to copy the snapshot into NFS storage. >> >> 1.3 Create volume from snapshot: If the snapshot is a delta snapshot, >> coalesce them on NFS storage, then vdi.copy it from NFS to primary storage. >> If it's KVM, use "cp" or "qemu-img convert" to copy the snapshot from NFS >> storage to primary storage. >> >> >> If S3 is used: >> >> 1.4 Register a template/ISO: download the template/ISO into NFS storage >> first, then there is background thread, which can upload the template/ISO >> from NFS storage into S3 regularly. The template is in Ready state, only >> means the template is stored on NFS storage, but admin doesn't know the >> template is stored on the S3 or not. Even worse, if there are multiple >> zones, cloudstack will copy the template from one zone wide NFS storage into >> another NFS storage in another zone, while there is already has a region >> wide S3 available. As the template is not directly uploaded to S3 when >> registering a template, it will take several copy in order to spread the >> template into a region wide. >> >> 1.5 Backup snapshot: cloudstack sends a command to xenserver hypervisor, >> copy the snapshot to NFS storage, then immediately, upload the snapshot from >> NFS storage into S3. The snapshot is in Backedup state, not only means the >> snapshot is in NFS storage, but also means it's stored on S3. >> >> 1.6 Create volume from snapshot: download the snapshot and it's parent >> snapshots from S3 into NFS storage, then coalesce and vdi.copy the volume >> from NFS to primary storage. >> >> >> >> 2. Then let's talk about how it works on object_store: >> If S3 is not used, there is ZERO change from master branch. How the NFS >> secondary storage works before, is the same on object_store. >> If S3 is used, and NFS cache storage used also(which is by default): >> 2.1 Register a template/ISO: the template/ISO are directly uploaded to S3, >> there is no extra copy to NFS storage. When the template is in "Ready" >> state, means the template is stored on S3. It implies that: >> the template is immediately available in the region as soon as it's in Ready >> State. And admin can clearly knows the status of template on S3, what's >> percentage of the uploading, is it failed or succeed? Also if register >> template failed for some reason, admin can issue the register template >> command again. I would say the change of how to register template into S3 is >> far better than what we did on master branch. >> 2.2 Backup snapshot: it's same as master branch, sends a command to >> xenserver host, copy the snapshot into NFS, then upload to S3. >> 2.3 Create volume from snapshot: it's the same as master branch, download >> snapshot and it's parent snaphots from S3 into NFS, then copy it from NFS to >> primary storage. >> From above few typical usage cases, you may understand how S3 and NFS cache >> storage is used, and what's difference between object_store branch and >> master branch: basically, we only change the way how to register a template, >> nothing else. >> If S3 is used, and no NFS cache storage is used(it's possible, depends on >> which datamotion strategy is used): >> 2.4 Register a template/ISO: it's the same as 2.1 >> 2.5 Backup snapshot: export the snapshot from primary storage into S3 >> directly >> 2.6 Create volume from snapshot: download snapshots from S3 into primary >> storage directly, then coalesce and create volume from it. >> >> Hope above explanation will tell the truth how the system works on >> object_store, and clarify the misconception/misunderstanding about >> object_store branch. Even the change is huge, we still maintain the back >> compatibility. If you don't want to use S3, only want to existing NFS >> storage, it's definitely OK, it works the same as before. If you want to use >> S3, we provide a better S3 implementation when registering template/ISO. If >> you want to use S3 without NFS storage, that's also definitely OK, the >> framework is quite flexible to accommodate different solutions. >> >> Ok, let's talk about the NFS storage cache issues. >> The issue about NFS cache storage is discussed in several threads, back and >> forth. All in all, the NFs cache storage is only one usage case out of three >> usage cases supported by object_store branch. It's not something that if it >> has issue, then everything doesn't work. >> In above 2.2 and 2.3, it shows how the NFS cache storage is involved during >> snapshot related operations. The complains about there is no aging policy, >> no capacity planner for NFS cache storage, is happened when download a >> snapshot from S3 into NFS, or copy a snapshot from primary storage into NFS, >> or download template from S3 into NFS. Yes, it's an issue, the NFS cache >> storage can be used out, if there is no capacity planner, and no aging out >> policy. But can it be fixed? Is it a design issue? >> Let's talk the code: Here is the code related to NFS cache storage, not >> much, only one class depends on NFS cache storage: >> https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;a=blob;f=engine/storage/datamotion/src/org/apache/cloudstack/storage/motion/AncientDataMotionStrategy.java;h=a01d2d30139f70ad8c907b6d6bc9759d47dcc2d6;hb=refs/heads/object_store >> Take copyVolumeFromSnapshot as example, which will be called when create >> Volume from snapshot, if first calls cacheSnapshotChain, which will call >> cacheMgr.createCacheObject to download the snapshot into NFs cache storage. >> StorageCacheManagerImpl-> createCacheObject is the only place to create >> objects on NFs cache storage, the code is at >> https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;a=blob;f=engine/storage/cache/src/org/apache/cloudstack/storage/cache/manager/StorageCacheManagerImpl.java;h=cb5ea106fed3e5d2135dca7d98aede13effcf7d9;hb=refs/heads/object_store >> In createCacheObject, it will first find out a cache storage, in case there >> are multiple cache storages available in a scope: >> DataStore cacheStore = this.getCacheStorage(scope); >> getCacheStorage will call StorageCacheAllocator to find out a proper NFS >> cache storage. So StorageCacheAllocator is the place to choose NFS cache >> storage based on certain criteria, the current implementation only randomly >> choose one of them, we can add a new allocator algorithm, based on capacity >> etc, etc. >> Regarding capacity reservation, there is already a table, called >> op_host_capacity which has entry for NFS secondary storage, we can reuse >> this entry to store capacity information about NFS cache storages(such as, >> total size, available/used capacity etc). So when every call >> createCacheObject, we can call StorageCacheAllocator to find out a proper >> NFS storage based on first fit criteria, then increase used capacity in >> op_host_capacity table. If the create cache object failed, return the >> capacity to op_host_capacity. >> >> Regarding the aging out policy, we can start a background thread on mgt >> server, which will scan all the objects created on NFS cache storage(the >> tables called: snapshot_store_ref, template_store_ref, volume_store_ref), >> each entry of these tables has a column called: updated, every time, when >> the object's state is changed, the "updated" column will be got updated >> also. When the object's state is changed? Every time, when the object is >> used in some contexts(such as copy the snapshot on NFS cache storage into >> somewhere), the object's state will be changed accordingly, such as >> "Copying", means the object is being copied to some place, which is exactly >> the information we need to implement LRU algorithm. >> >> How do you guys think about the fix? If you have better solution, please let >> me know. >> >> >