Re: [DISCUSS] NFS cache storage issue on object_store

John Burwell Wed, 05 Jun 2013 13:07:16 -0700

Edison,

One thing I forgot to say is that reference counting be an unnecessary 
complexity in the event that sharing of the same resource by multiple process 
concurrently is rare.


Thanks,
-John

On Jun 5, 2013, at 4:04 PM, John Burwell <jburw...@basho.com> wrote:

> Edison,
> 
> You have provided some great information below which helps greatly to 
> understand the role of the "NFS cache" mechanism.  To summarize, this 
> mechanism is only currently required for Xen snapshot operations driven by 
> Xen's coalescing operations.  Is my understanding correct?  Just out of 
> curiosity, is their a Xen expert on the list who can provide a high-level 
> description of the coalescing operation -- in particular, the way it 
> interacts with storage?  I have Googled a bit, and found very little 
> information about it.  Has the object_store branch been tested with VMWare 
> and KVM?  If so, what operations on these hypervisors have been tested?
> 
> In reading through the description below, my operation concerns remain 
> regarding potential race conditions and resource exhaustion.  Also, in 
> reading through the description, I think we should find a new name for this 
> mechanism.  As Chip has previous mentioned, a cache implies the following 
> characteristics:
> 
>    1. Optional: Systems can operate without caches just more slowly.  
> However, with this mechanism, snapshots on Xen will not function.
>    2. Volatility: Caches are backed by durable, non-volitale storage.  
> Therefore, if the cache's data is lost, it can be rebuilt from the backing 
> store and no data will be permanently lost from the system.  However, this 
> mechanism contains snapshots in-transit to an object store.  If the data 
> contained in this "cache" were lost before its transfer to the object store 
> completed, the snapshot data would be lost.
> 
> In order to set expectations with users and better frame our design 
> conversation, I think it would be appropriate this mechanism as a staging, 
> scratch, or temporary area.  I also recommend removing the notion of NFS its 
> name as NFS is initial implementation of this mechanism.  In the future, I 
> can see a desire for local filesystem, RBD, and iSCSI implementations of it.
> 
> In terms of solving the potential race conditions and resource exhaustion 
> issues, I don't think an LRU approach will be sufficient because the least 
> recently used resource may be still be in use by the system.  I think we 
> should look to a reservation model with reference counting where files are 
> deleted when once no processes are accessing them.  The following is a 
> (handwave-handwave) overview of the process I think would meet these 
> requirements:
> 
>       1. Request a reservation for the maximum size of the file(s) that will 
> be processed in the staging area.
>               - If the file is already in the staging area, increase its 
> reference count
>               - If the reservation can not be fulfilled, we can either drop 
> the process in a retry queue or reject it.  
>       2. Perform work and transfer file(s) to/from the object store
>       3. Release the file(s) -- decrementing the reference count.  When the 
> reference count is <= 0, delete the file(s) from the staging area
> 
> We would also likely want to consider a TTL to purge files after a 
> configurable period of inactivity as a backstop against crashed processes 
> failing to properly decrementing the reference count.  In this model, we will 
> either defer or reject work if resources are not available, and we properly 
> bound resources.  
> 
> Finally, in terms of decoupling the decision to use of this mechanism by 
> hypervisor plugins from the storage subsystem, I think we should expose 
> methods on the secondary storage services that allow clients to explicitly 
> request or create resources using files (i.e. java.io.File) instead of 
> streams (e.g. createXXX(File) or readXXXAsFile).  These interfaces would 
> provide the storage subsystem with the hint that the client requires file 
> access to the request resource.   For object store plugins, this hint would 
> be used to wrap the resource in an object that would transfer in and/out of 
> the staging area.
> 
> Thoughts?
> -John
> 
> On Jun 3, 2013, at 7:17 PM, Edison Su <edison...@citrix.com> wrote:
> 
>> Let's start a new thread about NFS cache storage issues on object_store.
>> First, I'll go through how NFS storage works on master branch, then how it 
>> works on object_store branch, then let's talk about the "issues".
>> 
>> 0.       Why we need NFS secondary storage? Nfs secondary storage is used as 
>> a place to store templates/snapshots etc, it's zone wide, and it's widely 
>> supported by most of hypervisors(except HyperV). NFS storage exists in 
>> CloudStack since 1.x. With the rising of object storage, like S3/Swift, 
>> CloudStack adds the support of Swift in 3.x, and S3 in 4.0. You may wonder, 
>> if S3/Swift is used as the place to store templates/snapshots, then why we 
>> still need NFS secondary storage?
>> 
>> There are two reasons for that:
>> 
>> a.       CloudStack storage code is tightly coupled with NFS secondary 
>> storage, so when adding Swift/S3 support, it's likely to take shortcut, 
>> leave NFS secondary storage as it is.
>> 
>> b.      Certain hypervisors, and certain storage related operations, can not 
>> directly operate on object storage.
>> Examples:
>> 
>> b.1 When backing up snapshot(the snapshot taken from xenserver hypervisor) 
>> from primary storage to S3 in xenserver
>> 
>> If there are snapshot chains on the volume, and if we want to coalesce the 
>> snapshot chains into a new disk, then copy it to S3, we either, coalesce the 
>> snapshot chains on primary storage, or on an extra storage repository (SR) 
>> that supported by Xenserver.
>> 
>> If we coalesce it on primary storage, then may blow up the primary storage, 
>> as the coalesced new disk may need a lot of space(thinking about, the new 
>> disk will contain all the content in from leaf snapshot, all the way up to 
>> base template), but the primary storage is not planned to this 
>> operation(cloudstack mgt server is unaware of this operation, the mgt server 
>> may think the primary storage still has enough space to create volumes).
>> 
>> While xenserver doesn't have API to coalesce snapshots directly to S3, so we 
>> have to use other storages that supported by Xenserver, that's why the NFS 
>> storage is used during snapshot backup. So what we did is that first call 
>> xenserver api to coalesce the snapshot to NFS storage, then copy the newly 
>> created file into S3. This is what we did on both master branch and 
>> object_store branch.
>>                              b.2 When create volume from snapshot if the 
>> snapshot is stored on S3.
>>                                                If the snapshot is a delta 
>> snapshot, we need to coalesce them into a new volume. We can't coalesce 
>> snapshots directly on S3, AFAIK, so we have to download the snapshot and its 
>> parents into somewhere, then coalesce them with xenserver's tools. Again, 
>> there are two options, one is to download all the snapshots into primary 
>> storage, or download them into NFS storage:
>>                                               If we download all the 
>> snapshots into primary storage directly from S3, then first we need find a 
>> way import snapshot from S3 into Primary storage(if primary storage is a 
>> block device, then need extra care) and then coalesce them. If we go this 
>> way, need to find a primary storage with enough space, and even worse, if 
>> the primary storage is not zone-wide, then later on, we may need to copy the 
>> volume from one primary storage to another, which is time consuming.
>>                                               If we download all the 
>> snapshots into NFS storage from S3, then coalesce them, and then copy the 
>> volume to primary storage. As the NFS storage is zone wide, so, you can copy 
>> the volume into whatever primary storage, without extra copy. This is what 
>> we did in master branch and object_store branch.
>>                             b.3, some hypervisors, or some storages do not 
>> support directly import template into primary storage from a URL. For 
>> example, if Ceph is used as primary storage, when import a template into 
>> RBD, need transform a Qcow2 image into RAW disk, then into RBD format 2. In 
>> order to transform an image from Qcow2 image into RAW disk, you need extra 
>> file system, either a local file system(this is what other stack does, which 
>> is not scalable to me), or a NFS storage(this is what can be done on both 
>> master and object_store). Or one can modify hypervisor or storage to support 
>> directly import template from S3 into RBD. Here is the 
>> link(http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg14411.html), 
>> that Wido posted.
>>                Anyway, there are so many combination of hypervisors and 
>> storages: for some hypervisors with zone wide file system based storage(e.g. 
>> KVM + gluster/NFS as primary storage), you don't need extra nfs storage. 
>> Also if you are using VMware or HyperV, which can import template from a 
>> URL, regardless which storage your are using, then you don't need extra NFS 
>> storage. While if you are using xenserver, in order to create volume from 
>> delta snapshot, you will need a NFS storage, or if you are using KVM + Ceph, 
>> you also may need a NFS storage.
>>               Due to above reasons, NFS cache storage is need in certain 
>> cases if S3 is used as secondary storage. The combination of hypervisors and 
>> storages are quite complicated, to use cache storage or not, should be case 
>> by case. But as long as cloudstack provides a framework, gives people the 
>> choice to enable/disable cache storage on their own, then I think the 
>> framework is  good enough.
>> 
>> 
>> 1.       Then let's talk about how NFS storage works on master branch, with 
>> or without S3.
>> If S3 is not used, here is the how NFS storage is used:
>> 
>> 1.1   Register a template/ISO: cloudstack downloads the template/ISO into 
>> NFS storage.
>> 
>> 1.2   Backup snapshot: cloudstack sends a command to xenserver hypervisor, 
>> issue vdi.copy command copy the snapshot to NFS, for kvm, directly use "cp" 
>> or "qemu-img convert" to copy the snapshot into NFS storage.
>> 
>> 1.3   Create volume from snapshot: If the snapshot is a delta snapshot, 
>> coalesce them on NFS storage, then vdi.copy it from NFS to primary storage. 
>> If it's KVM, use "cp" or "qemu-img convert" to copy the snapshot from NFS 
>> storage to primary storage.
>> 
>> 
>>              If S3 is used:
>> 
>> 1.4   Register a template/ISO: download the template/ISO into NFS storage 
>> first, then there is background thread, which can upload the template/ISO 
>> from NFS storage into S3 regularly. The template is in Ready state, only 
>> means the template is stored on NFS storage, but admin doesn't know the 
>> template is stored on the S3 or not. Even worse, if there are multiple 
>> zones, cloudstack will copy the template from one zone wide NFS storage into 
>> another NFS storage in another zone, while there is already has a region 
>> wide S3 available. As the template is not directly uploaded to S3 when 
>> registering a template, it will take several copy in order to spread the 
>> template into a region wide.
>> 
>> 1.5   Backup snapshot: cloudstack sends a command to xenserver hypervisor, 
>> copy the snapshot to NFS storage, then immediately, upload the snapshot from 
>> NFS storage into S3. The snapshot is in Backedup state, not only means the 
>> snapshot is in  NFS storage, but also means it's stored on S3.
>> 
>> 1.6   Create volume from snapshot: download the snapshot  and it's parent 
>> snapshots from S3 into NFS storage, then coalesce and vdi.copy the volume 
>> from NFS to primary storage.
>> 
>> 
>> 
>> 2.       Then let's talk about how it works on object_store:
>> If S3 is not used, there is ZERO change from master branch. How the NFS 
>> secondary storage works before, is the same on object_store.
>> If S3 is used, and NFS cache storage used also(which is by default):
>>  2.1 Register a template/ISO: the template/ISO are directly uploaded to S3, 
>> there is no extra copy to NFS storage. When the template is in "Ready" 
>> state, means the template is stored on S3.                  It implies that: 
>> the template is immediately available in the region as soon as it's in Ready 
>> State. And admin can clearly knows the status of template on S3, what's 
>> percentage of the uploading, is it failed or succeed? Also if register 
>> template failed for some reason, admin can issue the register template 
>> command again. I would say the change of how to register template into S3 is 
>> far better than what we did on master branch.
>>  2.2 Backup snapshot: it's same as master branch, sends a command to 
>> xenserver host, copy the snapshot into NFS, then upload to S3.
>>  2.3 Create volume from snapshot: it's the same as master branch, download 
>> snapshot and it's parent snaphots from S3 into NFS, then copy it from NFS to 
>> primary storage.
>> From above few typical usage cases, you may understand how S3 and NFS cache 
>> storage is used, and what's difference between object_store branch and 
>> master branch: basically, we only change the way how to register a template, 
>> nothing else.
>> If S3 is used, and no NFS cache storage is used(it's possible, depends on 
>> which datamotion strategy is used):
>>   2.4 Register a template/ISO: it's the same as 2.1
>>   2.5 Backup snapshot: export the snapshot from primary storage into S3 
>> directly
>>   2.6 Create volume from snapshot: download snapshots from S3 into primary 
>> storage directly, then coalesce and create volume from it.
>> 
>>         Hope above explanation will tell the truth how the system works on 
>> object_store, and clarify the misconception/misunderstanding  about 
>> object_store branch. Even the change is huge, we still maintain the back 
>> compatibility. If you don't want to use S3, only want to existing NFS 
>> storage, it's definitely OK, it works the same as before. If you want to use 
>> S3, we provide a better S3 implementation when registering template/ISO. If 
>> you want to use S3 without NFS storage, that's also definitely OK,  the 
>> framework is quite flexible to accommodate different solutions.
>> 
>> Ok, let's talk  about the NFS storage cache issues.
>> The issue about NFS cache storage is discussed in several threads, back and 
>> forth. All in all, the NFs cache storage is only one usage case out of three 
>> usage cases supported by object_store branch. It's not something that if it 
>> has issue, then everything doesn't work.
>> In above 2.2 and 2.3, it shows how the NFS cache storage is involved during 
>> snapshot related operations. The complains about there is no aging policy, 
>> no capacity planner for NFS cache storage, is happened when download a 
>> snapshot from S3 into NFS, or copy a snapshot from primary storage into NFS, 
>> or download template from S3 into NFS. Yes, it's an issue, the NFS cache 
>> storage can be used out, if there is no capacity planner, and no aging out 
>> policy. But can it be fixed? Is it a design issue?
>> Let's talk the code: Here is the code related to NFS cache storage, not 
>> much, only one class depends on NFS cache storage: 
>> https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;a=blob;f=engine/storage/datamotion/src/org/apache/cloudstack/storage/motion/AncientDataMotionStrategy.java;h=a01d2d30139f70ad8c907b6d6bc9759d47dcc2d6;hb=refs/heads/object_store
>> Take copyVolumeFromSnapshot as example, which will be called when create 
>> Volume from snapshot, if first calls cacheSnapshotChain, which will call 
>> cacheMgr.createCacheObject to download the snapshot into NFs cache storage. 
>> StorageCacheManagerImpl-> createCacheObject is the only place to create 
>> objects on NFs cache storage, the code is at 
>> https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;a=blob;f=engine/storage/cache/src/org/apache/cloudstack/storage/cache/manager/StorageCacheManagerImpl.java;h=cb5ea106fed3e5d2135dca7d98aede13effcf7d9;hb=refs/heads/object_store
>> In createCacheObject, it will first find out a cache storage, in case there 
>> are multiple cache storages available in a scope:
>> DataStore cacheStore = this.getCacheStorage(scope);
>> getCacheStorage will call StorageCacheAllocator to find out a proper NFS 
>> cache storage. So StorageCacheAllocator is the place to choose NFS cache 
>> storage based on certain criteria, the current implementation only randomly 
>> choose one of them, we can add a new allocator algorithm, based on capacity 
>> etc, etc.
>> Regarding capacity reservation, there is already a table, called 
>> op_host_capacity which has entry for NFS secondary storage, we can reuse 
>> this entry to store capacity information about NFS cache storages(such as, 
>> total size, available/used capacity etc). So when every call 
>> createCacheObject, we can call StorageCacheAllocator to find out a proper 
>> NFS storage based on first fit criteria, then increase used capacity in 
>> op_host_capacity table. If the create cache object failed, return the 
>> capacity to op_host_capacity.
>> 
>> Regarding the aging out policy, we can start a background thread on mgt 
>> server, which will scan all the objects created on NFS cache storage(the 
>> tables called: snapshot_store_ref, template_store_ref, volume_store_ref), 
>> each entry of these tables has a column called: updated, every time, when 
>> the object's state is changed, the "updated" column will be got updated 
>> also. When the object's state is changed? Every time, when the object is 
>> used in some contexts(such as copy the snapshot on NFS cache storage into 
>> somewhere), the object's state will be changed  accordingly, such as 
>> "Copying", means the object is being copied to some place, which is exactly 
>> the information we need to implement LRU algorithm.
>> 
>> How do you guys think about the fix? If you have better solution, please let 
>> me know.
>> 
>> 
>

Re: [DISCUSS] NFS cache storage issue on object_store

Reply via email to