> -----Original Message-----
> From: Indra Pramana [mailto:in...@sg.or.id]
> Sent: Wednesday, October 09, 2013 11:07 PM
> To: dev@cloudstack.apache.org; us...@cloudstack.apache.org
> Subject: Re: High CPU utilization on KVM hosts while doing RBD snapshot -
> was Re: snapshot caused host disconnected
> 
> Dear all,
> 
> I and my colleague tried to scrutinize the source code and found the script
> which is performing the copying of the snapshot from primary storage to
> secondary storage on this file:
> 
> ./plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMSto
> rageProcessor.java
> 
> Specifically under this function:
> 
>     @Override
>     public Answer backupSnapshot(CopyCommand cmd) {
> 
> ===
>                     File snapDir = new File(snapshotDestPath);
>                     s_logger.debug("Attempting to create " +
> snapDir.getAbsolutePath() + " recursively");
>                     FileUtils.forceMkdir(snapDir);
> 
>                     File snapFile = new File(snapshotDestPath + "/" + 
> snapshotName);
>                     s_logger.debug("Backing up RBD snapshot to " +
> snapFile.getAbsolutePath());
>                     BufferedOutputStream bos = new BufferedOutputStream(new
> FileOutputStream(snapFile));
>                     int chunkSize = 4194304;
>                     long offset = 0;
>                     while(true) {
>                         byte[] buf = new byte[chunkSize];
> 
>                         int bytes = image.read(offset, buf, chunkSize);
>                         if (bytes <= 0) {
>                             break;
>                         }
>                         bos.write(buf, 0, bytes);
>                         offset += bytes;
>                     }
>                     s_logger.debug("Completed backing up RBD snapshot " +
> snapshotName + " to  " + snapFile.getAbsolutePath() + ". Bytes written: " +
> offset);
>                     bos.close();
> ===
> 
> (1) Is it safe to comment out the above lines and recompile/reinstall, to
> prevent CloudStack from copying the snapshots from the RBD primary
> storage to the secondary storage?
> 
> (2) What would be the impact to CloudStack operations if we leave the
> snapshots on primary storage without copying them to secondary storage?
> Are we still able to do restoration from the snapshots kept in the primary
> storage?

Yes, it's doable, but need to write a SnapshotStrategy in mgt server. The 
current SnapshotStrategy will always backup snapshot into secondary after 
taking snapshot.
You can write a new SnapshotStrategy, just take snapshot without backup to 
secondary storage.
After the commit: 180cfa19e87b909cb1c8a738359e31a6111b11c5 checked into master, 
you will get a lot of freedom to manipulate snapshot.

> 
> Looking forward to your reply, thank you.
> 
> Cheers.
> 
> 
> 
> On Wed, Oct 9, 2013 at 2:36 PM, Indra Pramana <in...@sg.or.id> wrote:
> 
> > Hi Wido and all,
> >
> > Good day to you, and thank you for your e-mail reply.
> >
> > Yes, from the agent logs I can see that the RBD snapshot was created.
> > However, it seems that CPU utilisation goes up drastically during the
> > copying over of the snapshot from primary storage to the secondary
> storage.
> >
> > ===
> > 2013-10-08 00:01:58,765 DEBUG [cloud.agent.Agent]
> > (agentRequest-Handler-5:null) Request:Seq 34-898172006:  { Cmd , MgmtId:
> > 161342671900, via: 34, Ver: v1, Flags: 100011,
> >
> [{"org.apache.cloudstack.storage.command.CreateObjectCommand":{"data"
> :{"org.apache.cloudstack.storage.to.SnapshotObjectTO":{"volume":{"uuid":"
> 0c4f8e41-dfd8-4fc2-a22e-
> 1a79738560a1","volumeType":"DATADISK","dataStore":{"org.apache.cloudst
> ack.storage.to.PrimaryDataStoreTO":{"uuid":"d433809b-01ea-3947-ba0f-
> 48077244e4d6","id":214,"poolType":"RBD","host":"
> > ceph-mon.simplercloud.com
> > ","path":"simplercloud-sg-01","port":6789}},"name":"DATA-
> 2051","size":64424509440,"path":"fc5dfa05-2431-4b42-804b-
> b2fb72e219d0","volumeId":2289,"vmName":"i-195-2051-
> VM","accountId":195,"format":"RAW","id":2289,"hypervisorType":"KVM"},"
> parentSnapshotPath":"simplercloud-sg-01/fc5dfa05-2431-4b42-804b-
> b2fb72e219d0/61042668-23ab-4f63-8a21-
> ce5a24f9c883","dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataS
> toreTO":{"uuid":"d433809b-01ea-3947-ba0f-
> 48077244e4d6","id":214,"poolType":"RBD","host":"
> > ceph-mon.simplercloud.com","path":"simplercloud-sg-01","port":6789}},"
> > vmName":"i-195-2051-VM","name":"test-snapshot-and-ip-1_DATA-
> 2051_20131
> > 007160158","hypervisorType":"KVM","id":22}},"wait":0}}]
> > }
> > 2013-10-08 00:01:58,765 DEBUG [cloud.agent.Agent]
> > (agentRequest-Handler-5:null) Processing command:
> > org.apache.cloudstack.storage.command.CreateObjectCommand
> > 2013-10-08 00:02:08,071 DEBUG [cloud.agent.Agent]
> > (agentRequest-Handler-1:null) Request:Seq 34-898172007:  { Cmd , MgmtId:
> > 161342671900, via: 34, Ver: v1, Flags: 100011,
> >
> [{"org.apache.cloudstack.storage.command.CreateObjectCommand":{"data"
> :{"org.apache.cloudstack.storage.to.SnapshotObjectTO":{"volume":{"uuid":"
> 35d9bae0-1683-4a3d-9a69-
> ccefa18bf899","volumeType":"DATADISK","dataStore":{"org.apache.cloudsta
> ck.storage.to.PrimaryDataStoreTO":{"uuid":"d433809b-01ea-3947-ba0f-
> 48077244e4d6","id":214,"poolType":"RBD","host":"
> > ceph-mon.simplercloud.com
> > ","path":"simplercloud-sg-01","port":6789}},"name":"DATA-
> 2046","size":42949672960,"path":"59825284-6b60-4a37-b728-
> 755b3752a755","volumeId":2278,"vmName":"i-190-2046-
> VM","accountId":190,"format":"RAW","id":2278,"hypervisorType":"KVM"},"
> dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid"
> :"d433809b-01ea-3947-ba0f-
> 48077244e4d6","id":214,"poolType":"RBD","host":"
> > ceph-mon.simplercloud.com","path":"simplercloud-sg-01","port":6789}},"
> > vmName":"i-190-2046-VM","name":"test-aft-upgrade-12-win_DATA-
> 2046_2013
> > 1007160207","hypervisorType":"KVM","id":23}},"wait":0}}]
> > }
> > ...
> > 2013-10-08 00:02:08,191 DEBUG [kvm.storage.KVMStorageProcessor]
> > (agentRequest-Handler-5:null) Succesfully connected to Ceph cluster at
> > ceph-mon.simplercloud.com:6789
> > 2013-10-08 00:02:08,214 DEBUG [kvm.storage.KVMStorageProcessor]
> > (agentRequest-Handler-5:null) Attempting to create RBD snapshot
> > fc5dfa05-2431-4b42-804b-b2fb72e219d0@d6df6e15-d2ec-46a6-8b75-
> 1f94c127d
> > bb8
> > ...
> > 2013-10-08 00:02:20,821 DEBUG [kvm.storage.KVMStorageProcessor]
> > (agentRequest-Handler-1:null) Succesfully connected to Ceph cluster at
> > ceph-mon.simplercloud.com:6789
> > ...
> > 2013-10-08 00:05:19,580 DEBUG [kvm.storage.KVMStorageProcessor]
> > (agentRequest-Handler-3:null) Succesfully connected to Ceph cluster at
> > ceph-mon.simplercloud.com:6789
> > 2013-10-08 00:05:19,610 DEBUG [kvm.storage.KVMStorageProcessor]
> > (agentRequest-Handler-3:null) Attempting to create
> > /mnt/c10404d3-070e-3579-980e-cb0d40effb7b/snapshots/190/2277
> > recursively
> > 2013-10-08 00:05:20,645 DEBUG [kvm.storage.KVMStorageProcessor]
> > (agentRequest-Handler-3:null) Backing up RBD snapshot to
> > /mnt/c10404d3-070e-3579-980e-
> cb0d40effb7b/snapshots/190/2277/5c9ea7c4-
> > 03e1-455e-a785-3c96df68cf69
> > ===
> >
> > After which, depends on the size of the snapshot to be backed up, the
> > KVM host will suffer from high CPU utilisation and will sometimes
> > time-out and disconnected from the management server. Then HA will
> > kick in and everything goes haywire.
> >
> > This is what I have done:
> >
> > (1) Enabling jumbo frames (MTU 9000) on the NIC cards and the switch
> > ports where the KVM hosts and the secondary storage server are
> > connected to. We are using 10 GBps NIC cards and switches.
> >
> > (2) Performed some fine-tuning on the KVM hosts' /etc/sysctl.conf to
> > improve the network performance:
> >
> > net.ipv4.tcp_wmem = 4096 65536 16777216 net.ipv4.tcp_rmem = 4096
> 87380
> > 16777216 net.core.wmem_max = 16777216 net.core.rmem_max =
> 16777216
> > net.core.wmem_default = 65536 net.core.rmem_default = 87380
> > net.core.netdev_max_backlog = 30000
> >
> > (3) Try to disable RBD snapshot copying from primary to secondary
> > storage by setting this global Cloudstack configuration:
> >
> > snapshot.backup.rightafter = false
> >
> > However, when I tested, the snapshotting process will still try to
> > copy the snapshot backup over to the secondary storage. I checked on
> > another thread and it seems to be a bug on 4.2.0.
> >
> > Is it actually a necessity for the snapshots to be backed up to
> > secondary storage, or is it OK for the snapshots to stay in primary
> > storage? It seems that the process of copying over the snapshot from
> > primary to secondary storage is the root cause of the high CPU utilisation
> on the KVM hosts.
> >
> > Due to the above bug on point (3), is there a workaround on how we can
> > prevent the snapshots to be copied over to the secondary storage after
> > it's being created on the primary storage?
> >
> > Looking forward to your reply, thank you.
> >
> > Cheers.
> >
> >
> >
> > On Tue, Oct 8, 2013 at 2:41 PM, Wido den Hollander <w...@widodh.nl>
> wrote:
> >
> >>
> >>
> >> On 10/08/2013 04:59 AM, Indra Pramana wrote:
> >>
> >>> Dear Wido and all,
> >>>
> >>> I performed some further tests last night:
> >>>
> >>> (1) CPU utilization of the KVM host while RBD snapshot running is
> >>> still shooting up high even after I set global setting:
> >>> concurrent.snapshots.**threshold.perhost to 2.
> >>>
> >>> (2) Most of the concurrent snapshot processes will fail with either
> >>> stuck in "Creating" state, or "CreatedOnPrimary" error message.
> >>>
> >>>
> >> Hmm, that is odd. It uses rados-java to call the RBD library to
> >> create the snapshot and afterwards it copies it to Secondary Storage.
> >>
> >> I'm leaving for the Ceph Days and the Build a Cloud Day afterwards in
> >> London now, so I won't be able to look at this the coming 2 days.
> >>
> >>
> >>  (3) I also have adjusted some other related global settings such as
> >>> backup.snapshot.wait and job.expire.minutes, without any luck.
> >>>
> >>> Any advise on the reason what causes the high CPU utilization is
> >>> greatly appreciated.
> >>>
> >>>
> >> You might want to set the Agent log to debug and see if the RBD
> >> snapshot was created, it should log that:
> >> https://git-wip-us.apache.org/**
> >> repos/asf?p=cloudstack.git;a=**blob;f=plugins/hypervisors/**
> >>
> kvm/src/com/cloud/hypervisor/**kvm/storage/**KVMStorageProcessor.ja
> va
> >> ;h=*
> >>
> *1b883519073acc7514b66857e080a4**64714c4324;hb=4.2#l1091<https://git-
> >> wip-us.apache.org/repos/asf?p=cloudstack.git;a=blob;f=plugins/hypervi
> >>
> sors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.ja
> v
> >> a;h=1b883519073acc7514b66857e080a464714c4324;hb=4.2#l1091>
> >>
> >> "Attempting to create RBD snapshot"
> >>
> >> If that succeeds the problem lies with backing up the snapshot to
> >> Secondary Storage.
> >>
> >> Wido
> >>
> >>
> >>  Looking forward to your reply, thank you.
> >>>
> >>> Cheers.
> >>>
> >>>
> >>> On Mon, Oct 7, 2013 at 11:03 PM, Indra Pramana <in...@sg.or.id> wrote:
> >>>
> >>>  Dear all,
> >>>>
> >>>> I also found out that when the RBD snapshot is being run, the CPU
> >>>> utilisation on the KVM host will be shooting up very high, which
> >>>> might explain why the host becomes disconnected.
> >>>>
> >>>> top - 22:49:32 up 3 days, 19:31,  1 user,  load average: 7.85,
> >>>> 4.97,
> >>>> 3.47
> >>>> Tasks: 297 total,   3 running, 294 sleeping,   0 stopped,   0 zombie
> >>>> Cpu(s):  4.5%us,  1.2%sy,  0.0%ni, 94.1%id,  0.1%wa,  0.0%hi,
> >>>> 0.0%si, 0.0%st
> >>>> Mem:  264125244k total, 77203460k used, 186921784k free,   154888k
> >>>> buffers
> >>>> Swap:   545788k total,        0k used,   545788k free, 60677092k cached
> >>>>
> >>>>    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
> COMMAND
> >>>> 18161 root      20   0 3871m  31m 8444 S  101  0.0 301:58.09 kvm
> >>>>   2790 root      20   0 43.5g 1.6g  19m S   97  0.7  45:52.42 jsvc
> >>>> 24544 root      20   0 4583m  31m 8364 S   97  0.0 425:29.48 kvm
> >>>>   6537 root      20   0     0    0    0 R   71  0.0   0:17.49
> >>>> kworker/3:2
> >>>> 22546 root      20   0 6143m 2.0g 8452 S   26  0.8  55:14.07 kvm
> >>>>   4219 root      20   0 7671m 4.0g 8524 S    6  1.6 106:12.26 kvm
> >>>>   5989 root      20   0 43.2g 1.6g  232 D    6  0.6   0:08.13 jsvc
> >>>>   5993 root      20   0 43.3g 1.6g  224 D    6  0.6   0:08.36 jsvc
> >>>>
> >>>> Is it normal when snapshot is being run on the VM running on that
> >>>> host, the host's CPU utilisation will be higher than usual? How can
> >>>> I limit the CPU resources used by the snapshot?
> >>>>
> >>>>
> >>>> Looking forward to your reply, thank you.
> >>>>
> >>>> Cheers.
> >>>>
> >>>>
> >>>>
> >>>> On Mon, Oct 7, 2013 at 7:18 PM, Indra Pramana <in...@sg.or.id> wrote:
> >>>>
> >>>>  Dear all,
> >>>>>
> >>>>> I did some tests on snapshots since it's now supported for my Ceph
> >>>>> RBD primary storage in CloudStack 4.2. When I ran the snapshot for
> >>>>> a particular VM instance earlier, I noticed that this has caused
> >>>>> the host (where the VM is on) becomes disconnected.
> >>>>>
> >>>>> Here's the excerpt from the agent.log:
> >>>>>
> >>>>> http://pastebin.com/dxVV7stu
> >>>>>
> >>>>> The management-server.log doesn't much showing anything other
> than
> >>>>> detecting that the host was down and HA is being activated:
> >>>>>
> >>>>> http://pastebin.com/UeLiSm9K
> >>>>>
> >>>>> Anyone can advise what is causing the problem? So far there is
> >>>>> only one user doing the snapshotting and it has caused issues to
> >>>>> the host, I can't imagine what if multiple users try to do
> >>>>> snapshotting at the same time?
> >>>>>
> >>>>> I read about snapshot job throttling which is described on the manual:
> >>>>>
> >>>>>
> >>>>> http://cloudstack.apache.org/**docs/en-US/Apache_CloudStack/**
> >>>>> 4.2.0/html/Admin_Guide/**working-with-
> snapshots.html<http://clouds
> >>>>> tack.apache.org/docs/en-
> US/Apache_CloudStack/4.2.0/html/Admin_Guid
> >>>>> e/working-with-snapshots.html>
> >>>>>
> >>>>> But I am not too sure whether this will help to resolve the
> >>>>> problem since there is only one user trying to perform snapshot
> >>>>> and we already encounter the problem already.
> >>>>>
> >>>>> Anyone can advise how I can troubleshoot further and find a
> >>>>> solution to the problem?
> >>>>>
> >>>>> Looking forward to your reply, thank you.
> >>>>>
> >>>>> Cheers.
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >

Reply via email to