> -----Original Message----- > From: Indra Pramana [mailto:in...@sg.or.id] > Sent: Wednesday, October 09, 2013 11:07 PM > To: dev@cloudstack.apache.org; us...@cloudstack.apache.org > Subject: Re: High CPU utilization on KVM hosts while doing RBD snapshot - > was Re: snapshot caused host disconnected > > Dear all, > > I and my colleague tried to scrutinize the source code and found the script > which is performing the copying of the snapshot from primary storage to > secondary storage on this file: > > ./plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMSto > rageProcessor.java > > Specifically under this function: > > @Override > public Answer backupSnapshot(CopyCommand cmd) { > > === > File snapDir = new File(snapshotDestPath); > s_logger.debug("Attempting to create " + > snapDir.getAbsolutePath() + " recursively"); > FileUtils.forceMkdir(snapDir); > > File snapFile = new File(snapshotDestPath + "/" + > snapshotName); > s_logger.debug("Backing up RBD snapshot to " + > snapFile.getAbsolutePath()); > BufferedOutputStream bos = new BufferedOutputStream(new > FileOutputStream(snapFile)); > int chunkSize = 4194304; > long offset = 0; > while(true) { > byte[] buf = new byte[chunkSize]; > > int bytes = image.read(offset, buf, chunkSize); > if (bytes <= 0) { > break; > } > bos.write(buf, 0, bytes); > offset += bytes; > } > s_logger.debug("Completed backing up RBD snapshot " + > snapshotName + " to " + snapFile.getAbsolutePath() + ". Bytes written: " + > offset); > bos.close(); > === > > (1) Is it safe to comment out the above lines and recompile/reinstall, to > prevent CloudStack from copying the snapshots from the RBD primary > storage to the secondary storage? > > (2) What would be the impact to CloudStack operations if we leave the > snapshots on primary storage without copying them to secondary storage? > Are we still able to do restoration from the snapshots kept in the primary > storage?
Yes, it's doable, but need to write a SnapshotStrategy in mgt server. The current SnapshotStrategy will always backup snapshot into secondary after taking snapshot. You can write a new SnapshotStrategy, just take snapshot without backup to secondary storage. After the commit: 180cfa19e87b909cb1c8a738359e31a6111b11c5 checked into master, you will get a lot of freedom to manipulate snapshot. > > Looking forward to your reply, thank you. > > Cheers. > > > > On Wed, Oct 9, 2013 at 2:36 PM, Indra Pramana <in...@sg.or.id> wrote: > > > Hi Wido and all, > > > > Good day to you, and thank you for your e-mail reply. > > > > Yes, from the agent logs I can see that the RBD snapshot was created. > > However, it seems that CPU utilisation goes up drastically during the > > copying over of the snapshot from primary storage to the secondary > storage. > > > > === > > 2013-10-08 00:01:58,765 DEBUG [cloud.agent.Agent] > > (agentRequest-Handler-5:null) Request:Seq 34-898172006: { Cmd , MgmtId: > > 161342671900, via: 34, Ver: v1, Flags: 100011, > > > [{"org.apache.cloudstack.storage.command.CreateObjectCommand":{"data" > :{"org.apache.cloudstack.storage.to.SnapshotObjectTO":{"volume":{"uuid":" > 0c4f8e41-dfd8-4fc2-a22e- > 1a79738560a1","volumeType":"DATADISK","dataStore":{"org.apache.cloudst > ack.storage.to.PrimaryDataStoreTO":{"uuid":"d433809b-01ea-3947-ba0f- > 48077244e4d6","id":214,"poolType":"RBD","host":" > > ceph-mon.simplercloud.com > > ","path":"simplercloud-sg-01","port":6789}},"name":"DATA- > 2051","size":64424509440,"path":"fc5dfa05-2431-4b42-804b- > b2fb72e219d0","volumeId":2289,"vmName":"i-195-2051- > VM","accountId":195,"format":"RAW","id":2289,"hypervisorType":"KVM"}," > parentSnapshotPath":"simplercloud-sg-01/fc5dfa05-2431-4b42-804b- > b2fb72e219d0/61042668-23ab-4f63-8a21- > ce5a24f9c883","dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataS > toreTO":{"uuid":"d433809b-01ea-3947-ba0f- > 48077244e4d6","id":214,"poolType":"RBD","host":" > > ceph-mon.simplercloud.com","path":"simplercloud-sg-01","port":6789}}," > > vmName":"i-195-2051-VM","name":"test-snapshot-and-ip-1_DATA- > 2051_20131 > > 007160158","hypervisorType":"KVM","id":22}},"wait":0}}] > > } > > 2013-10-08 00:01:58,765 DEBUG [cloud.agent.Agent] > > (agentRequest-Handler-5:null) Processing command: > > org.apache.cloudstack.storage.command.CreateObjectCommand > > 2013-10-08 00:02:08,071 DEBUG [cloud.agent.Agent] > > (agentRequest-Handler-1:null) Request:Seq 34-898172007: { Cmd , MgmtId: > > 161342671900, via: 34, Ver: v1, Flags: 100011, > > > [{"org.apache.cloudstack.storage.command.CreateObjectCommand":{"data" > :{"org.apache.cloudstack.storage.to.SnapshotObjectTO":{"volume":{"uuid":" > 35d9bae0-1683-4a3d-9a69- > ccefa18bf899","volumeType":"DATADISK","dataStore":{"org.apache.cloudsta > ck.storage.to.PrimaryDataStoreTO":{"uuid":"d433809b-01ea-3947-ba0f- > 48077244e4d6","id":214,"poolType":"RBD","host":" > > ceph-mon.simplercloud.com > > ","path":"simplercloud-sg-01","port":6789}},"name":"DATA- > 2046","size":42949672960,"path":"59825284-6b60-4a37-b728- > 755b3752a755","volumeId":2278,"vmName":"i-190-2046- > VM","accountId":190,"format":"RAW","id":2278,"hypervisorType":"KVM"}," > dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid" > :"d433809b-01ea-3947-ba0f- > 48077244e4d6","id":214,"poolType":"RBD","host":" > > ceph-mon.simplercloud.com","path":"simplercloud-sg-01","port":6789}}," > > vmName":"i-190-2046-VM","name":"test-aft-upgrade-12-win_DATA- > 2046_2013 > > 1007160207","hypervisorType":"KVM","id":23}},"wait":0}}] > > } > > ... > > 2013-10-08 00:02:08,191 DEBUG [kvm.storage.KVMStorageProcessor] > > (agentRequest-Handler-5:null) Succesfully connected to Ceph cluster at > > ceph-mon.simplercloud.com:6789 > > 2013-10-08 00:02:08,214 DEBUG [kvm.storage.KVMStorageProcessor] > > (agentRequest-Handler-5:null) Attempting to create RBD snapshot > > fc5dfa05-2431-4b42-804b-b2fb72e219d0@d6df6e15-d2ec-46a6-8b75- > 1f94c127d > > bb8 > > ... > > 2013-10-08 00:02:20,821 DEBUG [kvm.storage.KVMStorageProcessor] > > (agentRequest-Handler-1:null) Succesfully connected to Ceph cluster at > > ceph-mon.simplercloud.com:6789 > > ... > > 2013-10-08 00:05:19,580 DEBUG [kvm.storage.KVMStorageProcessor] > > (agentRequest-Handler-3:null) Succesfully connected to Ceph cluster at > > ceph-mon.simplercloud.com:6789 > > 2013-10-08 00:05:19,610 DEBUG [kvm.storage.KVMStorageProcessor] > > (agentRequest-Handler-3:null) Attempting to create > > /mnt/c10404d3-070e-3579-980e-cb0d40effb7b/snapshots/190/2277 > > recursively > > 2013-10-08 00:05:20,645 DEBUG [kvm.storage.KVMStorageProcessor] > > (agentRequest-Handler-3:null) Backing up RBD snapshot to > > /mnt/c10404d3-070e-3579-980e- > cb0d40effb7b/snapshots/190/2277/5c9ea7c4- > > 03e1-455e-a785-3c96df68cf69 > > === > > > > After which, depends on the size of the snapshot to be backed up, the > > KVM host will suffer from high CPU utilisation and will sometimes > > time-out and disconnected from the management server. Then HA will > > kick in and everything goes haywire. > > > > This is what I have done: > > > > (1) Enabling jumbo frames (MTU 9000) on the NIC cards and the switch > > ports where the KVM hosts and the secondary storage server are > > connected to. We are using 10 GBps NIC cards and switches. > > > > (2) Performed some fine-tuning on the KVM hosts' /etc/sysctl.conf to > > improve the network performance: > > > > net.ipv4.tcp_wmem = 4096 65536 16777216 net.ipv4.tcp_rmem = 4096 > 87380 > > 16777216 net.core.wmem_max = 16777216 net.core.rmem_max = > 16777216 > > net.core.wmem_default = 65536 net.core.rmem_default = 87380 > > net.core.netdev_max_backlog = 30000 > > > > (3) Try to disable RBD snapshot copying from primary to secondary > > storage by setting this global Cloudstack configuration: > > > > snapshot.backup.rightafter = false > > > > However, when I tested, the snapshotting process will still try to > > copy the snapshot backup over to the secondary storage. I checked on > > another thread and it seems to be a bug on 4.2.0. > > > > Is it actually a necessity for the snapshots to be backed up to > > secondary storage, or is it OK for the snapshots to stay in primary > > storage? It seems that the process of copying over the snapshot from > > primary to secondary storage is the root cause of the high CPU utilisation > on the KVM hosts. > > > > Due to the above bug on point (3), is there a workaround on how we can > > prevent the snapshots to be copied over to the secondary storage after > > it's being created on the primary storage? > > > > Looking forward to your reply, thank you. > > > > Cheers. > > > > > > > > On Tue, Oct 8, 2013 at 2:41 PM, Wido den Hollander <w...@widodh.nl> > wrote: > > > >> > >> > >> On 10/08/2013 04:59 AM, Indra Pramana wrote: > >> > >>> Dear Wido and all, > >>> > >>> I performed some further tests last night: > >>> > >>> (1) CPU utilization of the KVM host while RBD snapshot running is > >>> still shooting up high even after I set global setting: > >>> concurrent.snapshots.**threshold.perhost to 2. > >>> > >>> (2) Most of the concurrent snapshot processes will fail with either > >>> stuck in "Creating" state, or "CreatedOnPrimary" error message. > >>> > >>> > >> Hmm, that is odd. It uses rados-java to call the RBD library to > >> create the snapshot and afterwards it copies it to Secondary Storage. > >> > >> I'm leaving for the Ceph Days and the Build a Cloud Day afterwards in > >> London now, so I won't be able to look at this the coming 2 days. > >> > >> > >> (3) I also have adjusted some other related global settings such as > >>> backup.snapshot.wait and job.expire.minutes, without any luck. > >>> > >>> Any advise on the reason what causes the high CPU utilization is > >>> greatly appreciated. > >>> > >>> > >> You might want to set the Agent log to debug and see if the RBD > >> snapshot was created, it should log that: > >> https://git-wip-us.apache.org/** > >> repos/asf?p=cloudstack.git;a=**blob;f=plugins/hypervisors/** > >> > kvm/src/com/cloud/hypervisor/**kvm/storage/**KVMStorageProcessor.ja > va > >> ;h=* > >> > *1b883519073acc7514b66857e080a4**64714c4324;hb=4.2#l1091<https://git- > >> wip-us.apache.org/repos/asf?p=cloudstack.git;a=blob;f=plugins/hypervi > >> > sors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.ja > v > >> a;h=1b883519073acc7514b66857e080a464714c4324;hb=4.2#l1091> > >> > >> "Attempting to create RBD snapshot" > >> > >> If that succeeds the problem lies with backing up the snapshot to > >> Secondary Storage. > >> > >> Wido > >> > >> > >> Looking forward to your reply, thank you. > >>> > >>> Cheers. > >>> > >>> > >>> On Mon, Oct 7, 2013 at 11:03 PM, Indra Pramana <in...@sg.or.id> wrote: > >>> > >>> Dear all, > >>>> > >>>> I also found out that when the RBD snapshot is being run, the CPU > >>>> utilisation on the KVM host will be shooting up very high, which > >>>> might explain why the host becomes disconnected. > >>>> > >>>> top - 22:49:32 up 3 days, 19:31, 1 user, load average: 7.85, > >>>> 4.97, > >>>> 3.47 > >>>> Tasks: 297 total, 3 running, 294 sleeping, 0 stopped, 0 zombie > >>>> Cpu(s): 4.5%us, 1.2%sy, 0.0%ni, 94.1%id, 0.1%wa, 0.0%hi, > >>>> 0.0%si, 0.0%st > >>>> Mem: 264125244k total, 77203460k used, 186921784k free, 154888k > >>>> buffers > >>>> Swap: 545788k total, 0k used, 545788k free, 60677092k cached > >>>> > >>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ > COMMAND > >>>> 18161 root 20 0 3871m 31m 8444 S 101 0.0 301:58.09 kvm > >>>> 2790 root 20 0 43.5g 1.6g 19m S 97 0.7 45:52.42 jsvc > >>>> 24544 root 20 0 4583m 31m 8364 S 97 0.0 425:29.48 kvm > >>>> 6537 root 20 0 0 0 0 R 71 0.0 0:17.49 > >>>> kworker/3:2 > >>>> 22546 root 20 0 6143m 2.0g 8452 S 26 0.8 55:14.07 kvm > >>>> 4219 root 20 0 7671m 4.0g 8524 S 6 1.6 106:12.26 kvm > >>>> 5989 root 20 0 43.2g 1.6g 232 D 6 0.6 0:08.13 jsvc > >>>> 5993 root 20 0 43.3g 1.6g 224 D 6 0.6 0:08.36 jsvc > >>>> > >>>> Is it normal when snapshot is being run on the VM running on that > >>>> host, the host's CPU utilisation will be higher than usual? How can > >>>> I limit the CPU resources used by the snapshot? > >>>> > >>>> > >>>> Looking forward to your reply, thank you. > >>>> > >>>> Cheers. > >>>> > >>>> > >>>> > >>>> On Mon, Oct 7, 2013 at 7:18 PM, Indra Pramana <in...@sg.or.id> wrote: > >>>> > >>>> Dear all, > >>>>> > >>>>> I did some tests on snapshots since it's now supported for my Ceph > >>>>> RBD primary storage in CloudStack 4.2. When I ran the snapshot for > >>>>> a particular VM instance earlier, I noticed that this has caused > >>>>> the host (where the VM is on) becomes disconnected. > >>>>> > >>>>> Here's the excerpt from the agent.log: > >>>>> > >>>>> http://pastebin.com/dxVV7stu > >>>>> > >>>>> The management-server.log doesn't much showing anything other > than > >>>>> detecting that the host was down and HA is being activated: > >>>>> > >>>>> http://pastebin.com/UeLiSm9K > >>>>> > >>>>> Anyone can advise what is causing the problem? So far there is > >>>>> only one user doing the snapshotting and it has caused issues to > >>>>> the host, I can't imagine what if multiple users try to do > >>>>> snapshotting at the same time? > >>>>> > >>>>> I read about snapshot job throttling which is described on the manual: > >>>>> > >>>>> > >>>>> http://cloudstack.apache.org/**docs/en-US/Apache_CloudStack/** > >>>>> 4.2.0/html/Admin_Guide/**working-with- > snapshots.html<http://clouds > >>>>> tack.apache.org/docs/en- > US/Apache_CloudStack/4.2.0/html/Admin_Guid > >>>>> e/working-with-snapshots.html> > >>>>> > >>>>> But I am not too sure whether this will help to resolve the > >>>>> problem since there is only one user trying to perform snapshot > >>>>> and we already encounter the problem already. > >>>>> > >>>>> Anyone can advise how I can troubleshoot further and find a > >>>>> solution to the problem? > >>>>> > >>>>> Looking forward to your reply, thank you. > >>>>> > >>>>> Cheers. > >>>>> > >>>>> > >>>> > >>>> > >>> > >