Thanks everyone for your inputs.

Below is a small writeup which I wanted to share with everyone in Ceph User 
community.

Summary of the Ceph Issue with Volumes

Our Setup
As mentioned earlier in our setup we have Openstack MOS 6.0 integrated with 
Ceph Storage cluster.
The version details are as follows
Ceph version : 0.80.7
Libvirt version : 1.2.2
Openstack Version : Juno (Mirantis 6.0)


Statement of Problem
When we attached multiple volumes( greater than 6) to  VM instance ,similar to 
adding multiple disks on a Hadoop Baremetal. And tried to write to multiple 
disks simultaneously for example via dd command "dd if=/dev/zero 
of=/disk{1..6}/test bs=4K count=10485760".
Observing the "vmstat 1" on the VM instance, we saw that over a period of time 
the "bo (block out)" value started to trickle down to zero.
As soon as the "bo value" reached zero, the load on the VM instance spiked and 
system became unresponsive. We had to reboot the VM instance to recover.

Also when we checked all the "dd" processes were in "D uninterruptible sleep 
state".

Our investigate and Probable Resolution
It was from the /var/log/syslog on the compute nodes on which the VM instance 
was running where we found an error message "Too many open files".
Example below
ABCD = PID of qemu instance

<8>Nov 18 04:56:49 node-XXX qemu-system-x86_64: 2016-11-18 04:56:49.939702 
7fe9b569d700 -1 -- <COMPUTE IP>:0/70<ABCD> >> <CEPH MONITOR>:6830/14356 
pipe(0x7fede65dcbf0 sd=
-1 :0 s=1 pgs=0 cs=0 l=1 c=0x7fede37fc8c0).connect couldn't created socket (24) 
Too many open files

When we checked the limit for number of open files from proc we found below
XXXXXX@node-XXXX:~# cat /proc/<ABCD>/limits
...
Max open files            1024                 4096                 files
...

On this basis ,we increased the open file decriptor limit for libvirt-bin 
process from 1024 to 65536.
We had to put the below limit commands in /etc/default/libvirt-bin
ulimit -Hn 65536
ulimit -Sn 65536

We had to reboot the qemu instances via nova stop and nova start for the new 
limits to take effect.

This workaround has solved our issue for now and the above mentioned test cases 
are now successful.

We also checked different points below which were indeed helpful in narrowing 
the issue

*         Was the issue limited to a specific type of Linux OS (Ubuntu or 
CentOS)

*         Was the issue limited to specific kernel. We upgraded the kernel but 
still the issue persisted.

*         Was the issue due to any limiting resources (CPU , RAM, NETWORK , 
DISK IO) on either the VM instance or Compute Node.

*         We also tried to tune kernel parameters such as dirty_ratio , 
background_dirty_ratio etc. But no improvement was observed.

*         Also we observed that issue was NOT the number of volumes attached 
but the total amount of IO performed.

As per our understanding this is a good resolution for now but it may need 
monitoring and appropriate tuning.

Please do let me know if there are any questions/concerns or pointers :)

Thanks once again.

Thanks,
Mehul


From: Mehul1 Jani
Sent: 16 November 2016 11:40
To: 'ceph-users@lists.ceph.com'
Cc: Sanjeev Jaiswal; Harshit T Shah; Hardikv Desai
Subject: Ceph Volume Issue

Hi All,

We have a Ceph Storage Cluster and it's been integrated with our Openstack 
private cloud.
We have created a Pool for Volume which allows our Openstack Private Cloud user 
to create a volume from image and boot from volume.
Additionally our images(both Ubuntu1404 and CentOS 7) are in a raw format.

One of our use cases is to attach multiple volumes other than "boot volume".
We have observed that when we attach multiple volumes, and try to simultaneous 
writes to these attached volumes for example via the "dd command" , all these 
processes go into "D state (uninterruptible sleep)".
Also we can see in vmstat output that "bo" values trickling down to zero.
We have checked the network utilization on the compute node which does not show 
any issues.

Finally after a while system becomes unresponsive and only way to resolve is to 
reboot the VM.

Some of our version details are as follows.

Ceph version : 0.80.7
Libvirt version : 1.2.2
Openstack Version : Juno (Mirantis 6.0)

Please do let me know if anyone has faced a similar issue or have any pointers.

Any direction will be helpful.

Thanks,
Mehul


"Confidentiality Warning: This message and any attachments are intended only 
for the use of the intended recipient(s). 
are confidential and may be privileged. If you are not the intended recipient. 
you are hereby notified that any 
review. re-transmission. conversion to hard copy. copying. circulation or other 
use of this message and any attachments is 
strictly prohibited. If you are not the intended recipient. please notify the 
sender immediately by return email. 
and delete this message and any attachments from your system.

Virus Warning: Although the company has taken reasonable precautions to ensure 
no viruses are present in this email. 
The company cannot accept responsibility for any loss or damage arising from 
the use of this email or attachment."
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to