Re: [ceph-users] latency when OSD falls out of cluster

Wido den Hollander Fri, 12 Jul 2013 00:22:07 -0700

Hi Edwin,

On 07/12/2013 08:03 AM, Edwin Peer wrote:

Hi there,


We've been noticing nasty multi-second cluster wide latencies if an OSD
drops out of an active cluster (due to power failure, or even being
stopped cleanly). We've also seen this problem occur when an OSD is
inserted back into the cluster.

You will probably see that Peering Groups (PGs) go into a differentstate then active+clean.


What does ceph -s tell you in such a case?

Obviously, this has the effect of freezing all VMs doing I/O across the
cluster for several seconds when a single node fails. Is this behaviour
expected? Or have I perhaps got something configured wrong?

Not really the expected behavior, but it could be CPU power limitationson the OSDs. I notice this latency with a Atom cluster as well, butthat's mainly due to the fact that the Atoms aren't fast enough tofigure out what's happening.

Faster AMD or Intel CPUs don't suffer from this. There will be a veryshort I/O stall for certain PGs when an OSD goes down, but that shouldbe very short and not every VM should suffer.


How many OSDs do you have with how many PGs per pool?

Wido

We're trying very hard to eliminate all single points of failure in our
architecture, is there anything that can be done about this?

Regards,
Edwin Peer
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] latency when OSD falls out of cluster

Reply via email to