[ceph-users] Monitors repeatedly calling for new elections

Sanders, Bill Mon, 08 Dec 2014 11:24:18 -0800

I've just stood up a Ceph cluster for some experimentation.  Unfortunately, 
we're having some performance and stability problems I'm trying to pin down.  
More unfortunately, I'm new to Ceph, so I'm not sure where to start looking for 
the problem.


Under activity, we'll get monitors going into election cycles repeatedly, OSD's 
being "wrongly marked down", as well as slow requests "osd.11 
39.7.48.6:6833/21938 failed (3 reports from 1 peers after 52.914693 >= grace 
20.000000)" .  During this, ceph -w shows the cluster essentially idle.  None 
of the network, disks, or cpu's ever appear to max out.  It also doesn't appear 
to be the same OSD's, MON's, or node causing the problem.  Top reports all 128 
GB RAM (negligible swap) in use on the storage nodes.  Only Ceph is running on 
the storage nodes.

We've configured 4 nodes for storage and have connected 2 identical nodes to 
this cluster to access the cluster storage over the kernel RBD driver.  MON's 
are configured on the first three storage nodes.

The nodes we're using are Dell R720xd:

2x1TB spinners configured in RAID for the OS
12x4TB spinners for OSD's (3.5 TB XFS + 10GB Journal partition on each disk)
2x Xeon E5-2620 CPU (/proc/cpuinfo reports 24 cores)
128GB RAM
Two networks (public+cluster), both over infiniband

Software:
SLES 11SP3, with some in house patching. (3.0.1 kernel, "ceph-client" 
backported from 3.10)
Ceph version: ceph-0.80.5-0.9.2, packaged by SUSE

Our ceph.conf is pretty simple (as is our configuration, I think):
fsid = c216d502-5179-49b8-9b6c-ffc2cdd29374
mon initial members = tvsaq1
mon host = 39.7.48.6

cluster network = 39.64.0.0/12
public network = 39.0.0.0/12
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
osd journal size = 9000
filestore xattr use omap = true
osd crush update on start = false
osd pool default size = 3
osd pool default min size = 1
osd pool default pg num = 4096
osd pool default pgp num = 4096


What sort of performance should we be getting out of a setup like this?

Any help would be appreciated, and I'd be happy to provide whatever logs, 
config files, etc are needed.  I'm sure we're doing something wrong, but I 
don't know what it is.

Bill

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Monitors repeatedly calling for new elections

Reply via email to