On 08/11/2014 08:26 PM, Craig Lewis wrote:
Your MON nodes are separate hardware from the OSD nodes, right?
Two nodes are OSD + MON, plus a separate MON node.
If so, with replication=2, you should be able to shut down one of the two OSD nodes, and everything will continue working.
IIUC, the third MON node is sufficient for a quorum if one of the OSD + MON nodes shuts down, is that right?
Replication=2 is a little worrisome, since we've already seen two disks simultaneously fail just in the year the cluster has been running. That statistically unlikely situation is the first and probably last time I'll see that, but they say lightning can strike twice....
Since it's for experimentation, I wouldn't deal with the extra hassle of replication=4 and custom CRUSH rules to make it work. If you have your heart set on that, it should be possible. I'm no CRUSH expert though, so I can't say for certain until I've actually done it. I'm a bit confused why your performance is horrible though. I'm assuming your HDDs are 7200 RPM. With the SSD journals and replication=3, you won't have a ton of IO, but you shouldn't have any problem doing > 100 MB/s with 4 MB blocks. Unless your SSDs are very low quality, the HDDs should be your bottleneck.
The below setup is tomorrow's plan; today's reality is 3 OSDs on one node and 2 OSDs on another, crappy SSDs, 1Gb networks, pgs stuck unclean and no monitoring to pinpoint bottlenecks. My work is cut out for me. :)
Thanks for the helpful reply. I wish we could just add a third OSD node and have these issues just go away, but it's not in the budget ATM.
John
On Fri, Aug 8, 2014 at 10:24 PM, John Morris <j...@zultron.com <mailto:j...@zultron.com>> wrote: Our experimental Ceph cluster is performing terribly (with the operator to blame!), and while it's down to address some issues, I'm curious to hear advice about the following ideas. The cluster: - two disk nodes (6 * CPU, 16GB RAM each) - 8 OSDs (4 each) - 3 monitors - 10Gb front + back networks - 2TB Enterprise SATA drives - HP RAID controller w/battery-backed cache - one SSD journal drive for each two OSDs First, I'd like to play with taking one machine down, but with the other node continuing to serve the cluster. To maintain redundancy in this scenario, I'm thinking of setting the pool size to 4 and the min_size to 2, with the idea that a proper CRUSH map should always keep two copies on each disk node. Again, *this is for experimentation* and probably raises red flags for production, but I'm just asking if it's *possible*: Could one node go down and the other node continue to serve r/w data? Any anecdotes of performance differences between size=4 and size=3 in other clusters? Second, does it make any sense to divide the CRUSH map into an extra level for the SSD disks, which each hold journals for two OSDs? This might increase redundancy in case of a journal disk failure, but ISTR something about too few OSDs in a bucket causing problems with the CRUSH algorithm. Thanks- John _________________________________________________ ceph-users mailing list ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com