Our experimental Ceph cluster is performing terribly (with the operator
to blame!), and while it's down to address some issues, I'm curious to
hear advice about the following ideas.
The cluster:
- two disk nodes (6 * CPU, 16GB RAM each)
- 8 OSDs (4 each)
- 3 monitors
- 10Gb front + back networks
- 2TB Enterprise SATA drives
- HP RAID controller w/battery-backed cache
- one SSD journal drive for each two OSDs
First, I'd like to play with taking one machine down, but with the other
node continuing to serve the cluster. To maintain redundancy in this
scenario, I'm thinking of setting the pool size to 4 and the min_size to
2, with the idea that a proper CRUSH map should always keep two copies
on each disk node. Again, *this is for experimentation* and probably
raises red flags for production, but I'm just asking if it's *possible*:
Could one node go down and the other node continue to serve r/w data?
Any anecdotes of performance differences between size=4 and size=3 in
other clusters?
Second, does it make any sense to divide the CRUSH map into an extra
level for the SSD disks, which each hold journals for two OSDs? This
might increase redundancy in case of a journal disk failure, but ISTR
something about too few OSDs in a bucket causing problems with the CRUSH
algorithm.
Thanks-
John
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com