Hello,

On Thu, 22 May 2014 18:00:56 +0200 (CEST) Alexandre DERUMIER wrote:

> Hi,
> 
> I'm looking to build a full osd ssd cluster, with this config:
> 
What is your main goal for that cluster, high IOPS, high sequential writes
or reads?

Remember my "Slow IOPS on RBD..." thread, you probably shouldn't expect
more than 800 write IOPS and 4000 read IOPS per OSD (replication 2).

> 6 nodes,
> 
> each node 10 osd/ ssd drives (dual 10gbit network).  (1journal + datas
> on each osd)
> 
Halving the write speed of the SSD, leaving you with about 2GB/s max write
speed per node.

If you're after good write speeds and with a replication factor of 2 I
would split the network into public and cluster ones.
If you're however after top read speeds, use bonding for the 2 links into
the public network, half of your SSDs per node are able to saturate that.

> ssd drive will be entreprise grade, 
> 
> maybe intel sc3500 800GB (well known ssd)
> 
How much write activity do you expect per OSD (remember that you in your
case writes are doubled)? Those drives have a total write capacity of
about 450TB (within 5 years).

> or new Samsung SSD PM853T 960GB (don't have too much info about it for
> the moment, but price seem a little bit lower than intel)
> 

Looking at the specs it seems to have a better endurance (I used
500GB/day, a value that seemed realistic given the 2 numbers they gave),
at least double that of the Intel. 
Alas they only give a 3 year warranty, which makes me wonder.
Also the latencies are significantly higher than the 3500.

> 
> I would like to have some advise on replication level,
> 
> 
> Maybe somebody have experience with intel sc3500 failure rate ?

I doubt many people have managed to wear out SSDs of that vintage in
normal usage yet. And so far none of my dozens of Intel SSDs (including
some ancient X25-M ones) have died.

> How many chance to have 2 failing disks on 2 differents nodes at the
> same time (murphy's law ;).
> 
Indeed.

>From my experience and looking at the technology I would postulate that:
1. SSD failures are very rare during their guaranteed endurance
period/data volume. 
2. Once the endurance level is exceeded the probability of SSDs failing
within short periods of each other becomes pretty high.

So if you're monitoring the SSDs (SMART) religiously and take measure to
avoid clustered failures (for example by replacing SSDs early or adding
new nodes gradually, like 1 every 6 months or so) you probably are OK.

Keep in mind however that the larger this cluster grows, the more likely a
double failure scenario becomes. 
Statistics and Murphy are out to get you.

With normal disks I would use a Ceph replication of 3 or when using RAID6
nothing larger than 12 disks per set.

> 
> I think in case of disk failure, pgs should replicate fast with 10gbits
> links.
> 
That very much also depends on your cluster load and replication settings.

Regards,

Christian

> 
> So the question is:
> 
> 2x or 3x ?
> 
> 
> Regards,
> 
> Alexandre


-- 
Christian Balzer        Network/Systems Engineer                
ch...@gol.com           Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to