On 08/23/2017 06:18 PM, Xavier Trilla wrote:
Oh man, what do you know!... I'm quite amazed. I've been reviewing more 
documentation about min_replica_size and seems like it doesn't work as I 
thought (Although I remember specifically reading it somewhere some years ago 
:/ ).

And, as all replicas need to be written before primary OSD informs the client 
about the write being completed, we cannot have the third replica on HDDs, no 
way. It would kill latency.

Well, we'll just keep adding NVMs to our cluster (I mean, S4500 and P4500 price 
difference is negligible) and we'll decrease the primary affinity weight for 
SATA SSDs, just to be sure we get the most out of NVMe.

BTW, does anybody have any experience so far with erasure coding and rbd? A 2/3 
profile, would really save space on SSDs but I'm afraid about the extra 
calculations needed and how will it affect performance... Well, maybe I'll 
check into it, and I'll start a new thread :)

There's a decent chance you'll get higher performance with something like EC 6+2 vs 3X replication for large writes due simply to having less data to write (we see somewhere between 2x and 3x rep performance in the lab for 4MB writes to RBD). Small random writes will almost certainly be slower due to increased latency. Reads in general will be slower as well. With replication the read comes entirely from the primary but in EC you have to fetch chunks from the secondaries and reconstruct the object before sending it back to the client.

So basically compared to 3X rep you'll likely gain some performance on large writes, lose some performance on large reads, and lose more performance on small writes/reads (dependent on cpu speed and various other factors).

Mark


Anyway, thanks for the info!
Xavier.

-----Mensaje original-----
De: Christian Balzer [mailto:ch...@gol.com]
Enviado el: martes, 22 de agosto de 2017 2:40
Para: ceph-users@lists.ceph.com
CC: Xavier Trilla <xavier.tri...@silicontower.net>
Asunto: Re: [ceph-users] NVMe + SSD + HDD RBD Replicas with Bluestore...


Hello,


Firstly, what David said.

On Mon, 21 Aug 2017 20:25:07 +0000 Xavier Trilla wrote:

Hi,

I'm working into improving the costs of our actual ceph cluster. We actually 
keep 3 x replicas, all of them in SSDs (That cluster hosts several hundred VMs 
RBD disks) and lately I've been wondering if the following setup would make 
sense, in order to improve cost / performance.


Have you done a full analysis of your current cluster, as in utilization of 
your SSDs (IOPS), CPU, etc with atop/iostat/collectd/grafana?
During peak utilization times?

If so, you should have a decent enough idea of what level IOPS you need and can 
design from there.

The ideal would be to move PG primaries to high performance nodes using NVMe, 
keep secondary replica in SSDs and move the third replica to HDDs.

Most probably the hardware will be:

1st Replica: Intel P4500 NVMe (2TB)
2nd Replica: Intel S3520 SATA SSD (1.6TB)
Unless you have:
a) a lot of these and/or
b) very little writes
what David said.

Aside from that whole replica idea not working. as you think.

3rd Replica: WD Gold Harddrives (2 TB) (I'm considering either 1TB o
2TB model, as I want to have as many spins as possible)

Also, hosts running OSDs would have a quite different HW configuration
(In our experience NVMe need crazy CPU power in order to get the best
out of them)

Correct, one might run into that with pure NVMe/SSD nodes.

I know the NVMe and SATA SSD replicas will work, no problem about that (We'll 
just adjust the primary affinity and crushmap in order to have the desired data 
layoff + primary OSDs) what I'm worried is about the HDD replica.

Also the pool will have min_size 1 (Would love to use min_size 2, but it would kill 
latency times) so, even if we have to do some maintenance in the NVMe nodes, writes to 
HDDs will be always "lazy".

Before bluestore (we are planning to move to luminous most probably by the end 
of the year or beginning 2018, once it is released and tested properly) I would 
just use  SSD/NVMe journals for the HDDs. So, all writes would go to the SSD 
journal, and then moved to the HDD. But now, with Bluestore I don't think 
that's an option anymore.

Bluestore bits are still a bit of dark magic in terms of concise and complete 
documentation, but the essentials have been mentioned here before.

Essentially, if you can get the needed IOPS with SSD/NVMe journals and HDDs, 
Bluestore won't be worse than that, if done correctly.

With Bluestore use either NVMe for the WAL (small space, high IOPS/data), SSDs 
for the actual rocksdb and the (surprise, surprise!) journal for small writes 
(large space, nobody knows for sure how large is large enough) and finally the 
HDDs.

If you're trying to optimize costs, decent SSDs (good luck finding any with 
Intel 37xx and 36xx basically unavailable), maybe the S or P 4600, to hold both 
the WAL and DB should do the trick.

Christian

What I'm worried is how would affect to the NVMe primary OSDs having a quite 
slow third replica. WD Gold hard drives seem quite decent (For a SATA drive) 
but obviously performance is nowhere near to SSDs or NVMe.

So, what do you think? Does anybody have some opinions or experience he would 
like to share?

Thanks!
Xavier.





_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to