On 08/23/2017 06:18 PM, Xavier Trilla wrote:
Oh man, what do you know!... I'm quite amazed. I've been reviewing more
documentation about min_replica_size and seems like it doesn't work as I
thought (Although I remember specifically reading it somewhere some years ago
:/ ).
And, as all replicas need to be written before primary OSD informs the client
about the write being completed, we cannot have the third replica on HDDs, no
way. It would kill latency.
Well, we'll just keep adding NVMs to our cluster (I mean, S4500 and P4500 price
difference is negligible) and we'll decrease the primary affinity weight for
SATA SSDs, just to be sure we get the most out of NVMe.
BTW, does anybody have any experience so far with erasure coding and rbd? A 2/3
profile, would really save space on SSDs but I'm afraid about the extra
calculations needed and how will it affect performance... Well, maybe I'll
check into it, and I'll start a new thread :)
There's a decent chance you'll get higher performance with something
like EC 6+2 vs 3X replication for large writes due simply to having less
data to write (we see somewhere between 2x and 3x rep performance in the
lab for 4MB writes to RBD). Small random writes will almost certainly be
slower due to increased latency. Reads in general will be slower as
well. With replication the read comes entirely from the primary but in
EC you have to fetch chunks from the secondaries and reconstruct the
object before sending it back to the client.
So basically compared to 3X rep you'll likely gain some performance on
large writes, lose some performance on large reads, and lose more
performance on small writes/reads (dependent on cpu speed and various
other factors).
Mark
Anyway, thanks for the info!
Xavier.
-----Mensaje original-----
De: Christian Balzer [mailto:ch...@gol.com]
Enviado el: martes, 22 de agosto de 2017 2:40
Para: ceph-users@lists.ceph.com
CC: Xavier Trilla <xavier.tri...@silicontower.net>
Asunto: Re: [ceph-users] NVMe + SSD + HDD RBD Replicas with Bluestore...
Hello,
Firstly, what David said.
On Mon, 21 Aug 2017 20:25:07 +0000 Xavier Trilla wrote:
Hi,
I'm working into improving the costs of our actual ceph cluster. We actually
keep 3 x replicas, all of them in SSDs (That cluster hosts several hundred VMs
RBD disks) and lately I've been wondering if the following setup would make
sense, in order to improve cost / performance.
Have you done a full analysis of your current cluster, as in utilization of
your SSDs (IOPS), CPU, etc with atop/iostat/collectd/grafana?
During peak utilization times?
If so, you should have a decent enough idea of what level IOPS you need and can
design from there.
The ideal would be to move PG primaries to high performance nodes using NVMe,
keep secondary replica in SSDs and move the third replica to HDDs.
Most probably the hardware will be:
1st Replica: Intel P4500 NVMe (2TB)
2nd Replica: Intel S3520 SATA SSD (1.6TB)
Unless you have:
a) a lot of these and/or
b) very little writes
what David said.
Aside from that whole replica idea not working. as you think.
3rd Replica: WD Gold Harddrives (2 TB) (I'm considering either 1TB o
2TB model, as I want to have as many spins as possible)
Also, hosts running OSDs would have a quite different HW configuration
(In our experience NVMe need crazy CPU power in order to get the best
out of them)
Correct, one might run into that with pure NVMe/SSD nodes.
I know the NVMe and SATA SSD replicas will work, no problem about that (We'll
just adjust the primary affinity and crushmap in order to have the desired data
layoff + primary OSDs) what I'm worried is about the HDD replica.
Also the pool will have min_size 1 (Would love to use min_size 2, but it would kill
latency times) so, even if we have to do some maintenance in the NVMe nodes, writes to
HDDs will be always "lazy".
Before bluestore (we are planning to move to luminous most probably by the end
of the year or beginning 2018, once it is released and tested properly) I would
just use SSD/NVMe journals for the HDDs. So, all writes would go to the SSD
journal, and then moved to the HDD. But now, with Bluestore I don't think
that's an option anymore.
Bluestore bits are still a bit of dark magic in terms of concise and complete
documentation, but the essentials have been mentioned here before.
Essentially, if you can get the needed IOPS with SSD/NVMe journals and HDDs,
Bluestore won't be worse than that, if done correctly.
With Bluestore use either NVMe for the WAL (small space, high IOPS/data), SSDs
for the actual rocksdb and the (surprise, surprise!) journal for small writes
(large space, nobody knows for sure how large is large enough) and finally the
HDDs.
If you're trying to optimize costs, decent SSDs (good luck finding any with
Intel 37xx and 36xx basically unavailable), maybe the S or P 4600, to hold both
the WAL and DB should do the trick.
Christian
What I'm worried is how would affect to the NVMe primary OSDs having a quite
slow third replica. WD Gold hard drives seem quite decent (For a SATA drive)
but obviously performance is nowhere near to SSDs or NVMe.
So, what do you think? Does anybody have some opinions or experience he would
like to share?
Thanks!
Xavier.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com