Re: [ceph-users] 2x replica with NVMe

David Turner Thu, 08 Jun 2017 07:25:09 -0700

Whether or not 2x replica is possible has little to do with the technology
and EVERYTHING to do with your use case.  How redundant is your hardware
for instance?  If you have the best drives in the world that will never
fail after constant use over 100 years.... but you don't have redundant
power, bonded network, are running on used hardware, are in a cheap
datacenter that doesn't guarantee 99.999% uptime, etc, etc then you are
going to lose hosts regardless of what your disks are.

As Wido was quoted saying, the biggest problem with 2x replication is that
people use it with min_size=1.  That is cancer and will eventually cause
you to have inconsistent data and most likely data loss.  OTOH, min_size=2
and size=2 means that you need to schedule down time to restart your ceph
hosts for kernel updates, upgrading ceph, restarting the daemons with new
config file options that can't be injected, etc.  You can get around that
by using min_size=1 while you perform the scheduled maintenance.  If for
any reason you ever lose a server, NVMe, etc while running with 2 replica
and min_size=2, then you have unscheduled down time.

Running with 2x Replica right now is possible.  For that matter, people run
with 1x replication all the time (especially in testing).  You will never
get anyone to tell you that it is the optimal configuration because it is
and will always be a lie for general use cases no matter how robust and
bullet proof your drives are.  The primary problem is that nodes to be
restarted, power goes out, and Murphy's law.  Is your use case such that
having a certain percentage of data loss is acceptable?  Then run size=2
and min_size=1 and assume that you will eventually lose data.  Does your
use case allow for unexpected downtime?  Then run size=2 and min_size=2.
If you cannot lose data no matter what and must maintain as high of an
uptime as possible then you should be asking questions about multi-site
replication and the down sides of running 4x replication... 2x replication
shouldn't even cross your mind.

Now I'm assuming that you're broaching the topic because a 3x replica NVMe
cluster is super expensive.  I think all of us feel your pain there,
otherwise we'd all be running it.  A topic that has happened on the ML a
couple times is to use primary_affinity and an interesting distribution of
buckets in your crush map to build a cluster with both SSD storage and HDD
storage in a way that your data is well backed up, but all writes and reads
happen to the SSD/NVMe.  What you do here would be create 3 "racks" in your
crush map and use a rack failure domain.  1 rack has all of your SSD hosts,
and your HDD hosts with SSD/NVMe journals (matching what your other nodes
have) are split between your other 2 racks.  Now you set primary_affinity=0
for all of your HDD nodes forcing Ceph to use the SSD/NVMe OSD as the
primary for all of the PGs.  What you end up with is a 3 replica situation
where 1, and only 1, copy go onto an SSD and 2 copies go onto HDDs.  Once
you have this set up the way things will work is writes still happen to all
OSDs in a PG, so you will have 2 writes going to HDDs, except the write
acks once it is written to the SSD journal.  So your writes happen to all
flash storage.  Your reads are only ever done to your primary OSD for a PG,
so all reads will happen to the SSD/NVMe OSD.  Your recovery/backfilling
will be slower as you'll be reading a fair amount of your data from HDDs,
but that's a fairly insignificant sacrifice for what you are gaining.  For
each 1TB of flash storage, you need to have 2TB of HDD storage.  If you
have more HDD storage than this ratio, then it is wasted and won't be used.

To recap... The problems with 2x replica isn't the disk failure rate or how
bullet proof your hardware is.  Unless downtime or data loss is acceptable,
just don't talk about 2x replica.  But you can have 3 replicas that run as
fast as all flash with only having 1 replica of flash storage and enough
flash journals for the slower HDD replicas.  The trade off for this is that
you limit future customizations to your CRUSH map if you want to actually
configure logical racks for a growing/large cluster and you have generally
increased complexity when adding new storage nodes.

If downtime or data loss is not an acceptable running state and running
with a complex CRUSH map is not viable due to who will be in charge of
adding the storage... Then you're back to getting 3x replicas of the same
type of storage.

On Thu, Jun 8, 2017 at 9:32 AM <i...@witeq.com> wrote:

> I'm thinking to delay this project until Luminous release to
> have Bluestore support.
>
> So are you telling me that checksum capability will be present in
> Bluestore and therefore considering using NVMe with 2x replica for
> production data will be possibile?
>
>
> ------------------------------
> *From: *"nick" <n...@fisk.me.uk>
> *To: *"Vy Nguyen Tan" <vynt.kensh...@gmail.com>, i...@witeq.com
> *Cc: *"ceph-users" <ceph-users@lists.ceph.com>
> *Sent: *Thursday, June 8, 2017 3:19:20 PM
> *Subject: *RE: [ceph-users] 2x replica with NVMe
>
> There are two main concerns with using 2x replicas, recovery speed and
> coming across inconsistent objects.
>
>
>
> With spinning disks their size to access speed means recovery can take a
> long time and increases the chance that additional failures may happen
> during the recovery process. NVME will recover a lot faster and so this
> risk is greatly reduced and means that using 2x replicas may be possible.
>
>
>
> However, with Filestore there are no checksums and so there is no way to
> determine in the event of inconsistent objects, which one is corrupt. So
> even with NVME, I would not feel 100% confident using 2x replicas. With
> Bluestore this problem will go away.
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *Vy Nguyen Tan
> *Sent:* 08 June 2017 13:47
> *To:* i...@witeq.com
> *Cc:* ceph-users <ceph-users@lists.ceph.com>
> *Subject:* Re: [ceph-users] 2x replica with NVMe
>
>
>
> Hi,
>
>
>
> I think that the replica 2x on HDD/SSD are the same. You should read quote
> from Wido bellow:
>
>
>
> ""Hi,
>
>
> As a Ceph consultant I get numerous calls throughout the year to help
> people with getting their broken Ceph clusters back online.
>
> The causes of downtime vary vastly, but one of the biggest causes is that
> people use replication 2x. size = 2, min_size = 1.
>
> In 2016 the amount of cases I have where data was lost due to these
> settings grew exponentially.
>
> Usually a disk failed, recovery kicks in and while recovery is happening a
> second disk fails. Causing PGs to become incomplete.
>
> There have been to many times where I had to use xfs_repair on broken disks
> and use ceph-objectstore-tool to export/import PGs.
>
> I really don't like these cases, mainly because they can be prevented
> easily by using size = 3 and min_size = 2 for all pools.
>
> With size = 2 you go into the danger zone as soon as a single disk/daemon
> fails. With size = 3 you always have two additional copies left thus
> keeping your data safe(r).
>
> If you are running CephFS, at least consider running the 'metadata' pool
> with size = 3 to keep the MDS happy.
>
> Please, let this be a big warning to everybody who is running with size =
> 2. The downtime and problems caused by missing objects/replicas are usually
> big and it takes days to recover from those. But very often data is lost
> and/or corrupted which causes even more problems.
>
> I can't stress this enough. Running with size = 2 in production is a
> SERIOUS hazard and should not be done imho.
>
> To anyone out there running with size = 2, please reconsider this!
>
> Thanks,
>
> Wido""
>
>
>
> On Thu, Jun 8, 2017 at 5:32 PM, <i...@witeq.com> wrote:
>
> Hi all,
>
>
>
> i'm going to build an all-flash ceph cluster, looking around the existing
> documentation i see lots of guides and and use case scenarios from various
> vendor testing Ceph with replica 2x.
>
>
>
> Now, i'm an old school Ceph user, I always considered 2x replica really
> dangerous for production data, especially when both OSDs can't decide which
> replica is the good one.
>
> Why all NVMe storage vendor and partners use only 2x replica?
>
> They claim it's safe because NVMe is better in handling errors, but i
> usually don't trust marketing claims :)
>
> Is it true? Can someone confirm that NVMe is different compared to HDD and
> therefore replica 2 can be considered safe to be put in production?
>
>
>
> Many Thanks
>
> Giordano
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> <http://xo4t.mj.am/lnk/AEEALj8jXYgAAAAAAAAAAF3gd04AADNJBWwAAAAAAACRXwBZOU7cb46h4z9pToCa6VLdYf2h6AAAlBI/1/GysXe0cHiheNJt5oY81oAA/aHR0cDovL2xpc3RzLmNlcGguY29tL2xpc3RpbmZvLmNnaS9jZXBoLXVzZXJzLWNlcGguY29t>
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 2x replica with NVMe

Reply via email to