On Fri, Jan 11, 2019 at 10:07 PM Brian Topping <brian.topp...@gmail.com>
wrote:

> Hi all,
>
> I have a simple two-node Ceph cluster that I’m comfortable with the care
> and feeding of. Both nodes are in a single rack and captured in the
> attached dump, it has two nodes, only one mon, all pools size 2. Due to
> physical limitations, the primary location can’t move past two nodes at the
> present time. As far as hardware, those two nodes are 18-core Xeon with
> 128GB RAM and connected with 10GbE.
>
> My next goal is to add an offsite replica and would like to validate the
> plan I have in mind. For it’s part, the offsite replica can be considered
> read-only except for the occasional snapshot in order to run backups to
> tape. The offsite location is connected with a reliable and secured
> ~350Kbps WAN link.
>

Unfortunately this is just not going to work. All writes to a Ceph OSD are
replicated synchronously to every replica, all reads are served from the
primary OSD for any given piece of data, and unless you do some hackery on
your CRUSH map each of your 3 OSD nodes is going to be a primary for about
1/3 of the total data.

If you want to move your data off-site asynchronously, there are various
options for doing that in RBD (either periodic snapshots and export-diff,
or by maintaining a journal and streaming it out) and RGW (with the
multi-site stuff). But you're not going to be successful trying to stretch
a Ceph cluster over that link.
-Greg


>
> The following presuppositions bear challenge:
>
> * There is only a single mon at the present time, which could be expanded
> to three with the offsite location. Two mons at the primary location is
> obviously a lower MTBF than one, but  with a third one on the other side of
> the WAN, I could create resiliency against *either* a WAN failure or a
> single node maintenance event.
> * Because there are two mons at the primary location and one at the
> offsite, the degradation mode for a WAN loss (most likely scenario due to
> facility support) leaves the primary nodes maintaining the quorum, which is
> desirable.
> * It’s clear that a WAN failure and a mon failure at the primary location
> will halt cluster access.
> * The CRUSH maps will be managed to reflect the topology change.
>
> If that’s a good capture so far, I’m comfortable with it. What I don’t
> understand is what to expect in actual use:
>
> * Is the link speed asymmetry between the two primary nodes and the
> offsite node going to create significant risk or unexpected behaviors?
> * Will the performance of the two primary nodes be limited to the speed
> that the offsite mon can participate? Or will the primary mons correctly
> calculate they have quorum and keep moving forward under normal operation?
> * In the case of an extended WAN outage (and presuming full uptime on
> primary site mons), would return to full cluster health be simply a matter
> of time? Are there any limits on how long the WAN could be down if the
> other two maintain quorum?
>
> I hope I’m asking the right questions here. Any feedback appreciated,
> including blogs and RTFM pointers.
>
>
> Thanks for a great product!! I’m really excited for this next frontier!
>
> Brian
>
> > [root@gw01 ~]# ceph -s
> >  cluster:
> >    id:     nnnn
> >    health: HEALTH_OK
> >
> >  services:
> >    mon: 1 daemons, quorum gw01
> >    mgr: gw01(active)
> >    mds: cephfs-1/1/1 up  {0=gw01=up:active}
> >    osd: 8 osds: 8 up, 8 in
> >
> >  data:
> >    pools:   3 pools, 380 pgs
> >    objects: 172.9 k objects, 11 GiB
> >    usage:   30 GiB used, 5.8 TiB / 5.8 TiB avail
> >    pgs:     380 active+clean
> >
> >  io:
> >    client:   612 KiB/s wr, 0 op/s rd, 50 op/s wr
> >
> > [root@gw01 ~]# ceph df
> > GLOBAL:
> >    SIZE        AVAIL       RAW USED     %RAW USED
> >    5.8 TiB     5.8 TiB       30 GiB          0.51
> > POOLS:
> >    NAME                ID     USED        %USED     MAX AVAIL
>  OBJECTS
> >    cephfs_metadata     2      264 MiB         0       2.7 TiB
> 1085
> >    cephfs_data         3      8.3 GiB      0.29       2.7 TiB
> 171283
> >    rbd                 4      2.0 GiB      0.07       2.7 TiB
>  542
> > [root@gw01 ~]# ceph osd tree
> > ID CLASS WEIGHT  TYPE NAME     STATUS REWEIGHT PRI-AFF
> > -1       5.82153 root default
> > -3       2.91077     host gw01
> > 0   ssd 0.72769         osd.0     up  1.00000 1.00000
> > 2   ssd 0.72769         osd.2     up  1.00000 1.00000
> > 4   ssd 0.72769         osd.4     up  1.00000 1.00000
> > 6   ssd 0.72769         osd.6     up  1.00000 1.00000
> > -5       2.91077     host gw02
> > 1   ssd 0.72769         osd.1     up  1.00000 1.00000
> > 3   ssd 0.72769         osd.3     up  1.00000 1.00000
> > 5   ssd 0.72769         osd.5     up  1.00000 1.00000
> > 7   ssd 0.72769         osd.7     up  1.00000 1.00000
> > [root@gw01 ~]# ceph osd df
> > ID CLASS WEIGHT  REWEIGHT SIZE    USE     AVAIL   %USE VAR  PGS
> > 0   ssd 0.72769  1.00000 745 GiB 4.9 GiB 740 GiB 0.66 1.29 115
> > 2   ssd 0.72769  1.00000 745 GiB 3.1 GiB 742 GiB 0.42 0.82  83
> > 4   ssd 0.72769  1.00000 745 GiB 3.6 GiB 742 GiB 0.49 0.96  90
> > 6   ssd 0.72769  1.00000 745 GiB 3.5 GiB 742 GiB 0.47 0.93  92
> > 1   ssd 0.72769  1.00000 745 GiB 3.4 GiB 742 GiB 0.46 0.90  76
> > 3   ssd 0.72769  1.00000 745 GiB 3.9 GiB 741 GiB 0.52 1.02 102
> > 5   ssd 0.72769  1.00000 745 GiB 3.9 GiB 741 GiB 0.52 1.02  98
> > 7   ssd 0.72769  1.00000 745 GiB 4.0 GiB 741 GiB 0.54 1.06 104
> >                    TOTAL 5.8 TiB  30 GiB 5.8 TiB 0.51
> > MIN/MAX VAR: 0.82/1.29  STDDEV: 0.07
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to