Re: [ceph-users] I/O hangs with 2 node failure even if one node isn't involved in I/O

Kjetil Jørgensen Tue, 21 Mar 2017 12:55:11 -0700

Hi,

On Tue, Mar 21, 2017 at 11:59 AM, Adam Carheden <carhe...@ucar.edu> wrote:


> Let's see if I got this. 4 host cluster. size=3, min_size=2. 2 hosts
> fail. Are all of the following accurate?
>
> a. An rdb is split into lots of objects, parts of which will probably
> exist on all 4 hosts.
>

Correct.


>
> b. Some objects will have 2 of their 3 replicas on 2 of the offline OSDs.
>
> Likely correct.


> c. Reads can continue from the single online OSD even in pgs that
> happened to have two of 3 osds offline.
>
>
Hypothetically (This is partially informed guessing on my part):
If the survivor happens to be the acting primary and it were up-to-date at
the time,
it can in theory serve reads. (Only the primary serves reads).

If the survivor weren't the acting primary - you don't have any guarantees
as to
whether or not it had the most up-to-date version of any objects. I don't
know
if enough state is tracked outside of the osds to make this determination,
but
I doubt it (it feels costly to maintain).

Regardless of scenario - I'd guess - the PG is marked as down, and will stay
that way until you revive either of deceased OSDs or you essentially tell
ceph
that they're a lost cause and incur potential data loss over that. (See:
ceph osd lost).

d. Writes hang for pgs that have 2 offline OSDs because CRUSH can't meet
> the min_size=2 constraint.
>

Correct.


> e. Rebalancing does not occur because with only two hosts online there
> is no way for CRUSH to meet the size=3 constraint even if it were to
> rebalance.
>

Partially correct, see c)

f. I/O can been restored by setting min_size=1.
>

See c)


> g. Alternatively, I/O can be restored by setting size=2, which would
> kick off rebalancing and restored I/O as the pgs come into compliance
> with the size=2 constraint.
>

See c)


> h. If I instead have a cluster with 10 hosts, size=3 and min_size=2 and
> two hosts fail, some pgs would have only 1 OSD online, but rebalancing
> would start immediately since CRUSH can honor the size=3 constraint by
> rebalancing. This means more nodes makes for a more reliable cluster.
>

See c)

Side-note: This is where you start using crush to enumerate what you'd
consider
the likely failure domains for concurrent failures. I.e. you have racks
with distinct
power circuits and TOR switches, your more likely large scale failures will
be
a rack, so you tell crush to maintain replicas in distinct racks.

i. If I wanted to force CRUSH to bring I/O back online with size=3 and
> min_size=2 but only 2 hosts online, I could remove the host bucket from
> the crushmap. CRUSH would then rebalance, but some PGs would likely end
> up with 3 OSDs all on the same host. (This is theory. I promise not to
> do any such thing to a production system ;)
>

Partially correct, see c).



> Thanks
> --
> Adam Carheden
>
>
> On 03/21/2017 11:48 AM, Wes Dillingham wrote:
> > If you had set min_size to 1 you would not have seen the writes pause. a
> > min_size of 1 is dangerous though because it means you are 1 hard disk
> > failure away from losing the objects within that placement group
> > entirely. a min_size of 2 is generally considered the minimum you want
> > but many people ignore that advice, some wish they hadn't.
> >
> > On Tue, Mar 21, 2017 at 11:46 AM, Adam Carheden <carhe...@ucar.edu
> > <mailto:carhe...@ucar.edu>> wrote:
> >
> >     Thanks everyone for the replies. Very informative. However, should I
> >     have expected writes to pause if I'd had min_size set to 1 instead
> of 2?
> >
> >     And yes, I was under the false impression that my rdb devices was a
> >     single object. That explains what all those other things are on a
> test
> >     cluster where I only created a single object!
> >
> >
> >     --
> >     Adam Carheden
> >
> >     On 03/20/2017 08:24 PM, Wes Dillingham wrote:
> >     > This is because of the min_size specification. I would bet you
> have it
> >     > set at 2 (which is good).
> >     >
> >     > ceph osd pool get rbd min_size
> >     >
> >     > With 4 hosts, and a size of 3, removing 2 of the hosts (or 2
> drives 1
> >     > from each hosts) results in some of the objects only having 1
> replica
> >     > min_size dictates that IO freezes for those objects until min_size
> is
> >     > achieved. http://docs.ceph.com/docs/jewel/rados/operations/pools/#
> set-the-number-of-object-replicas
> >     <http://docs.ceph.com/docs/jewel/rados/operations/pools/#
> set-the-number-of-object-replicas>
> >     >
> >     > I cant tell if your under the impression that your RBD device is a
> >     > single object. It is not. It is chunked up into many objects and
> spread
> >     > throughout the cluster, as Kjeti mentioned earlier.
> >     >
> >     > On Mon, Mar 20, 2017 at 8:48 PM, Kjetil Jørgensen <
> kje...@medallia.com <mailto:kje...@medallia.com>
> >     > <mailto:kje...@medallia.com <mailto:kje...@medallia.com>>> wrote:
> >     >
> >     >     Hi,
> >     >
> >     >     rbd_id.vm-100-disk-1 is only a "meta object", IIRC, it's
> contents
> >     >     will get you a "prefix", which then gets you on to
> >     >     rbd_header.<prefix>, rbd_header.prefix contains block size,
> >     >     striping, etc. The actual data bearing objects will be named
> >     >     something like rbd_data.prefix.%-016x.
> >     >
> >     >     Example - vm-100-disk-1 has the prefix 86ce2ae8944a, the first
> >     >     <block size> of that image will be named rbd_data.
> >     >     86ce2ae8944a.000000000000, the second <block size> will be
> >     >     86ce2ae8944a.000000000001, and so on, chances are that one of
> these
> >     >     objects are mapped to a pg which has both host3 and host4
> among it's
> >     >     replicas.
> >     >
> >     >     An rbd image will end up scattered across most/all osds of the
> pool
> >     >     it's in.
> >     >
> >     >     Cheers,
> >     >     -KJ
> >     >
> >     >     On Fri, Mar 17, 2017 at 12:30 PM, Adam Carheden <
> carhe...@ucar.edu <mailto:carhe...@ucar.edu>
> >     >     <mailto:carhe...@ucar.edu <mailto:carhe...@ucar.edu>>> wrote:
> >     >
> >     >         I have a 4 node cluster shown by `ceph osd tree` below.
> >     Monitors are
> >     >         running on hosts 1, 2 and 3. It has a single replicated
> >     pool of size
> >     >         3. I have a VM with its hard drive replicated to OSDs
> >     11(host3),
> >     >         5(host1) and 3(host2).
> >     >
> >     >         I can 'fail' any one host by disabling the SAN network
> >     interface and
> >     >         the VM keeps running with a simple slowdown in I/O
> performance
> >     >         just as
> >     >         expected. However, if 'fail' both nodes 3 and 4, I/O hangs
> on
> >     >         the VM.
> >     >         (i.e. `df` never completes, etc.) The monitors on hosts 1
> >     and 2
> >     >         still
> >     >         have quorum, so that shouldn't be an issue. The placement
> >     group
> >     >         still
> >     >         has 2 of its 3 replicas online.
> >     >
> >     >         Why does I/O hang even though host4 isn't running a
> >     monitor and
> >     >         doesn't have anything to do with my VM's hard drive.
> >     >
> >     >
> >     >         Size?
> >     >         # ceph osd pool get rbd size
> >     >         size: 3
> >     >
> >     >         Where's rbd_id.vm-100-disk-1?
> >     >         # ceph osd getmap -o /tmp/map && osdmaptool --pool 0
> >     >         --test-map-object
> >     >         rbd_id.vm-100-disk-1 /tmp/map
> >     >         got osdmap epoch 1043
> >     >         osdmaptool: osdmap file '/tmp/map'
> >     >          object 'rbd_id.vm-100-disk-1' -> 0.1ea -> [11,5,3]
> >     >
> >     >         # ceph osd tree
> >     >         ID WEIGHT  TYPE NAME          UP/DOWN REWEIGHT
> >     PRIMARY-AFFINITY
> >     >         -1 8.06160 root default
> >     >         -7 5.50308     room A
> >     >         -3 1.88754         host host1
> >     >          4 0.40369             osd.4       up  1.00000
> >     1.00000
> >     >          5 0.40369             osd.5       up  1.00000
> >     1.00000
> >     >          6 0.54008             osd.6       up  1.00000
> >     1.00000
> >     >          7 0.54008             osd.7       up  1.00000
> >     1.00000
> >     >         -2 3.61554         host host2
> >     >          0 0.90388             osd.0       up  1.00000
> >     1.00000
> >     >          1 0.90388             osd.1       up  1.00000
> >     1.00000
> >     >          2 0.90388             osd.2       up  1.00000
> >     1.00000
> >     >          3 0.90388             osd.3       up  1.00000
> >     1.00000
> >     >         -6 2.55852     room B
> >     >         -4 1.75114         host host3
> >     >          8 0.40369             osd.8       up  1.00000
> >     1.00000
> >     >          9 0.40369             osd.9       up  1.00000
> >     1.00000
> >     >         10 0.40369             osd.10      up  1.00000
> >     1.00000
> >     >         11 0.54008             osd.11      up  1.00000
> >     1.00000
> >     >         -5 0.80737         host host4
> >     >         12 0.40369             osd.12      up  1.00000
> >     1.00000
> >     >         13 0.40369             osd.13      up  1.00000
> >     1.00000
> >     >
> >     >
> >     >         --
> >     >         Adam Carheden
> >     >         _______________________________________________
> >     >         ceph-users mailing list
> >     >         ceph-users@lists.ceph.com
> >     <mailto:ceph-users@lists.ceph.com> <mailto:ceph-users@lists.ceph.com
> >     <mailto:ceph-users@lists.ceph.com>>
> >     >         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> >     >         <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>>
> >     >
> >     >
> >     >
> >     >
> >     >     --
> >     >     Kjetil Joergensen <kje...@medallia.com
> >     <mailto:kje...@medallia.com> <mailto:kje...@medallia.com
> >     <mailto:kje...@medallia.com>>>
> >     >     SRE, Medallia Inc
> >     >     Phone: +1 (650) 739-6580 <tel:%2B1%20%28650%29%20739-6580>
> >     <tel:(650)%20739-6580>
> >     >
> >     >     _______________________________________________
> >     >     ceph-users mailing list
> >     >     ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> >     <mailto:ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com
> >>
> >     >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> >     >     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>>
> >     >
> >     >
> >     >
> >     >
> >     > --
> >     > Respectfully,
> >     >
> >     > Wes Dillingham
> >     > wes_dilling...@harvard.edu <mailto:wes_dilling...@harvard.edu>
> >     <mailto:wes_dilling...@harvard.edu <mailto:wes_dillingham@
> harvard.edu>>
> >     > Research Computing | Infrastructure Engineer
> >     > Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room
> 210
> >     >
> >     _______________________________________________
> >     ceph-users mailing list
> >     ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> >
> >
> >
> >
> > --
> > Respectfully,
> >
> > Wes Dillingham
> > wes_dilling...@harvard.edu <mailto:wes_dilling...@harvard.edu>
> > Research Computing | Infrastructure Engineer
> > Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 210
> >
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Kjetil Joergensen <kje...@medallia.com>
SRE, Medallia Inc
Phone: +1 (650) 739-6580

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] I/O hangs with 2 node failure even if one node isn't involved in I/O

Reply via email to