Re: [ceph-users] I/O hangs with 2 node failure even if one node isn't involved in I/O

Wes Dillingham Mon, 20 Mar 2017 19:25:16 -0700

This is because of the min_size specification. I would bet you have it set
at 2 (which is good).


ceph osd pool get rbd min_size

With 4 hosts, and a size of 3, removing 2 of the hosts (or 2 drives 1 from
each hosts) results in some of the objects only having 1 replica
min_size dictates that IO freezes for those objects until min_size is
achieved.
http://docs.ceph.com/docs/jewel/rados/operations/pools/#set-the-number-of-object-replicas

I cant tell if your under the impression that your RBD device is a single
object. It is not. It is chunked up into many objects and spread throughout
the cluster, as Kjeti mentioned earlier.

On Mon, Mar 20, 2017 at 8:48 PM, Kjetil Jørgensen <kje...@medallia.com>
wrote:

> Hi,
>
> rbd_id.vm-100-disk-1 is only a "meta object", IIRC, it's contents will get
> you a "prefix", which then gets you on to rbd_header.<prefix>,
> rbd_header.prefix contains block size, striping, etc. The actual data
> bearing objects will be named something like rbd_data.prefix.%-016x.
>
> Example - vm-100-disk-1 has the prefix 86ce2ae8944a, the first <block
> size> of that image will be named rbd_data. 86ce2ae8944a.000000000000, the
> second <block size> will be 86ce2ae8944a.000000000001, and so on, chances
> are that one of these objects are mapped to a pg which has both host3 and
> host4 among it's replicas.
>
> An rbd image will end up scattered across most/all osds of the pool it's
> in.
>
> Cheers,
> -KJ
>
> On Fri, Mar 17, 2017 at 12:30 PM, Adam Carheden <carhe...@ucar.edu> wrote:
>
>> I have a 4 node cluster shown by `ceph osd tree` below. Monitors are
>> running on hosts 1, 2 and 3. It has a single replicated pool of size
>> 3. I have a VM with its hard drive replicated to OSDs 11(host3),
>> 5(host1) and 3(host2).
>>
>> I can 'fail' any one host by disabling the SAN network interface and
>> the VM keeps running with a simple slowdown in I/O performance just as
>> expected. However, if 'fail' both nodes 3 and 4, I/O hangs on the VM.
>> (i.e. `df` never completes, etc.) The monitors on hosts 1 and 2 still
>> have quorum, so that shouldn't be an issue. The placement group still
>> has 2 of its 3 replicas online.
>>
>> Why does I/O hang even though host4 isn't running a monitor and
>> doesn't have anything to do with my VM's hard drive.
>>
>>
>> Size?
>> # ceph osd pool get rbd size
>> size: 3
>>
>> Where's rbd_id.vm-100-disk-1?
>> # ceph osd getmap -o /tmp/map && osdmaptool --pool 0 --test-map-object
>> rbd_id.vm-100-disk-1 /tmp/map
>> got osdmap epoch 1043
>> osdmaptool: osdmap file '/tmp/map'
>>  object 'rbd_id.vm-100-disk-1' -> 0.1ea -> [11,5,3]
>>
>> # ceph osd tree
>> ID WEIGHT  TYPE NAME          UP/DOWN REWEIGHT PRIMARY-AFFINITY
>> -1 8.06160 root default
>> -7 5.50308     room A
>> -3 1.88754         host host1
>>  4 0.40369             osd.4       up  1.00000          1.00000
>>  5 0.40369             osd.5       up  1.00000          1.00000
>>  6 0.54008             osd.6       up  1.00000          1.00000
>>  7 0.54008             osd.7       up  1.00000          1.00000
>> -2 3.61554         host host2
>>  0 0.90388             osd.0       up  1.00000          1.00000
>>  1 0.90388             osd.1       up  1.00000          1.00000
>>  2 0.90388             osd.2       up  1.00000          1.00000
>>  3 0.90388             osd.3       up  1.00000          1.00000
>> -6 2.55852     room B
>> -4 1.75114         host host3
>>  8 0.40369             osd.8       up  1.00000          1.00000
>>  9 0.40369             osd.9       up  1.00000          1.00000
>> 10 0.40369             osd.10      up  1.00000          1.00000
>> 11 0.54008             osd.11      up  1.00000          1.00000
>> -5 0.80737         host host4
>> 12 0.40369             osd.12      up  1.00000          1.00000
>> 13 0.40369             osd.13      up  1.00000          1.00000
>>
>>
>> --
>> Adam Carheden
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> --
> Kjetil Joergensen <kje...@medallia.com>
> SRE, Medallia Inc
> Phone: +1 (650) 739-6580 <(650)%20739-6580>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Respectfully,

Wes Dillingham
wes_dilling...@harvard.edu
Research Computing | Infrastructure Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 210

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] I/O hangs with 2 node failure even if one node isn't involved in I/O

Reply via email to