Re: [ceph-users] - cluster stuck and undersized if at least one osd is down

Piotr Dzionek Tue, 29 Nov 2016 03:39:09 -0800

Hi,

You are right I missed that there is default time out for changing statefrom in to out for down osd. "mon osd down out interval" : 300and Ididn't wait long enough before starting it again.



Kind regards,

Piotr Dzionek


W dniu 28.11.2016 o 16:12, David Turner pisze:

In the cluster your OSD is down, not out. When an osd goes out, thatis when the data will start to rebuild. Once the osd is marked out,it will show as 11/11 osds are up instead of 1/12 osds are down.
------------------------------------------------------------------------
<https://storagecraft.com> DavidTurner | Cloud Operations Engineer |StorageCraft Technology Corporation <https://storagecraft.com>
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760| Mobile: 385.224.2943

------------------------------------------------------------------------
If you are not the intended recipient of this message or received iterroneously, please notify the sender and delete it, together with anyattachments, and be advised that any dissemination or copying of thismessage is prohibited.
------------------------------------------------------------------------
------------------------------------------------------------------------
*From:* ceph-users [ceph-users-boun...@lists.ceph.com] on behalf ofPiotr Dzionek [piotr.dzio...@seqr.com]
*Sent:* Monday, November 28, 2016 4:54 AM
*To:* ceph-users@lists.ceph.com
*Subject:* [ceph-users] - cluster stuck and undersized if at least oneosd is down
Hi,
I recently installed 3 nodes ceph cluster v.10.2.3. It has 3 mons, and12 osds. I removed default pool and created the following one:
/pool 7 'data' replicated size 2 min_size 1 crush_ruleset 0object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 126 flagshashpspool stripe_width 0/
Cluster is healthy if all osds are up, however if I stop any of theosds, it becomes stuck and undersized - it is not rebuilding.
/    cluster *****
     health HEALTH_WARN
            166 pgs degraded
            108 pgs stuck unclean
            166 pgs undersized
            recovery 67261/827220 objects degraded (8.131%)
            1/12 in osds are down
monmap e3: 3 mons at{**osd01=***.144:6789/0,***osd02=***.145:6789/0,**osd03=*****.146:6789/0}
            election epoch 14, quorum 0,1,2 **osd01,**osd02,**osd03
     osdmap e161: 12 osds: 11 up, 12 in; 166 remapped pgs
            flags sortbitwise
      pgmap v307710: 1024 pgs, 1 pools, 1230 GB data, 403 kobjects
            2452 GB used, 42231 GB / 44684 GB avail
            67261/827220 objects degraded (8.131%)
                 858 active+clean
                 166 active+undersized+degraded/

Replica size is 2 and and I use the following crushmap:

/# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable straw_calc_version 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host osd01 {
        id -2           # do not change unnecessarily
        # weight 14.546
        alg straw
        hash 0  # rjenkins1
        item osd.0 weight 3.636
        item osd.1 weight 3.636
        item osd.2 weight 3.636
        item osd.3 weight 3.636
}
host osd02 {
        id -3           # do not change unnecessarily
        # weight 14.546
        alg straw
        hash 0  # rjenkins1
        item osd.4 weight 3.636
        item osd.5 weight 3.636
        item osd.6 weight 3.636
        item osd.7 weight 3.636
}
host osd03 {
        id -4           # do not change unnecessarily
        # weight 14.546
        alg straw
        hash 0  # rjenkins1
        item osd.8 weight 3.636
        item osd.9 weight 3.636
        item osd.10 weight 3.636
        item osd.11 weight 3.636
}
root default {
        id -1           # do not change unnecessarily
        # weight 43.637
        alg straw
        hash 0  # rjenkins1
        item osd01 weight 14.546//
        item osd02 weight 14.546
        item osd03 weight 14.546
}

# rules
rule replicated_ruleset {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type host
        step emit
}

# end crush map
/
I am not sure what is the reason for undersized state. All osd disksare the same size and replica size is 2. Also data is only replicatedper hosts basis and I have 3 separate hosts. Maybe number of pg isincorrect ? Is 1024 too big ? or maybe there is some misconfigurationin crushmap ?
Kind regards,
Piotr Dzionek


--
Piotr Dzionek
System Administrator

SEQR Poland Sp. z o.o.
ul. Łąkowa 29, 90-554 Łódź, Poland
Mobile: +48 796555587
Mail: piotr.dzio...@seqr.com
www.seqr.com | www.seamless.se

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] - cluster stuck and undersized if at least one osd is down

Reply via email to