Re: [ceph-users] Bluestore: v11.2.0 peering not happening when OSD is down

Muthusamy Muthiah Tue, 31 Jan 2017 09:07:38 -0800

Hi Greg,

the problem is in kraken,  when a pool is created with EC profile ,
min_size equals erasure size.


For 3+1 profile , following is the pool status ,
pool 2 'cdvr_ec' erasure size 4 min_size 4 crush_ruleset 1 object_hash
rjenkins pg_num 1024 pgp_num 1024 last_change 234 flags hashpspool
stripe_width 4128

For 4+1 profile:
pool 5 'cdvr_ec' erasure size 5 min_size 5 crush_ruleset 1 object_hash
rjenkins pg_num 4096 pgp_num 4096

For 3+2 profile :
pool 3 'cdvr_ec' erasure size 5 min_size 4 crush_ruleset 1 object_hash
rjenkins pg_num 1024 pgp_num 1024 last_change 412 flags hashpspool
stripe_width 4128

Where as on Jewel release for EC 4+1:
pool 30 'cdvr_ec' *erasure size 5 min_size 4* crush_ruleset 1 object_hash
rjenkins pg_num 4096 pgp_num 4096

Trying to modify min_size and verify the status.

Is there any reason behind this change in ceph kraken  or a bug.

Thanks,
Muthu




On 31 January 2017 at 18:17, Muthusamy Muthiah <muthiah.muthus...@gmail.com>
wrote:

> Hi Greg,
>
> Following are the test outcomes on EC profile ( n = k + m)
>
>
>
> 1.       Kraken filestore and bluetore with m=1 , recovery does not start
> .
>
> 2.       Jewel filestore and bluestore with m=1 , recovery happens .
>
> 3.       Kraken bluestore all default configuration and m=1, no recovery.
>
> 4.       Kraken bluestore with m=2 , recovery happens when one OSD is
> down and for 2 OSD fails.
>
>
>
> So, the issue seems to be on ceph-kraken release. Your views…
>
>
>
> Thanks,
>
> Muthu
>
>
>
> On 31 January 2017 at 14:18, Muthusamy Muthiah <
> muthiah.muthus...@gmail.com> wrote:
>
>> Hi Greg,
>>
>> Now we could see the same problem exists for kraken-filestore also.
>> Attached the requested osdmap and crushmap.
>>
>> OSD.1 was stopped in this following procedure and OSD map for a PG is
>> displayed.
>>
>> ceph osd dump | grep cdvr_ec
>> 2017-01-31 08:39:44.827079 7f323d66c700 -1 WARNING: the following
>> dangerous and experimental features are enabled: bluestore,rocksdb
>> 2017-01-31 08:39:44.848901 7f323d66c700 -1 WARNING: the following
>> dangerous and experimental features are enabled: bluestore,rocksdb
>> pool 2 'cdvr_ec' erasure size 4 min_size 4 crush_ruleset 1 object_hash
>> rjenkins pg_num 1024 pgp_num 1024 last_change 234 flags hashpspool
>> stripe_width 4128
>>
>> [root@ca-cn2 ~]# ceph osd getmap -o /tmp/osdmap
>>
>>
>> [root@ca-cn2 ~]# osdmaptool --pool 2 --test-map-object object1
>> /tmp/osdmap
>> osdmaptool: osdmap file '/tmp/osdmap'
>>  object 'object1' -> 2.2bc -> [20,47,1,36]
>>
>> [root@ca-cn2 ~]# ceph osd map cdvr_ec object1
>> osdmap e402 pool 'cdvr_ec' (2) object 'object1' -> pg 2.bac5debc (2.2bc)
>> -> up ([20,47,1,36], p20) acting ([20,47,1,36], p20)
>>
>> [root@ca-cn2 ~]# systemctl stop ceph-osd@1.service
>>
>> [root@ca-cn2 ~]# ceph osd getmap -o /tmp/osdmap1
>>
>>
>> [root@ca-cn2 ~]# osdmaptool --pool 2 --test-map-object object1
>> /tmp/osdmap1
>> osdmaptool: osdmap file '/tmp/osdmap1'
>>  object 'object1' -> 2.2bc -> [20,47,2147483647,36]
>>
>>
>> [root@ca-cn2 ~]# ceph osd map cdvr_ec object1
>> osdmap e406 pool 'cdvr_ec' (2) object 'object1' -> pg 2.bac5debc (2.2bc)
>> -> up ([20,47,39,36], p20) acting ([20,47,NONE,36], p20)
>>
>>
>> [root@ca-cn2 ~]# ceph osd tree
>> 2017-01-31 08:42:19.606876 7f4ed856a700 -1 WARNING: the following
>> dangerous and experimental features are enabled: bluestore,rocksdb
>> 2017-01-31 08:42:19.628358 7f4ed856a700 -1 WARNING: the following
>> dangerous and experimental features are enabled: bluestore,rocksdb
>> ID WEIGHT    TYPE NAME       UP/DOWN REWEIGHT PRIMARY-AFFINITY
>> -1 327.47314 root default
>> -2  65.49463     host ca-cn4
>>  3   5.45789         osd.3        up  1.00000          1.00000
>>  5   5.45789         osd.5        up  1.00000          1.00000
>> 10   5.45789         osd.10       up  1.00000          1.00000
>> 16   5.45789         osd.16       up  1.00000          1.00000
>> 21   5.45789         osd.21       up  1.00000          1.00000
>> 27   5.45789         osd.27       up  1.00000          1.00000
>> 30   5.45789         osd.30       up  1.00000          1.00000
>> 35   5.45789         osd.35       up  1.00000          1.00000
>> 42   5.45789         osd.42       up  1.00000          1.00000
>> 47   5.45789         osd.47       up  1.00000          1.00000
>> 51   5.45789         osd.51       up  1.00000          1.00000
>> 53   5.45789         osd.53       up  1.00000          1.00000
>> -3  65.49463     host ca-cn3
>>  2   5.45789         osd.2        up  1.00000          1.00000
>>  6   5.45789         osd.6        up  1.00000          1.00000
>> 11   5.45789         osd.11       up  1.00000          1.00000
>> 15   5.45789         osd.15       up  1.00000          1.00000
>> 20   5.45789         osd.20       up  1.00000          1.00000
>> 25   5.45789         osd.25       up  1.00000          1.00000
>> 29   5.45789         osd.29       up  1.00000          1.00000
>> 33   5.45789         osd.33       up  1.00000          1.00000
>> 38   5.45789         osd.38       up  1.00000          1.00000
>> 40   5.45789         osd.40       up  1.00000          1.00000
>> 45   5.45789         osd.45       up  1.00000          1.00000
>> 49   5.45789         osd.49       up  1.00000          1.00000
>> -4  65.49463     host ca-cn5
>>  0   5.45789         osd.0        up  1.00000          1.00000
>>  7   5.45789         osd.7        up  1.00000          1.00000
>> 12   5.45789         osd.12       up  1.00000          1.00000
>> 17   5.45789         osd.17       up  1.00000          1.00000
>> 23   5.45789         osd.23       up  1.00000          1.00000
>> 26   5.45789         osd.26       up  1.00000          1.00000
>> 32   5.45789         osd.32       up  1.00000          1.00000
>> 34   5.45789         osd.34       up  1.00000          1.00000
>> 41   5.45789         osd.41       up  1.00000          1.00000
>> 46   5.45789         osd.46       up  1.00000          1.00000
>> 52   5.45789         osd.52       up  1.00000          1.00000
>> 56   5.45789         osd.56       up  1.00000          1.00000
>> -5  65.49463     host ca-cn1
>>  4   5.45789         osd.4        up  1.00000          1.00000
>>  9   5.45789         osd.9        up  1.00000          1.00000
>> 14   5.45789         osd.14       up  1.00000          1.00000
>> 19   5.45789         osd.19       up  1.00000          1.00000
>> 24   5.45789         osd.24       up  1.00000          1.00000
>> 36   5.45789         osd.36       up  1.00000          1.00000
>> 43   5.45789         osd.43       up  1.00000          1.00000
>> 50   5.45789         osd.50       up  1.00000          1.00000
>> 55   5.45789         osd.55       up  1.00000          1.00000
>> 57   5.45789         osd.57       up  1.00000          1.00000
>> 58   5.45789         osd.58       up  1.00000          1.00000
>> 59   5.45789         osd.59       up  1.00000          1.00000
>> -6  65.49463     host ca-cn2
>>  1   5.45789         osd.1      down        0          1.00000
>>  8   5.45789         osd.8        up  1.00000          1.00000
>> 13   5.45789         osd.13       up  1.00000          1.00000
>> 18   5.45789         osd.18       up  1.00000          1.00000
>> 22   5.45789         osd.22       up  1.00000          1.00000
>> 28   5.45789         osd.28       up  1.00000          1.00000
>> 31   5.45789         osd.31       up  1.00000          1.00000
>> 37   5.45789         osd.37       up  1.00000          1.00000
>> 39   5.45789         osd.39       up  1.00000          1.00000
>> 44   5.45789         osd.44       up  1.00000          1.00000
>> 48   5.45789         osd.48       up  1.00000          1.00000
>> 54   5.45789         osd.54       up  1.00000          1.00000
>>
>> health HEALTH_ERR
>>             69 pgs are stuck inactive for more than 300 seconds
>>             69 pgs incomplete
>>             69 pgs stuck inactive
>>             69 pgs stuck unclean
>>             512 requests are blocked > 32 sec
>>      monmap e2: 5 mons at {ca-cn1=10.50.5.117:6789/0,ca-
>> cn2=10.50.5.118:6789/0,ca-cn3=10.50.5.119:6789/0,ca-cn4=10.5
>> 0.5.120:6789/0,ca-cn5=10.50.5.121:6789/0}
>>             election epoch 8, quorum 0,1,2,3,4
>> ca-cn1,ca-cn2,ca-cn3,ca-cn4,ca-cn5
>>         mgr active: ca-cn4 standbys: ca-cn2, ca-cn5, ca-cn3, ca-cn1
>>      osdmap e406: 60 osds: 59 up, 59 in; 69 remapped pgs
>>             flags sortbitwise,require_jewel_osds,require_kraken_osds
>>       pgmap v23018: 1024 pgs, 1 pools, 3892 GB data, 7910 kobjects
>>             6074 GB used, 316 TB / 322 TB avail
>>                  955 active+clean
>>                   69 remapped+incomplete
>>
>> Thanks,
>> Muthu
>>
>>
>> On 31 January 2017 at 02:54, Gregory Farnum <gfar...@redhat.com> wrote:
>>
>>> You might also check out "ceph osd tree" and crush dump and make sure
>>> they look the way you expect.
>>>
>>> On Mon, Jan 30, 2017 at 1:23 PM, Gregory Farnum <gfar...@redhat.com>
>>> wrote:
>>> > On Sun, Jan 29, 2017 at 6:40 AM, Muthusamy Muthiah
>>> > <muthiah.muthus...@gmail.com> wrote:
>>> >> Hi All,
>>> >>
>>> >> Also tried EC profile 3+1 on 5 node cluster with bluestore enabled  .
>>> When
>>> >> an OSD is down the cluster goes to ERROR state even when the cluster
>>> is n+1
>>> >> . No recovery happening.
>>> >>
>>> >> health HEALTH_ERR
>>> >>             75 pgs are stuck inactive for more than 300 seconds
>>> >>             75 pgs incomplete
>>> >>             75 pgs stuck inactive
>>> >>             75 pgs stuck unclean
>>> >>      monmap e2: 5 mons at
>>> >> {ca-cn1=10.50.5.117:6789/0,ca-cn2=10.50.5.118:6789/0,ca-cn3=
>>> 10.50.5.119:6789/0,ca-cn4=10.50.5.120:6789/0,ca-cn5=10.50.5.121:6789/0}
>>> >>             election epoch 10, quorum 0,1,2,3,4
>>> >> ca-cn1,ca-cn2,ca-cn3,ca-cn4,ca-cn5
>>> >>         mgr active: ca-cn1 standbys: ca-cn4, ca-cn3, ca-cn5, ca-cn2
>>> >>      osdmap e264: 60 osds: 59 up, 59 in; 75 remapped pgs
>>> >>             flags sortbitwise,require_jewel_osds,require_kraken_osds
>>> >>       pgmap v119402: 1024 pgs, 1 pools, 28519 GB data, 21548 kobjects
>>> >>             39976 GB used, 282 TB / 322 TB avail
>>> >>                  941 active+clean
>>> >>                   75 remapped+incomplete
>>> >>                    8 active+clean+scrubbing
>>> >>
>>> >> this seems to be an issue with bluestore , recovery not happening
>>> properly
>>> >> with EC .
>>> >
>>> > It's possible but it seems a lot more likely this is some kind of
>>> > config issue. Can you share your osd map ("ceph osd getmap")?
>>> > -Greg
>>> >
>>> >>
>>> >> Thanks,
>>> >> Muthu
>>> >>
>>> >> On 24 January 2017 at 12:57, Muthusamy Muthiah <
>>> muthiah.muthus...@gmail.com>
>>> >> wrote:
>>> >>>
>>> >>> Hi Greg,
>>> >>>
>>> >>> We use EC:4+1 on 5 node cluster in production deployments with
>>> filestore
>>> >>> and it does recovery and peering when one OSD goes down. After few
>>> mins ,
>>> >>> other OSD from a node where the fault OSD exists will take over the
>>> PGs
>>> >>> temporarily and all PGs goes to active + clean state . Cluster also
>>> does not
>>> >>> goes down during this recovery process.
>>> >>>
>>> >>> Only on bluestore we see cluster going to error state when one OSD is
>>> >>> down.
>>> >>> We are still validating this and let you know additional findings.
>>> >>>
>>> >>> Thanks,
>>> >>> Muthu
>>> >>>
>>> >>> On 21 January 2017 at 02:06, Shinobu Kinjo <ski...@redhat.com>
>>> wrote:
>>> >>>>
>>> >>>> `ceph pg dump` should show you something like:
>>> >>>>
>>> >>>>  * active+undersized+degraded ... [NONE,3,2,4,1]    3
>>> [NONE,3,2,4,1]
>>> >>>>
>>> >>>> Sam,
>>> >>>>
>>> >>>> Am I wrong? Or is it up to something else?
>>> >>>>
>>> >>>>
>>> >>>> On Sat, Jan 21, 2017 at 4:22 AM, Gregory Farnum <gfar...@redhat.com
>>> >
>>> >>>> wrote:
>>> >>>> > I'm pretty sure the default configs won't let an EC PG go active
>>> with
>>> >>>> > only "k" OSDs in its PG; it needs at least k+1 (or possibly more?
>>> Not
>>> >>>> > certain). Running an "n+1" EC config is just not a good idea.
>>> >>>> > For testing you could probably adjust this with the equivalent of
>>> >>>> > min_size for EC pools, but I don't know the parameters off the
>>> top of
>>> >>>> > my head.
>>> >>>> > -Greg
>>> >>>> >
>>> >>>> > On Fri, Jan 20, 2017 at 2:15 AM, Muthusamy Muthiah
>>> >>>> > <muthiah.muthus...@gmail.com> wrote:
>>> >>>> >> Hi ,
>>> >>>> >>
>>> >>>> >> We are validating kraken 11.2.0 with bluestore  on 5 node
>>> cluster with
>>> >>>> >> EC
>>> >>>> >> 4+1.
>>> >>>> >>
>>> >>>> >> When an OSD is down , the peering is not happening and ceph
>>> health
>>> >>>> >> status
>>> >>>> >> moved to ERR state after few mins. This was working in previous
>>> >>>> >> development
>>> >>>> >> releases. Any additional configuration required in v11.2.0
>>> >>>> >>
>>> >>>> >> Following is our ceph configuration:
>>> >>>> >>
>>> >>>> >> mon_osd_down_out_interval = 30
>>> >>>> >> mon_osd_report_timeout = 30
>>> >>>> >> mon_osd_down_out_subtree_limit = host
>>> >>>> >> mon_osd_reporter_subtree_level = host
>>> >>>> >>
>>> >>>> >> and the recovery parameters set to default.
>>> >>>> >>
>>> >>>> >> [root@ca-cn1 ceph]# ceph osd crush show-tunables
>>> >>>> >>
>>> >>>> >> {
>>> >>>> >>     "choose_local_tries": 0,
>>> >>>> >>     "choose_local_fallback_tries": 0,
>>> >>>> >>     "choose_total_tries": 50,
>>> >>>> >>     "chooseleaf_descend_once": 1,
>>> >>>> >>     "chooseleaf_vary_r": 1,
>>> >>>> >>     "chooseleaf_stable": 1,
>>> >>>> >>     "straw_calc_version": 1,
>>> >>>> >>     "allowed_bucket_algs": 54,
>>> >>>> >>     "profile": "jewel",
>>> >>>> >>     "optimal_tunables": 1,
>>> >>>> >>     "legacy_tunables": 0,
>>> >>>> >>     "minimum_required_version": "jewel",
>>> >>>> >>     "require_feature_tunables": 1,
>>> >>>> >>     "require_feature_tunables2": 1,
>>> >>>> >>     "has_v2_rules": 1,
>>> >>>> >>     "require_feature_tunables3": 1,
>>> >>>> >>     "has_v3_rules": 0,
>>> >>>> >>     "has_v4_buckets": 0,
>>> >>>> >>     "require_feature_tunables5": 1,
>>> >>>> >>     "has_v5_rules": 0
>>> >>>> >> }
>>> >>>> >>
>>> >>>> >> ceph status:
>>> >>>> >>
>>> >>>> >>      health HEALTH_ERR
>>> >>>> >>             173 pgs are stuck inactive for more than 300 seconds
>>> >>>> >>             173 pgs incomplete
>>> >>>> >>             173 pgs stuck inactive
>>> >>>> >>             173 pgs stuck unclean
>>> >>>> >>      monmap e2: 5 mons at
>>> >>>> >>
>>> >>>> >> {ca-cn1=10.50.5.117:6789/0,ca-cn2=10.50.5.118:6789/0,ca-cn3=
>>> 10.50.5.119:6789/0,ca-cn4=10.50.5.120:6789/0,ca-cn5=10.50.5.121:6789/0}
>>> >>>> >>             election epoch 106, quorum 0,1,2,3,4
>>> >>>> >> ca-cn1,ca-cn2,ca-cn3,ca-cn4,ca-cn5
>>> >>>> >>         mgr active: ca-cn1 standbys: ca-cn2, ca-cn4, ca-cn5,
>>> ca-cn3
>>> >>>> >>      osdmap e1128: 60 osds: 59 up, 59 in; 173 remapped pgs
>>> >>>> >>             flags sortbitwise,require_jewel_osds
>>> ,require_kraken_osds
>>> >>>> >>       pgmap v782747: 2048 pgs, 1 pools, 63133 GB data, 46293
>>> kobjects
>>> >>>> >>             85199 GB used, 238 TB / 322 TB avail
>>> >>>> >>                 1868 active+clean
>>> >>>> >>                  173 remapped+incomplete
>>> >>>> >>                    7 active+clean+scrubbing
>>> >>>> >>
>>> >>>> >> MON log:
>>> >>>> >>
>>> >>>> >> 2017-01-20 09:25:54.715684 7f55bcafb700  0 log_channel(cluster)
>>> log
>>> >>>> >> [INF] :
>>> >>>> >> osd.54 out (down for 31.703786)
>>> >>>> >> 2017-01-20 09:25:54.725688 7f55bf4d5700  0 mon.ca-cn1@0
>>> (leader).osd
>>> >>>> >> e1120
>>> >>>> >> crush map has features 288250512065953792, adjusting msgr
>>> requires
>>> >>>> >> 2017-01-20 09:25:54.729019 7f55bf4d5700  0 log_channel(cluster)
>>> log
>>> >>>> >> [INF] :
>>> >>>> >> osdmap e1120: 60 osds: 59 up, 59 in
>>> >>>> >> 2017-01-20 09:25:54.735987 7f55bf4d5700  0 log_channel(cluster)
>>> log
>>> >>>> >> [INF] :
>>> >>>> >> pgmap v781993: 2048 pgs: 1869 active+clean, 173 incomplete, 6
>>> >>>> >> active+clean+scrubbing; 63159 GB data, 85201 GB used, 238 TB /
>>> 322 TB
>>> >>>> >> avail;
>>> >>>> >> 21825 B/s rd, 163 MB/s wr, 2046 op/s
>>> >>>> >> 2017-01-20 09:25:55.737749 7f55bf4d5700  0 mon.ca-cn1@0
>>> (leader).osd
>>> >>>> >> e1121
>>> >>>> >> crush map has features 288250512065953792, adjusting msgr
>>> requires
>>> >>>> >> 2017-01-20 09:25:55.744338 7f55bf4d5700  0 log_channel(cluster)
>>> log
>>> >>>> >> [INF] :
>>> >>>> >> osdmap e1121: 60 osds: 59 up, 59 in
>>> >>>> >> 2017-01-20 09:25:55.749616 7f55bf4d5700  0 log_channel(cluster)
>>> log
>>> >>>> >> [INF] :
>>> >>>> >> pgmap v781994: 2048 pgs: 29 remapped+incomplete, 1869
>>> active+clean,
>>> >>>> >> 144
>>> >>>> >> incomplete, 6 active+clean+scrubbing; 63159 GB data, 85201 GB
>>> used,
>>> >>>> >> 238 TB /
>>> >>>> >> 322 TB avail; 44503 B/s rd, 45681 kB/s wr, 518 op/s
>>> >>>> >> 2017-01-20 09:25:56.768721 7f55bf4d5700  0 log_channel(cluster)
>>> log
>>> >>>> >> [INF] :
>>> >>>> >> pgmap v781995: 2048 pgs: 47 remapped+incomplete, 1869
>>> active+clean,
>>> >>>> >> 126
>>> >>>> >> incomplete, 6 active+clean+scrubbing; 63159 GB data, 85201 GB
>>> used,
>>> >>>> >> 238 TB /
>>> >>>> >> 322 TB avail; 20275 B/s rd, 72742 kB/s wr, 665 op/s
>>> >>>> >>
>>> >>>> >> Thanks,
>>> >>>> >> Muthu
>>> >>>> >>
>>> >>>> >>
>>> >>>> >> _______________________________________________
>>> >>>> >> ceph-users mailing list
>>> >>>> >> ceph-users@lists.ceph.com
>>> >>>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >>>> >>
>>> >>>> > _______________________________________________
>>> >>>> > ceph-users mailing list
>>> >>>> > ceph-users@lists.ceph.com
>>> >>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >>>
>>> >>>
>>> >>
>>> >>
>>> >> _______________________________________________
>>> >> ceph-users mailing list
>>> >> ceph-users@lists.ceph.com
>>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >>
>>>
>>
>>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore: v11.2.0 peering not happening when OSD is down

Reply via email to