Re: [ceph-users] CEPH All OSD got segmentation fault after CRUSH edit

Samuel Just Tue, 26 Apr 2016 07:59:53 -0700

Can you attach the OSDMap (ceph osd getmap -o <mapfile>)?
-Sam

On Tue, Apr 26, 2016 at 2:07 AM, Henrik Svensson <henrik.svens...@sectra.com
> wrote:


> Hi!
>
> We got a three node CEPH cluster with 10 OSD each.
>
> We bought 3 new machines with additional 30 disks that should reside in
> another location.
> Before adding these machines we modified the default CRUSH table.
>
> After modifying the (default) crush table with these commands the cluster
> went down:
>
> ————————————————
> ceph osd crush add-bucket dc1 datacenter
> ceph osd crush add-bucket dc2 datacenter
> ceph osd crush add-bucket availo datacenter
> ceph osd crush move dc1 root=default
> ceph osd crush move lkpsx0120 root=default datacenter=dc1
> ceph osd crush move lkpsx0130 root=default datacenter=dc1
> ceph osd crush move lkpsx0140 root=default datacenter=dc1
> ceph osd crush move dc2 root=default
> ceph osd crush move availo root=default
> ceph osd crush add-bucket sectra root
> ceph osd crush move dc1 root=sectra
> ceph osd crush move dc2 root=sectra
> ceph osd crush move dc3 root=sectra
> ceph osd crush move availo root=sectra
> ceph osd crush remove default
> ————————————————
>
> I tried to revert the CRUSH map but no luck:
>
> ————————————————
> ceph osd crush add-bucket default root
> ceph osd crush move lkpsx0120 root=default
> ceph osd crush move lkpsx0130 root=default
> ceph osd crush move lkpsx0140 root=default
> ceph osd crush remove sectra
> ————————————————
>
> After trying to restart the cluster (and even the machines) no OSD started
> up again.
> But ceph osd tree gave this output, stating certain OSD:s are up (but the
> processes are not running):
>
> ————————————————
> # id weight type name up/down reweight
> -1 163.8 root default
> -2 54.6 host lkpsx0120
> 0 5.46 osd.0 down 0
> 1 5.46 osd.1 down 0
> 2 5.46 osd.2 down 0
> 3 5.46 osd.3 down 0
> 4 5.46 osd.4 down 0
> 5 5.46 osd.5 down 0
> 6 5.46 osd.6 down 0
> 7 5.46 osd.7 down 0
> 8 5.46 osd.8 down 0
> 9 5.46 osd.9 down 0
> -3 54.6 host lkpsx0130
> 10 5.46 osd.10 down 0
> 11 5.46 osd.11 down 0
> 12 5.46 osd.12 down 0
> 13 5.46 osd.13 down 0
> 14 5.46 osd.14 down 0
> 15 5.46 osd.15 down 0
> 16 5.46 osd.16 down 0
> 17 5.46 osd.17 down 0
> 18 5.46 osd.18 up 1
> 19 5.46 osd.19 up 1
> -4 54.6 host lkpsx0140
> 20 5.46 osd.20 up 1
> 21 5.46 osd.21 down 0
> 22 5.46 osd.22 down 0
> 23 5.46 osd.23 down 0
> 24 5.46 osd.24 down 0
> 25 5.46 osd.25 up 1
> 26 5.46 osd.26 up 1
> 27 5.46 osd.27 up 1
> 28 5.46 osd.28 up 1
> 29 5.46 osd.29 up 1
> ————————————————
>
> The monitor starts/restarts OK (only one monitor exists).
> But when starting one OSD with ceph -w nothing shows.
>
> Here is the ceph mon_status:
>
> ————————————————
> { "name": "lkpsx0120",
>   "rank": 0,
>   "state": "leader",
>   "election_epoch": 1,
>   "quorum": [
>         0],
>   "outside_quorum": [],
>   "extra_probe_peers": [],
>   "sync_provider": [],
>   "monmap": { "epoch": 4,
>       "fsid": "9244194a-5e10-47ae-9287-507944612f95",
>       "modified": "0.000000",
>       "created": "0.000000",
>       "mons": [
>             { "rank": 0,
>               "name": "lkpsx0120",
>               "addr": "10.15.2.120:6789\/0"}]}}
> ————————————————
>
> Here is the ceph.conf file
>
> ————————————————
> [global]
> fsid = 9244194a-5e10-47ae-9287-507944612f95
> mon_initial_members = lkpsx0120
> mon_host = 10.15.2.120
> #debug osd = 20
> #debug ms = 1
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> filestore_xattr_use_omap = true
> osd_crush_chooseleaf_type = 1
> osd_pool_default_size = 2
> public_network = 10.15.2.0/24
> cluster_network = 10.15.4.0/24
> rbd_cache = true
> rbd_cache_size = 67108864
> rbd_cache_max_dirty = 50331648
> rbd_cache_target_dirty = 33554432
> rbd_cache_max_dirty_age = 2
> rbd_cache_writethrough_until_flush = true
> ————————————————
>
> Here is the decompiled crush map:
>
> ————————————————
> # begin crush map
> tunable choose_local_tries 0
> tunable choose_local_fallback_tries 0
> tunable choose_total_tries 50
> tunable chooseleaf_descend_once 1
>
> # devices
> device 0 osd.0
> device 1 osd.1
> device 2 osd.2
> device 3 osd.3
> device 4 osd.4
> device 5 osd.5
> device 6 osd.6
> device 7 osd.7
> device 8 osd.8
> device 9 osd.9
> device 10 osd.10
> device 11 osd.11
> device 12 osd.12
> device 13 osd.13
> device 14 osd.14
> device 15 osd.15
> device 16 osd.16
> device 17 osd.17
> device 18 osd.18
> device 19 osd.19
> device 20 osd.20
> device 21 osd.21
> device 22 osd.22
> device 23 osd.23
> device 24 osd.24
> device 25 osd.25
> device 26 osd.26
> device 27 osd.27
> device 28 osd.28
> device 29 osd.29
>
> # types
> type 0 osd
> type 1 host
> type 2 chassis
> type 3 rack
> type 4 row
> type 5 pdu
> type 6 pod
> type 7 room
> type 8 datacenter
> type 9 region
> type 10 root
>
> # buckets
> host lkpsx0120 {
> id -2 # do not change unnecessarily
> # weight 54.600
> alg straw
> hash 0 # rjenkins1
> item osd.0 weight 5.460
> item osd.1 weight 5.460
> item osd.2 weight 5.460
> item osd.3 weight 5.460
> item osd.4 weight 5.460
> item osd.5 weight 5.460
> item osd.6 weight 5.460
> item osd.7 weight 5.460
> item osd.8 weight 5.460
> item osd.9 weight 5.460
> }
> host lkpsx0130 {
> id -3 # do not change unnecessarily
> # weight 54.600
> alg straw
> hash 0 # rjenkins1
> item osd.10 weight 5.460
> item osd.11 weight 5.460
> item osd.12 weight 5.460
> item osd.13 weight 5.460
> item osd.14 weight 5.460
> item osd.15 weight 5.460
> item osd.16 weight 5.460
> item osd.17 weight 5.460
> item osd.18 weight 5.460
> item osd.19 weight 5.460
> }
> host lkpsx0140 {
> id -4 # do not change unnecessarily
> # weight 54.600
> alg straw
> hash 0 # rjenkins1
> item osd.20 weight 5.460
> item osd.21 weight 5.460
> item osd.22 weight 5.460
> item osd.23 weight 5.460
> item osd.24 weight 5.460
> item osd.25 weight 5.460
> item osd.26 weight 5.460
> item osd.27 weight 5.460
> item osd.28 weight 5.460
> item osd.29 weight 5.460
> }
> root default {
> id -1 # do not change unnecessarily
> # weight 163.800
> alg straw
> hash 0 # rjenkins1
> item lkpsx0120 weight 54.600
> item lkpsx0130 weight 54.600
> item lkpsx0140 weight 54.600
> }
>
> # rules
> rule replicated_ruleset {
> ruleset 0
> type replicated
> min_size 1
> max_size 10
> step take default
> step chooseleaf firstn 0 type host
> step emit
> }
>
> # end crush map
> ————————————————
>
> Operating system is Debian 8.0 and the CEPH version is 0.80.7 as stated in
> the crash log.
>
> We increased the log level and tried to start osd.1 as an example. All
> OSD:s we tried to start experiencing the same problem and dies.
>
> The log file from OSD 1 (ceph-osd.1.log) can be found here:
> https://www.dropbox.com/s/dqunlufh0qtked5/ceph-osd.1.log.zip?dl=0
>
> As of now, all systems are down including the KVM-cluster that are
> dependent of CEPH.
>
> Best regards,
> Med vänlig hälsning
>
> Henrik
> ------------------------------
> *Henrik Svensson*
> OpIT
> Sectra AB
> Teknikringen 20, 58330 Linköping, Sweden
> E-mail: henrik.svens...@sectra.com
> Phone: +46 (0)13 352 884
> Cellular: +46 (0)70 395141
> Web: *www.sectra.com* <http://www.sectra.com/medical/>
>
> ------------------------------
> This message is intended only for the addressee and may contain
> information that is
> confidential or privileged. Unauthorized use is strictly prohibited and
> may be unlawful.
>
> If you are not the addressee, you should not read, copy, disclose or
> otherwise use this
> message, except for the purpose of delivery to the addressee. If you have
> received
> this in error, please delete and advise us immediately.
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CEPH All OSD got segmentation fault after CRUSH edit

Reply via email to