Re: [ceph-users] osd problem upgrading from hammer to jewel

Randy Orr Fri, 29 Apr 2016 13:24:50 -0700

Hi,

I have a little bit of additional information here that might help debug
this situation. From the OSD logs:


2016-04-29 14:32:46.886538 7fa4cd004800  0 osd.2 14422 done with init,
starting boot process
2016-04-29 14:32:46.886555 7fa4cd004800  1 -- 10.2.0.116:6808/32079 -->
10.2.0.117:6789/0 -- mon_subscribe({osd_pg_creates=0+}) v2 -- ?+0
0x55d8389ee200 con 0x55d8549c4e80
2016-04-29 14:32:46.886568 7fa4cd004800  1 osd.2 14422 We are healthy,
booting
2016-04-29 14:32:46.886577 7fa4cd004800  1 -- 10.2.0.116:6808/32079 -->
10.2.0.117:6789/0 -- mon_get_version(what=osdmap handle=1) v1 -- ?+0
0x55d837dc61e0 con 0x55d8549c4e80
2016-04-29 14:32:46.887063 7fa4b66bc700  1 -- 10.2.0.116:6808/32079 <==
mon.1 10.2.0.117:6789/0 8 ==== mon_get_version_reply(handle=1
version=14422) v2 ==== 24+0+0 (1829608329 0 0) 0x55d837dc65a0 con
0x55d8549c4e80
2016-04-29 14:32:46.887087 7fa4adeab700  1 osd.2 14422 osdmap indicates one
or more pre-v0.94.4 hammer OSDs is running
2016-04-29 14:32:46.887100 7fa4adeab700  1 -- 10.2.0.116:6808/32079 -->
10.2.0.117:6789/0 -- mon_subscribe({osdmap=14423}) v2 -- ?+0 0x55d854d65c00
con 0x55d8549c4e80

So, it's saying there is an older OSD running, but:

#ceph tell osd.* version
Error ENXIO: problem getting command descriptions from osd.0
osd.0: problem getting command descriptions from osd.0
Error ENXIO: problem getting command descriptions from osd.1
osd.1: problem getting command descriptions from osd.1
Error ENXIO: problem getting command descriptions from osd.2
osd.2: problem getting command descriptions from osd.2
Error ENXIO: problem getting command descriptions from osd.3
osd.3: problem getting command descriptions from osd.3
osd.4: {
    "version": "ceph version 0.94.6
(e832001feaf8c176593e0325c8298e3f16dfb403)"
}
osd.5: {
    "version": "ceph version 0.94.6
(e832001feaf8c176593e0325c8298e3f16dfb403)"
}
osd.6: {
    "version": "ceph version 0.94.6
(e832001feaf8c176593e0325c8298e3f16dfb403)"
}
osd.7: {
    "version": "ceph version 0.94.6
(e832001feaf8c176593e0325c8298e3f16dfb403)"
}
osd.8: {
    "version": "ceph version 0.94.6
(e832001feaf8c176593e0325c8298e3f16dfb403)"
}
osd.9: {
    "version": "ceph version 0.94.6
(e832001feaf8c176593e0325c8298e3f16dfb403)"
}
osd.10: {
    "version": "ceph version 0.94.6
(e832001feaf8c176593e0325c8298e3f16dfb403)"
}
osd.11: {
    "version": "ceph version 0.94.6
(e832001feaf8c176593e0325c8298e3f16dfb403)"
}
root@DAL1S4UTIL8:~# ceph tell mon.* version
mon.DAL1S4UTIL6: ceph version 10.2.0
(3a9fba20ec743699b69bd0181dd6c54dc01c64b9)
mon.DAL1S4UTIL7: ceph version 10.2.0
(3a9fba20ec743699b69bd0181dd6c54dc01c64b9)
mon.DAL1S4UTIL8: ceph version 10.2.0
(3a9fba20ec743699b69bd0181dd6c54dc01c64b9)

osd[1,2,3] are the ones that have been upgraded and restarted. So, it looks
to me like all OSDs are greater than 0.94.4...

What could be causing this?

Thanks,
Randy

On Wed, Apr 27, 2016 at 4:57 PM, Randy Orr <randy....@nimbix.net> wrote:

> Hi,
>
> I have a small dev/test ceph cluster that sat neglected for quite some
> time. It was on the firefly release until very recently. I successfully
> upgraded from firefly to hammer without issue as an intermediate step to
> get to the latest jewel release.
>
> This cluster has 3 ubuntu 14.04 hosts with kernel 3.13.0-40-generic. MONs
> and OSDs are colocated on the same hosts with 11 total OSDs across the 3
> hosts.
>
> The 3 MONs have been updated to jewel and are running successfully. I set
> noout on the cluster and shutdown the first 3 OSD processes, ran chown -R
> ceph:ceph on /var/lib/ceph/osd. The OSD processes start and run, but never
> show as UP. After setting debug osd = 20 I see the following in the logs:
>
> 2016-04-27 15:55:19.042230 7fd3854c7700 10 osd.1 13324 tick
> 2016-04-27 15:55:19.042244 7fd3854c7700 10 osd.1 13324 do_waiters -- start
> 2016-04-27 15:55:19.042247 7fd3854c7700 10 osd.1 13324 do_waiters -- finish
> 2016-04-27 15:55:19.061083 7fd384cc6700 10 osd.1 13324
> tick_without_osd_lock
> 2016-04-27 15:55:19.061096 7fd384cc6700 20 osd.1 13324
> scrub_random_backoff lost coin flip, randomly backing off
> 2016-04-27 15:55:20.042351 7fd3854c7700 10 osd.1 13324 tick
> 2016-04-27 15:55:20.042364 7fd3854c7700 10 osd.1 13324 do_waiters -- start
> 2016-04-27 15:55:20.042368 7fd3854c7700 10 osd.1 13324 do_waiters -- finish
> 2016-04-27 15:55:20.061192 7fd384cc6700 10 osd.1 13324
> tick_without_osd_lock
> 2016-04-27 15:55:20.061206 7fd384cc6700 20 osd.1 13324
> can_inc_scrubs_pending0 -> 1 (max 1, active 0)
> 2016-04-27 15:55:20.061212 7fd384cc6700 20 osd.1 13324 scrub_time_permit
> should run between 0 - 24 now 15 = yes
> 2016-04-27 15:55:20.061247 7fd384cc6700 20 osd.1 13324
> scrub_load_below_threshold loadavg 0.04 < max 0.5 = yes
> 2016-04-27 15:55:20.061259 7fd384cc6700 20 osd.1 13324 sched_scrub
> load_is_low=1
> 2016-04-27 15:55:20.061261 7fd384cc6700 20 osd.1 13324 sched_scrub done
> 2016-04-27 15:55:20.861872 7fd368ded700 20 osd.1 13324 update_osd_stat
> osd_stat(61789 MB used, 865 GB avail, 926 GB total, peers []/[] op hist [])
> 2016-04-27 15:55:20.861886 7fd368ded700  5 osd.1 13324 heartbeat:
> osd_stat(61789 MB used, 865 GB avail, 926 GB total, peers []/[] op hist [])
>
> The fact that no peers show up in the heartbeat seems problematic, but I
> can't see why the OSDs are failing to start correctly.
>
> A ceph status gives this:
>
>     cluster 9e3f9cab-6f1b-4c7c-ab13-e01cb774f752
>      health HEALTH_WARN
>             725 pgs degraded
>             3584 pgs stuck unclean
>             725 pgs undersized
>             recovery 23363/180420 objects degraded (12.949%)
>             recovery 49218/180420 objects misplaced (27.280%)
>             too many PGs per OSD (651 > max 300)
>             3/11 in osds are down
>             noout flag(s) set
>      monmap e3: 3 mons at {DAL1S4UTIL6=
> 10.2.0.116:6789/0,DAL1S4UTIL7=10.2.0.117:6789/0,DAL1S4UTIL8=10.2.0.118:6789/0
> }
>             election epoch 32, quorum 0,1,2
> DAL1S4UTIL6,DAL1S4UTIL7,DAL1S4UTIL8
>      osdmap e13324: 11 osds: 8 up, 11 in; 2859 remapped pgs
>             flags noout
>       pgmap v6332775: 3584 pgs, 7 pools, 180 GB data, 60140 objects
>             703 GB used, 9483 GB / 10186 GB avail
>             23363/180420 objects degraded (12.949%)
>             49218/180420 objects misplaced (27.280%)
>                 2238 active+remapped
>                  725 active+undersized+degraded
>                  621 active
>
> Disk utilization is low. Nothing interesting in syslog or dmesg. Any ideas
> or suggestions on where to start debugging this?
>
> Thanks,
> Randy
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] osd problem upgrading from hammer to jewel

Reply via email to