Ok, just to update everyone, after moving out all the pg directories on the
OSD that were no longer valid PGs I was able to start it and the cluster is
back to healthy.

I'm going to trigger a deep scrub of osd.3 to be safe prior to deleting any
of those PGs though.

If I understand the gist of how 11429 is going to be addressed in 94.2, it
is going to disregard such "dead" PGs and complain in the logs. As far as
cleaning those up would a procedure similar to mine be appropriate (either
before or after 94.2).

Thank you Sam for your help, I greatly appreciate it!
-Berant

On Tue, May 19, 2015 at 7:13 PM, Berant Lemmenes <ber...@lemmenes.com>
wrote:

> Sam,
>
> It is for a valid pool, however the up and acting sets for 2.14 both show
> OSDs 8 & 7. I'll take a look at 7 &  8 and see if they are good.
>
> If so, it seems like it being present on osd.3 could be an artifact from
> previous topologies and I could mv it off old.3
>
> Thanks very much for the assistance!
>
> Berant
>
>
> On Tuesday, May 19, 2015, Samuel Just <sj...@redhat.com> wrote:
>
>> If 2.14 is part of a non-existent pool, you should be able to rename it
>> out of current/ in the osd directory to prevent the osd from seeing it on
>> startup.
>> -Sam
>>
>> ----- Original Message -----
>> From: "Berant Lemmenes" <ber...@lemmenes.com>
>> To: "Samuel Just" <sj...@redhat.com>
>> Cc: ceph-users@lists.ceph.com
>> Sent: Tuesday, May 19, 2015 12:58:30 PM
>> Subject: Re: [ceph-users] OSD unable to start (giant -> hammer)
>>
>> Hello,
>>
>> So here are the steps I performed and where I sit now.
>>
>> Step 1) Using 'ceph-objectstore-tool list' to create a list of all PGs not
>> associated with the 3 pools (rbd, data, metadata) that are actually in use
>> on this cluster.
>>
>> Step 2) I then did a 'ceph-objectstore-tool remove' of those PGs
>>
>> Then when starting the OSD it would complain about PGs that were NOT in
>> the
>> list of 'ceph-objectstore-tool list' but WERE present on the filesystem of
>> the OSD in question.
>>
>> Step 3) Iterating over all of the PGs that were on disk and using
>> 'ceph-objectstore-tool info' I made a list of all PGs that returned
>> ENOENT,
>>
>> Step 4) 'ceph-objectstore-tool remove' to remove all those as well.
>>
>> Now when starting osd.3 I get an "unable to load metadata' error for a PG
>> that according to 'ceph pg 2.14 query' is not present (and shouldn't be)
>> on
>> osd.3. Shown below with OSD debugging at 20:
>>
>> <snip>
>>
>>    -23> 2015-05-19 15:15:12.712036 7fb079a20780 20 read_log 39533'174051
>> (39533'174050) modify   49277412/rb.0.100f.2ae8944a.000000029945/head//2
>> by
>> client.18119.0:2811937 2015-05-18 07:18:42.859501
>>
>>    -22> 2015-05-19 15:15:12.712066 7fb079a20780 20 read_log 39533'174052
>> (39533'174051) modify   49277412/rb.0.100f.2ae8944a.000000029945/head//2
>> by
>> client.18119.0:2812374 2015-05-18 07:33:21.973157
>>
>>    -21> 2015-05-19 15:15:12.712096 7fb079a20780 20 read_log 39533'174053
>> (39533'174052) modify   49277412/rb.0.100f.2ae8944a.000000029945/head//2
>> by
>> client.18119.0:2812861 2015-05-18 07:48:23.098343
>>
>>    -20> 2015-05-19 15:15:12.712127 7fb079a20780 20 read_log 39533'174054
>> (39533'174053) modify   49277412/rb.0.100f.2ae8944a.000000029945/head//2
>> by
>> client.18119.0:2813371 2015-05-18 08:03:54.226512
>>
>>    -19> 2015-05-19 15:15:12.712157 7fb079a20780 20 read_log 39533'174055
>> (39533'174054) modify   49277412/rb.0.100f.2ae8944a.000000029945/head//2
>> by
>> client.18119.0:2813922 2015-05-18 08:18:20.351421
>>
>>    -18> 2015-05-19 15:15:12.712187 7fb079a20780 20 read_log 39533'174056
>> (39533'174055) modify   49277412/rb.0.100f.2ae8944a.000000029945/head//2
>> by
>> client.18119.0:2814396 2015-05-18 08:33:56.476035
>>
>>    -17> 2015-05-19 15:15:12.712221 7fb079a20780 20 read_log 39533'174057
>> (39533'174056) modify   49277412/rb.0.100f.2ae8944a.000000029945/head//2
>> by
>> client.18119.0:2814971 2015-05-18 08:48:22.605674
>>
>>    -16> 2015-05-19 15:15:12.712252 7fb079a20780 20 read_log 39533'174058
>> (39533'174057) modify   49277412/rb.0.100f.2ae8944a.000000029945/head//2
>> by
>> client.18119.0:2815407 2015-05-18 09:02:48.720181
>>
>>    -15> 2015-05-19 15:15:12.712282 7fb079a20780 20 read_log 39533'174059
>> (39533'174058) modify   49277412/rb.0.100f.2ae8944a.000000029945/head//2
>> by
>> client.18119.0:2815434 2015-05-18 09:03:43.727839
>>
>>    -14> 2015-05-19 15:15:12.712312 7fb079a20780 20 read_log 39533'174060
>> (39533'174059) modify   49277412/rb.0.100f.2ae8944a.000000029945/head//2
>> by
>> client.18119.0:2815889 2015-05-18 09:17:49.846406
>>
>>    -13> 2015-05-19 15:15:12.712342 7fb079a20780 20 read_log 39533'174061
>> (39533'174060) modify   49277412/rb.0.100f.2ae8944a.000000029945/head//2
>> by
>> client.18119.0:2816358 2015-05-18 09:32:50.969457
>>
>>    -12> 2015-05-19 15:15:12.712372 7fb079a20780 20 read_log 39533'174062
>> (39533'174061) modify   49277412/rb.0.100f.2ae8944a.000000029945/head//2
>> by
>> client.18119.0:2816840 2015-05-18 09:47:52.091524
>>
>>    -11> 2015-05-19 15:15:12.712403 7fb079a20780 20 read_log 39533'174063
>> (39533'174062) modify   49277412/rb.0.100f.2ae8944a.000000029945/head//2
>> by
>> client.18119.0:2816861 2015-05-18 09:48:22.096309
>>
>>    -10> 2015-05-19 15:15:12.712433 7fb079a20780 20 read_log 39533'174064
>> (39533'174063) modify   49277412/rb.0.100f.2ae8944a.000000029945/head//2
>> by
>> client.18119.0:2817714 2015-05-18 10:02:53.222749
>>
>>     -9> 2015-05-19 15:15:12.713130 7fb079a20780 10 read_log done
>>
>>     -8> 2015-05-19 15:15:12.713550 7fb079a20780 10 osd.3 pg_epoch: 39533
>> pg[2.12( v 39533'174064 (37945'171063,39533'174064] local-les=39529 n=101
>> ec=1 les/c 39529/39529 39526/39526/39526) [9,3,10] r=1 lpr=0
>> pi=37959-39525/7 crt=39533'174062 lcod 0'0 inactive] handle_loaded
>>
>>     -7> 2015-05-19 15:15:12.713570 7fb079a20780  5 osd.3 pg_epoch: 39533
>> pg[2.12( v 39533'174064 (37945'171063,39533'174064] local-les=39529 n=101
>> ec=1 les/c 39529/39529 39526/39526/39526) [9,3,10] r=1 lpr=0
>> pi=37959-39525/7 crt=39533'174062 lcod 0'0 inactive NOTIFY] exit Initial
>> 0.097986 0 0.000000
>>
>>     -6> 2015-05-19 15:15:12.713587 7fb079a20780  5 osd.3 pg_epoch: 39533
>> pg[2.12( v 39533'174064 (37945'171063,39533'174064] local-les=39529 n=101
>> ec=1 les/c 39529/39529 39526/39526/39526) [9,3,10] r=1 lpr=0
>> pi=37959-39525/7 crt=39533'174062 lcod 0'0 inactive NOTIFY] enter Reset
>>
>>     -5> 2015-05-19 15:15:12.713601 7fb079a20780 20 osd.3 pg_epoch: 39533
>> pg[2.12( v 39533'174064 (37945'171063,39533'174064] local-les=39529 n=101
>> ec=1 les/c 39529/39529 39526/39526/39526) [9,3,10] r=1 lpr=0
>> pi=37959-39525/7 crt=39533'174062 lcod 0'0 inactive NOTIFY]
>> set_last_peering_reset 39533
>>
>>     -4> 2015-05-19 15:15:12.713614 7fb079a20780 10 osd.3 pg_epoch: 39533
>> pg[2.12( v 39533'174064 (37945'171063,39533'174064] local-les=39529 n=101
>> ec=1 les/c 39529/39529 39526/39526/39526) [9,3,10] r=1 lpr=39533
>> pi=37959-39525/7 crt=39533'174062 lcod 0'0 inactive NOTIFY] Clearing
>> blocked outgoing recovery messages
>>
>>     -3> 2015-05-19 15:15:12.713629 7fb079a20780 10 osd.3 pg_epoch: 39533
>> pg[2.12( v 39533'174064 (37945'171063,39533'174064] local-les=39529 n=101
>> ec=1 les/c 39529/39529 39526/39526/39526) [9,3,10] r=1 lpr=39533
>> pi=37959-39525/7 crt=39533'174062 lcod 0'0 inactive NOTIFY] Not blocking
>> outgoing recovery messages
>>
>>     -2> 2015-05-19 15:15:12.713643 7fb079a20780 10 osd.3 39533 load_pgs
>> loaded pg[2.12( v 39533'174064 (37945'171063,39533'174064] local-les=39529
>> n=101 ec=1 les/c 39529/39529 39526/39526/39526) [9,3,10] r=1 lpr=39533
>> pi=37959-39525/7 crt=39533'174062 lcod 0'0 inactive NOTIFY]
>> log((37945'171063,39533'174064], crt=39533'174062)
>>
>>     -1> 2015-05-19 15:15:12.713658 7fb079a20780 10 osd.3 39533 pgid 2.14
>> coll 2.14_head
>>
>>      0> 2015-05-19 15:15:12.716475 7fb079a20780 -1 osd/PG.cc: In function
>> 'static epoch_t PG::peek_map_epoch(ObjectStore*, spg_t,
>> ceph::bufferlist*)'
>> thread 7fb079a20780 time 2015-05-19 15:15:12.715425
>>
>> osd/PG.cc: 2860: FAILED assert(0 == "unable to open pg metadata")
>>
>>
>>  ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
>>
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x7f) [0xb1784f]
>>
>>  2: (PG::peek_map_epoch(ObjectStore*, spg_t, ceph::buffer::list*)+0xb28)
>> [0x793dd8]
>>
>>  3: (OSD::load_pgs()+0x147f) [0x683dff]
>>
>>  4: (OSD::init()+0x1448) [0x6930b8]
>>
>>  5: (main()+0x26b9) [0x62fd89]
>>
>>  6: (__libc_start_main()+0xed) [0x7fb07767876d]
>>
>>  7: ceph-osd() [0x635679]
>>
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
>> to interpret this.
>>
>>
>> --- logging levels ---
>>
>>    0/ 5 none
>>
>>    0/ 1 lockdep
>>
>>    0/ 1 context
>>
>>    1/ 1 crush
>>
>>    1/ 5 mds
>>
>>    1/ 5 mds_balancer
>>
>>    1/ 5 mds_locker
>>
>>    1/ 5 mds_log
>>
>>    1/ 5 mds_log_expire
>>
>>    1/ 5 mds_migrator
>>
>>    0/ 1 buffer
>>
>>    0/ 1 timer
>>
>>    0/ 1 filer
>>
>>    0/ 1 striper
>>
>>    0/ 1 objecter
>>
>>    0/ 5 rados
>>
>>    0/ 5 rbd
>>
>>    0/ 5 rbd_replay
>>
>>    0/ 5 journaler
>>
>>    0/ 5 objectcacher
>>
>>    0/ 5 client
>>
>>   20/20 osd
>>
>>    0/ 5 optracker
>>
>>    0/ 5 objclass
>>
>>    1/ 3 filestore
>>
>>    1/ 3 keyvaluestore
>>
>>    1/ 3 journal
>>
>>    0/ 5 ms
>>
>>    1/ 5 mon
>>
>>    0/10 monc
>>
>>    1/ 5 paxos
>>
>>    0/ 5 tp
>>
>>    1/ 5 auth
>>
>>    1/ 5 crypto
>>
>>    1/ 1 finisher
>>
>>    1/ 5 heartbeatmap
>>
>>    1/ 5 perfcounter
>>
>>    1/ 5 rgw
>>
>>    1/10 civetweb
>>
>>    1/ 5 javaclient
>>
>>    1/ 5 asok
>>
>>    1/ 1 throttle
>>
>>    0/ 0 refs
>>
>>    1/ 5 xio
>>
>>   -2/-2 (syslog threshold)
>>
>>   99/99 (stderr threshold)
>>
>>   max_recent     10000
>>
>>   max_new         1000
>>
>>   log_file
>>
>> --- end dump of recent events ---
>>
>> terminate called after throwing an instance of 'ceph::FailedAssertion'
>>
>> *** Caught signal (Aborted) **
>>
>>  in thread 7fb079a20780
>>
>>  ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
>>
>>  1: ceph-osd() [0xa1fe55]
>>
>>  2: (()+0xfcb0) [0x7fb078a60cb0]
>>
>>  3: (gsignal()+0x35) [0x7fb07768d0d5]
>>
>>  4: (abort()+0x17b) [0x7fb07769083b]
>>
>>  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fb077fde69d]
>>
>>  6: (()+0xb5846) [0x7fb077fdc846]
>>
>>  7: (()+0xb5873) [0x7fb077fdc873]
>>
>>  8: (()+0xb596e) [0x7fb077fdc96e]
>>
>>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x259) [0xb17a29]
>>
>>  10: (PG::peek_map_epoch(ObjectStore*, spg_t, ceph::buffer::list*)+0xb28)
>> [0x793dd8]
>>
>>  11: (OSD::load_pgs()+0x147f) [0x683dff]
>>
>>  12: (OSD::init()+0x1448) [0x6930b8]
>>
>>  13: (main()+0x26b9) [0x62fd89]
>>
>>  14: (__libc_start_main()+0xed) [0x7fb07767876d]
>>
>>  15: ceph-osd() [0x635679]
>>
>> 2015-05-19 15:15:12.812704 7fb079a20780 -1 *** Caught signal (Aborted) **
>>
>>  in thread 7fb079a20780
>>
>>
>>  ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
>>
>>  1: ceph-osd() [0xa1fe55]
>>
>>  2: (()+0xfcb0) [0x7fb078a60cb0]
>>
>>  3: (gsignal()+0x35) [0x7fb07768d0d5]
>>
>>  4: (abort()+0x17b) [0x7fb07769083b]
>>
>>  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fb077fde69d]
>>
>>  6: (()+0xb5846) [0x7fb077fdc846]
>>
>>  7: (()+0xb5873) [0x7fb077fdc873]
>>
>>  8: (()+0xb596e) [0x7fb077fdc96e]
>>
>>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x259) [0xb17a29]
>>
>>  10: (PG::peek_map_epoch(ObjectStore*, spg_t, ceph::buffer::list*)+0xb28)
>> [0x793dd8]
>>
>>  11: (OSD::load_pgs()+0x147f) [0x683dff]
>>
>>  12: (OSD::init()+0x1448) [0x6930b8]
>>
>>  13: (main()+0x26b9) [0x62fd89]
>>
>>  14: (__libc_start_main()+0xed) [0x7fb07767876d]
>>
>>  15: ceph-osd() [0x635679]
>>
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
>> to interpret this.
>>
>>
>> --- begin dump of recent events ---
>>
>>      0> 2015-05-19 15:15:12.812704 7fb079a20780 -1 *** Caught signal
>> (Aborted) **
>>
>>  in thread 7fb079a20780
>>
>>
>>  ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
>>
>>  1: ceph-osd() [0xa1fe55]
>>
>>  2: (()+0xfcb0) [0x7fb078a60cb0]
>>
>>  3: (gsignal()+0x35) [0x7fb07768d0d5]
>>
>>  4: (abort()+0x17b) [0x7fb07769083b]
>>
>>  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fb077fde69d]
>>
>>  6: (()+0xb5846) [0x7fb077fdc846]
>>
>>  7: (()+0xb5873) [0x7fb077fdc873]
>>
>>  8: (()+0xb596e) [0x7fb077fdc96e]
>>
>>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x259) [0xb17a29]
>>
>>  10: (PG::peek_map_epoch(ObjectStore*, spg_t, ceph::buffer::list*)+0xb28)
>> [0x793dd8]
>>
>>  11: (OSD::load_pgs()+0x147f) [0x683dff]
>>
>>  12: (OSD::init()+0x1448) [0x6930b8]
>>
>>  13: (main()+0x26b9) [0x62fd89]
>>
>>  14: (__libc_start_main()+0xed) [0x7fb07767876d]
>>
>>  15: ceph-osd() [0x635679]
>>
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
>> to interpret this.
>>
>>
>> --- logging levels ---
>>
>>    0/ 5 none
>>
>>    0/ 1 lockdep
>>
>>    0/ 1 context
>>
>>    1/ 1 crush
>>
>>    1/ 5 mds
>>
>>    1/ 5 mds_balancer
>>
>>    1/ 5 mds_locker
>>
>>    1/ 5 mds_log
>>
>>    1/ 5 mds_log_expire
>>
>>    1/ 5 mds_migrator
>>
>>    0/ 1 buffer
>>
>>    0/ 1 timer
>>
>>    0/ 1 filer
>>
>>    0/ 1 striper
>>
>>    0/ 1 objecter
>>
>>    0/ 5 rados
>>
>>    0/ 5 rbd
>>
>>    0/ 5 rbd_replay
>>
>>    0/ 5 journaler
>>
>>    0/ 5 objectcacher
>>
>>    0/ 5 client
>>
>>   20/20 osd
>>
>>    0/ 5 optracker
>>
>>    0/ 5 objclass
>>
>>    1/ 3 filestore
>>
>>    1/ 3 keyvaluestore
>>
>>    1/ 3 journal
>>
>>    0/ 5 ms
>>
>>    1/ 5 mon
>>
>>    0/10 monc
>>
>>    1/ 5 paxos
>>
>>    0/ 5 tp
>>
>>    1/ 5 auth
>>
>>    1/ 5 crypto
>>
>>    1/ 1 finisher
>>
>>    1/ 5 heartbeatmap
>>
>>    1/ 5 perfcounter
>>
>>    1/ 5 rgw
>>
>>    1/10 civetweb
>>
>>    1/ 5 javaclient
>>
>>    1/ 5 asok
>>
>>    1/ 1 throttle
>>
>>    0/ 0 refs
>>
>>    1/ 5 xio
>>
>>   -2/-2 (syslog threshold)
>>
>>   99/99 (stderr threshold)
>>
>>   max_recent     10000
>>
>>   max_new         1000
>>
>>   log_file
>>
>> --- end dump of recent events ---
>>
>>
>> Here is the PG info for 2.14
>>
>> ceph pg 2.14 query
>>
>> { "state": "active+undersized+degraded",
>>
>>   "snap_trimq": "[]",
>>
>>   "epoch": 39556,
>>
>>   "up": [
>>
>>         8,
>>
>>         7],
>>
>>   "acting": [
>>
>>         8,
>>
>>         7],
>>
>>   "actingbackfill": [
>>
>>         "7",
>>
>>         "8"],
>>
>>   "info": { "pgid": "2.14",
>>
>>       "last_update": "39533'175859",
>>
>>       "last_complete": "39533'175859",
>>
>>       "log_tail": "36964'172858",
>>
>>       "last_user_version": 175859,
>>
>>       "last_backfill": "MAX",
>>
>>       "purged_snaps": "[]",
>>
>>       "history": { "epoch_created": 1,
>>
>>           "last_epoch_started": 39536,
>>
>>           "last_epoch_clean": 39536,
>>
>>           "last_epoch_split": 0,
>>
>>           "same_up_since": 39534,
>>
>>           "same_interval_since": 39534,
>>
>>           "same_primary_since": 39527,
>>
>>           "last_scrub": "39533'175859",
>>
>>           "last_scrub_stamp": "2015-05-18 05:23:02.952523",
>>
>>           "last_deep_scrub": "39533'175859",
>>
>>           "last_deep_scrub_stamp": "2015-05-18 05:23:02.952523",
>>
>>           "last_clean_scrub_stamp": "2015-05-18 05:23:02.952523"},
>>
>>       "stats": { "version": "39533'175859",
>>
>>           "reported_seq": "281883",
>>
>>           "reported_epoch": "39556",
>>
>>           "state": "active+undersized+degraded",
>>
>>           "last_fresh": "2015-05-19 06:41:09.002111",
>>
>>           "last_change": "2015-05-18 10:19:22.277851",
>>
>>           "last_active": "2015-05-19 06:41:09.002111",
>>
>>           "last_clean": "2015-05-18 06:41:38.906417",
>>
>>           "last_became_active": "2013-05-07 04:23:31.972742",
>>
>>           "last_unstale": "2015-05-19 06:41:09.002111",
>>
>>           "last_undegraded": "2015-05-18 10:18:37.449550",
>>
>>           "last_fullsized": "2015-05-18 10:18:37.449550",
>>
>>           "mapping_epoch": 39527,
>>
>>           "log_start": "36964'172858",
>>
>>           "ondisk_log_start": "36964'172858",
>>
>>           "created": 1,
>>
>>           "last_epoch_clean": 39536,
>>
>>           "parent": "0.0",
>>
>>           "parent_split_bits": 0,
>>
>>           "last_scrub": "39533'175859",
>>
>>           "last_scrub_stamp": "2015-05-18 05:23:02.952523",
>>
>>           "last_deep_scrub": "39533'175859",
>>
>>           "last_deep_scrub_stamp": "2015-05-18 05:23:02.952523",
>>
>>           "last_clean_scrub_stamp": "2015-05-18 05:23:02.952523",
>>
>>           "log_size": 3001,
>>
>>           "ondisk_log_size": 3001,
>>
>>           "stats_invalid": "0",
>>
>>           "stat_sum": { "num_bytes": 441982976,
>>
>>               "num_objects": 106,
>>
>>               "num_object_clones": 0,
>>
>>               "num_object_copies": 318,
>>
>>               "num_objects_missing_on_primary": 0,
>>
>>               "num_objects_degraded": 106,
>>
>>               "num_objects_misplaced": 0,
>>
>>               "num_objects_unfound": 0,
>>
>>               "num_objects_dirty": 11,
>>
>>               "num_whiteouts": 0,
>>
>>               "num_read": 61399,
>>
>>               "num_read_kb": 1285319,
>>
>>               "num_write": 135192,
>>
>>               "num_write_kb": 2422029,
>>
>>               "num_scrub_errors": 0,
>>
>>               "num_shallow_scrub_errors": 0,
>>
>>               "num_deep_scrub_errors": 0,
>>
>>               "num_objects_recovered": 79,
>>
>>               "num_bytes_recovered": 329883648,
>>
>>               "num_keys_recovered": 0,
>>
>>               "num_objects_omap": 0,
>>
>>               "num_objects_hit_set_archive": 0,
>>
>>               "num_bytes_hit_set_archive": 0},
>>
>>           "stat_cat_sum": {},
>>
>>           "up": [
>>
>>                 8,
>>
>>                 7],
>>
>>           "acting": [
>>
>>                 8,
>>
>>                 7],
>>
>>           "blocked_by": [],
>>
>>           "up_primary": 8,
>>
>>           "acting_primary": 8},
>>
>>       "empty": 0,
>>
>>       "dne": 0,
>>
>>       "incomplete": 0,
>>
>>       "last_epoch_started": 39536,
>>
>>       "hit_set_history": { "current_last_update": "0'0",
>>
>>           "current_last_stamp": "0.000000",
>>
>>           "current_info": { "begin": "0.000000",
>>
>>               "end": "0.000000",
>>
>>               "version": "0'0"},
>>
>>           "history": []}},
>>
>>   "peer_info": [
>>
>>         { "peer": "7",
>>
>>           "pgid": "2.14",
>>
>>           "last_update": "39533'175859",
>>
>>           "last_complete": "39533'175859",
>>
>>           "log_tail": "36964'172858",
>>
>>           "last_user_version": 175859,
>>
>>           "last_backfill": "MAX",
>>
>>           "purged_snaps": "[]",
>>
>>           "history": { "epoch_created": 1,
>>
>>               "last_epoch_started": 39536,
>>
>>               "last_epoch_clean": 39536,
>>
>>               "last_epoch_split": 0,
>>
>>               "same_up_since": 39534,
>>
>>               "same_interval_since": 39534,
>>
>>               "same_primary_since": 39527,
>>
>>               "last_scrub": "39533'175859",
>>
>>               "last_scrub_stamp": "2015-05-18 05:23:02.952523",
>>
>>               "last_deep_scrub": "39533'175859",
>>
>>               "last_deep_scrub_stamp": "2015-05-18 05:23:02.952523",
>>
>>               "last_clean_scrub_stamp": "2015-05-18 05:23:02.952523"},
>>
>>           "stats": { "version": "39533'175858",
>>
>>               "reported_seq": "281598",
>>
>>               "reported_epoch": "39533",
>>
>>               "state": "active+clean",
>>
>>               "last_fresh": "2015-05-13 21:58:43.553887",
>>
>>               "last_change": "2015-05-12 22:50:16.011917",
>>
>>               "last_active": "2015-05-13 21:58:43.553887",
>>
>>               "last_clean": "2015-05-13 21:58:43.553887",
>>
>>               "last_became_active": "2013-05-07 04:23:31.972742",
>>
>>               "last_unstale": "2015-05-13 21:58:43.553887",
>>
>>               "last_undegraded": "2015-05-13 21:58:43.553887",
>>
>>               "last_fullsized": "2015-05-13 21:58:43.553887",
>>
>>               "mapping_epoch": 39527,
>>
>>               "log_start": "36964'172857",
>>
>>               "ondisk_log_start": "36964'172857",
>>
>>               "created": 1,
>>
>>               "last_epoch_clean": 39529,
>>
>>               "parent": "0.0",
>>
>>               "parent_split_bits": 0,
>>
>>               "last_scrub": "39533'175857",
>>
>>               "last_scrub_stamp": "2015-05-12 22:50:16.011867",
>>
>>               "last_deep_scrub": "39533'175856",
>>
>>               "last_deep_scrub_stamp": "2015-05-10 10:30:24.933431",
>>
>>               "last_clean_scrub_stamp": "2015-05-12 22:50:16.011867",
>>
>>               "log_size": 3001,
>>
>>               "ondisk_log_size": 3001,
>>
>>               "stats_invalid": "0",
>>
>>               "stat_sum": { "num_bytes": 441982976,
>>
>>                   "num_objects": 106,
>>
>>                   "num_object_clones": 0,
>>
>>                   "num_object_copies": 315,
>>
>>                   "num_objects_missing_on_primary": 0,
>>
>>                   "num_objects_degraded": 0,
>>
>>                   "num_objects_misplaced": 0,
>>
>>                   "num_objects_unfound": 0,
>>
>>                   "num_objects_dirty": 11,
>>
>>                   "num_whiteouts": 0,
>>
>>                   "num_read": 61157,
>>
>>                   "num_read_kb": 1281187,
>>
>>                   "num_write": 135192,
>>
>>                   "num_write_kb": 2422029,
>>
>>                   "num_scrub_errors": 0,
>>
>>                   "num_shallow_scrub_errors": 0,
>>
>>                   "num_deep_scrub_errors": 0,
>>
>>                   "num_objects_recovered": 79,
>>
>>                   "num_bytes_recovered": 329883648,
>>
>>                   "num_keys_recovered": 0,
>>
>>                   "num_objects_omap": 0,
>>
>>                   "num_objects_hit_set_archive": 0,
>>
>>                   "num_bytes_hit_set_archive": 0},
>>
>>               "stat_cat_sum": {},
>>
>>               "up": [
>>
>>                     8,
>>
>>                     7],
>>
>>               "acting": [
>>
>>                     8,
>>
>>                     7],
>>
>>               "blocked_by": [],
>>
>>               "up_primary": 8,
>>
>>               "acting_primary": 8},
>>
>>           "empty": 0,
>>
>>           "dne": 0,
>>
>>           "incomplete": 0,
>>
>>           "last_epoch_started": 39536,
>>
>>           "hit_set_history": { "current_last_update": "0'0",
>>
>>               "current_last_stamp": "0.000000",
>>
>>               "current_info": { "begin": "0.000000",
>>
>>                   "end": "0.000000",
>>
>>                   "version": "0'0"},
>>
>>               "history": []}}],
>>
>>   "recovery_state": [
>>
>>         { "name": "Started\/Primary\/Active",
>>
>>           "enter_time": "2015-05-18 10:18:37.449561",
>>
>>           "might_have_unfound": [],
>>
>>           "recovery_progress": { "backfill_targets": [],
>>
>>               "waiting_on_backfill": [],
>>
>>               "last_backfill_started": "0\/\/0\/\/-1",
>>
>>               "backfill_info": { "begin": "0\/\/0\/\/-1",
>>
>>                   "end": "0\/\/0\/\/-1",
>>
>>                   "objects": []},
>>
>>               "peer_backfill_info": [],
>>
>>               "backfills_in_flight": [],
>>
>>               "recovering": [],
>>
>>               "pg_backend": { "pull_from_peer": [],
>>
>>                   "pushing": []}},
>>
>>           "scrub": { "scrubber.epoch_start": "39527",
>>
>>               "scrubber.active": 0,
>>
>>               "scrubber.block_writes": 0,
>>
>>               "scrubber.waiting_on": 0,
>>
>>               "scrubber.waiting_on_whom": []}},
>>
>>         { "name": "Started",
>>
>>           "enter_time": "2015-05-18 10:18:05.335040"}],
>>
>>   "agent_state": {}}
>>
>> On Mon, May 18, 2015 at 2:34 PM, Berant Lemmenes <ber...@lemmenes.com>
>> wrote:
>>
>> > Sam,
>> >
>> > Thanks for taking a look. It does seem to fit my issue. Would just
>> > removing the 5.0_head directory be appropriate or would using
>> > ceph-objectstore-tool be better?
>> >
>> > Thanks,
>> > Berant
>> >
>> > On Mon, May 18, 2015 at 1:47 PM, Samuel Just <sj...@redhat.com> wrote:
>> >
>> >> You have most likely hit http://tracker.ceph.com/issues/11429.  There
>> >> are some workarounds in the bugs marked as duplicates of that bug, or
>> you
>> >> can wait for the next hammer point release.
>> >> -Sam
>> >>
>> >> ----- Original Message -----
>> >> From: "Berant Lemmenes" <ber...@lemmenes.com>
>> >> To: ceph-users@lists.ceph.com
>> >> Sent: Monday, May 18, 2015 10:24:38 AM
>> >> Subject: [ceph-users] OSD unable to start (giant -> hammer)
>> >>
>> >> Hello all,
>> >>
>> >> I've encountered a problem when upgrading my single node home cluster
>> >> from giant to hammer, and I would greatly appreciate any insight.
>> >>
>> >> I upgraded the packages like normal, then proceeded to restart the mon
>> >> and once that came back restarted the first OSD (osd.3). However it
>> >> subsequently won't start and crashes with the following failed
>> assertion:
>> >>
>> >>
>> >>
>> >> osd/OSD.h: 716: FAILED assert(ret)
>> >>
>> >> ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
>> >>
>> >> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> >> const*)+0x7f) [0xb1784f]
>> >>
>> >> 2: (OSD::load_pgs()+0x277b) [0x6850fb]
>> >>
>> >> 3: (OSD::init()+0x1448) [0x6930b8]
>> >>
>> >> 4: (main()+0x26b9) [0x62fd89]
>> >>
>> >> 5: (__libc_start_main()+0xed) [0x7f2345bc976d]
>> >>
>> >> 6: ceph-osd() [0x635679]
>> >>
>> >> NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> needed
>> >> to interpret this.
>> >>
>> >>
>> >>
>> >>
>> >> --- logging levels ---
>> >>
>> >> 0/ 5 none
>> >>
>> >> 0/ 1 lockdep
>> >>
>> >> 0/ 1 context
>> >>
>> >> 1/ 1 crush
>> >>
>> >> 1/ 5 mds
>> >>
>> >> 1/ 5 mds_balancer
>> >>
>> >> 1/ 5 mds_locker
>> >>
>> >> 1/ 5 mds_log
>> >>
>> >> 1/ 5 mds_log_expire
>> >>
>> >> 1/ 5 mds_migrator
>> >>
>> >> 0/ 1 buffer
>> >>
>> >> 0/ 1 timer
>> >>
>> >> 0/ 1 filer
>> >>
>> >> 0/ 1 striper
>> >>
>> >> 0/ 1 objecter
>> >>
>> >> 0/ 5 rados
>> >>
>> >> 0/ 5 rbd
>> >>
>> >> 0/ 5 rbd_replay
>> >>
>> >> 0/ 5 journaler
>> >>
>> >> 0/ 5 objectcacher
>> >>
>> >> 0/ 5 client
>> >>
>> >> 0/ 5 osd
>> >>
>> >> 0/ 5 optracker
>> >>
>> >> 0/ 5 objclass
>> >>
>> >> 1/ 3 filestore
>> >>
>> >> 1/ 3 keyvaluestore
>> >>
>> >> 1/ 3 journal
>> >>
>> >> 0/ 5 ms
>> >>
>> >> 1/ 5 mon
>> >>
>> >> 0/10 monc
>> >>
>> >> 1/ 5 paxos
>> >>
>> >> 0/ 5 tp
>> >>
>> >> 1/ 5 auth
>> >>
>> >> 1/ 5 crypto
>> >>
>> >> 1/ 1 finisher
>> >>
>> >> 1/ 5 heartbeatmap
>> >>
>> >> 1/ 5 perfcounter
>> >>
>> >> 1/ 5 rgw
>> >>
>> >> 1/10 civetweb
>> >>
>> >> 1/ 5 javaclient
>> >>
>> >> 1/ 5 asok
>> >>
>> >> 1/ 1 throttle
>> >>
>> >> 0/ 0 refs
>> >>
>> >> 1/ 5 xio
>> >>
>> >> -2/-2 (syslog threshold)
>> >>
>> >> 99/99 (stderr threshold)
>> >>
>> >> max_recent 10000
>> >>
>> >> max_new 1000
>> >>
>> >> log_file
>> >>
>> >> --- end dump of recent events ---
>> >>
>> >> terminate called after throwing an instance of 'ceph::FailedAssertion'
>> >>
>> >> *** Caught signal (Aborted) **
>> >>
>> >> in thread 7f2347f71780
>> >>
>> >> ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
>> >>
>> >> 1: ceph-osd() [0xa1fe55]
>> >>
>> >> 2: (()+0xfcb0) [0x7f2346fb1cb0]
>> >>
>> >> 3: (gsignal()+0x35) [0x7f2345bde0d5]
>> >>
>> >> 4: (abort()+0x17b) [0x7f2345be183b]
>> >>
>> >> 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f234652f69d]
>> >>
>> >> 6: (()+0xb5846) [0x7f234652d846]
>> >>
>> >> 7: (()+0xb5873) [0x7f234652d873]
>> >>
>> >> 8: (()+0xb596e) [0x7f234652d96e]
>> >>
>> >> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> >> const*)+0x259) [0xb17a29]
>> >>
>> >> 10: (OSD::load_pgs()+0x277b) [0x6850fb]
>> >>
>> >> 11: (OSD::init()+0x1448) [0x6930b8]
>> >>
>> >> 12: (main()+0x26b9) [0x62fd89]
>> >>
>> >> 13: (__libc_start_main()+0xed) [0x7f2345bc976d]
>> >>
>> >> 14: ceph-osd() [0x635679]
>> >>
>> >> 2015-05-18 13:02:33.643064 7f2347f71780 -1 *** Caught signal (Aborted)
>> **
>> >>
>> >> in thread 7f2347f71780
>> >>
>> >>
>> >>
>> >>
>> >> ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
>> >>
>> >> 1: ceph-osd() [0xa1fe55]
>> >>
>> >> 2: (()+0xfcb0) [0x7f2346fb1cb0]
>> >>
>> >> 3: (gsignal()+0x35) [0x7f2345bde0d5]
>> >>
>> >> 4: (abort()+0x17b) [0x7f2345be183b]
>> >>
>> >> 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f234652f69d]
>> >>
>> >> 6: (()+0xb5846) [0x7f234652d846]
>> >>
>> >> 7: (()+0xb5873) [0x7f234652d873]
>> >>
>> >> 8: (()+0xb596e) [0x7f234652d96e]
>> >>
>> >> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> >> const*)+0x259) [0xb17a29]
>> >>
>> >> 10: (OSD::load_pgs()+0x277b) [0x6850fb]
>> >>
>> >> 11: (OSD::init()+0x1448) [0x6930b8]
>> >>
>> >> 12: (main()+0x26b9) [0x62fd89]
>> >>
>> >> 13: (__libc_start_main()+0xed) [0x7f2345bc976d]
>> >>
>> >> 14: ceph-osd() [0x635679]
>> >>
>> >> NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> needed
>> >> to interpret this.
>> >>
>> >>
>> >>
>> >>
>> >> --- begin dump of recent events ---
>> >>
>> >> 0> 2015-05-18 13:02:33.643064 7f2347f71780 -1 *** Caught signal
>> (Aborted)
>> >> **
>> >>
>> >> in thread 7f2347f71780
>> >>
>> >>
>> >>
>> >>
>> >> ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
>> >>
>> >> 1: ceph-osd() [0xa1fe55]
>> >>
>> >> 2: (()+0xfcb0) [0x7f2346fb1cb0]
>> >>
>> >> 3: (gsignal()+0x35) [0x7f2345bde0d5]
>> >>
>> >> 4: (abort()+0x17b) [0x7f2345be183b]
>> >>
>> >> 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f234652f69d]
>> >>
>> >> 6: (()+0xb5846) [0x7f234652d846]
>> >>
>> >> 7: (()+0xb5873) [0x7f234652d873]
>> >>
>> >> 8: (()+0xb596e) [0x7f234652d96e]
>> >>
>> >> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> >> const*)+0x259) [0xb17a29]
>> >>
>> >> 10: (OSD::load_pgs()+0x277b) [0x6850fb]
>> >>
>> >> 11: (OSD::init()+0x1448) [0x6930b8]
>> >>
>> >> 12: (main()+0x26b9) [0x62fd89]
>> >>
>> >> 13: (__libc_start_main()+0xed) [0x7f2345bc976d]
>> >>
>> >> 14: ceph-osd() [0x635679]
>> >>
>> >> NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> needed
>> >> to interpret this.
>> >>
>> >>
>> >>
>> >>
>> >> --- logging levels ---
>> >>
>> >> 0/ 5 none
>> >>
>> >> 0/ 1 lockdep
>> >>
>> >> 0/ 1 context
>> >>
>> >> 1/ 1 crush
>> >>
>> >> 1/ 5 mds
>> >>
>> >> 1/ 5 mds_balancer
>> >>
>> >> 1/ 5 mds_locker
>> >>
>> >> 1/ 5 mds_log
>> >>
>> >> 1/ 5 mds_log_expire
>> >>
>> >> 1/ 5 mds_migrator
>> >>
>> >> 0/ 1 buffer
>> >>
>> >> 0/ 1 timer
>> >>
>> >> 0/ 1 filer
>> >>
>> >> 0/ 1 striper
>> >>
>> >> 0/ 1 objecter
>> >>
>> >> 0/ 5 rados
>> >>
>> >> 0/ 5 rbd
>> >>
>> >> 0/ 5 rbd_replay
>> >>
>> >> 0/ 5 journaler
>> >>
>> >> 0/ 5 objectcacher
>> >>
>> >> 0/ 5 client
>> >>
>> >> 0/ 5 osd
>> >>
>> >> 0/ 5 optracker
>> >>
>> >> 0/ 5 objclass
>> >>
>> >> 1/ 3 filestore
>> >>
>> >> 1/ 3 keyvaluestore
>> >>
>> >> 1/ 3 journal
>> >>
>> >> 0/ 5 ms
>> >>
>> >> 1/ 5 mon
>> >>
>> >> 0/10 monc
>> >>
>> >> 1/ 5 paxos
>> >>
>> >> 0/ 5 tp
>> >>
>> >> 1/ 5 auth
>> >>
>> >> 1/ 5 crypto
>> >>
>> >> 1/ 1 finisher
>> >>
>> >> 1/ 5 heartbeatmap
>> >>
>> >> 1/ 5 perfcounter
>> >>
>> >> 1/ 5 rgw
>> >>
>> >> 1/10 civetweb
>> >>
>> >> 1/ 5 javaclient
>> >>
>> >> 1/ 5 asok
>> >>
>> >> 1/ 1 throttle
>> >>
>> >> 0/ 0 refs
>> >>
>> >> 1/ 5 xio
>> >>
>> >> -2/-2 (syslog threshold)
>> >>
>> >> 99/99 (stderr threshold)
>> >>
>> >> max_recent 10000
>> >>
>> >> max_new 1000
>> >>
>> >> log_file
>> >>
>> >> --- end dump of recent events ---
>> >>
>> >>
>> >> I've included a 'ceph osd dump' here:
>> >> http://pastebin.com/RKbaY7nv
>> >>
>> >> ceph osd tree:
>> >>
>> >>
>> >> ceph osd tree
>> >>
>> >> ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
>> >>
>> >> -1 24.14000 root default
>> >>
>> >> -3 0 rack unknownrack
>> >>
>> >> -2 0 host ceph-test
>> >>
>> >> -4 24.14000 host ceph01
>> >>
>> >> 0 1.50000 osd.0 down 0 1.00000
>> >>
>> >> 2 1.50000 osd.2 down 0 1.00000
>> >>
>> >> 3 1.50000 osd.3 down 1.00000 1.00000
>> >>
>> >> 5 2.00000 osd.5 up 1.00000 1.00000
>> >>
>> >> 6 2.00000 osd.6 up 1.00000 1.00000
>> >>
>> >> 7 2.00000 osd.7 up 1.00000 1.00000
>> >>
>> >> 8 2.00000 osd.8 up 1.00000 1.00000
>> >>
>> >> 9 2.00000 osd.9 up 1.00000 1.00000
>> >>
>> >> 10 2.00000 osd.10 up 1.00000 1.00000
>> >>
>> >> 4 4.00000 osd.4 up 1.00000 1.00000
>> >>
>> >> 1 3.64000 osd.1 up 1.00000 1.00000
>> >>
>> >>
>> >>
>> >>
>> >> Note that osd.0 and osd.2 were down prior to the upgrade and the
>> cluster
>> >> was healthy (these are failed disks that have been out for some time
>> just
>> >> not removed from CRUSH.
>> >>
>> >> I've also included a log with OSD debugging set to 20 here:
>> >>
>> >> https://dl.dropboxusercontent.com/u/1043493/osd.3.log.gz
>> >>
>> >>
>> >> Looking through that file, it appears the last pg that it loads
>> >> successfully is 2.3f6 then it moves to 5.0
>> >>
>> >> -3> 2015-05-18 12:25:24.292091 7f6f407f9780 10 osd.3 39533 load_pgs
>> >> loaded pg[2.3f6( v 39533'289849 (37945'286848,39533'289849]
>> local-les=39532
>> >> n=99 ec=1 les/c 39532/39532 39531/39531/39523) [5,4,3] r=2 lpr=39533
>> >> pi=34961-39530/34 crt=39533'289846 lcod 0'0 inactive NOTIFY]
>> >> log((37945'286848,39533'289849], crt=39533'289846)
>> >>
>> >> -2> 2015-05-18 12:25:24.292100 7f6f407f9780 10 osd.3 39533 pgid 5.0
>> coll
>> >> 5.0_head
>> >>
>> >> -1> 2015-05-18 12:25:24.570188 7f6f407f9780 20 osd.3 0 get_map 34144 -
>> >> loading and decoding 0x411fd80
>> >>
>> >> 0> 2015-05-18 12:26:02.758914 7f6f407f9780 -1 osd/OSD.h: In function
>> >> 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f6f407f9780 time
>> >> 2015-05-18 12:25:24.620468
>> >>
>> >>
>> >>
>> >> osd/OSD.h: 716: FAILED assert(ret)
>> >>
>> >> [snip]
>> >>
>> >> Which I don't see 5.0 in a pg dump.
>> >>
>> >>
>> >>
>> >>
>> >> Thanks in advance!
>> >>
>> >> Berant
>> >>
>> >>
>> >>
>> >> _______________________________________________
>> >> ceph-users mailing list
>> >> ceph-users@lists.ceph.com
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>
>> >
>> >
>>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to