Re: [ceph-users] Ceph OSD crash starting up

David Turner Tue, 19 Sep 2017 05:48:07 -0700

Are you asking to add the osd back with its data or add it back in as a
fresh osd.  What is your `ceph status`?


On Tue, Sep 19, 2017, 5:23 AM Gonzalo Aguilar Delgado <
gagui...@aguilardelgado.com> wrote:

> Hi David,
>
> Thank you for the great explanation of the weights, I thought that ceph
> was adjusting them based on disk. But it seems it's not.
>
> But the problem was not that I think the node was failing because a
> software bug because the disk was not full anymeans.
>
> /dev/sdb1                     976284608 172396756   803887852  18%
> /var/lib/ceph/osd/ceph-1
>
> Now the question is to know if I can add again this osd safely. Is it
> possible?
>
> Best regards,
>
>
>
> On 14/09/17 23:29, David Turner wrote:
>
> Your weights should more closely represent the size of the OSDs.  OSD3 and
> OSD6 are weighted properly, but your other 3 OSDs have the same weight even
> though OSD0 is twice the size of OSD2 and OSD4.
>
> Your OSD weights is what I thought you were referring to when you said you
> set the crush map to 1.  At some point it does look like you set all of
> your OSD weights to 1, which would apply to OSD1.  If the OSD was too small
> for that much data, it would have filled up and be too full to start.  Can
> you mount that disk and see how much free space is on it?
>
> Just so you understand what that weight is, it is how much data the
> cluster is going to put on it.  The default is for the weight to be the
> size of the OSD in TiB (1024 based instead of TB which is 1000).  If you
> set the weight of a 1TB disk and a 4TB disk both to 1, then the cluster
> will try and give them the same amount of data.  If you set the 4TB disk to
> a weight of 4, then the cluster will try to give it 4x more data than the
> 1TB drive (usually what you want).
>
> In your case, your 926G OSD0 has a weight of 1 and your 460G OSD2 has a
> weight of 1 so the cluster thinks they should each receive the same amount
> of data (which it did, they each have ~275GB of data).  OSD3 has a weight
> of 1.36380 (its size in TiB) and OSD6 has a weight of 0.90919 and they have
> basically the same %used space (17%) as opposed to the same amount of data
> because the weight is based on their size.
>
> As long as you had enough replicas of your data in the cluster for it to
> recover from you removing OSD1 such that your cluster is health_ok without
> any missing objects, then there is nothing that you need off of OSD1 and
> ceph recovered from the lost disk successfully.
>
> On Thu, Sep 14, 2017 at 4:39 PM Gonzalo Aguilar Delgado <
> gagui...@aguilardelgado.com> wrote:
>
>> Hello,
>>
>> I was on a old version of ceph. And it showed a warning saying:
>>
>> *crush map* has straw_calc_version=*0*
>>
>> I rode that adjusting it will only rebalance all so admin should select
>> when to do it. So I went straigth and ran:
>>
>>
>> ceph osd crush tunables optimal
>>
>>
>> It rebalanced as it said but then I started to have lots of pg wrong. I
>> discovered that it was because my OSD1. I thought it was disk faillure so I
>> added a new OSD6 and system started to rebalance. Anyway OSD was not
>> starting.
>>
>> I thought to wipe it all. But I preferred to leave disk as it was, and
>> journal intact, in case I can recover and get data from it. (See mail:
>> [ceph-users] Scrub failing all the time, new inconsistencies keep
>> appearing).
>>
>>
>> So here's the information. But it has OSD1 replaced by OSD3, sorry.
>>
>> ID WEIGHT  REWEIGHT SIZE  USE   AVAIL %USE  VAR  PGS
>>  0 1.00000  1.00000  926G  271G  654G 29.34 1.10 369
>>  2 1.00000  1.00000  460G  284G  176G 61.67 2.32 395
>>  4 1.00000  1.00000  465G  151G  313G 32.64 1.23 214
>>  3 1.36380  1.00000 1396G  239G 1157G 17.13 0.64 340
>>  6 0.90919  1.00000  931G  164G  766G 17.70 0.67 210
>>               TOTAL 4179G 1111G 3067G 26.60
>> MIN/MAX VAR: 0.64/2.32  STDDEV: 16.99
>>
>> As I said I still have OSD1 intact so I can do whatever you need except
>> readding to the cluster. Since I don't know what It will do, maybe cause
>> havok.
>> Best regards,
>>
>>
>> On 14/09/17 17:12, David Turner wrote:
>>
>> What do you mean by "updated crush map to 1"?  Can you please provide a
>> copy of your crush map and `ceph osd df`?
>>
>> On Wed, Sep 13, 2017 at 6:39 AM Gonzalo Aguilar Delgado <
>> gagui...@aguilardelgado.com> wrote:
>>
>>> Hi,
>>>
>>> I'recently updated crush map to 1 and did all relocation of the pgs. At
>>> the end I found that one of the OSD is not starting.
>>>
>>> This is what it shows:
>>>
>>>
>>> 2017-09-13 10:37:34.287248 7f49cbe12700 -1 *** Caught signal (Aborted) **
>>>  in thread 7f49cbe12700 thread_name:filestore_sync
>>>
>>>  ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
>>>  1: (()+0x9616ee) [0xa93c6ef6ee]
>>>  2: (()+0x11390) [0x7f49d9937390]
>>>  3: (gsignal()+0x38) [0x7f49d78d3428]
>>>  4: (abort()+0x16a) [0x7f49d78d502a]
>>>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>> const*)+0x26b) [0xa93c7ef43b]
>>>  6: (FileStore::sync_entry()+0x2bbb) [0xa93c47fcbb]
>>>  7: (FileStore::SyncThread::entry()+0xd) [0xa93c4adcdd]
>>>  8: (()+0x76ba) [0x7f49d992d6ba]
>>>  9: (clone()+0x6d) [0x7f49d79a53dd]
>>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>>> needed to interpret this.
>>>
>>> --- begin dump of recent events ---
>>>     -3> 2017-09-13 10:37:34.253808 7f49dac6e8c0  5 osd.1 pg_epoch: 6293
>>> pg[10.8c( v 6220'575937 (4942'572901,6220'575937] local-les=6235 n=282
>>> ec=419 les/c/f 6235/6235/0 6293/6293/6290) [1,2]/[2] r=-1 lpr=0
>>> pi=6234-6292/24 crt=6220'575937 lcod 0'0 inactive NOTIFY NIBBLEWISE] exit
>>> Initial 0.029683 0 0.000000
>>>     -2> 2017-09-13 10:37:34.253848 7f49dac6e8c0  5 osd.1 pg_epoch: 6293
>>> pg[10.8c( v 6220'575937 (4942'572901,6220'575937] local-les=6235 n=282
>>> ec=419 les/c/f 6235/6235/0 6293/6293/6290) [1,2]/[2] r=-1 lpr=0
>>> pi=6234-6292/24 crt=6220'575937 lcod 0'0 inactive NOTIFY NIBBLEWISE] enter
>>> Reset
>>>     -1> 2017-09-13 10:37:34.255018 7f49dac6e8c0  5 osd.1 pg_epoch: 6293
>>> pg[10.90(unlocked)] enter Initial
>>>      0> 2017-09-13 10:37:34.287248 7f49cbe12700 -1 *** Caught signal
>>> (Aborted) **
>>>  in thread 7f49cbe12700 thread_name:filestore_sync
>>>
>>>  ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
>>>  1: (()+0x9616ee) [0xa93c6ef6ee]
>>>  2: (()+0x11390) [0x7f49d9937390]
>>>  3: (gsignal()+0x38) [0x7f49d78d3428]
>>>  4: (abort()+0x16a) [0x7f49d78d502a]
>>>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>> const*)+0x26b) [0xa93c7ef43b]
>>>  6: (FileStore::sync_entry()+0x2bbb) [0xa93c47fcbb]
>>>  7: (FileStore::SyncThread::entry()+0xd) [0xa93c4adcdd]
>>>  8: (()+0x76ba) [0x7f49d992d6ba]
>>>  9: (clone()+0x6d) [0x7f49d79a53dd]
>>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>>> needed to interpret this.
>>>
>>> --- logging levels ---
>>>    0/ 5 none
>>>    0/ 1 lockdep
>>>    0/ 1 context
>>>    1/ 1 crush
>>>    1/ 5 mds
>>>    1/ 5 mds_balancer
>>>    1/ 5 mds_locker
>>>    1/ 5 mds_log
>>>    1/ 5 mds_log_expire
>>>    1/ 5 mds_migrator
>>>    0/ 1 buffer
>>>    0/ 1 timer
>>>    0/ 1 filer
>>>    0/ 1 striper
>>>    0/ 1 objecter
>>>    0/ 5 rados
>>>    0/ 5 rbd
>>>    0/ 5 rbd_mirror
>>>    0/ 5 rbd_replay
>>>    0/ 5 journaler
>>>    0/ 5 objectcacher
>>>    0/ 5 client
>>>    0/ 5 osd
>>>    0/ 5 optracker
>>>    0/ 5 objclass
>>>    1/ 3 filestore
>>>    1/ 3 journal
>>>    0/ 5 ms
>>>    1/ 5 mon
>>>    0/10 monc
>>>    1/ 5 paxos
>>>    0/ 5 tp
>>>    1/ 5 auth
>>>    1/ 5 crypto
>>>    1/ 1 finisher
>>>    1/ 5 heartbeatmap
>>>    1/ 5 perfcounter
>>>    1/ 5 rgw
>>>    1/10 civetweb
>>>    1/ 5 javaclient
>>>    1/ 5 asok
>>>    1/ 1 throttle
>>>    0/ 0 refs
>>>    1/ 5 xio
>>>    1/ 5 compressor
>>>    1/ 5 newstore
>>>    1/ 5 bluestore
>>>    1/ 5 bluefs
>>>    1/ 3 bdev
>>>    1/ 5 kstore
>>>    4/ 5 rocksdb
>>>    4/ 5 leveldb
>>>    1/ 5 kinetic
>>>    1/ 5 fuse
>>>   -2/-2 (syslog threshold)
>>>   -1/-1 (stderr threshold)
>>>   max_recent     10000
>>>   max_new         1000
>>>   log_file /var/log/ceph/ceph-osd.1.log
>>> --- end dump of recent events ---
>>>
>>>
>>>
>>> Is there any way to recover it or should I open a bug?
>>>
>>>
>>> Best regards
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph OSD crash starting up

Reply via email to