[ceph-users] osdmap::decode crc error -- 13.2.7 -- most osds down
Hi, My turn. We suddenly have a big outage which is similar/identical to http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036519.html Some of the osds are runnable, but most crash when they start -- crc error in osdmap::decode. I'm able to extract an osd map from a good osd and it decodes well with osdmaptool: # ceph-objectstore-tool --op get-osdmap --data-path /var/lib/ceph/osd/ceph-680/ --file osd.680.map But when I try on one of the bad osds I get: # ceph-objectstore-tool --op get-osdmap --data-path /var/lib/ceph/osd/ceph-666/ --file osd.666.map terminate called after throwing an instance of 'ceph::buffer::malformed_input' what(): buffer::malformed_input: bad crc, actual 822724616 != expected 2334082500 *** Caught signal (Aborted) ** in thread 7f600aa42d00 thread_name:ceph-objectstor ceph version 13.2.7 (71bd687b6e8b9424dd5e5974ed542595d8977416) mimic (stable) 1: (()+0xf5f0) [0x7f5ffefc45f0] 2: (gsignal()+0x37) [0x7f5ffdbae337] 3: (abort()+0x148) [0x7f5ffdbafa28] 4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f5ffe4be7d5] 5: (()+0x5e746) [0x7f5ffe4bc746] 6: (()+0x5e773) [0x7f5ffe4bc773] 7: (()+0x5e993) [0x7f5ffe4bc993] 8: (OSDMap::decode(ceph::buffer::list::iterator&)+0x160e) [0x7f6000f4168e] 9: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f6000f42e31] 10: (get_osdmap(ObjectStore*, unsigned int, OSDMap&, ceph::buffer::list&)+0x1d0) [0x55d30a489190] 11: (main()+0x5340) [0x55d30a3aae70] 12: (__libc_start_main()+0xf5) [0x7f5ffdb9a505] 13: (()+0x3a0f40) [0x55d30a483f40] Aborted (core dumped) I think I want to inject the osdmap, but can't: # ceph-objectstore-tool --op set-osdmap --data-path /var/lib/ceph/osd/ceph-666/ --file osd.680.map osdmap (#-1:b65b78ab:::osdmap.2983572:0#) does not exist. How do I do this? Thanks for any help! dan ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: osdmap::decode crc error -- 13.2.7 -- most osds down
On 2/20/20 12:40 PM, Dan van der Ster wrote: > Hi, > > My turn. > We suddenly have a big outage which is similar/identical to > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036519.html > > Some of the osds are runnable, but most crash when they start -- crc > error in osdmap::decode. > I'm able to extract an osd map from a good osd and it decodes well > with osdmaptool: > > # ceph-objectstore-tool --op get-osdmap --data-path > /var/lib/ceph/osd/ceph-680/ --file osd.680.map > > But when I try on one of the bad osds I get: > > # ceph-objectstore-tool --op get-osdmap --data-path > /var/lib/ceph/osd/ceph-666/ --file osd.666.map > terminate called after throwing an instance of 'ceph::buffer::malformed_input' > what(): buffer::malformed_input: bad crc, actual 822724616 != > expected 2334082500 > *** Caught signal (Aborted) ** > in thread 7f600aa42d00 thread_name:ceph-objectstor > ceph version 13.2.7 (71bd687b6e8b9424dd5e5974ed542595d8977416) mimic (stable) > 1: (()+0xf5f0) [0x7f5ffefc45f0] > 2: (gsignal()+0x37) [0x7f5ffdbae337] > 3: (abort()+0x148) [0x7f5ffdbafa28] > 4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f5ffe4be7d5] > 5: (()+0x5e746) [0x7f5ffe4bc746] > 6: (()+0x5e773) [0x7f5ffe4bc773] > 7: (()+0x5e993) [0x7f5ffe4bc993] > 8: (OSDMap::decode(ceph::buffer::list::iterator&)+0x160e) [0x7f6000f4168e] > 9: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f6000f42e31] > 10: (get_osdmap(ObjectStore*, unsigned int, OSDMap&, > ceph::buffer::list&)+0x1d0) [0x55d30a489190] > 11: (main()+0x5340) [0x55d30a3aae70] > 12: (__libc_start_main()+0xf5) [0x7f5ffdb9a505] > 13: (()+0x3a0f40) [0x55d30a483f40] > Aborted (core dumped) > > > > I think I want to inject the osdmap, but can't: > > # ceph-objectstore-tool --op set-osdmap --data-path > /var/lib/ceph/osd/ceph-666/ --file osd.680.map > osdmap (#-1:b65b78ab:::osdmap.2983572:0#) does not exist. > Have you tried to list which epoch osd.680 is at and which one osd.666 is at? And which one the MONs are at? Maybe there is a difference there? Wido > > How do I do this? > > Thanks for any help! > > dan > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: osdmap::decode crc error -- 13.2.7 -- most osds down
680 is epoch 2983572 666 crashes at 2982809 or 2982808 -407> 2020-02-20 11:20:24.960 7f4d931b5b80 10 osd.666 0 add_map_bl 2982809 612069 bytes -407> 2020-02-20 11:20:24.966 7f4d931b5b80 -1 *** Caught signal (Aborted) ** in thread 7f4d931b5b80 thread_name:ceph-osd So I grabbed 2982809 and 2982808 and am setting them. Checking if the osds will start with that. -- dan On Thu, Feb 20, 2020 at 12:47 PM Wido den Hollander wrote: > On 2/20/20 12:40 PM, Dan van der Ster wrote: > > Hi, > > > > My turn. > > We suddenly have a big outage which is similar/identical to > > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036519.html > > > > Some of the osds are runnable, but most crash when they start -- crc > > error in osdmap::decode. > > I'm able to extract an osd map from a good osd and it decodes well > > with osdmaptool: > > > > # ceph-objectstore-tool --op get-osdmap --data-path > > /var/lib/ceph/osd/ceph-680/ --file osd.680.map > > > > But when I try on one of the bad osds I get: > > > > # ceph-objectstore-tool --op get-osdmap --data-path > > /var/lib/ceph/osd/ceph-666/ --file osd.666.map > > terminate called after throwing an instance of > > 'ceph::buffer::malformed_input' > > what(): buffer::malformed_input: bad crc, actual 822724616 != > > expected 2334082500 > > *** Caught signal (Aborted) ** > > in thread 7f600aa42d00 thread_name:ceph-objectstor > > ceph version 13.2.7 (71bd687b6e8b9424dd5e5974ed542595d8977416) mimic > > (stable) > > 1: (()+0xf5f0) [0x7f5ffefc45f0] > > 2: (gsignal()+0x37) [0x7f5ffdbae337] > > 3: (abort()+0x148) [0x7f5ffdbafa28] > > 4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f5ffe4be7d5] > > 5: (()+0x5e746) [0x7f5ffe4bc746] > > 6: (()+0x5e773) [0x7f5ffe4bc773] > > 7: (()+0x5e993) [0x7f5ffe4bc993] > > 8: (OSDMap::decode(ceph::buffer::list::iterator&)+0x160e) [0x7f6000f4168e] > > 9: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f6000f42e31] > > 10: (get_osdmap(ObjectStore*, unsigned int, OSDMap&, > > ceph::buffer::list&)+0x1d0) [0x55d30a489190] > > 11: (main()+0x5340) [0x55d30a3aae70] > > 12: (__libc_start_main()+0xf5) [0x7f5ffdb9a505] > > 13: (()+0x3a0f40) [0x55d30a483f40] > > Aborted (core dumped) > > > > > > > > I think I want to inject the osdmap, but can't: > > > > # ceph-objectstore-tool --op set-osdmap --data-path > > /var/lib/ceph/osd/ceph-666/ --file osd.680.map > > osdmap (#-1:b65b78ab:::osdmap.2983572:0#) does not exist. > > > > Have you tried to list which epoch osd.680 is at and which one osd.666 > is at? And which one the MONs are at? > > Maybe there is a difference there? > > Wido > > > > > How do I do this? > > > > Thanks for any help! > > > > dan > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: osdmap::decode crc error -- 13.2.7 -- most osds down
For those following along, the issue is here: https://tracker.ceph.com/issues/39525#note-6 Somehow single bits are getting flipped in the osdmaps -- maybe network, maybe memory, maybe a bug. To get an osd starting, we have to extract the full osdmap from the mon, and set it into the crashing osd. So for the osd.666: # ceph osd getmap 2982809 -o 2982809 # ceph-objectstore-tool --op set-osdmap --data-path /var/lib/ceph/osd/ceph-666/ --file 2982809 Some osds had multiple corrupted osdmaps -- so we scriptified the above. As of now our PGs are all active, but we're not confident that this won't happen again (without knowing why the maps were corrupting). Thanks to all who helped! dan On Thu, Feb 20, 2020 at 1:01 PM Dan van der Ster wrote: > > 680 is epoch 2983572 > 666 crashes at 2982809 or 2982808 > > -407> 2020-02-20 11:20:24.960 7f4d931b5b80 10 osd.666 0 add_map_bl > 2982809 612069 bytes > -407> 2020-02-20 11:20:24.966 7f4d931b5b80 -1 *** Caught signal (Aborted) ** > in thread 7f4d931b5b80 thread_name:ceph-osd > > So I grabbed 2982809 and 2982808 and am setting them. > > Checking if the osds will start with that. > > -- dan > > > > On Thu, Feb 20, 2020 at 12:47 PM Wido den Hollander wrote: > > On 2/20/20 12:40 PM, Dan van der Ster wrote: > > > Hi, > > > > > > My turn. > > > We suddenly have a big outage which is similar/identical to > > > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036519.html > > > > > > Some of the osds are runnable, but most crash when they start -- crc > > > error in osdmap::decode. > > > I'm able to extract an osd map from a good osd and it decodes well > > > with osdmaptool: > > > > > > # ceph-objectstore-tool --op get-osdmap --data-path > > > /var/lib/ceph/osd/ceph-680/ --file osd.680.map > > > > > > But when I try on one of the bad osds I get: > > > > > > # ceph-objectstore-tool --op get-osdmap --data-path > > > /var/lib/ceph/osd/ceph-666/ --file osd.666.map > > > terminate called after throwing an instance of > > > 'ceph::buffer::malformed_input' > > > what(): buffer::malformed_input: bad crc, actual 822724616 != > > > expected 2334082500 > > > *** Caught signal (Aborted) ** > > > in thread 7f600aa42d00 thread_name:ceph-objectstor > > > ceph version 13.2.7 (71bd687b6e8b9424dd5e5974ed542595d8977416) mimic > > > (stable) > > > 1: (()+0xf5f0) [0x7f5ffefc45f0] > > > 2: (gsignal()+0x37) [0x7f5ffdbae337] > > > 3: (abort()+0x148) [0x7f5ffdbafa28] > > > 4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f5ffe4be7d5] > > > 5: (()+0x5e746) [0x7f5ffe4bc746] > > > 6: (()+0x5e773) [0x7f5ffe4bc773] > > > 7: (()+0x5e993) [0x7f5ffe4bc993] > > > 8: (OSDMap::decode(ceph::buffer::list::iterator&)+0x160e) > > > [0x7f6000f4168e] > > > 9: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f6000f42e31] > > > 10: (get_osdmap(ObjectStore*, unsigned int, OSDMap&, > > > ceph::buffer::list&)+0x1d0) [0x55d30a489190] > > > 11: (main()+0x5340) [0x55d30a3aae70] > > > 12: (__libc_start_main()+0xf5) [0x7f5ffdb9a505] > > > 13: (()+0x3a0f40) [0x55d30a483f40] > > > Aborted (core dumped) > > > > > > > > > > > > I think I want to inject the osdmap, but can't: > > > > > > # ceph-objectstore-tool --op set-osdmap --data-path > > > /var/lib/ceph/osd/ceph-666/ --file osd.680.map > > > osdmap (#-1:b65b78ab:::osdmap.2983572:0#) does not exist. > > > > > > > Have you tried to list which epoch osd.680 is at and which one osd.666 > > is at? And which one the MONs are at? > > > > Maybe there is a difference there? > > > > Wido > > > > > > > > How do I do this? > > > > > > Thanks for any help! > > > > > > dan > > > ___ > > > ceph-users mailing list -- ceph-users@ceph.io > > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: osdmap::decode crc error -- 13.2.7 -- most osds down
> Op 20 feb. 2020 om 19:54 heeft Dan van der Ster het > volgende geschreven: > > For those following along, the issue is here: > https://tracker.ceph.com/issues/39525#note-6 > > Somehow single bits are getting flipped in the osdmaps -- maybe > network, maybe memory, maybe a bug. > Weird! But I did see things like this happen before. This was under Hammer and Jewel where I needed to these kind of things. Crashes looked very similar. > To get an osd starting, we have to extract the full osdmap from the > mon, and set it into the crashing osd. So for the osd.666: > > # ceph osd getmap 2982809 -o 2982809 > # ceph-objectstore-tool --op set-osdmap --data-path > /var/lib/ceph/osd/ceph-666/ --file 2982809 > > Some osds had multiple corrupted osdmaps -- so we scriptified the above. Were those corrupted onces in sequence? > As of now our PGs are all active, but we're not confident that this Awesome! Wido > won't happen again (without knowing why the maps were corrupting). > > Thanks to all who helped! > > dan > > > >> On Thu, Feb 20, 2020 at 1:01 PM Dan van der Ster wrote: >> >> 680 is epoch 2983572 >> 666 crashes at 2982809 or 2982808 >> >> -407> 2020-02-20 11:20:24.960 7f4d931b5b80 10 osd.666 0 add_map_bl >> 2982809 612069 bytes >> -407> 2020-02-20 11:20:24.966 7f4d931b5b80 -1 *** Caught signal (Aborted) ** >> in thread 7f4d931b5b80 thread_name:ceph-osd >> >> So I grabbed 2982809 and 2982808 and am setting them. >> >> Checking if the osds will start with that. >> >> -- dan >> >> >> >>> On Thu, Feb 20, 2020 at 12:47 PM Wido den Hollander wrote: >>> On 2/20/20 12:40 PM, Dan van der Ster wrote: Hi, My turn. We suddenly have a big outage which is similar/identical to http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036519.html Some of the osds are runnable, but most crash when they start -- crc error in osdmap::decode. I'm able to extract an osd map from a good osd and it decodes well with osdmaptool: # ceph-objectstore-tool --op get-osdmap --data-path /var/lib/ceph/osd/ceph-680/ --file osd.680.map But when I try on one of the bad osds I get: # ceph-objectstore-tool --op get-osdmap --data-path /var/lib/ceph/osd/ceph-666/ --file osd.666.map terminate called after throwing an instance of 'ceph::buffer::malformed_input' what(): buffer::malformed_input: bad crc, actual 822724616 != expected 2334082500 *** Caught signal (Aborted) ** in thread 7f600aa42d00 thread_name:ceph-objectstor ceph version 13.2.7 (71bd687b6e8b9424dd5e5974ed542595d8977416) mimic (stable) 1: (()+0xf5f0) [0x7f5ffefc45f0] 2: (gsignal()+0x37) [0x7f5ffdbae337] 3: (abort()+0x148) [0x7f5ffdbafa28] 4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f5ffe4be7d5] 5: (()+0x5e746) [0x7f5ffe4bc746] 6: (()+0x5e773) [0x7f5ffe4bc773] 7: (()+0x5e993) [0x7f5ffe4bc993] 8: (OSDMap::decode(ceph::buffer::list::iterator&)+0x160e) [0x7f6000f4168e] 9: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f6000f42e31] 10: (get_osdmap(ObjectStore*, unsigned int, OSDMap&, ceph::buffer::list&)+0x1d0) [0x55d30a489190] 11: (main()+0x5340) [0x55d30a3aae70] 12: (__libc_start_main()+0xf5) [0x7f5ffdb9a505] 13: (()+0x3a0f40) [0x55d30a483f40] Aborted (core dumped) I think I want to inject the osdmap, but can't: # ceph-objectstore-tool --op set-osdmap --data-path /var/lib/ceph/osd/ceph-666/ --file osd.680.map osdmap (#-1:b65b78ab:::osdmap.2983572:0#) does not exist. >>> >>> Have you tried to list which epoch osd.680 is at and which one osd.666 >>> is at? And which one the MONs are at? >>> >>> Maybe there is a difference there? >>> >>> Wido >>> How do I do this? Thanks for any help! dan ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: osdmap::decode crc error -- 13.2.7 -- most osds down
On Thu, Feb 20, 2020 at 9:20 PM Wido den Hollander wrote: > > > Op 20 feb. 2020 om 19:54 heeft Dan van der Ster het > > volgende geschreven: > > > > For those following along, the issue is here: > > https://tracker.ceph.com/issues/39525#note-6 > > > > Somehow single bits are getting flipped in the osdmaps -- maybe > > network, maybe memory, maybe a bug. > > > > Weird! > > But I did see things like this happen before. This was under Hammer and Jewel > where I needed to these kind of things. Crashes looked very similar. > > > To get an osd starting, we have to extract the full osdmap from the > > mon, and set it into the crashing osd. So for the osd.666: > > > > # ceph osd getmap 2982809 -o 2982809 > > # ceph-objectstore-tool --op set-osdmap --data-path > > /var/lib/ceph/osd/ceph-666/ --file 2982809 > > > > Some osds had multiple corrupted osdmaps -- so we scriptified the above. > > Were those corrupted onces in sequence? Yes, usually 1 to 3 osdmaps corrupted in sequence. There's a theory that this might be related (https://tracker.ceph.com/issues/43903) but the backports to mimic or even nautilus look challenging. -- dan > > > As of now our PGs are all active, but we're not confident that this > > > Awesome! > > Wido > > > won't happen again (without knowing why the maps were corrupting). > > > > Thanks to all who helped! > > > > dan > > > > > > > >> On Thu, Feb 20, 2020 at 1:01 PM Dan van der Ster > >> wrote: > >> > >> 680 is epoch 2983572 > >> 666 crashes at 2982809 or 2982808 > >> > >> -407> 2020-02-20 11:20:24.960 7f4d931b5b80 10 osd.666 0 add_map_bl > >> 2982809 612069 bytes > >> -407> 2020-02-20 11:20:24.966 7f4d931b5b80 -1 *** Caught signal (Aborted) > >> ** > >> in thread 7f4d931b5b80 thread_name:ceph-osd > >> > >> So I grabbed 2982809 and 2982808 and am setting them. > >> > >> Checking if the osds will start with that. > >> > >> -- dan > >> > >> > >> > >>> On Thu, Feb 20, 2020 at 12:47 PM Wido den Hollander wrote: > >>> On 2/20/20 12:40 PM, Dan van der Ster wrote: > Hi, > > My turn. > We suddenly have a big outage which is similar/identical to > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036519.html > > Some of the osds are runnable, but most crash when they start -- crc > error in osdmap::decode. > I'm able to extract an osd map from a good osd and it decodes well > with osdmaptool: > > # ceph-objectstore-tool --op get-osdmap --data-path > /var/lib/ceph/osd/ceph-680/ --file osd.680.map > > But when I try on one of the bad osds I get: > > # ceph-objectstore-tool --op get-osdmap --data-path > /var/lib/ceph/osd/ceph-666/ --file osd.666.map > terminate called after throwing an instance of > 'ceph::buffer::malformed_input' > what(): buffer::malformed_input: bad crc, actual 822724616 != > expected 2334082500 > *** Caught signal (Aborted) ** > in thread 7f600aa42d00 thread_name:ceph-objectstor > ceph version 13.2.7 (71bd687b6e8b9424dd5e5974ed542595d8977416) mimic > (stable) > 1: (()+0xf5f0) [0x7f5ffefc45f0] > 2: (gsignal()+0x37) [0x7f5ffdbae337] > 3: (abort()+0x148) [0x7f5ffdbafa28] > 4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f5ffe4be7d5] > 5: (()+0x5e746) [0x7f5ffe4bc746] > 6: (()+0x5e773) [0x7f5ffe4bc773] > 7: (()+0x5e993) [0x7f5ffe4bc993] > 8: (OSDMap::decode(ceph::buffer::list::iterator&)+0x160e) > [0x7f6000f4168e] > 9: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f6000f42e31] > 10: (get_osdmap(ObjectStore*, unsigned int, OSDMap&, > ceph::buffer::list&)+0x1d0) [0x55d30a489190] > 11: (main()+0x5340) [0x55d30a3aae70] > 12: (__libc_start_main()+0xf5) [0x7f5ffdb9a505] > 13: (()+0x3a0f40) [0x55d30a483f40] > Aborted (core dumped) > > > > I think I want to inject the osdmap, but can't: > > # ceph-objectstore-tool --op set-osdmap --data-path > /var/lib/ceph/osd/ceph-666/ --file osd.680.map > osdmap (#-1:b65b78ab:::osdmap.2983572:0#) does not exist. > > >>> > >>> Have you tried to list which epoch osd.680 is at and which one osd.666 > >>> is at? And which one the MONs are at? > >>> > >>> Maybe there is a difference there? > >>> > >>> Wido > >>> > > How do I do this? > > Thanks for any help! > > dan > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RESOLVED: Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]
Hi Troy, Looks like we hit the same today -- Sage posted some observations here: https://tracker.ceph.com/issues/39525#note-6 Did it happen again in your cluster? Cheers, Dan On Tue, Aug 20, 2019 at 2:18 AM Troy Ablan wrote: > > While I'm still unsure how this happened, this is what was done to solve > this. > > Started OSD in foreground with debug 10, watched for the most recent > osdmap epoch mentioned before abort(). For example, if it mentioned > that it just tried to load 80896 and then crashed > > # ceph osd getmap -o osdmap.80896 80896 > # ceph-objectstore-tool --op set-osdmap --data-path > /var/lib/ceph/osd/ceph-77/ --file osdmap.80896 > > Then I restarted the osd in foreground/debug, and repeated for the next > osdmap epoch until it got past the first few seconds. This process > worked for all but two OSDs. For the ones that succeeded I'd ^C and > then start the osd via systemd > > For the remaining two, it would try loading the incremental map and then > crash. I had presence of mind to make dd images of every OSD before > starting this process, so I reverted these two to the state before > injecting the osdmaps. > > I then injected the last 15 or so epochs of the osdmap in sequential > order before starting the osd, with success. > > This leads me to believe that the step-wise injection didn't work > because the osd had more subtle corruption that it got past, but it was > confused when it requested the next incremental delta. > > Thanks again to Brad/badone for the guidance! > > Tracker issue updated. > > Here's the closing IRC dialogue re this issue (UTC-0700) > > 2019-08-19 16:27:42 < MooingLemur> badone: I appreciate you reaching out > yesterday, you've helped a ton, twice now :) I'm still concerned > because I don't know how this happened. I'll feel better once > everything's active+clean, but it's all at least active. > > 2019-08-19 16:30:28 < badone> MooingLemur: I had a quick discussion with > Josh earlier and he shares my opinion this is likely somehow related to > these drives or perhaps controllers, or at least specific to these machines > > 2019-08-19 16:31:04 < badone> however, there is a possibility you are > seeing some extremely rare race that no one up to this point has seen before > > 2019-08-19 16:31:20 < badone> that is less likely though > > 2019-08-19 16:32:50 < badone> the osd read the osdmap over the wire > successfully but wrote it out to disk in a format that it could not then > read back in (unlikely) or... > > 2019-08-19 16:33:21 < badone> the map "changed" after it had been > written to disk > > 2019-08-19 16:33:46 < badone> the second is considered most likely by us > but I recognise you may not share that opinion > ___ > ceph-users mailing list > ceph-us...@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RESOLVED: Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]
Dan, Yes, I have had this happen several times since, but fortunately the last couple of times has only happened to one or two OSDs at a time so it didn't take down entire pools. Remedy has been the same. I had been holding off on too much further investigation because I thought the source of the issue may have been some old hardware gremlins, and we're waiting on some new hardware. -Troy On 2/20/20 1:40 PM, Dan van der Ster wrote: Hi Troy, Looks like we hit the same today -- Sage posted some observations here: https://tracker.ceph.com/issues/39525#note-6 Did it happen again in your cluster? Cheers, Dan On Tue, Aug 20, 2019 at 2:18 AM Troy Ablan wrote: While I'm still unsure how this happened, this is what was done to solve this. Started OSD in foreground with debug 10, watched for the most recent osdmap epoch mentioned before abort(). For example, if it mentioned that it just tried to load 80896 and then crashed # ceph osd getmap -o osdmap.80896 80896 # ceph-objectstore-tool --op set-osdmap --data-path /var/lib/ceph/osd/ceph-77/ --file osdmap.80896 Then I restarted the osd in foreground/debug, and repeated for the next osdmap epoch until it got past the first few seconds. This process worked for all but two OSDs. For the ones that succeeded I'd ^C and then start the osd via systemd For the remaining two, it would try loading the incremental map and then crash. I had presence of mind to make dd images of every OSD before starting this process, so I reverted these two to the state before injecting the osdmaps. I then injected the last 15 or so epochs of the osdmap in sequential order before starting the osd, with success. This leads me to believe that the step-wise injection didn't work because the osd had more subtle corruption that it got past, but it was confused when it requested the next incremental delta. Thanks again to Brad/badone for the guidance! Tracker issue updated. Here's the closing IRC dialogue re this issue (UTC-0700) 2019-08-19 16:27:42 < MooingLemur> badone: I appreciate you reaching out yesterday, you've helped a ton, twice now :) I'm still concerned because I don't know how this happened. I'll feel better once everything's active+clean, but it's all at least active. 2019-08-19 16:30:28 < badone> MooingLemur: I had a quick discussion with Josh earlier and he shares my opinion this is likely somehow related to these drives or perhaps controllers, or at least specific to these machines 2019-08-19 16:31:04 < badone> however, there is a possibility you are seeing some extremely rare race that no one up to this point has seen before 2019-08-19 16:31:20 < badone> that is less likely though 2019-08-19 16:32:50 < badone> the osd read the osdmap over the wire successfully but wrote it out to disk in a format that it could not then read back in (unlikely) or... 2019-08-19 16:33:21 < badone> the map "changed" after it had been written to disk 2019-08-19 16:33:46 < badone> the second is considered most likely by us but I recognise you may not share that opinion ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RESOLVED: Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]
Thanks Troy for the quick response. Are you still running mimic on that cluster? Seeing the crashes in nautilus too? Our cluster is also quite old -- so it could very well be memory or network gremlins. Cheers, Dan On Thu, Feb 20, 2020 at 10:11 PM Troy Ablan wrote: > > Dan, > > Yes, I have had this happen several times since, but fortunately the > last couple of times has only happened to one or two OSDs at a time so > it didn't take down entire pools. Remedy has been the same. > > I had been holding off on too much further investigation because I > thought the source of the issue may have been some old hardware > gremlins, and we're waiting on some new hardware. > > -Troy > > > On 2/20/20 1:40 PM, Dan van der Ster wrote: > > Hi Troy, > > > > Looks like we hit the same today -- Sage posted some observations > > here: https://tracker.ceph.com/issues/39525#note-6 > > > > Did it happen again in your cluster? > > > > Cheers, Dan > > > > > > > > On Tue, Aug 20, 2019 at 2:18 AM Troy Ablan wrote: > >> > >> While I'm still unsure how this happened, this is what was done to solve > >> this. > >> > >> Started OSD in foreground with debug 10, watched for the most recent > >> osdmap epoch mentioned before abort(). For example, if it mentioned > >> that it just tried to load 80896 and then crashed > >> > >> # ceph osd getmap -o osdmap.80896 80896 > >> # ceph-objectstore-tool --op set-osdmap --data-path > >> /var/lib/ceph/osd/ceph-77/ --file osdmap.80896 > >> > >> Then I restarted the osd in foreground/debug, and repeated for the next > >> osdmap epoch until it got past the first few seconds. This process > >> worked for all but two OSDs. For the ones that succeeded I'd ^C and > >> then start the osd via systemd > >> > >> For the remaining two, it would try loading the incremental map and then > >> crash. I had presence of mind to make dd images of every OSD before > >> starting this process, so I reverted these two to the state before > >> injecting the osdmaps. > >> > >> I then injected the last 15 or so epochs of the osdmap in sequential > >> order before starting the osd, with success. > >> > >> This leads me to believe that the step-wise injection didn't work > >> because the osd had more subtle corruption that it got past, but it was > >> confused when it requested the next incremental delta. > >> > >> Thanks again to Brad/badone for the guidance! > >> > >> Tracker issue updated. > >> > >> Here's the closing IRC dialogue re this issue (UTC-0700) > >> > >> 2019-08-19 16:27:42 < MooingLemur> badone: I appreciate you reaching out > >> yesterday, you've helped a ton, twice now :) I'm still concerned > >> because I don't know how this happened. I'll feel better once > >> everything's active+clean, but it's all at least active. > >> > >> 2019-08-19 16:30:28 < badone> MooingLemur: I had a quick discussion with > >> Josh earlier and he shares my opinion this is likely somehow related to > >> these drives or perhaps controllers, or at least specific to these machines > >> > >> 2019-08-19 16:31:04 < badone> however, there is a possibility you are > >> seeing some extremely rare race that no one up to this point has seen > >> before > >> > >> 2019-08-19 16:31:20 < badone> that is less likely though > >> > >> 2019-08-19 16:32:50 < badone> the osd read the osdmap over the wire > >> successfully but wrote it out to disk in a format that it could not then > >> read back in (unlikely) or... > >> > >> 2019-08-19 16:33:21 < badone> the map "changed" after it had been > >> written to disk > >> > >> 2019-08-19 16:33:46 < badone> the second is considered most likely by us > >> but I recognise you may not share that opinion > >> ___ > >> ceph-users mailing list > >> ceph-us...@lists.ceph.com > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RESOLVED: Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]
I hope I don't sound too happy to hear that you've run into this same problem, but still I'm glad to see that it's not just a one-off problem with us. :) We're still running Mimic. I haven't yet deployed Nautilus anywhere. Thanks -Troy On 2/20/20 2:14 PM, Dan van der Ster wrote: Thanks Troy for the quick response. Are you still running mimic on that cluster? Seeing the crashes in nautilus too? Our cluster is also quite old -- so it could very well be memory or network gremlins. Cheers, Dan On Thu, Feb 20, 2020 at 10:11 PM Troy Ablan wrote: Dan, Yes, I have had this happen several times since, but fortunately the last couple of times has only happened to one or two OSDs at a time so it didn't take down entire pools. Remedy has been the same. I had been holding off on too much further investigation because I thought the source of the issue may have been some old hardware gremlins, and we're waiting on some new hardware. -Troy On 2/20/20 1:40 PM, Dan van der Ster wrote: Hi Troy, Looks like we hit the same today -- Sage posted some observations here: https://tracker.ceph.com/issues/39525#note-6 Did it happen again in your cluster? Cheers, Dan On Tue, Aug 20, 2019 at 2:18 AM Troy Ablan wrote: While I'm still unsure how this happened, this is what was done to solve this. Started OSD in foreground with debug 10, watched for the most recent osdmap epoch mentioned before abort(). For example, if it mentioned that it just tried to load 80896 and then crashed # ceph osd getmap -o osdmap.80896 80896 # ceph-objectstore-tool --op set-osdmap --data-path /var/lib/ceph/osd/ceph-77/ --file osdmap.80896 Then I restarted the osd in foreground/debug, and repeated for the next osdmap epoch until it got past the first few seconds. This process worked for all but two OSDs. For the ones that succeeded I'd ^C and then start the osd via systemd For the remaining two, it would try loading the incremental map and then crash. I had presence of mind to make dd images of every OSD before starting this process, so I reverted these two to the state before injecting the osdmaps. I then injected the last 15 or so epochs of the osdmap in sequential order before starting the osd, with success. This leads me to believe that the step-wise injection didn't work because the osd had more subtle corruption that it got past, but it was confused when it requested the next incremental delta. Thanks again to Brad/badone for the guidance! Tracker issue updated. Here's the closing IRC dialogue re this issue (UTC-0700) 2019-08-19 16:27:42 < MooingLemur> badone: I appreciate you reaching out yesterday, you've helped a ton, twice now :) I'm still concerned because I don't know how this happened. I'll feel better once everything's active+clean, but it's all at least active. 2019-08-19 16:30:28 < badone> MooingLemur: I had a quick discussion with Josh earlier and he shares my opinion this is likely somehow related to these drives or perhaps controllers, or at least specific to these machines 2019-08-19 16:31:04 < badone> however, there is a possibility you are seeing some extremely rare race that no one up to this point has seen before 2019-08-19 16:31:20 < badone> that is less likely though 2019-08-19 16:32:50 < badone> the osd read the osdmap over the wire successfully but wrote it out to disk in a format that it could not then read back in (unlikely) or... 2019-08-19 16:33:21 < badone> the map "changed" after it had been written to disk 2019-08-19 16:33:46 < badone> the second is considered most likely by us but I recognise you may not share that opinion ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RESOLVED: Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]
Another thing... in your thread that you said that only the *SSDs* in your cluster had crashed, but not the HDDs. Both SSDs and HDDs were bluestore? Did the hdds ever crash subsequently? Which OS/kernel do you run? We're CentOS 7 with quite some uptime. On Thu, Feb 20, 2020 at 10:29 PM Troy Ablan wrote: > > I hope I don't sound too happy to hear that you've run into this same > problem, but still I'm glad to see that it's not just a one-off problem > with us. :) > > We're still running Mimic. I haven't yet deployed Nautilus anywhere. > > Thanks > -Troy > > On 2/20/20 2:14 PM, Dan van der Ster wrote: > > Thanks Troy for the quick response. > > Are you still running mimic on that cluster? Seeing the crashes in nautilus > > too? > > > > Our cluster is also quite old -- so it could very well be memory or > > network gremlins. > > > > Cheers, Dan > > > > On Thu, Feb 20, 2020 at 10:11 PM Troy Ablan wrote: > >> > >> Dan, > >> > >> Yes, I have had this happen several times since, but fortunately the > >> last couple of times has only happened to one or two OSDs at a time so > >> it didn't take down entire pools. Remedy has been the same. > >> > >> I had been holding off on too much further investigation because I > >> thought the source of the issue may have been some old hardware > >> gremlins, and we're waiting on some new hardware. > >> > >> -Troy > >> > >> > >> On 2/20/20 1:40 PM, Dan van der Ster wrote: > >>> Hi Troy, > >>> > >>> Looks like we hit the same today -- Sage posted some observations > >>> here: https://tracker.ceph.com/issues/39525#note-6 > >>> > >>> Did it happen again in your cluster? > >>> > >>> Cheers, Dan > >>> > >>> > >>> > >>> On Tue, Aug 20, 2019 at 2:18 AM Troy Ablan wrote: > > While I'm still unsure how this happened, this is what was done to solve > this. > > Started OSD in foreground with debug 10, watched for the most recent > osdmap epoch mentioned before abort(). For example, if it mentioned > that it just tried to load 80896 and then crashed > > # ceph osd getmap -o osdmap.80896 80896 > # ceph-objectstore-tool --op set-osdmap --data-path > /var/lib/ceph/osd/ceph-77/ --file osdmap.80896 > > Then I restarted the osd in foreground/debug, and repeated for the next > osdmap epoch until it got past the first few seconds. This process > worked for all but two OSDs. For the ones that succeeded I'd ^C and > then start the osd via systemd > > For the remaining two, it would try loading the incremental map and then > crash. I had presence of mind to make dd images of every OSD before > starting this process, so I reverted these two to the state before > injecting the osdmaps. > > I then injected the last 15 or so epochs of the osdmap in sequential > order before starting the osd, with success. > > This leads me to believe that the step-wise injection didn't work > because the osd had more subtle corruption that it got past, but it was > confused when it requested the next incremental delta. > > Thanks again to Brad/badone for the guidance! > > Tracker issue updated. > > Here's the closing IRC dialogue re this issue (UTC-0700) > > 2019-08-19 16:27:42 < MooingLemur> badone: I appreciate you reaching out > yesterday, you've helped a ton, twice now :) I'm still concerned > because I don't know how this happened. I'll feel better once > everything's active+clean, but it's all at least active. > > 2019-08-19 16:30:28 < badone> MooingLemur: I had a quick discussion with > Josh earlier and he shares my opinion this is likely somehow related to > these drives or perhaps controllers, or at least specific to these > machines > > 2019-08-19 16:31:04 < badone> however, there is a possibility you are > seeing some extremely rare race that no one up to this point has seen > before > > 2019-08-19 16:31:20 < badone> that is less likely though > > 2019-08-19 16:32:50 < badone> the osd read the osdmap over the wire > successfully but wrote it out to disk in a format that it could not then > read back in (unlikely) or... > > 2019-08-19 16:33:21 < badone> the map "changed" after it had been > written to disk > > 2019-08-19 16:33:46 < badone> the second is considered most likely by us > but I recognise you may not share that opinion > ___ > ceph-users mailing list > ceph-us...@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RESOLVED: Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]
Dan, This has happened to HDDs also, and nvme most recently. CentOS 7.7, usually the kernel is within 6 months of current updates. We try to stay relatively up to date. -Troy On 2/20/20 5:28 PM, Dan van der Ster wrote: Another thing... in your thread that you said that only the *SSDs* in your cluster had crashed, but not the HDDs. Both SSDs and HDDs were bluestore? Did the hdds ever crash subsequently? Which OS/kernel do you run? We're CentOS 7 with quite some uptime. On Thu, Feb 20, 2020 at 10:29 PM Troy Ablan wrote: I hope I don't sound too happy to hear that you've run into this same problem, but still I'm glad to see that it's not just a one-off problem with us. :) We're still running Mimic. I haven't yet deployed Nautilus anywhere. Thanks -Troy On 2/20/20 2:14 PM, Dan van der Ster wrote: Thanks Troy for the quick response. Are you still running mimic on that cluster? Seeing the crashes in nautilus too? Our cluster is also quite old -- so it could very well be memory or network gremlins. Cheers, Dan On Thu, Feb 20, 2020 at 10:11 PM Troy Ablan wrote: Dan, Yes, I have had this happen several times since, but fortunately the last couple of times has only happened to one or two OSDs at a time so it didn't take down entire pools. Remedy has been the same. I had been holding off on too much further investigation because I thought the source of the issue may have been some old hardware gremlins, and we're waiting on some new hardware. -Troy On 2/20/20 1:40 PM, Dan van der Ster wrote: Hi Troy, Looks like we hit the same today -- Sage posted some observations here: https://tracker.ceph.com/issues/39525#note-6 Did it happen again in your cluster? Cheers, Dan On Tue, Aug 20, 2019 at 2:18 AM Troy Ablan wrote: While I'm still unsure how this happened, this is what was done to solve this. Started OSD in foreground with debug 10, watched for the most recent osdmap epoch mentioned before abort(). For example, if it mentioned that it just tried to load 80896 and then crashed # ceph osd getmap -o osdmap.80896 80896 # ceph-objectstore-tool --op set-osdmap --data-path /var/lib/ceph/osd/ceph-77/ --file osdmap.80896 Then I restarted the osd in foreground/debug, and repeated for the next osdmap epoch until it got past the first few seconds. This process worked for all but two OSDs. For the ones that succeeded I'd ^C and then start the osd via systemd For the remaining two, it would try loading the incremental map and then crash. I had presence of mind to make dd images of every OSD before starting this process, so I reverted these two to the state before injecting the osdmaps. I then injected the last 15 or so epochs of the osdmap in sequential order before starting the osd, with success. This leads me to believe that the step-wise injection didn't work because the osd had more subtle corruption that it got past, but it was confused when it requested the next incremental delta. Thanks again to Brad/badone for the guidance! Tracker issue updated. Here's the closing IRC dialogue re this issue (UTC-0700) 2019-08-19 16:27:42 < MooingLemur> badone: I appreciate you reaching out yesterday, you've helped a ton, twice now :) I'm still concerned because I don't know how this happened. I'll feel better once everything's active+clean, but it's all at least active. 2019-08-19 16:30:28 < badone> MooingLemur: I had a quick discussion with Josh earlier and he shares my opinion this is likely somehow related to these drives or perhaps controllers, or at least specific to these machines 2019-08-19 16:31:04 < badone> however, there is a possibility you are seeing some extremely rare race that no one up to this point has seen before 2019-08-19 16:31:20 < badone> that is less likely though 2019-08-19 16:32:50 < badone> the osd read the osdmap over the wire successfully but wrote it out to disk in a format that it could not then read back in (unlikely) or... 2019-08-19 16:33:21 < badone> the map "changed" after it had been written to disk 2019-08-19 16:33:46 < badone> the second is considered most likely by us but I recognise you may not share that opinion ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Module 'telemetry' has experienced an error
This evening I was awakened by an error message cluster: id: 9b4468b7-5bf2-4964-8aec-4b2f4bee87ad health: HEALTH_ERR Module 'telemetry' has failed: ('Connection aborted.', error(101, 'Network is unreachable')) services: I have not seen any other problems with anything else on the cluster. I disabled and enabled the telemetry module and health returned to OK status. Any ideas on what could cause the issue? As far as I understand, telemetry is a module that sends messages to an external ceph server outside of the network. Thank you for any advice, ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io