Hi,
My turn.
We suddenly have a big outage which is similar/identical to
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036519.html
Some of the osds are runnable, but most crash when they start -- crc
error in osdmap::decode.
I'm able to extract an osd map from a good osd and it
On 2/20/20 12:40 PM, Dan van der Ster wrote:
> Hi,
>
> My turn.
> We suddenly have a big outage which is similar/identical to
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036519.html
>
> Some of the osds are runnable, but most crash when they start -- crc
> error in osdmap
680 is epoch 2983572
666 crashes at 2982809 or 2982808
-407> 2020-02-20 11:20:24.960 7f4d931b5b80 10 osd.666 0 add_map_bl
2982809 612069 bytes
-407> 2020-02-20 11:20:24.966 7f4d931b5b80 -1 *** Caught signal (Aborted) **
in thread 7f4d931b5b80 thread_name:ceph-osd
So I grabbed 2982809 and 298
For those following along, the issue is here:
https://tracker.ceph.com/issues/39525#note-6
Somehow single bits are getting flipped in the osdmaps -- maybe
network, maybe memory, maybe a bug.
To get an osd starting, we have to extract the full osdmap from the
mon, and set it into the crashing osd.
> Op 20 feb. 2020 om 19:54 heeft Dan van der Ster het
> volgende geschreven:
>
> For those following along, the issue is here:
> https://tracker.ceph.com/issues/39525#note-6
>
> Somehow single bits are getting flipped in the osdmaps -- maybe
> network, maybe memory, maybe a bug.
>
Weird!
On Thu, Feb 20, 2020 at 9:20 PM Wido den Hollander wrote:
>
> > Op 20 feb. 2020 om 19:54 heeft Dan van der Ster het
> > volgende geschreven:
> >
> > For those following along, the issue is here:
> > https://tracker.ceph.com/issues/39525#note-6
> >
> > Somehow single bits are getting flipped in
Hi Troy,
Looks like we hit the same today -- Sage posted some observations
here: https://tracker.ceph.com/issues/39525#note-6
Did it happen again in your cluster?
Cheers, Dan
On Tue, Aug 20, 2019 at 2:18 AM Troy Ablan wrote:
>
> While I'm still unsure how this happened, this is what was done
Dan,
Yes, I have had this happen several times since, but fortunately the
last couple of times has only happened to one or two OSDs at a time so
it didn't take down entire pools. Remedy has been the same.
I had been holding off on too much further investigation because I
thought the source
Thanks Troy for the quick response.
Are you still running mimic on that cluster? Seeing the crashes in nautilus too?
Our cluster is also quite old -- so it could very well be memory or
network gremlins.
Cheers, Dan
On Thu, Feb 20, 2020 at 10:11 PM Troy Ablan wrote:
>
> Dan,
>
> Yes, I have had
I hope I don't sound too happy to hear that you've run into this same
problem, but still I'm glad to see that it's not just a one-off problem
with us. :)
We're still running Mimic. I haven't yet deployed Nautilus anywhere.
Thanks
-Troy
On 2/20/20 2:14 PM, Dan van der Ster wrote:
Thanks Troy
Another thing... in your thread that you said that only the *SSDs* in
your cluster had crashed, but not the HDDs.
Both SSDs and HDDs were bluestore? Did the hdds ever crash subsequently?
Which OS/kernel do you run? We're CentOS 7 with quite some uptime.
On Thu, Feb 20, 2020 at 10:29 PM Troy Ablan
Dan,
This has happened to HDDs also, and nvme most recently. CentOS 7.7,
usually the kernel is within 6 months of current updates. We try to
stay relatively up to date.
-Troy
On 2/20/20 5:28 PM, Dan van der Ster wrote:
Another thing... in your thread that you said that only the *SSDs* in
This evening I was awakened by an error message
cluster:
id: 9b4468b7-5bf2-4964-8aec-4b2f4bee87ad
health: HEALTH_ERR
Module 'telemetry' has failed: ('Connection aborted.', error(101,
'Network is unreachable'))
services:
I have not seen any other problems with anythi
13 matches
Mail list logo