For whatever reason, in recent years I’ve seen these more often with Dells than other systems. My first thought was that maybe you were running an ancient kernel, but then I saw that you aren’t. Is the kernel you’re running the stock one that comes with your distribution? I’ve seen CPU reset events on R750s running an elrepo kernel.
I suspect that some code change may have tickled a latent issue that perhaps you were fortunate to have not previously run into, but this is entirely speculation. > On Apr 16, 2025, at 9:39 AM, Janek Bevendorff > <janek.bevendo...@uni-weimar.de> wrote: > > Yes, they are older Dell PowerEdges. I have to check whether there's newer > firmware, but we've been running Ceph for years without these problems. > > I checked the logs on the host on which I had a lockup just an hour ago, but > there's nothing besides the expected hardreset messages. There are two older > watchdog messages, but they are from March: > > -------------------------------------------------------------------------------- > SeqNumber = 2089 > Message ID = ASR0000 > Category = System > AgentID = SEL > Severity = Critical > Timestamp = 2025-03-27 07:16:03 > Message = The watchdog timer expired. > RawEventData = > 0x03,0x00,0x02,0x33,0xFB,0xE4,0x67,0x20,0x00,0x04,0x23,0x71,0x6F,0xC0,0x04,0xFF > > FQDD = WatchdogTimer.iDRAC.1 > -------------------------------------------------------------------------------- > SeqNumber = 2088 > Message ID = ASR0000 > Category = System > AgentID = SEL > Severity = Critical > Timestamp = 2025-03-27 07:06:41 > Message = The watchdog timer expired. > RawEventData = > 0x02,0x00,0x02,0x01,0xF9,0xE4,0x67,0x20,0x00,0x04,0x23,0x71,0x6F,0xC0,0x04,0xFF > > FQDD = WatchdogTimer.iDRAC.1 > -------------------------------------------------------------------------------- > > I grepped the logs of another host where it happened, but couldn't find any > watchtdog messages there. I believe it's also unlikely that suddenly all MDS > hosts (we have five active, five hot standbys, and one cold standby) start > having hardware issues. I also ran a memtest on one of the hosts last week > and couldn't find anything there either. > > > > On 16/04/2025 15:14, Anthony D'Atri wrote: >> Curious, are your systems Dells? If so you might see some improvement from >> running DSU to update all the firmware. It might also be illuminating to >> run `racadm lclog view` >> >>> On Apr 16, 2025, at 8:32 AM, Janek Bevendorff >>> <janek.bevendo...@uni-weimar.de> wrote: >>> >>> Hi, >>> >>> Since the latest Reef update I have the problem that some of my hosts >>> suddenly go into a state where all CPUs are stuck in kernel mode causing >>> all daemons on that host to become unresponsive. When I connect to the IPMI >>> console, I see a lot of messages like: >>> >>> watchdog: BUG: soft lockup - CPU#8 stuck for 47s! [cron:868840] >>> >>> (it's basically a list of all processes running on the machine). >>> >>> Usually, this resolves itself after several minutes, but sometimes I have >>> to hardreset the host. When this happens, all daemons are marked as down >>> and I cannot interact with the host at all. I don't know what causes this >>> but, I think it happens primarily on the hosts where my MDS run and it >>> seems to be triggered by events such as cluster rebalances, MDS restarts, >>> or just randomly. >>> >>> I found a few reports about similar issues on the bug tracker and mailing >>> list, but they are all very unspecific, unanswered, or more than 6 years >>> old. >>> >>> Is there any way I can debug this? I upgraded to Squid already, but that >>> didn't solve the problem. I also had massive issues with this during the >>> upgrade. Particularly at the end when the MDS were upgraded, I had constant >>> struggles with it. I had to set the noout flag and then literally sit next >>> to it to resume the upgrade every few minutes until it finally went >>> through, because random MDS hosts went intermittently dark all the time. >>> >>> All hosts run Ubuntu 22.04 with kernel 6.8.0. >>> >>> Any ideas? Thanks! >>> Janek >>> >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@ceph.io >>> To unsubscribe send an email to ceph-users-le...@ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io