Maybe.  You might also check messages for MCE and/or machine check and
see.  I have seen PCI errors on my SAS2008 when doing high io so I assume
the same thing could happen if the video card was doing high traffic.     I
removed all of the cards, cleaned the connectors on the cards, and vacuumed
out the slots and then blew out the slots with air, and made sure to
tightly seat and screw in the cards, and that fixed my crash that had been
going on for many months (before I happened to be looking at dmesg for
something else and found the MCE error that decoded to the PCI bus, mine is
an AMD, your error may be the equivalent on Intel).

My pci error had a MTBF of around a month, and was 95% of the time during
the weekly raid check while doing high IO.

On Tue, Jan 11, 2022 at 4:30 PM Eyal Lebedinsky <fed...@eyal.emu.id.au>
wrote:

>
>
> On 11/01/2022 23.57, Roger Heflin wrote:
> > Well, usually a real hardware error (uncorrectable memory MCE, or CPU
> > memory MCE, or PCI MCE) will cause an immediate reset (no reset button
> > needed).
> >
> > That error could be a result of the reset button being pressed, and
>
> Looks like it was the reset. While the lockup happened more times, I may
> have hit the reset in only
> two cases.
>
> As I recall the hard lockup usually happens when I watch a video (mythtv
> or, rarely, youtube).
> May be related?
>
> > not really a hardware error.   I work with enterprise vendors hw and
> > they classify the stupidest things as errors (on a reboot they
> > classify the nics and fiber channel cards losing link as hardware
> > errors--and this happens every boot on device init, the the boot
> > "errors" are 100x-1000x more common than the actual real life link
> > downs--so their false alarm rate in horrible).
> >
> > What kind of MB/HW is it?
> >
> > On Tue, Jan 11, 2022 at 5:16 AM Eyal Lebedinsky <fed...@eyal.emu.id.au>
> wrote:
> >>
> >> I just had the system lock-up hard, requiring hitting the reset button.
> >>
> >> I now see in the system log:
> >>
> >> Jan 11 19:28:54 e7 kernel: BERT: Error records from previous boot:
> >> Jan 11 19:28:54 e7 kernel: [Hardware Error]: event severity: fatal
> >> Jan 11 19:28:54 e7 kernel: [Hardware Error]:  Error 0, type: fatal
> >> Jan 11 19:28:54 e7 kernel: [Hardware Error]:   section_type: Firmware
> Error Record Reference
> >> Jan 11 19:28:54 e7 kernel: [Hardware Error]:   Firmware Error Record
> Type: SOC Firmware Error Record Type1 (Legacy CrashLog Support)
> >> Jan 11 19:28:54 e7 kernel: [Hardware Error]:   Revision: 0
> >> Jan 11 19:28:54 e7 kernel: [Hardware Error]:   Record Identifier:
> 100300100000000
> >> Jan 11 19:28:54 e7 kernel: [Hardware Error]:   00000000: 00000000
> 00000000 00000000 00000000  ................
> >> ... continue until
> >> Jan 11 19:28:54 e7 kernel: [Hardware Error]:   00000c00: ffffffff
> ffffffff ffffffff ffffffff  ................
> >>
> >> I understand that this is related to the preceding crash. If so, what
> does it tell me?
> >>
> >> Doing a search suggests that if this happens rarely then it can be
> ignored- true?.\
> >> I now see that I had it also last Oct but not earlier. This hard
> lock-up happens at times (more than twice for sure)
> >> and if I can do something about it then I would.
> >>
> >> TIA
> >>
> >> --
> >> Eyal Lebedinsky (fed...@eyal.emu.id.au)
>
> --
> Eyal Lebedinsky (fed...@eyal.emu.id.au)
> _______________________________________________
> users mailing list -- users@lists.fedoraproject.org
> To unsubscribe send an email to users-le...@lists.fedoraproject.org
> Fedora Code of Conduct:
> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives:
> https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
> Do not reply to spam on the list, report it:
> https://pagure.io/fedora-infrastructure
>
_______________________________________________
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure

Reply via email to