Well I suspected the PS, but the guy I spoke with at Titan said some other things would fail before the SSD if that was the problem.
The power should be pretty stable, and I did connect to a good transient suppressor strip. Anyway there was no lightning when it died, which was in the past 24 hours. I had been watching smartmon every few days and it showed no error and temps <37C. Titan has suggested installing a sata SSD (eliminate the m.2) and I'm going to try that. He suggested it might be a software issue, that something might be e.g. erasing the partition table on the drive (I don't have another machine handy to verify this), but this seems really unlikely. I just installed F35 and a moderate set of scientific packages, no proprietary software. The only access in via ssh inside of vpn and I have the only account. On Tue, Feb 22, 2022 at 10:47 AM George N. White III <gnw...@gmail.com> wrote: > On Tue, 22 Feb 2022 at 10:04, Neal Becker <ndbeck...@gmail.com> wrote: > >> Thanks Richard. Yes, I talked with Titan; they suggested trying the >> pcie-m.2 adapter. I will try them again. >> I have not checked for bios updates. Not sure how to go about that (last >> time I did that it required an msdos floppy disc). >> >> Haven't tried the SSDs in another device because I don't have one. But >> the fact that replacing the SSD causes it to work, where it wasn't working >> before, tells me they were damaged. I have at least once power off/on the >> workstation, and the bios did not find any ssd to boot from. So power >> cycle didn't fix it, but replace ssd did fix it. >> >> I will try Titan again later today, but just looking for ideas. >> > > With this history, I'd probably replace the workstation power supply. I > would also scan the > the system board for capacitors on bulging tops or overheated components. > > Are there any externally powered devices connected to the workstation > (other than the monitor)? > > Are you in an area with frequent lightning storms? How stable is your > power? Is the system > connected to a UPS? > > I had a similar experience with spinning disks in a system that contained > a drive-bay radio receiver > and was connected to a satellite dish and GPS receiver on the roof, and an > antenna controller. Everything > was powered by a high quality UPS. I added a heavy wire connecting the > antenna controller case to the > workstation case and the failures stopped. > > I gather you now have space for two m.2 SSD's. If you haven't discarded > the non-working devices, > it would be interesting to see if any are detected and what smartmontools > says about them, but > you also have the option to put /var on a separate drive. Smartmon tools > can monitor a drive and > report any problems it detects, but you may also want to run self-tests > periodically. > > > >> >> Thanks, >> Neal >> >> On Tue, Feb 22, 2022 at 8:44 AM Richard Shaw <hobbes1...@gmail.com> >> wrote: >> >>> On Tue, Feb 22, 2022 at 7:34 AM Neal Becker <ndbeck...@gmail.com> wrote: >>> >>>> I know this is a bit OT, but you guys are great at answering all >>>> questions. >>>> >>>> I bought a workstation from Titan computers around 1/2020 (dual EPYC >>>> cpu). After about 1 year it stopped working. I could ssh to it, and >>>> almost any command would return Input/Output error. Unfortunately >>>> journalctl gave input/output error so I can't see logs. cat >>>> /proc/partitions did not show any nvme device (the root device) on which >>>> the OS was installed. >>>> >>>> I replaced the SSD with a samsung 980 pro. Reinstalled fedora. It >>>> then worked a few weeks, then the exact same symptoms. >>>> >>>> I replaced the SSD with another samsung 980 pro, this time with >>>> heatsink. Reinstalled fedora. It worked a few weeks. Then same symptoms. >>>> >>>> Then I replaced with a 4th samsung 980 pro, but this time instead of >>>> using the M.2 socket I used a pcie-m.2 adapter (in case something was wrong >>>> with the m.2 socket). Also added a surge protector outlet for good >>>> measure. Reinstalled. Watched the smartctl. No errors. Temperature was >>>> always low. >>>> >>>> Now it's failed again, exactly same symptoms. >>>> >>>> Any ideas? >>>> >>> >>> I remember your other email about a month or so ago and thought it was >>> really strange. Have you tried the drives in another system to confirm >>> they're truly dead? >>> >>> I would check for BIOS updates just for good measure. Other than that, >>> have you had any communication with Titan about it? >>> >>> Thanks, >>> Richard >>> _______________________________________________ >>> users mailing list -- users@lists.fedoraproject.org >>> To unsubscribe send an email to users-le...@lists.fedoraproject.org >>> Fedora Code of Conduct: >>> https://docs.fedoraproject.org/en-US/project/code-of-conduct/ >>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines >>> List Archives: >>> https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org >>> Do not reply to spam on the list, report it: >>> https://pagure.io/fedora-infrastructure >>> >> >> >> -- >> *Those who don't understand recursion are doomed to repeat it* >> _______________________________________________ >> users mailing list -- users@lists.fedoraproject.org >> To unsubscribe send an email to users-le...@lists.fedoraproject.org >> Fedora Code of Conduct: >> https://docs.fedoraproject.org/en-US/project/code-of-conduct/ >> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines >> List Archives: >> https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org >> Do not reply to spam on the list, report it: >> https://pagure.io/fedora-infrastructure >> > > > -- > George N. White III > > _______________________________________________ > users mailing list -- users@lists.fedoraproject.org > To unsubscribe send an email to users-le...@lists.fedoraproject.org > Fedora Code of Conduct: > https://docs.fedoraproject.org/en-US/project/code-of-conduct/ > List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines > List Archives: > https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org > Do not reply to spam on the list, report it: > https://pagure.io/fedora-infrastructure > -- *Those who don't understand recursion are doomed to repeat it*
_______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure