Well I suspected the PS, but the guy I spoke with at Titan said some other
things would fail before the SSD if that was the problem.

The power should be pretty stable, and I did connect to a good transient
suppressor strip.  Anyway there was no lightning when it died, which was in
the past 24 hours.  I had been watching smartmon every few days and it
showed no error and temps <37C.

Titan has suggested installing a sata SSD (eliminate the m.2) and I'm going
to try that.  He suggested it might be a software issue, that something
might be e.g. erasing the partition table on the drive (I don't have
another machine handy to verify this), but this seems really unlikely.  I
just installed F35 and a moderate set of scientific packages, no
proprietary software.  The only access in via ssh inside of vpn and I have
the only account.

On Tue, Feb 22, 2022 at 10:47 AM George N. White III <gnw...@gmail.com>
wrote:

> On Tue, 22 Feb 2022 at 10:04, Neal Becker <ndbeck...@gmail.com> wrote:
>
>> Thanks Richard.  Yes, I talked with Titan; they suggested trying the
>> pcie-m.2 adapter.  I will try them again.
>> I have not checked for bios updates.  Not sure how to go about that (last
>> time I did that it required an msdos floppy disc).
>>
>> Haven't tried the SSDs in another device because I don't have one.  But
>> the fact that replacing the SSD causes it to work, where it wasn't working
>> before, tells me they were damaged.  I have at least once power off/on the
>> workstation, and the bios did not find any ssd to boot from.  So power
>> cycle didn't fix it, but replace ssd did fix it.
>>
>> I will try Titan again later today, but just looking for ideas.
>>
>
> With this history, I'd probably replace the workstation power supply.   I
> would also scan the
> the system board for capacitors on bulging tops or overheated components.
>
> Are there any externally powered devices connected to the workstation
> (other than the monitor)?
>
> Are you in an area with frequent lightning storms?  How stable is your
> power?  Is the system
> connected to a UPS?
>
> I had a similar experience with spinning disks in a system that contained
> a drive-bay radio receiver
> and was connected to a satellite dish and GPS receiver on the roof, and an
> antenna controller.  Everything
> was powered by a high quality UPS.  I added a heavy wire connecting the
> antenna controller case to the
> workstation case and the failures stopped.
>
> I gather you now have space for two m.2 SSD's.   If you haven't discarded
> the non-working devices,
> it would be interesting to see if any are detected and what smartmontools
> says about them, but
> you also have the option to put /var on a separate drive.  Smartmon tools
> can monitor a drive and
> report any problems it detects, but you may also want to run self-tests
> periodically.
>
>
>
>>
>> Thanks,
>> Neal
>>
>> On Tue, Feb 22, 2022 at 8:44 AM Richard Shaw <hobbes1...@gmail.com>
>> wrote:
>>
>>> On Tue, Feb 22, 2022 at 7:34 AM Neal Becker <ndbeck...@gmail.com> wrote:
>>>
>>>> I know this is a bit OT, but you guys are great at answering all
>>>> questions.
>>>>
>>>> I bought a workstation from Titan computers around 1/2020 (dual EPYC
>>>> cpu).  After about 1 year it stopped working.  I could ssh to it, and
>>>> almost any command would return Input/Output error.  Unfortunately
>>>> journalctl gave input/output error so I can't see logs.  cat
>>>> /proc/partitions did not show any nvme device (the root device) on which
>>>> the OS was installed.
>>>>
>>>> I replaced the SSD with a samsung 980 pro.  Reinstalled fedora.  It
>>>> then worked a few weeks, then the exact same symptoms.
>>>>
>>>> I replaced the SSD with another samsung 980 pro, this time with
>>>> heatsink.  Reinstalled fedora.  It worked a few weeks.  Then same symptoms.
>>>>
>>>> Then I replaced with a 4th samsung 980 pro, but this time instead of
>>>> using the M.2 socket I used a pcie-m.2 adapter (in case something was wrong
>>>> with the m.2 socket).  Also added a surge protector outlet for good
>>>> measure. Reinstalled.  Watched the smartctl.  No errors.  Temperature was
>>>> always low.
>>>>
>>>> Now it's failed again, exactly same symptoms.
>>>>
>>>> Any ideas?
>>>>
>>>
>>> I remember your other email about a month or so ago and thought it was
>>> really strange. Have you tried the drives in another system to confirm
>>> they're truly dead?
>>>
>>> I would check for BIOS updates just for good measure. Other than that,
>>> have you had any communication with Titan about it?
>>>
>>> Thanks,
>>> Richard
>>> _______________________________________________
>>> users mailing list -- users@lists.fedoraproject.org
>>> To unsubscribe send an email to users-le...@lists.fedoraproject.org
>>> Fedora Code of Conduct:
>>> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
>>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
>>> List Archives:
>>> https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
>>> Do not reply to spam on the list, report it:
>>> https://pagure.io/fedora-infrastructure
>>>
>>
>>
>> --
>> *Those who don't understand recursion are doomed to repeat it*
>> _______________________________________________
>> users mailing list -- users@lists.fedoraproject.org
>> To unsubscribe send an email to users-le...@lists.fedoraproject.org
>> Fedora Code of Conduct:
>> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
>> List Archives:
>> https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
>> Do not reply to spam on the list, report it:
>> https://pagure.io/fedora-infrastructure
>>
>
>
> --
> George N. White III
>
> _______________________________________________
> users mailing list -- users@lists.fedoraproject.org
> To unsubscribe send an email to users-le...@lists.fedoraproject.org
> Fedora Code of Conduct:
> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives:
> https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
> Do not reply to spam on the list, report it:
> https://pagure.io/fedora-infrastructure
>


-- 
*Those who don't understand recursion are doomed to repeat it*
_______________________________________________
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure

Reply via email to