On Tue, 22 Feb 2022 at 10:04, Neal Becker <ndbeck...@gmail.com> wrote:

> Thanks Richard.  Yes, I talked with Titan; they suggested trying the
> pcie-m.2 adapter.  I will try them again.
> I have not checked for bios updates.  Not sure how to go about that (last
> time I did that it required an msdos floppy disc).
>
> Haven't tried the SSDs in another device because I don't have one.  But
> the fact that replacing the SSD causes it to work, where it wasn't working
> before, tells me they were damaged.  I have at least once power off/on the
> workstation, and the bios did not find any ssd to boot from.  So power
> cycle didn't fix it, but replace ssd did fix it.
>
> I will try Titan again later today, but just looking for ideas.
>

With this history, I'd probably replace the workstation power supply.   I
would also scan the
the system board for capacitors on bulging tops or overheated components.

Are there any externally powered devices connected to the workstation
(other than the monitor)?

Are you in an area with frequent lightning storms?  How stable is your
power?  Is the system
connected to a UPS?

I had a similar experience with spinning disks in a system that contained a
drive-bay radio receiver
and was connected to a satellite dish and GPS receiver on the roof, and an
antenna controller.  Everything
was powered by a high quality UPS.  I added a heavy wire connecting the
antenna controller case to the
workstation case and the failures stopped.

I gather you now have space for two m.2 SSD's.   If you haven't discarded
the non-working devices,
it would be interesting to see if any are detected and what smartmontools
says about them, but
you also have the option to put /var on a separate drive.  Smartmon tools
can monitor a drive and
report any problems it detects, but you may also want to run self-tests
periodically.



>
> Thanks,
> Neal
>
> On Tue, Feb 22, 2022 at 8:44 AM Richard Shaw <hobbes1...@gmail.com> wrote:
>
>> On Tue, Feb 22, 2022 at 7:34 AM Neal Becker <ndbeck...@gmail.com> wrote:
>>
>>> I know this is a bit OT, but you guys are great at answering all
>>> questions.
>>>
>>> I bought a workstation from Titan computers around 1/2020 (dual EPYC
>>> cpu).  After about 1 year it stopped working.  I could ssh to it, and
>>> almost any command would return Input/Output error.  Unfortunately
>>> journalctl gave input/output error so I can't see logs.  cat
>>> /proc/partitions did not show any nvme device (the root device) on which
>>> the OS was installed.
>>>
>>> I replaced the SSD with a samsung 980 pro.  Reinstalled fedora.  It then
>>> worked a few weeks, then the exact same symptoms.
>>>
>>> I replaced the SSD with another samsung 980 pro, this time with
>>> heatsink.  Reinstalled fedora.  It worked a few weeks.  Then same symptoms.
>>>
>>> Then I replaced with a 4th samsung 980 pro, but this time instead of
>>> using the M.2 socket I used a pcie-m.2 adapter (in case something was wrong
>>> with the m.2 socket).  Also added a surge protector outlet for good
>>> measure. Reinstalled.  Watched the smartctl.  No errors.  Temperature was
>>> always low.
>>>
>>> Now it's failed again, exactly same symptoms.
>>>
>>> Any ideas?
>>>
>>
>> I remember your other email about a month or so ago and thought it was
>> really strange. Have you tried the drives in another system to confirm
>> they're truly dead?
>>
>> I would check for BIOS updates just for good measure. Other than that,
>> have you had any communication with Titan about it?
>>
>> Thanks,
>> Richard
>> _______________________________________________
>> users mailing list -- users@lists.fedoraproject.org
>> To unsubscribe send an email to users-le...@lists.fedoraproject.org
>> Fedora Code of Conduct:
>> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
>> List Archives:
>> https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
>> Do not reply to spam on the list, report it:
>> https://pagure.io/fedora-infrastructure
>>
>
>
> --
> *Those who don't understand recursion are doomed to repeat it*
> _______________________________________________
> users mailing list -- users@lists.fedoraproject.org
> To unsubscribe send an email to users-le...@lists.fedoraproject.org
> Fedora Code of Conduct:
> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives:
> https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
> Do not reply to spam on the list, report it:
> https://pagure.io/fedora-infrastructure
>


-- 
George N. White III
_______________________________________________
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure

Reply via email to