The 2 crucial bx drives I was losing, I replaced with an older smaller
mx drive and that one has been working just fine for a couple of
months, thinking about my issue and Neal's issue here is what springs
to mind.

So in my case, if mine was a power supply issue, it would have to be
that something about the new ssds is excessively sensitive to power or
ground loops.   The thought of my issue being a power supply
issue/sata issue burning the device did occur to me.  And that issue I
have is heavily reported in the 1-star reviews for the crucial device,
several people having more  than 1 failure and returning the device
for refund.  The people that have the failure seem to be able to
repeat, and I assume others work just fine.   So it would seem that
there must be some component used in recent ssd's may be super
sensitive to something either power supply wise or sata port wise, or
the design has a internal grounding issue and is sensitive to ground
loop wise that does not cause an issue with the older devices (I have
2 older SSD's and 8 hard drives that have been running in said machine
for months to years just fine).  I would think on an NVME device that
it would be well grounded to the motherboard/case.  In my case my ssds
were in a plastic drive holder so the only ground would have been via
the sata connection and the power supply, and so if the drive design
had components expecting a screw hole ground that won't exist in some
cases, and could have floating voltages then that might damage
something.

How was your nvme drive mounted in your case?   On mine the normal
screw holes were not connected to ground (plastic drive case) so the
"chassis" of the drive would not have been externally grounded, and
had said drive unit chassis not had a direct connect to to power or
SATA ground that could end up with floating voltages on the drive
chassis and any components tied to it internally.

And ground loops are tricky.  I have a wind meter on my roof hooked to
a device that counts it's rotations, and that serial port device would
randomly stop working requiring a reset of the usb-to-serial
communication to get it to function again (I had a cron job to
reload/reset the usb nightly because it was happening often enough). I
guessed ground loop ran a ground wire to house ground and grounded the
hw device doing the counting years ago, and that solved the issue.

On Tue, Feb 22, 2022 at 9:47 AM George N. White III <gnw...@gmail.com> wrote:
>
> On Tue, 22 Feb 2022 at 10:04, Neal Becker <ndbeck...@gmail.com> wrote:
>>
>> Thanks Richard.  Yes, I talked with Titan; they suggested trying the 
>> pcie-m.2 adapter.  I will try them again.
>> I have not checked for bios updates.  Not sure how to go about that (last 
>> time I did that it required an msdos floppy disc).
>>
>> Haven't tried the SSDs in another device because I don't have one.  But the 
>> fact that replacing the SSD causes it to work, where it wasn't working 
>> before, tells me they were damaged.  I have at least once power off/on the 
>> workstation, and the bios did not find any ssd to boot from.  So power cycle 
>> didn't fix it, but replace ssd did fix it.
>>
>> I will try Titan again later today, but just looking for ideas.
>
>
> With this history, I'd probably replace the workstation power supply.   I 
> would also scan the
> the system board for capacitors on bulging tops or overheated components.
>
> Are there any externally powered devices connected to the workstation (other 
> than the monitor)?
>
> Are you in an area with frequent lightning storms?  How stable is your power? 
>  Is the system
> connected to a UPS?
>
> I had a similar experience with spinning disks in a system that contained a 
> drive-bay radio receiver
> and was connected to a satellite dish and GPS receiver on the roof, and an 
> antenna controller.  Everything
> was powered by a high quality UPS.  I added a heavy wire connecting the 
> antenna controller case to the
> workstation case and the failures stopped.
>
> I gather you now have space for two m.2 SSD's.   If you haven't discarded the 
> non-working devices,
> it would be interesting to see if any are detected and what smartmontools 
> says about them, but
> you also have the option to put /var on a separate drive.  Smartmon tools can 
> monitor a drive and
> report any problems it detects, but you may also want to run self-tests 
> periodically.
>
>
>>
>>
>> Thanks,
>> Neal
>>
>> On Tue, Feb 22, 2022 at 8:44 AM Richard Shaw <hobbes1...@gmail.com> wrote:
>>>
>>> On Tue, Feb 22, 2022 at 7:34 AM Neal Becker <ndbeck...@gmail.com> wrote:
>>>>
>>>> I know this is a bit OT, but you guys are great at answering all questions.
>>>>
>>>> I bought a workstation from Titan computers around 1/2020 (dual EPYC cpu). 
>>>>  After about 1 year it stopped working.  I could ssh to it, and almost any 
>>>> command would return Input/Output error.  Unfortunately journalctl gave 
>>>> input/output error so I can't see logs.  cat /proc/partitions did not show 
>>>> any nvme device (the root device) on which the OS was installed.
>>>>
>>>> I replaced the SSD with a samsung 980 pro.  Reinstalled fedora.  It then 
>>>> worked a few weeks, then the exact same symptoms.
>>>>
>>>> I replaced the SSD with another samsung 980 pro, this time with heatsink.  
>>>> Reinstalled fedora.  It worked a few weeks.  Then same symptoms.
>>>>
>>>> Then I replaced with a 4th samsung 980 pro, but this time instead of using 
>>>> the M.2 socket I used a pcie-m.2 adapter (in case something was wrong with 
>>>> the m.2 socket).  Also added a surge protector outlet for good measure. 
>>>> Reinstalled.  Watched the smartctl.  No errors.  Temperature was always 
>>>> low.
>>>>
>>>> Now it's failed again, exactly same symptoms.
>>>>
>>>> Any ideas?
>>>
>>>
>>> I remember your other email about a month or so ago and thought it was 
>>> really strange. Have you tried the drives in another system to confirm 
>>> they're truly dead?
>>>
>>> I would check for BIOS updates just for good measure. Other than that, have 
>>> you had any communication with Titan about it?
>>>
>>> Thanks,
>>> Richard
>>> _______________________________________________
>>> users mailing list -- users@lists.fedoraproject.org
>>> To unsubscribe send an email to users-le...@lists.fedoraproject.org
>>> Fedora Code of Conduct: 
>>> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
>>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
>>> List Archives: 
>>> https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
>>> Do not reply to spam on the list, report it: 
>>> https://pagure.io/fedora-infrastructure
>>
>>
>>
>> --
>> Those who don't understand recursion are doomed to repeat it
>> _______________________________________________
>> users mailing list -- users@lists.fedoraproject.org
>> To unsubscribe send an email to users-le...@lists.fedoraproject.org
>> Fedora Code of Conduct: 
>> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
>> List Archives: 
>> https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
>> Do not reply to spam on the list, report it: 
>> https://pagure.io/fedora-infrastructure
>
>
>
> --
> George N. White III
>
> _______________________________________________
> users mailing list -- users@lists.fedoraproject.org
> To unsubscribe send an email to users-le...@lists.fedoraproject.org
> Fedora Code of Conduct: 
> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: 
> https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
> Do not reply to spam on the list, report it: 
> https://pagure.io/fedora-infrastructure
_______________________________________________
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure

Reply via email to