On 2024-07-26 8:25 a.m., Richard Shaw wrote:
On Thu, Jul 25, 2024 at 6:29 PM Jeffrey Walton <noloa...@gmail.com> wrote:

    On Thu, Jul 25, 2024 at 2:15 PM Richard Shaw
    <hobbes1...@gmail.com> wrote:
    >
    > I recently had the Fedora install on my laptop go sideways
    (Ryzen 5 4500U w/ nvme disk).
    >
    > The filesystem was going readonly so I installed System Rescue
    CD to a thumb drive to investigate. Sure enough I had 4
    unrecoverable errors.
    >
    > I don't keep anything critical on it so I decided to just
    reinstall with Fedora 40. Installation went fine but I did notice
    weird dnf output on my first updated buy everything SEEMED fine...
    >
    > I rebooted after the update and tried to log in when after a
    minute or two the system froze. Rebooted and sure enough a `dmesg
    | grep BTRFS` showed an error.
    >
    > Back to booting with System Rescue CD neither a `btrfs check
    --check-data-csum` or after mounting, a `btrfs scrub` show any errors.
    >
    > So who's right? And if there is an error, what's causing it?
    I've checked the drive with smartctl and even let the factory HP
    firmware diag tools run in a loop overnight checking everything
    without error.

    The (1) irrecoverable disk errors from the original install, and (2)
    the errors from the current install, and (3) the errors from dnf
    indicate (to me) you have a failed NVMe drive. I used to see the
    symptoms all the time when using SDcards in ARM dev boards. I would
    put a swap file on the dev board (due to lack of resources), and the
    drives would fail within about 6 months with the symptoms you
    describe.

    Now the interesting part (to me) is, (4) lack of errors reported by
    some tools. That indicates to me a Chinese drive that misreports drive
    size and statistics. They usually show up on thumb drives, but I
    experienced one on a SSD drive years ago. Also see
    <https://www.google.com/search?q=counterfeit+drive+misreport+size>.

    All in all, I would replace the NVMe drive with a new one from a
    trusted source. Not Amazon or eBay.


It's the drive that came with the laptop so unlikely to be a cheap/phony drive but the mystery does get deeper...

1. I was able to see the same results even if I booted to a F40 Live USB. I'm thinking that the system caught the problem quick enough the error didn't actually get written to the disk.

2. I consistently see the problem at about 30 seconds (from dmesg) if I boot the 6.9.9 or 6.9.10 kernels that have been installed via updates. If I boot 6.8.5, the kernel that shipped with F40 I can't reproduce the problem.

Of course that's strange because if this was a widespread issue there would be tons of people complaining.

Odds are that you have bad ram or are running the processor clock higher than what it can handle.  I also had this kind of issue when I had a bad video card, but the system generally froze or crashed and left the drive in an unrecoverable state.  The tools for fixing a btrfs partition are generally lacking in Fedora, and the tools that come with btrfs are also useless when the  failing partition is your active root partition.  I don't know if Suse has better tools, but its a huge problem with Fedora recoverability.

So, question for the Fedora filesystem team: When are we going to see fixed btrfs tools?

--
John mellorjohn.mel...@gmail.com     519-721-6671
-- 
_______________________________________________
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue

Reply via email to