On Wed, 13 Jan 2021 at 05:41, Sreyan Chakravarty <sreya...@gmail.com> wrote:

> On Tue, Jan 12, 2021 at 9:16 AM Chris Murphy <li...@colorremedies.com>
> wrote:
> >
> >
> > -x has more information that might be relevant including firmware
> > revision and some additional logs for recent drive reported errors
> > which usually are benign. But might be clues.
> >
> > These two attributes I'm not familiar with
> > 187 Reported_Uncorrect      0x0032   100   096   000    Old_age
> > Always       -       4294967301
> > 188 Command_Timeout         0x0032   100   100   000    Old_age
> > Always       -       98785820672
> >
> > But the value is well above threshold for both so I'm not worried about
> it.
> >
> >
>
> Here is the output of:
>
> # smartctl -Ax /dev/sda
>
> https://pastebin.com/raw/GrgrQrSf
>
> I have no idea what it means.
>

You are not alone.    Most people stop reading at the
line:

SMART overall-health self-assessment test result: PASSED

Before retiring I worked in remote sensing, which is a data-intensive
activity.   HDD failures were a major issue.   One sure way to kill a
drive was to start a batch job that filled a disk and then kept hammering
the drive over a long weekend when I was off somewhere without network
access.   I could usually get warranty replacements for failed drives by
submitting the smartctrl reports.  We use XFS starting on SGI IRIX and
then on linux when it became available, with striped arrays for
thruput with I/O bound processes.  XFS was designed to avoid lengthy
filesystem repair times, so getting a system back after a drive failure
just meant waiting for the tape robot to find and restore the backup tapes.

HDD's are mechanical so subject to wear.  With heavy use they tend to die
shortly after end-or-warranty.    I started replacing drives at
end-or-warranty
which, along with measures to reduce runaway batch jobs, greatly reduced
the number of failures.  Your drive has been used for 1671 hours, and
1491 power-on cycles.   Mechanical device wear is often highest at startup,
so this is probably getting close to the design lifetime of a consumer
laptop
HDD.

There are workloads (image processing, numerical modelling) where recovering
the work done since the last backup just means restarting a batch job and
is
generally easier than trying to repair a filesystem with a bunch of
partially written
HDF5 files.

Given the age of your HDD, I would replace it.   If your laptop came with
Windows,
you should be able to install Windows 10 on a small partition in order to
upgrade the
BIOS and maybe run the drive vendor's diagnostics.   You may want to
revisit your
choices of drive technology, filesystem, backup and recovery strategy, etc.
with
your use case in mind.


> This is the problem with SMART tests, they are so esoteric that it is
> difficult for a common user to make sense of it.
>
> Let me know what you think, if you see any glaring faults.
>
>
You are to be commended for helping the btrfs developers investigate one of
the
rare situations that make filesystems such a hard problem.   My experience
indicates
your HDD is involved, either by old age or some BIOS or drive firmware
glitch, so
your best way forward is to make sure your BIOS is current and replace the
drive
with one suited to your use case.


-- 
George N. White III
_______________________________________________
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org

Reply via email to