Re: slowness with kernel 6.4.10 and software raid

Ranjan Maitra Fri, 18 Aug 2023 13:16:11 -0700

Thanks, so are there two drives that are bad? Sorry, I am confused. It is 
likely no longer in warranty: the one with /home is new (I think) and also the 
/mnt/backup (which is a rsync-based backup I do so as to actually be able to 
see these files, and also as a more reliable backup that i can actually see). 
Outside this, I have a / drive  that is a smaller SSD. I also used to have that 
raided, but that other / drive died and I never got to replacing it.


So, my question is that is it only the raid drive /dev/sda that is bad, or is 
there something else that you can see based on the report?

Many thanks, and best wishes,
Ranjan


On Fri Aug18'23 02:58:30PM, Roger Heflin wrote:
> From: Roger Heflin <rogerhef...@gmail.com>
> Date: Fri, 18 Aug 2023 14:58:30 -0500
> To: Community support for Fedora users <users@lists.fedoraproject.org>
> Reply-To: Community support for Fedora users <users@lists.fedoraproject.org>
> Subject: Re: slowness with kernel 6.4.10 and software raid
>
> ok.  You have around 4000 sectors that are bad and are reallocated.
>
> You have around 1000 that are offline uncorrectable (reads failed).
>
> And you have a desktop drive that has a bad sector timeout of who
> knows exactly what.   I would guess at least 30 seconds, it could be
> higher, but it must be lower than the scsi timeout fo the device.
>
> Given the power on hours the disk is out of warranty (I think).  If
> the disk was in warranty you could get the disk vendor to replace it.
>
> So whatever that timeout is when you hit a single bad sector the disk
> is going to keep re-reading it for that timeout and then report that
> sector cannot be read and mdraid will then read it from the other
> mirror and re-write it.
>
> This disk could eventually failed to read each sector and mdraid could
> re-write them and that may fix it.  Or it could fix some of them on
> this pass, and some on the next pass, and never fix all of them so sda
> simply sucks.
>
> Best idea would be to buy a new disk, but this time do not buy a
> desktop drive nor buy a SMR drive.    There is a webpage someplaec
> that lists which disks are not SMR disks, and other webpages list what
> disks have a settable timeout (WD Red Plus and/or Seagate Ironwolf,
> and likely others).
>
> Likely the disks will be classified as enterprise and/or NAS disks,
> but whatever you look at make sure to check the vendors list to see if
> the disk is SMR or not.  Note WD Red is SMR, WD Red Plus is not SMR.
> And SMR sometimes does not play nice with raid.
>
> On Fri, Aug 18, 2023 at 2:05 PM Ranjan Maitra <mlmai...@gmx.com> wrote:
> >
> > On Fri Aug18'23 01:39:08PM, Roger Heflin wrote:
> > > From: Roger Heflin <rogerhef...@gmail.com>
> > > Date: Fri, 18 Aug 2023 13:39:08 -0500
> > > To: Community support for Fedora users <users@lists.fedoraproject.org>
> > > Reply-To: Community support for Fedora users 
> > > <users@lists.fedoraproject.org>
> > > Subject: Re: slowness with kernel 6.4.10 and software raid
> > >
> > > The above makes it very clear what is happening.   What kind of disks
> > > are these?  And did you set the scterc timeout?  You can see it via
> > > smartctl -l scterc /dev/sda   and then repeat on the other disk.
> > >
> > > Setting the timeout as low as you can will improve this situation
> > > some, but it appears that sda has a number of bad sectors on it.
> > >
> > > a full output of "smartclt --xall /dev/sda" would be useful also to
> > > see how bad it is.
> > >
> > > Short answer is you probably need a new device for sda.
> > >
> >
> > Thanks!
> >
> > I tried:
> >
> > # smartctl -l scterc /dev/sda
> >  smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.4.10-200.fc38.x86_64] (local 
> > build)
> >  Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
> >
> > SCT Error Recovery Control command not supported
> >
> > # smartctl --xall /dev/sda
> >
> >   smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.4.10-200.fc38.x86_64] 
> > (local build)
> >   Copyright (C) 2002-23, Bruce Allen, Christian Franke, 
> > www.smartmontools.org
> >
> >   === START OF INFORMATION SECTION ===
> >   Model Family:     Seagate Barracuda 7200.14 (AF)
> >   Device Model:     ST2000DM001-1ER164
> >   Serial Number:    Z4Z5F3LE
> >   LU WWN Device Id: 5 000c50 091167f04
> >   Firmware Version: CC27
> >   User Capacity:    2,000,398,934,016 bytes [2.00 TB]
> >   Sector Sizes:     512 bytes logical, 4096 bytes physical
> >   Rotation Rate:    7200 rpm
> >   Form Factor:      3.5 inches
> >   Device is:        In smartctl database 7.3/5528
> >   ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
> >   SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
> >   Local Time is:    Fri Aug 18 14:01:28 2023 CDT
> >   SMART support is: Available - device has SMART capability.
> >   SMART support is: Enabled
> >   AAM feature is:   Unavailable
> >   APM level is:     128 (minimum power consumption without standby)
> >   Rd look-ahead is: Enabled
> >   Write cache is:   Enabled
> >   DSN feature is:   Unavailable
> >   ATA Security is:  Disabled, NOT FROZEN [SEC1]
> >   Wt Cache Reorder: Unavailable
> >
> >   === START OF READ SMART DATA SECTION ===
> >   SMART overall-health self-assessment test result: PASSED
> >
> >   General SMART Values:
> >   Offline data collection status:  (0x00)       Offline data collection 
> > activity
> >                                         was never started.
> >                                         Auto Offline Data Collection: 
> > Disabled.
> >   Self-test execution status:      (   0)       The previous self-test 
> > routine completed
> >                                         without error or no self-test has 
> > ever
> >                                         been run.
> >   Total time to complete Offline
> >   data collection:              (   80) seconds.
> >   Offline data collection
> >   capabilities:                          (0x73) SMART execute Offline 
> > immediate.
> >                                         Auto Offline data collection on/off 
> > support.
> >                                         Suspend Offline collection upon new
> >                                         command.
> >                                         No Offline surface scan supported.
> >                                         Self-test supported.
> >                                         Conveyance Self-test supported.
> >                                         Selective Self-test supported.
> >   SMART capabilities:            (0x0003)       Saves SMART data before 
> > entering
> >                                         power-saving mode.
> >                                         Supports SMART auto save timer.
> >   Error logging capability:        (0x01)       Error logging supported.
> >                                         General Purpose Logging supported.
> >   Short self-test routine
> >   recommended polling time:      (   1) minutes.
> >   Extended self-test routine
> >   recommended polling time:      ( 212) minutes.
> >   Conveyance self-test routine
> >   recommended polling time:      (   2) minutes.
> >   SCT capabilities:            (0x1085) SCT Status supported.
> >
> >   SMART Attributes Data Structure revision number: 10
> >   Vendor Specific SMART Attributes with Thresholds:
> >   ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
> >     1 Raw_Read_Error_Rate     POSR--   116   092   006    -    106200704
> >     3 Spin_Up_Time            PO----   096   096   000    -    0
> >     4 Start_Stop_Count        -O--CK   100   100   020    -    97
> >     5 Reallocated_Sector_Ct   PO--CK   097   097   010    -    3960
> >     7 Seek_Error_Rate         POSR--   084   060   030    -    333268033
> >     9 Power_On_Hours          -O--CK   062   062   000    -    34085
> >    10 Spin_Retry_Count        PO--C-   100   100   097    -    0
> >    12 Power_Cycle_Count       -O--CK   100   100   020    -    96
> >   183 Runtime_Bad_Block       -O--CK   100   100   000    -    0
> >   184 End-to-End_Error        -O--CK   100   100   099    -    0
> >   187 Reported_Uncorrect      -O--CK   001   001   000    -    384
> >   188 Command_Timeout         -O--CK   100   098   000    -    3 71 72
> >   189 High_Fly_Writes         -O-RCK   065   065   000    -    35
> >   190 Airflow_Temperature_Cel -O---K   063   055   045    -    37 (Min/Max 
> > 37/42)
> >   191 G-Sense_Error_Rate      -O--CK   100   100   000    -    0
> >   192 Power-Off_Retract_Count -O--CK   100   100   000    -    19
> >   193 Load_Cycle_Count        -O--CK   001   001   000    -    294513
> >   194 Temperature_Celsius     -O---K   037   045   000    -    37 (0 18 0 0 
> > 0)
> >   197 Current_Pending_Sector  -O--C-   094   080   000    -    1064
> >   198 Offline_Uncorrectable   ----C-   094   080   000    -    1064
> >   199 UDMA_CRC_Error_Count    -OSRCK   200   200   000    -    0
> >   240 Head_Flying_Hours       ------   100   253   000    -    
> > 31366h+32m+19.252s
> >   241 Total_LBAs_Written      ------   100   253   000    -    22394883074
> >   242 Total_LBAs_Read         ------   100   253   000    -    258335971674
> >                               ||||||_ K auto-keep
> >                               |||||__ C event count
> >                               ||||___ R error rate
> >                               |||____ S speed/performance
> >                               ||_____ O updated online
> >                               |______ P prefailure warning
> >
> >   General Purpose Log Directory Version 1
> >   SMART           Log Directory Version 1 [multi-sector log support]
> >   Address    Access  R/W   Size  Description
> >   0x00       GPL,SL  R/O      1  Log Directory
> >   0x01           SL  R/O      1  Summary SMART error log
> >   0x02           SL  R/O      5  Comprehensive SMART error log
> >   0x03       GPL     R/O      5  Ext. Comprehensive SMART error log
> >   0x06           SL  R/O      1  SMART self-test log
> >   0x07       GPL     R/O      1  Extended self-test log
> >   0x09           SL  R/W      1  Selective self-test log
> >   0x10       GPL     R/O      1  NCQ Command Error log
> >   0x11       GPL     R/O      1  SATA Phy Event Counters log
> >   0x21       GPL     R/O      1  Write stream error log
> >   0x22       GPL     R/O      1  Read stream error log
> >   0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
> >   0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
> >   0xa1       GPL,SL  VS      20  Device vendor specific log
> >   0xa2       GPL     VS    4496  Device vendor specific log
> >   0xa8       GPL,SL  VS     129  Device vendor specific log
> >   0xa9       GPL,SL  VS       1  Device vendor specific log
> >   0xab       GPL     VS       1  Device vendor specific log
> >   0xb0       GPL     VS    5176  Device vendor specific log
> >   0xbe-0xbf  GPL     VS   65535  Device vendor specific log
> >   0xc0       GPL,SL  VS       1  Device vendor specific log
> >   0xc1       GPL,SL  VS      10  Device vendor specific log
> >   0xc3       GPL,SL  VS       8  Device vendor specific log
> >   0xe0       GPL,SL  R/W      1  SCT Command/Status
> >   0xe1       GPL,SL  R/W      1  SCT Data Transfer
> >
> >   SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
> >   Device Error Count: 384 (device log contains only the most recent 20 
> > errors)
> >         CR     = Command Register
> >         FEATR  = Features Register
> >         COUNT  = Count (was: Sector Count) Register
> >         LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
> >         LH     = LBA High (was: Cylinder High) Register    ]   LBA
> >         LM     = LBA Mid (was: Cylinder Low) Register      ] Register
> >         LL     = LBA Low (was: Sector Number) Register     ]
> >         DV     = Device (was: Device/Head) Register
> >         DC     = Device Control Register
> >         ER     = Error register
> >         ST     = Status register
> >   Powered_Up_Time is measured from power on, and printed as
> >   DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
> >   SS=sec, and sss=millisec. It "wraps" after 49.710 days.
> >
> >   Error 384 [3] occurred at disk power-on lifetime: 34042 hours (1418 days 
> > + 10 hours)
> >     When the command that caused the error occurred, the device was active 
> > or idle.
> >
> >     After command completion occurred, registers were:
> >     ER -- ST COUNT  LBA_48  LH LM LL DV DC
> >     -- -- -- == -- == == == -- -- -- -- --
> >     40 -- 53 00 00 00 00 a3 12 b9 20 00 00  Error: UNC at LBA = 0xa312b920 
> > = 2735913248
> >
> >     Commands leading to the command that caused the error were:
> >     CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  
> > Command/Feature_Name
> >     -- == -- == -- == == == -- -- -- -- --  ---------------  
> > --------------------
> >     60 00 00 00 08 00 00 a3 12 b9 20 40 00 16d+06:35:59.162  READ FPDMA 
> > QUEUED
> >     60 00 00 00 08 00 00 a3 12 b9 18 40 00 16d+06:35:59.154  READ FPDMA 
> > QUEUED
> >     60 00 00 00 08 00 00 a3 12 b9 10 40 00 16d+06:35:59.154  READ FPDMA 
> > QUEUED
> >     61 00 00 00 08 00 00 a3 12 b9 10 40 00 16d+06:35:59.154  WRITE FPDMA 
> > QUEUED
> >     ef 00 10 00 02 00 00 00 00 00 00 a0 00 16d+06:35:59.154  SET FEATURES 
> > [Enable SATA feature]
> >
> >   Error 383 [2] occurred at disk power-on lifetime: 34042 hours (1418 days 
> > + 10 hours)
> >     When the command that caused the error occurred, the device was active 
> > or idle.
> >
> >     After command completion occurred, registers were:
> >     ER -- ST COUNT  LBA_48  LH LM LL DV DC
> >     -- -- -- == -- == == == -- -- -- -- --
> >     40 -- 53 00 00 00 00 a3 12 b9 10 00 00  Error: UNC at LBA = 0xa312b910 
> > = 2735913232
> >
> >     Commands leading to the command that caused the error were:
> >     CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  
> > Command/Feature_Name
> >     -- == -- == -- == == == -- -- -- -- --  ---------------  
> > --------------------
> >     60 00 00 00 08 00 00 a3 12 b9 10 40 00 16d+06:35:53.336  READ FPDMA 
> > QUEUED
> >     60 00 00 00 08 00 00 a3 12 b9 08 40 00 16d+06:35:53.335  READ FPDMA 
> > QUEUED
> >     60 00 00 00 08 00 00 a3 12 b9 00 40 00 16d+06:35:53.335  READ FPDMA 
> > QUEUED
> >     60 00 00 00 08 00 00 a3 12 b8 f8 40 00 16d+06:35:53.335  READ FPDMA 
> > QUEUED
> >     60 00 00 00 08 00 00 a3 12 b8 f0 40 00 16d+06:35:53.331  READ FPDMA 
> > QUEUED
> >
> >   Error 382 [1] occurred at disk power-on lifetime: 34042 hours (1418 days 
> > + 10 hours)
> >     When the command that caused the error occurred, the device was active 
> > or idle.
> >
> >     After command completion occurred, registers were:
> >     ER -- ST COUNT  LBA_48  LH LM LL DV DC
> >     -- -- -- == -- == == == -- -- -- -- --
> >     40 -- 53 00 00 00 00 a3 12 b8 e8 00 00  Error: UNC at LBA = 0xa312b8e8 
> > = 2735913192
> >
> >     Commands leading to the command that caused the error were:
> >     CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  
> > Command/Feature_Name
> >     -- == -- == -- == == == -- -- -- -- --  ---------------  
> > --------------------
> >     60 00 00 00 08 00 00 a3 12 b8 e8 40 00 16d+06:35:49.468  READ FPDMA 
> > QUEUED
> >     60 00 00 00 08 00 00 a3 12 b8 e0 40 00 16d+06:35:49.460  READ FPDMA 
> > QUEUED
> >     60 00 00 00 08 00 00 a3 12 b8 d8 40 00 16d+06:35:49.460  READ FPDMA 
> > QUEUED
> >     61 00 00 00 08 00 00 a3 12 b8 d8 40 00 16d+06:35:49.460  WRITE FPDMA 
> > QUEUED
> >     ef 00 10 00 02 00 00 00 00 00 00 a0 00 16d+06:35:49.459  SET FEATURES 
> > [Enable SATA feature]
> >
> >   Error 381 [0] occurred at disk power-on lifetime: 34042 hours (1418 days 
> > + 10 hours)
> >     When the command that caused the error occurred, the device was active 
> > or idle.
> >
> >     After command completion occurred, registers were:
> >     ER -- ST COUNT  LBA_48  LH LM LL DV DC
> >     -- -- -- == -- == == == -- -- -- -- --
> >     40 -- 53 00 00 00 00 a3 12 b8 d8 00 00  Error: UNC at LBA = 0xa312b8d8 
> > = 2735913176
> >
> >     Commands leading to the command that caused the error were:
> >     CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  
> > Command/Feature_Name
> >     -- == -- == -- == == == -- -- -- -- --  ---------------  
> > --------------------
> >     60 00 00 00 08 00 00 a3 12 b8 d8 40 00 16d+06:35:45.676  READ FPDMA 
> > QUEUED
> >     60 00 00 00 08 00 00 a3 12 b8 d0 40 00 16d+06:35:45.673  READ FPDMA 
> > QUEUED
> >     ef 00 10 00 02 00 00 00 00 00 00 a0 00 16d+06:35:45.673  SET FEATURES 
> > [Enable SATA feature]
> >     27 00 00 00 00 00 00 00 00 00 00 e0 00 16d+06:35:45.673  READ NATIVE 
> > MAX ADDRESS EXT [OBS-ACS-3]
> >     ec 00 00 00 00 00 00 00 00 00 00 a0 00 16d+06:35:45.672  IDENTIFY DEVICE
> >
> >   Error 380 [19] occurred at disk power-on lifetime: 34042 hours (1418 days 
> > + 10 hours)
> >     When the command that caused the error occurred, the device was active 
> > or idle.
> >
> >     After command completion occurred, registers were:
> >     ER -- ST COUNT  LBA_48  LH LM LL DV DC
> >     -- -- -- == -- == == == -- -- -- -- --
> >     40 -- 53 00 00 00 00 a3 12 b8 c8 00 00  Error: UNC at LBA = 0xa312b8c8 
> > = 2735913160
> >
> >     Commands leading to the command that caused the error were:
> >     CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  
> > Command/Feature_Name
> >     -- == -- == -- == == == -- -- -- -- --  ---------------  
> > --------------------
> >     60 00 00 00 08 00 00 a3 12 b8 c8 40 00 16d+06:35:39.283  READ FPDMA 
> > QUEUED
> >     60 00 00 00 08 00 00 a3 12 b8 c0 40 00 16d+06:35:39.282  READ FPDMA 
> > QUEUED
> >     60 00 00 00 08 00 00 a3 12 b8 b8 40 00 16d+06:35:39.282  READ FPDMA 
> > QUEUED
> >     60 00 00 00 08 00 00 a3 12 b8 b0 40 00 16d+06:35:39.270  READ FPDMA 
> > QUEUED
> >     60 00 00 00 08 00 00 a3 12 b8 a8 40 00 16d+06:35:39.270  READ FPDMA 
> > QUEUED
> >
> >   Error 379 [18] occurred at disk power-on lifetime: 34042 hours (1418 days 
> > + 10 hours)
> >     When the command that caused the error occurred, the device was active 
> > or idle.
> >
> >     After command completion occurred, registers were:
> >     ER -- ST COUNT  LBA_48  LH LM LL DV DC
> >     -- -- -- == -- == == == -- -- -- -- --
> >     40 -- 53 00 00 00 00 a3 12 b8 a8 00 00  Error: UNC at LBA = 0xa312b8a8 
> > = 2735913128
> >
> >     Commands leading to the command that caused the error were:
> >     CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  
> > Command/Feature_Name
> >     -- == -- == -- == == == -- -- -- -- --  ---------------  
> > --------------------
> >     60 00 00 00 08 00 00 a3 12 b8 a8 40 00 16d+06:35:35.558  READ FPDMA 
> > QUEUED
> >     61 00 00 05 78 00 00 65 ac 20 00 40 00 16d+06:35:35.557  WRITE FPDMA 
> > QUEUED
> >     60 00 00 00 08 00 00 a3 12 b8 a0 40 00 16d+06:35:35.540  READ FPDMA 
> > QUEUED
> >     60 00 00 00 08 00 00 a3 12 b8 98 40 00 16d+06:35:35.532  READ FPDMA 
> > QUEUED
> >     ef 00 10 00 02 00 00 00 00 00 00 a0 00 16d+06:35:35.532  SET FEATURES 
> > [Enable SATA feature]
> >
> >   Error 378 [17] occurred at disk power-on lifetime: 34042 hours (1418 days 
> > + 10 hours)
> >     When the command that caused the error occurred, the device was active 
> > or idle.
> >
> >     After command completion occurred, registers were:
> >     ER -- ST COUNT  LBA_48  LH LM LL DV DC
> >     -- -- -- == -- == == == -- -- -- -- --
> >     40 -- 53 00 00 00 00 a3 12 b8 90 00 00  Error: UNC at LBA = 0xa312b890 
> > = 2735913104
> >
> >     Commands leading to the command that caused the error were:
> >     CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  
> > Command/Feature_Name
> >     -- == -- == -- == == == -- -- -- -- --  ---------------  
> > --------------------
> >     60 00 00 00 08 00 00 a3 12 b8 90 40 00 16d+06:35:31.406  READ FPDMA 
> > QUEUED
> >     60 00 00 00 08 00 00 a3 12 b8 88 40 00 16d+06:35:31.406  READ FPDMA 
> > QUEUED
> >     60 00 00 00 08 00 00 a3 12 b8 80 40 00 16d+06:35:31.405  READ FPDMA 
> > QUEUED
> >     60 00 00 00 08 00 00 a3 12 b8 78 40 00 16d+06:35:31.398  READ FPDMA 
> > QUEUED
> >     60 00 00 00 08 00 00 a3 12 b8 70 40 00 16d+06:35:31.397  READ FPDMA 
> > QUEUED
> >
> >   Error 377 [16] occurred at disk power-on lifetime: 34042 hours (1418 days 
> > + 10 hours)
> >     When the command that caused the error occurred, the device was active 
> > or idle.
> >
> >     After command completion occurred, registers were:
> >     ER -- ST COUNT  LBA_48  LH LM LL DV DC
> >     -- -- -- == -- == == == -- -- -- -- --
> >     40 -- 53 00 00 00 00 a3 12 b8 70 00 00  Error: UNC at LBA = 0xa312b870 
> > = 2735913072
> >
> >     Commands leading to the command that caused the error were:
> >     CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  
> > Command/Feature_Name
> >     -- == -- == -- == == == -- -- -- -- --  ---------------  
> > --------------------
> >     60 00 00 00 08 00 00 a3 12 b8 70 40 00 16d+06:35:27.414  READ FPDMA 
> > QUEUED
> >     60 00 00 00 08 00 00 a3 12 b8 68 40 00 16d+06:35:27.413  READ FPDMA 
> > QUEUED
> >     60 00 00 00 08 00 00 a3 12 b8 60 40 00 16d+06:35:27.402  READ FPDMA 
> > QUEUED
> >     60 00 00 00 08 00 00 a3 12 b8 58 40 00 16d+06:35:27.401  READ FPDMA 
> > QUEUED
> >     61 00 00 00 08 00 00 a3 12 b8 58 40 00 16d+06:35:27.401  WRITE FPDMA 
> > QUEUED
> >
> >   SMART Extended Self-test Log Version: 1 (1 sectors)
> >   Num  Test_Description    Status                  Remaining  
> > LifeTime(hours)  LBA_of_first_error
> >   # 1  Short offline       Completed: read failure       90%     29204      
> >    771754056
> >   # 2  Short offline       Completed without error       00%        19      
> >    -
> >   # 3  Short offline       Completed without error       00%         0      
> >    -
> >
> >   SMART Selective self-test log data structure revision number 1
> >    SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
> >       1        0        0  Not_testing
> >       2        0        0  Not_testing
> >       3        0        0  Not_testing
> >       4        0        0  Not_testing
> >       5        0        0  Not_testing
> >   Selective self-test flags (0x0):
> >     After scanning selected spans, do NOT read-scan remainder of disk.
> >   If Selective self-test is pending on power-up, resume after 0 minute 
> > delay.
> >
> >   SCT Status Version:                  3
> >   SCT Version (vendor specific):       522 (0x020a)
> >   Device State:                        Active (0)
> >   Current Temperature:                    37 Celsius
> >   Power Cycle Min/Max Temperature:     37/41 Celsius
> >   Lifetime    Min/Max Temperature:     18/45 Celsius
> >   Under/Over Temperature Limit Count:   0/0
> >
> >   SCT Data Table command not supported
> >
> >   SCT Error Recovery Control command not supported
> >
> >   Device Statistics (GP/SMART Log 0x04) not supported
> >
> >   Pending Defects log (GP Log 0x0c) not supported
> >
> >   SATA Phy Event Counters (GP Log 0x11)
> >   ID      Size     Value  Description
> >   0x000a  2          102  Device-to-host register FISes sent due to a 
> > COMRESET
> >   0x0001  2            0  Command failed due to ICRC error
> >   0x0003  2            0  R_ERR response for device-to-host data FIS
> >   0x0004  2            0  R_ERR response for host-to-device data FIS
> >   0x0006  2            0  R_ERR response for device-to-host non-data FIS
> >   0x0007  2            0  R_ERR response for host-to-device non-data FIS
> >
> > Many thanks,
> > Ranjan
> >
> >
> > > On Fri, Aug 18, 2023 at 1:30 PM Ranjan Maitra <mlmai...@gmx.com> wrote:
> > > >
> > > > Thanks, Roger!
> > > >
> > > >
> > > > On Fri Aug18'23 12:23:23PM, Roger Heflin wrote:
> > > > > From: Roger Heflin <rogerhef...@gmail.com>
> > > > > Date: Fri, 18 Aug 2023 12:23:23 -0500
> > > > > To: Community support for Fedora users <users@lists.fedoraproject.org>
> > > > > Reply-To: Community support for Fedora users 
> > > > > <users@lists.fedoraproject.org>
> > > > > Subject: Re: slowness with kernel 6.4.10 and software raid
> > > > >
> > > > > Is it moving at all or just stopped?  If just stopped it appears that
> > > > > md126 is using external:/md127 for something and md127 looks wrong
> > > > > (both disks are spare) but I don't know in this external case what
> > > > > md127 should look like.
> > > >
> > > > It is moving, slowly. It is a 2 TB drive, but this is weird.
> > > >
> > > > >
> > > > > I would suggest checking messages with grep md12[67] /var/log/messages
> > > > > (and older messages files if the reboot was not this week) to see what
> > > > > is going on.
> > > >
> > > > Good idea! Here is the result from
> > > >
> > > > $ grep md126  /var/log/messages
> > > >
> > > >
> > > >   Aug 14 15:02:30 localhost mdadm[1035]: Rebuild60 event detected on md 
> > > > device /dev/md126
> > > >   Aug 16 14:21:20 localhost kernel: md/raid1:md126: active with 2 out 
> > > > of 2 mirrors
> > > >   Aug 16 14:21:20 localhost kernel: md126: detected capacity change 
> > > > from 0 to 3711741952
> > > >   Aug 16 14:21:20 localhost kernel: md126: p1
> > > >   Aug 16 14:21:23 localhost systemd[1]: Condition check resulted in 
> > > > dev-md126p1.device - /dev/md126p1 being skipped.
> > > >   Aug 16 14:21:28 localhost systemd-fsck[942]: /dev/md126p1: clean, 
> > > > 7345384/115998720 files, 409971205/463967488 blocks
> > > >   Aug 16 14:21:31 localhost kernel: EXT4-fs (md126p1): mounted 
> > > > filesystem 932eb81c-2ab4-4e6e-b093-46e43dbd6c28 r/w with ordered data 
> > > > mode. Quota mode: none.
> > > >   Aug 16 14:21:31 localhost mdadm[1033]: NewArray event detected on md 
> > > > device /dev/md126
> > > >   Aug 16 14:21:31 localhost mdadm[1033]: RebuildStarted event detected 
> > > > on md device /dev/md126
> > > >   Aug 16 14:21:31 localhost kernel: md: data-check of RAID array md126
> > > >   Aug 16 19:33:18 localhost kernel: md/raid1:md126: sda: rescheduling 
> > > > sector 2735900352
> > > >   Aug 16 19:33:22 localhost kernel: md/raid1:md126: sda: rescheduling 
> > > > sector 2735900864
> > > >   Aug 16 19:33:28 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735900496 on sda)
> > > >   Aug 16 19:33:36 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735900568 on sda)
> > > >   Aug 16 19:33:41 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735900576 on sda)
> > > >   Aug 16 19:33:50 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735900624 on sda)
> > > >   Aug 16 19:34:00 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735900640 on sda)
> > > >   Aug 16 19:34:10 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735900688 on sda)
> > > >   Aug 16 19:34:18 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735900712 on sda)
> > > >   Aug 16 19:34:28 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735900792 on sda)
> > > >   Aug 16 19:34:32 localhost kernel: md/raid1:md126: redirecting sector 
> > > > 2735900352 to other mirror: sdc
> > > >   Aug 16 19:34:37 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735900872 on sda)
> > > >   Aug 16 19:34:45 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735900920 on sda)
> > > >   Aug 16 19:34:54 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735900992 on sda)
> > > >   Aug 16 19:34:54 localhost kernel: md/raid1:md126: redirecting sector 
> > > > 2735900864 to other mirror: sdc
> > > >   Aug 16 19:35:07 localhost kernel: md/raid1:md126: sda: rescheduling 
> > > > sector 2735905704
> > > >   Aug 16 19:35:11 localhost kernel: md/raid1:md126: sda: rescheduling 
> > > > sector 2735905960
> > > >   Aug 16 19:35:18 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735905768 on sda)
> > > >   Aug 16 19:35:19 localhost kernel: md/raid1:md126: redirecting sector 
> > > > 2735905704 to other mirror: sdc
> > > >   Aug 16 19:35:24 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735906120 on sda)
> > > >   Aug 16 19:35:33 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735906192 on sda)
> > > >   Aug 16 19:35:39 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735906448 on sda)
> > > >   Aug 16 19:35:40 localhost kernel: md/raid1:md126: redirecting sector 
> > > > 2735905960 to other mirror: sdc
> > > >   Aug 16 19:35:45 localhost kernel: md/raid1:md126: sda: rescheduling 
> > > > sector 2735906472
> > > >   Aug 16 19:35:49 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735906504 on sda)
> > > >   Aug 16 19:35:52 localhost kernel: md/raid1:md126: redirecting sector 
> > > > 2735906472 to other mirror: sdc
> > > >   Aug 16 19:36:03 localhost kernel: md/raid1:md126: sda: rescheduling 
> > > > sector 2735908008
> > > >   Aug 16 19:36:08 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735908232 on sda)
> > > >   Aug 16 19:36:16 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735908344 on sda)
> > > >   Aug 16 19:36:21 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735908424 on sda)
> > > >   Aug 16 19:36:21 localhost kernel: md/raid1:md126: redirecting sector 
> > > > 2735908008 to other mirror: sda
> > > >   Aug 16 19:36:30 localhost kernel: md/raid1:md126: sda: rescheduling 
> > > > sector 2735908008
> > > >   Aug 16 19:36:37 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735908296 on sda)
> > > >   Aug 16 19:36:38 localhost kernel: md/raid1:md126: redirecting sector 
> > > > 2735908008 to other mirror: sdc
> > > >   Aug 16 19:36:42 localhost kernel: md/raid1:md126: sda: rescheduling 
> > > > sector 2735908776
> > > >   Aug 16 19:36:42 localhost kernel: md/raid1:md126: sda: rescheduling 
> > > > sector 2735909032
> > > >   Aug 16 19:36:46 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735908784 on sda)
> > > >   Aug 16 19:36:50 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735908944 on sda)
> > > >   Aug 16 19:36:50 localhost kernel: md/raid1:md126: redirecting sector 
> > > > 2735908776 to other mirror: sdc
> > > >   Aug 16 19:36:55 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735909312 on sda)
> > > >   Aug 16 19:37:00 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735909360 on sda)
> > > >   Aug 16 19:37:04 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735909400 on sda)
> > > >   Aug 16 19:37:11 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735909520 on sda)
> > > >   Aug 16 19:37:11 localhost kernel: md/raid1:md126: redirecting sector 
> > > > 2735909032 to other mirror: sdc
> > > >   Aug 16 19:37:21 localhost kernel: md/raid1:md126: sda: rescheduling 
> > > > sector 2735910056
> > > >   Aug 16 19:37:21 localhost kernel: md/raid1:md126: sda: rescheduling 
> > > > sector 2735910568
> > > >   Aug 16 19:37:25 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735910064 on sda)
> > > >   Aug 16 19:37:31 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735910080 on sda)
> > > >   Aug 16 19:38:00 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735910128 on sda)
> > > >   Aug 16 19:38:08 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735910240 on sda)
> > > >   Aug 16 19:38:12 localhost kernel: md/raid1:md126: redirecting sector 
> > > > 2735910056 to other mirror: sdc
> > > >   Aug 16 19:38:15 localhost kernel: md/raid1:md126: redirecting sector 
> > > > 2735910568 to other mirror: sdc
> > > >   Aug 16 19:38:23 localhost kernel: md/raid1:md126: sda: rescheduling 
> > > > sector 2735911080
> > > >   Aug 16 19:38:23 localhost kernel: md/raid1:md126: sda: rescheduling 
> > > > sector 2735911592
> > > >   Aug 16 19:38:27 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735911520 on sda)
> > > >   Aug 16 19:38:27 localhost kernel: md/raid1:md126: redirecting sector 
> > > > 2735911080 to other mirror: sdc
> > > >   Aug 16 19:38:28 localhost kernel: md/raid1:md126: redirecting sector 
> > > > 2735911592 to other mirror: sdc
> > > >   Aug 16 19:38:33 localhost kernel: md/raid1:md126: sda: rescheduling 
> > > > sector 2735912104
> > > >   Aug 16 19:38:37 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735912184 on sda)
> > > >   Aug 16 19:38:45 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735912240 on sda)
> > > >   Aug 16 19:38:49 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735912248 on sda)
> > > >   Aug 16 19:38:59 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735912288 on sda)
> > > >   Aug 16 19:39:05 localhost kernel: md/raid1:md126: redirecting sector 
> > > > 2735912104 to other mirror: sdc
> > > >   Aug 16 19:39:10 localhost kernel: md/raid1:md126: sda: rescheduling 
> > > > sector 2735912872
> > > >   Aug 16 19:39:14 localhost kernel: md/raid1:md126: sda: rescheduling 
> > > > sector 2735913128
> > > >   Aug 16 19:39:25 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735912976 on sda)
> > > >   Aug 16 19:39:33 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735913048 on sda)
> > > >   Aug 16 19:39:37 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735913072 on sda)
> > > >   Aug 16 19:39:41 localhost kernel: md/raid1:md126: redirecting sector 
> > > > 2735912872 to other mirror: sdc
> > > >   Aug 16 19:39:45 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735913128 on sda)
> > > >   Aug 16 19:39:55 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735913176 on sda)
> > > >   Aug 16 19:40:05 localhost kernel: md/raid1:md126: read error 
> > > > corrected (8 sectors at 2735913232 on sda)
> > > >
> > > >
> > > > And here is what I get from:
> > > >
> > > > $ grep  md127  /var/log/messages
> > > >
> > > >
> > > >   Aug 16 14:16:38 localhost systemd[1]: mdmon@md127.service: 
> > > > Deactivated successfully.
> > > >   Aug 16 14:16:38 localhost systemd[1]: mdmon@md127.service: Unit 
> > > > process 884 (mdmon) remains running after unit stopped.
> > > >   Aug 16 14:16:38 localhost systemd[1]: Stopped mdmon@md127.service - 
> > > > MD Metadata Monitor on /dev/md127.
> > > >   Aug 16 14:16:38 localhost audit[1]: SERVICE_STOP pid=1 uid=0 
> > > > auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 
> > > > msg='unit=mdmon@md127 comm="systemd" exe="/usr/lib/systemd/systemd" 
> > > > hostname=? addr=? terminal=? res=success'
> > > >   Aug 16 14:16:38 localhost systemd[1]: mdmon@md127.service: Consumed 
> > > > 41.719s CPU time.
> > > >   Aug 16 14:21:20 localhost systemd[1]: Starting mdmon@md127.service - 
> > > > MD Metadata Monitor on /dev/md127...
> > > >   Aug 16 14:21:20 localhost systemd[1]: Started mdmon@md127.service - 
> > > > MD Metadata Monitor on /dev/md127.
> > > >   Aug 16 14:21:20 localhost audit[1]: SERVICE_START pid=1 uid=0 
> > > > auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 
> > > > msg='unit=mdmon@md127 comm="systemd" exe="/usr/lib/systemd/systemd" 
> > > > hostname=? addr=? terminal=? res=success'
> > > >
> > > > >
> > > > > Maybe also if you have a prior good reboot in messages file include
> > > > > that and see what happened differently between the 2.
> > > >
> > > > Yeah, I do not know where to find this. I looked into 
> > > > /var/log/messages, but it looks like it starts on August 13, which was 
> > > > a surprise to me, and the last non-responsive instance for me was last 
> > > > week (August 10, I think, when I booted into the 6.4 kernel). I did 
> > > > reboot in frustration on August 16.
> > > >
> > > > Thanks,
> > > > Ranjan
> > > >
> > > >
> > > > >
> > > > > On Fri, Aug 18, 2023 at 7:46 AM Ranjan Maitra <mlmai...@gmx.com> 
> > > > > wrote:
> > > > > >
> > > > > > On Thu Aug17'23 10:37:29PM, Samuel Sieb wrote:
> > > > > > > From: Samuel Sieb <sam...@sieb.net>
> > > > > > > Date: Thu, 17 Aug 2023 22:37:29 -0700
> > > > > > > To: users@lists.fedoraproject.org
> > > > > > > Reply-To: Community support for Fedora users 
> > > > > > > <users@lists.fedoraproject.org>
> > > > > > > Subject: Re: slowness with kernel 6.4.10 and software raid
> > > > > > >
> > > > > > > On 8/17/23 21:38, Ranjan Maitra wrote:
> > > > > > > > $ cat /proc/mdstat
> > > > > > > >   Personalities : [raid1]
> > > > > > > >   md126 : active raid1 sda[1] sdc[0]
> > > > > > > >         1855870976 blocks super external:/md127/0 [2/2] [UU]
> > > > > > > >         [=>...................]  check =  8.8% 
> > > > > > > > (165001216/1855870976) finish=45465.2min speed=619K/sec
> > > > > > > >
> > > > > > > >   md127 : inactive sda[1](S) sdc[0](S)
> > > > > > > >         10402 blocks super external:imsm
> > > > > > > >
> > > > > > > >   unused devices: <none>
> > > > > > > >
> > > > > > > > I am not sure what it is doing, and I am a bit concerned that 
> > > > > > > > this will go on at this rate for about 20 days. No knowing what 
> > > > > > > > will happen after that, and also if this problem will recur 
> > > > > > > > with another reboot.
> > > > > > >
> > > > > > > After a certain amount of time, mdraid will do a verification of 
> > > > > > > the data
> > > > > > > where it scans the entire array.  If you reboot, it will continue 
> > > > > > > from where
> > > > > > > it left off.  But that is *really* slow, so you should find out 
> > > > > > > what's going
> > > > > > > on there.
> > > > > >
> > > > > > Yes, I know, just not sure what to do. Thanks very much!
> > > > > >
> > > > > > Any suggestion is appreciated!
> > > > > >
> > > > > > Best wishes,
> > > > > > Ranjan
> > > > > > _______________________________________________
> > > > > > users mailing list -- users@lists.fedoraproject.org
> > > > > > To unsubscribe send an email to users-le...@lists.fedoraproject.org
> > > > > > Fedora Code of Conduct: 
> > > > > > https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> > > > > > List Guidelines: 
> > > > > > https://fedoraproject.org/wiki/Mailing_list_guidelines
> > > > > > List Archives: 
> > > > > > https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
> > > > > > Do not reply to spam, report it: 
> > > > > > https://pagure.io/fedora-infrastructure/new_issue
> > > > > _______________________________________________
> > > > > users mailing list -- users@lists.fedoraproject.org
> > > > > To unsubscribe send an email to users-le...@lists.fedoraproject.org
> > > > > Fedora Code of Conduct: 
> > > > > https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> > > > > List Guidelines: 
> > > > > https://fedoraproject.org/wiki/Mailing_list_guidelines
> > > > > List Archives: 
> > > > > https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
> > > > > Do not reply to spam, report it: 
> > > > > https://pagure.io/fedora-infrastructure/new_issue
> > > > _______________________________________________
> > > > users mailing list -- users@lists.fedoraproject.org
> > > > To unsubscribe send an email to users-le...@lists.fedoraproject.org
> > > > Fedora Code of Conduct: 
> > > > https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> > > > List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> > > > List Archives: 
> > > > https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
> > > > Do not reply to spam, report it: 
> > > > https://pagure.io/fedora-infrastructure/new_issue
> > > _______________________________________________
> > > users mailing list -- users@lists.fedoraproject.org
> > > To unsubscribe send an email to users-le...@lists.fedoraproject.org
> > > Fedora Code of Conduct: 
> > > https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> > > List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> > > List Archives: 
> > > https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
> > > Do not reply to spam, report it: 
> > > https://pagure.io/fedora-infrastructure/new_issue
> > _______________________________________________
> > users mailing list -- users@lists.fedoraproject.org
> > To unsubscribe send an email to users-le...@lists.fedoraproject.org
> > Fedora Code of Conduct: 
> > https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> > List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> > List Archives: 
> > https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
> > Do not reply to spam, report it: 
> > https://pagure.io/fedora-infrastructure/new_issue
> _______________________________________________
> users mailing list -- users@lists.fedoraproject.org
> To unsubscribe send an email to users-le...@lists.fedoraproject.org
> Fedora Code of Conduct: 
> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: 
> https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
> Do not reply to spam, report it: 
> https://pagure.io/fedora-infrastructure/new_issue
_______________________________________________
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue

Re: slowness with kernel 6.4.10 and software raid

Reply via email to