On 1/20/24 08:25, Tim Woodall wrote:
> Some time ago I wrote about a data corruption issue. I've still not
> managed to track it down ...

Please post a console session that demonstrates, or at least documents, the data corruption.


Please cut and paste complete console sessions into your posts -- prompt, command entered, output displayed. Redact sensitive information.


It helps if your prompt contains useful information. I set PS1 in $HOME/.profile as follows:

2024-01-20 11:31:58 dpchrist@laalaa ~
$ grep PS1 .profile | grep -v '#'
export PS1='\n\D{%Y-%m-%d %H:%M:%S} ${USER}@\h \w\n\$ '


> On the server that has no issues:
> sda: Sector size (logical/physical): 512 bytes / 512 bytes
> sdb: Sector size (logical/physical): 512 bytes / 512 bytes

Attempting to diagnose issues without all the facts is an exercise in futility.


Please post console sessions that document the make and model of your disks, their partition tables, your md RAID configurations, and your LVM configurations.


> These are then gpt partitioned, a small BIOS boot and EFI partition and
> then a big "Linux filesystem" partition that is part of a mdadm raid
>
> md0 : active raid1 sda3[3] sdb3[2]
>
> On the server that has performance issues and I get occasional data
> corruption (both reading and writing) under heavy (disk) load:
>
> sda: Sector size (logical/physical): 512 bytes / 512 bytes
> sdb: Sector size (logical/physical): 512 bytes / 4096 bytes

Putting a sector size 512/512 disk and a sector size 512/4096 disk into the same mirror is unconventional. I suppose there are kernel developers who could definitively explain the consequences, but I am not one of them. The KISS solution is to use matching disks in RAID.


> All the
> partitions start on a 4k boundary but the big partition is not an exact
> multiple of 4k.

I align my partitions to 1 MiB boundaries and suggest that you do the same.


> ... the "heavy load" filesystem that triggered the issue ...

Please post a console session that demonstrates how data corruption is related to I/O throughput.


> There are a LOT of
> partitions and filesystems in a complicated layered LVM setup ...

Complexity is the enemy of data integrity and system reliability. I suggest simplifying where it makes sense; but do not over-simplify.


> Booted on the problem machine but physical disk still on the OK machine:
> real    0m35.731s
> user    0m5.291s
> sys     0m4.677s
>
> Booted on the good machine but physical disk still on the problem
> machine:
> real    0m57.721s
> user    0m5.446s
> sys     0m4.783s

Please provide host names.


Please post a console session that demonstrates how data corruption affects VM boot time.


> The SMART attributes from the problem machine:
> sda:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
> UPDATED  WHEN_FAILED RAW_VALUE
>    5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail
> Always - 0> 12 Power_Cycle_Count 0x0032 099 099 000 Old_age > Always - 54> 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail > Always - 0> 181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age
> Always       -       0
> 182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age
> Always       -       0
> 183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail
> Always       -       0
> 187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age
> Always       -       0
> 190 Airflow_Temperature_Cel 0x0032   067   049   000    Old_age
> Always       -       33
> 195 ECC_Error_Rate          0x001a   200   200   000    Old_age
> Always       -       0
> 199 CRC_Error_Count         0x003e   100   100   000    Old_age
> Always       -       0

Those look good.


>    9 Power_On_Hours          0x0032   096   096   000    Old_age
> Always - 18280> 177 Wear_Leveling_Count 0x0013 087 087 000 Pre-fail > Always - 129> 241 Total_LBAs_Written 0x0032 099 099 000 Old_age
> Always       -       62154466086

Please compare those to the SSD specifications.


> 235 POR_Recovery_Count      0x0012   099   099   000    Old_age
> Always       -       39

https://www.overclock.net/threads/what-does-por-recovery-count-mean-in-samsung-magician.1491466/


I see a similar statistic on my Intel SSD 520 Series drives:

 12 Power_Cycle_Count       -O--CK   099   099   000    -    1996
174 Unexpect_Power_Loss_Ct  -O--CK   100   100   000    -    1994


Linux does not seem to shut down the drives the way they want to be shut down.


> sdb:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
> UPDATED  WHEN_FAILED RAW_VALUE
>    1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail
> Always       -       0
>    5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age
> Always - 0> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age
> Always       -       50
> 171 Program_Fail_Count      0x0032   100   100   000    Old_age
> Always       -       0
> 172 Erase_Fail_Count        0x0032   100   100   000    Old_age
> Always       -       0
> 183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age
> Always       -       0
> 184 Error_Correction_Count  0x0032   100   100   000    Old_age
> Always       -       0
> 187 Reported_Uncorrect      0x0032   100   100   000    Old_age
> Always       -       0
> 194 Temperature_Celsius     0x0022   074   052   000    Old_age
> Always       -       26 (Min/Max 0/48)
> 196 Reallocated_Event_Count 0x0032   100   100   000    Old_age
> Always       -       0
> 197 Current_Pending_ECC_Cnt 0x0032   100   100   000    Old_age
> Always       -       0
> 198 Offline_Uncorrectable   0x0030   100   100   000    Old_age
> Offline      -       0
> 206 Write_Error_Rate        0x000e   100   100   000    Old_age
> Always       -       0
> 210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age
> Always       -       0

Those look good.


> 199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age
> Always       -       1

I believe that indicates a SATA communications problem. I suggest using SATA cables that are rated for SATA III 6 Gbps with locking connectors. If you are in doubt, buy new cables that are properly identified.


>    9 Power_On_Hours          0x0032   100   100   000    Old_age
> Always       -       18697
> 173 Ave_Block-Erase_Count   0x0032   067   067   000    Old_age
> Always       -       433
> 180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail
> Always       -       45
> 246 Total_LBAs_Written      0x0032   100   100   000    Old_age
> Always       -       63148678276
> 247 Host_Program_Page_Count 0x0032   100   100   000    Old_age
> Always       -       1879223820
> 248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age
> Always       -       1922002147

Please compare those to SSD specifications.


> 202 Percent_Lifetime_Remain 0x0030   067   067   001    Old_age
> Offline      -       33

That value is not encouraging, but it is an estimate; not a hard error count. I would monitor it over time.


> 174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age
> Always       -       12

Same comments as above.


An underlying theme is "data integrity". AIUI only btrfs and ZFS have integrity checking built-in; AIUI md, LVM, and ext[234] do not. Linux dm-integrity has not reached Debian stable yet. I suggest that you implemented periodic runs of BSD mtree(8) to monitor your file systems for corruption:

https://manpages.debian.org/bullseye/mtree-netbsd/mtree.8.en.html


Another underlying theme is system monitoring and failure prediction. It is good to run SMART tests and get SMART tests on a regular basis. I do this manually, have too many disks, and am doing a lousy job. I need to learn smartd(8).


There have been a few posts recently by people who are running consumer SSD's in RAID 24x7. After 2+ years, the SSD's start having problems and produce scary SMART reports. AIUI consumer drives are rated for 40 hours/week. Running them 24x7 is like "dog years" -- multiply wall clock time by 24 * 7 / 40 to get equivalent usage time. In this case, 2 years at 24x7 is equivalent to 8.4 years of 40 hours/week usage. If you want to run disks 24x7 and have them last 5 years with a certain I/O load, get disks rated for that.


David

Reply via email to