On 1/20/24 08:25, Tim Woodall wrote:
> Some time ago I wrote about a data corruption issue. I've still not
> managed to track it down ...
Please post a console session that demonstrates, or at least documents,
the data corruption.
Please cut and paste complete console sessions into your posts --
prompt, command entered, output displayed. Redact sensitive information.
It helps if your prompt contains useful information. I set PS1 in
$HOME/.profile as follows:
2024-01-20 11:31:58 dpchrist@laalaa ~
$ grep PS1 .profile | grep -v '#'
export PS1='\n\D{%Y-%m-%d %H:%M:%S} ${USER}@\h \w\n\$ '
> On the server that has no issues:
> sda: Sector size (logical/physical): 512 bytes / 512 bytes
> sdb: Sector size (logical/physical): 512 bytes / 512 bytes
Attempting to diagnose issues without all the facts is an exercise in
futility.
Please post console sessions that document the make and model of your
disks, their partition tables, your md RAID configurations, and your LVM
configurations.
> These are then gpt partitioned, a small BIOS boot and EFI partition and
> then a big "Linux filesystem" partition that is part of a mdadm raid
>
> md0 : active raid1 sda3[3] sdb3[2]
>
> On the server that has performance issues and I get occasional data
> corruption (both reading and writing) under heavy (disk) load:
>
> sda: Sector size (logical/physical): 512 bytes / 512 bytes
> sdb: Sector size (logical/physical): 512 bytes / 4096 bytes
Putting a sector size 512/512 disk and a sector size 512/4096 disk into
the same mirror is unconventional. I suppose there are kernel
developers who could definitively explain the consequences, but I am not
one of them. The KISS solution is to use matching disks in RAID.
> All the
> partitions start on a 4k boundary but the big partition is not an exact
> multiple of 4k.
I align my partitions to 1 MiB boundaries and suggest that you do the same.
> ... the "heavy load" filesystem that triggered the issue ...
Please post a console session that demonstrates how data corruption is
related to I/O throughput.
> There are a LOT of
> partitions and filesystems in a complicated layered LVM setup ...
Complexity is the enemy of data integrity and system reliability. I
suggest simplifying where it makes sense; but do not over-simplify.
> Booted on the problem machine but physical disk still on the OK machine:
> real 0m35.731s
> user 0m5.291s
> sys 0m4.677s
>
> Booted on the good machine but physical disk still on the problem
> machine:
> real 0m57.721s
> user 0m5.446s
> sys 0m4.783s
Please provide host names.
Please post a console session that demonstrates how data corruption
affects VM boot time.
> The SMART attributes from the problem machine:
> sda:
> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
> UPDATED WHEN_FAILED RAW_VALUE
> 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail
> Always - 0> 12 Power_Cycle_Count 0x0032 099
099 000 Old_age
> Always - 54> 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100
100 010 Pre-fail
> Always - 0> 181 Program_Fail_Cnt_Total 0x0032 100
100 010 Old_age
> Always - 0
> 182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age
> Always - 0
> 183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail
> Always - 0
> 187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age
> Always - 0
> 190 Airflow_Temperature_Cel 0x0032 067 049 000 Old_age
> Always - 33
> 195 ECC_Error_Rate 0x001a 200 200 000 Old_age
> Always - 0
> 199 CRC_Error_Count 0x003e 100 100 000 Old_age
> Always - 0
Those look good.
> 9 Power_On_Hours 0x0032 096 096 000 Old_age
> Always - 18280> 177 Wear_Leveling_Count 0x0013 087
087 000 Pre-fail
> Always - 129> 241 Total_LBAs_Written 0x0032 099
099 000 Old_age
> Always - 62154466086
Please compare those to the SSD specifications.
> 235 POR_Recovery_Count 0x0012 099 099 000 Old_age
> Always - 39
https://www.overclock.net/threads/what-does-por-recovery-count-mean-in-samsung-magician.1491466/
I see a similar statistic on my Intel SSD 520 Series drives:
12 Power_Cycle_Count -O--CK 099 099 000 - 1996
174 Unexpect_Power_Loss_Ct -O--CK 100 100 000 - 1994
Linux does not seem to shut down the drives the way they want to be shut
down.
> sdb:
> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
> UPDATED WHEN_FAILED RAW_VALUE
> 1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail
> Always - 0
> 5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age
> Always - 0> 12 Power_Cycle_Count 0x0032 100
100 000 Old_age
> Always - 50
> 171 Program_Fail_Count 0x0032 100 100 000 Old_age
> Always - 0
> 172 Erase_Fail_Count 0x0032 100 100 000 Old_age
> Always - 0
> 183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age
> Always - 0
> 184 Error_Correction_Count 0x0032 100 100 000 Old_age
> Always - 0
> 187 Reported_Uncorrect 0x0032 100 100 000 Old_age
> Always - 0
> 194 Temperature_Celsius 0x0022 074 052 000 Old_age
> Always - 26 (Min/Max 0/48)
> 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age
> Always - 0
> 197 Current_Pending_ECC_Cnt 0x0032 100 100 000 Old_age
> Always - 0
> 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age
> Offline - 0
> 206 Write_Error_Rate 0x000e 100 100 000 Old_age
> Always - 0
> 210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age
> Always - 0
Those look good.
> 199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age
> Always - 1
I believe that indicates a SATA communications problem. I suggest using
SATA cables that are rated for SATA III 6 Gbps with locking connectors.
If you are in doubt, buy new cables that are properly identified.
> 9 Power_On_Hours 0x0032 100 100 000 Old_age
> Always - 18697
> 173 Ave_Block-Erase_Count 0x0032 067 067 000 Old_age
> Always - 433
> 180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail
> Always - 45
> 246 Total_LBAs_Written 0x0032 100 100 000 Old_age
> Always - 63148678276
> 247 Host_Program_Page_Count 0x0032 100 100 000 Old_age
> Always - 1879223820
> 248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age
> Always - 1922002147
Please compare those to SSD specifications.
> 202 Percent_Lifetime_Remain 0x0030 067 067 001 Old_age
> Offline - 33
That value is not encouraging, but it is an estimate; not a hard error
count. I would monitor it over time.
> 174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age
> Always - 12
Same comments as above.
An underlying theme is "data integrity". AIUI only btrfs and ZFS have
integrity checking built-in; AIUI md, LVM, and ext[234] do not. Linux
dm-integrity has not reached Debian stable yet. I suggest that you
implemented periodic runs of BSD mtree(8) to monitor your file systems
for corruption:
https://manpages.debian.org/bullseye/mtree-netbsd/mtree.8.en.html
Another underlying theme is system monitoring and failure prediction.
It is good to run SMART tests and get SMART tests on a regular basis. I
do this manually, have too many disks, and am doing a lousy job. I need
to learn smartd(8).
There have been a few posts recently by people who are running consumer
SSD's in RAID 24x7. After 2+ years, the SSD's start having problems and
produce scary SMART reports. AIUI consumer drives are rated for 40
hours/week. Running them 24x7 is like "dog years" -- multiply wall
clock time by 24 * 7 / 40 to get equivalent usage time. In this case, 2
years at 24x7 is equivalent to 8.4 years of 40 hours/week usage. If you
want to run disks 24x7 and have them last 5 years with a certain I/O
load, get disks rated for that.
David