Bug#1076372: linux-image-6.11.5+debian+tj: Diagnostic steps

Stefan Wed, 30 Oct 2024 17:21:26 -0700

Hi,

first, answers to your questions / remarks / reminders / notes:


1. I installed the latest mainboard firmware (/ UEFI / BIOS).
2. Both tested NVMe have a capacity of 4TB. The qualifying storage
   list does not contain any 4TB SSD from Lexar or Kingston
3. The corruptions occurs locally. But maybe only if large amounts of
   data are transferred (also see 4.). The easiest way to reproduce the
   bug is to use the tool `f3` which writes + verifies files with
   pseudo-random pattern (intended to detect faked memory cards)
4. Strangely enough, I probably generated thousands of corrupted files,
   but never noticed any file system errors. That's why my first idea
   was that is is an filesystem issue.
5. Lexar SSD with > 6.1 kernels in primary M.2 Socket (front side)
   produces write errors; Kingston SSD with 6.1 kernel in primary M.2
   socket produces read errors
6. The errors (read+write) occur bulk-wise, i.e. if the 1GB files (read
   + written by f3) are either o.k. or larger portions are defect.
   At least if write errors occur, the portions are often (but not
   always) multiples of 128KB.
7. The Asrock X600M(-STX) is chipset-less, i.e. the CPU AMD 8700G runs
   in SOC mode.

Conclusion:

A. I disagree that it is a SSD-specific issue. For example, the older
   Lexar SSD ran in the previous PC without any issues and works in the
   secondary (rear) M.2 with both tested kernel. On the other hand, both
   tested SSD's in the primary socket produces issues some kernels.
B. I think that the bug is either a CPU- (see 7.) or mainboard specific
   and because there are issues with both tested SSD, someone did a bad
   job testing the hardware.
C. Because some symptoms are quite weird (4., 5.) it may be something
   unusual, like a module writing into momory that belong to another
   module.

Testing:

ATM and also within the next months, testing is difficult, because the
PC is installed remotely and in full use. (For testing I need to swap
the SSD's ...) But I'll try to test the latest mainline (LTS and stable)
in November in order to verify that it is no Debian issue. (At least the
LTS kernel can be tested remotely.)


Regards Stefan



Am 30.10.24 um 22:42 schrieb Tj:

Package: linux-image-6.11.5+debian+tj
Followup-For: Bug #1076372
X-Debbugs-Cc: tj.iam...@proton.me

Following up from the kernel team discussion this evening that I only
caught the tail-end of I've reviewed this report and have the following
suggestions and observations.

It would be good to see complete-from-boot kernel logs for good/bad
results linked to which M2 slot each device is in.
That mobo (AsRock X600M-STX) has 3 M.2, 2 for SSDs (+1 for WiFi),
M2_1 Gen5x4 on front of PCB, and M2_2 Gen4x4 on rear.
AsRock also publish a qualifying storage device list and the Lexar
LNM790 appears to be on it

https://www.asrock.com/Nettop/AMD/DeskMini%20X600%20Series/index.asp#Storage

(and presumably implicitly requires most recent UEFI - there have
been several very recent updates).

https://www.asrock.com/Nettop/AMD/DeskMini%20X600%20Series/index.asp#BIOS

Also of use would be to know if the corruption occurs for locally
generated data - report states data is received from network so with
current knowledge the issue could be on the network side.
Also - with my forensics hat on - being shown the data expected vs
corrupted might give clues as to what type of cause it is. For example,
I've dealt with situations where a single register bit would flip
occasionally and the data stream would be scrambled until it flipped
once again and the data is unscrambled, or it could be bits/words being
lost entirely so when stored remaining data is at a different offset
to the original, but is still there.

Bug#1076372: linux-image-6.11.5+debian+tj: Diagnostic steps

Reply via email to