Hi Yu, Thanks for your followups.
On Tue, May 06, 2025 at 09:25:50AM +0800, Yu Kuai wrote: > Hi, > > 在 2025/05/06 4:59, Antoine Beaupré 写道: > > On 2025-05-05 22:36:07, Salvatore Bonaccorso wrote: > > > Hi Antoine, > > > > > > On Mon, May 05, 2025 at 02:50:32PM -0400, Antoine Beaupré wrote: > > > > On 2025-05-05 18:02:37, Salvatore Bonaccorso wrote: > > > > > On Mon, May 05, 2025 at 04:00:31PM +0200, Salvatore Bonaccorso wrote: > > > > > > Hi Moritz, > > > > > > > > > > > > On Mon, May 05, 2025 at 01:47:15PM +0200, Moritz Mühlenhoff wrote: > > > > > > > Am Wed, Apr 30, 2025 at 05:55:20PM +0200 schrieb Salvatore > > > > > > > Bonaccorso: > > > > > > > > Hi > > > > > > > > > > > > > > > > We got a regression report in Debian after the update from > > > > > > > > 6.1.133 to > > > > > > > > 6.1.135. Melvin is reporting that discard/trimm trhough a > > > > > > > > RAID10 array > > > > > > > > stalls idefintively. The full report is inlined below and > > > > > > > > originates > > > > > > > > from https://bugs.debian.org/1104460 . > > > > > > > > > > > > > > JFTR, we ran into the same problem with a few Wikimedia servers > > > > > > > running > > > > > > > 6.1.135 and RAID 10: The servers started to lock up once > > > > > > > fstrim.service > > > > > > > got started. Full oops messages are available at > > > > > > > https://phabricator.wikimedia.org/P75746 > > > > > > > > > > > > Thanks for this aditional datapoints. Assuming you wont be able to > > > > > > thest the other stable series where the commit d05af90d6218 > > > > > > ("md/raid10: fix missing discard IO accounting") went in, might you > > > > > > at > > > > > > least be able to test the 6.1.y branch with the commit reverted > > > > > > again > > > > > > and manually trigger the issue? > > > > > > > > > > > > If needed I can provide a test Debian package of 6.1.135 (or > > > > > > 6.1.137) > > > > > > with the patch reverted. > > > > > > > > > > So one additional data point as several Debian users were reporting > > > > > back beeing affected: One user did upgrade to 6.12.25 (where the > > > > > commit was backported as well) and is not able to reproduce the issue > > > > > there. > > > > > > > > That would be me. > > > > > > > > I can reproduce the issue as outlined by Moritz above fairly reliably in > > > > 6.1.135 (debian package 6.1.0-34-amd64). The reproducer is simple, on a > > > > RAID-10 host: > > > > > > > > 1. reboot > > > > 2. systemctl start fstrim.service > > > > > > > > We're tracking the issue internally in: > > > > > > > > https://gitlab.torproject.org/tpo/tpa/team/-/issues/42146 > > > > > > > > I've managed to workaround the issue by upgrading to the Debian package > > > > from testing/unstable (6.12.25), as Salvatore indicated above. There, > > > > fstrim doesn't cause any crash and completes successfully. In stable, it > > > > just hangs there forever. The kernel doesn't completely panic and the > > > > machine is otherwise somewhat still functional: my existing SSH > > > > connection keeps working, for example, but new ones fail. And an `apt > > > > install` of another kernel hangs forever. > > > > > > So likely at least in 6.1.y there are missing pre-requisites causing > > > the behaviour. > > > > > > If you can test 6.1.135-1 with the commit > > > 4a05f7ae33716d996c5ce56478a36a3ede1d76f2 reverted then you can fetch > > > built packages at: > > > > > > https://people.debian.org/~carnil/tmp/linux/1104460/ > > Can you also test with 4a05f7ae33716d996c5ce56478a36a3ede1d76f2 not > reverted, and also cherry-pick c567c86b90d4715081adfe5eb812141a5b6b4883? Thank you. Antoine, Moritz, https://people.debian.org/~carnil/tmp/linux/1104460-2/ contains a build with 4a05f7ae33716d996c5ce56478a36a3ede1d76f2 *not* reverted and with c567c86b90d4715081adfe5eb812141a5b6b4883 cherry-picked, can you test this one as well? Regards, Salvatore