On 16.02.21 12:18, Brian Foster wrote:
On Mon, Feb 15, 2021 at 02:36:38PM +0100, Donald Buczek wrote:
On 13.01.21 22:53, Dave Chinner wrote:
[...]
I agree that a throttling fix is needed, but I'm trying to
understand the scope and breadth of the problem first instead of
jumping the gu
On 13.01.21 22:53, Dave Chinner wrote:
[...]
I agree that a throttling fix is needed, but I'm trying to
understand the scope and breadth of the problem first instead of
jumping the gun and making the wrong fix for the wrong reasons that
just papers over the underlying problems that the throttling
On 09.02.21 01:46, Guoqing Jiang wrote:
Great. I will send a formal patch with your reported-by and tested-by.
Yes, that's fine.
Thanks a lot for your help!
Donald
Thanks,
Guoqing
Dear Guoqing,
On 08.02.21 15:53, Guoqing Jiang wrote:
On 2/8/21 12:38, Donald Buczek wrote:
5. maybe don't hold reconfig_mutex when try to unregister sync_thread, like
this.
/* resync has finished, collect result */
mddev_unlock(mddev);
md_unregister_t
On 02.02.21 16:42, Guoqing Jiang wrote:
Hi Donald,
On 1/26/21 17:05, Donald Buczek wrote:
Dear Guoqing,
On 26.01.21 15:06, Guoqing Jiang wrote:
On 1/26/21 13:58, Donald Buczek wrote:
Hmm, how about wake the waiter up in the while loop of raid5d?
@@ -6520,6 +6532,11 @@ static void
Dear Guoqing,
a colleague of mine was able to produce the issue inside a vm and were able to
find a procedure to run the vm into the issue within minutes (not unreliably
after hours on a physical system as before). This of course helped to pinpoint
the problem.
My current theory of what is ha
Dear Guoqing,
On 26.01.21 15:06, Guoqing Jiang wrote:
On 1/26/21 13:58, Donald Buczek wrote:
Hmm, how about wake the waiter up in the while loop of raid5d?
@@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread)
md_check_recovery(mddev
On 26.01.21 12:14, Guoqing Jiang wrote:
Hi Donald,
On 1/26/21 10:50, Donald Buczek wrote:
[...]
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 2d21c298ffa7..f40429843906 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4687,11 +4687,13 @@ action_store(struct mddev *mddev
Dear Guoqing,
On 26.01.21 01:44, Guoqing Jiang wrote:
Hi Donald,
On 1/25/21 22:32, Donald Buczek wrote:
On 25.01.21 09:54, Donald Buczek wrote:
Dear Guoqing,
a colleague of mine was able to produce the issue inside a vm and were able to
find a procedure to run the vm into the issue
On 25.01.21 09:54, Donald Buczek wrote:
Dear Guoqing,
a colleague of mine was able to produce the issue inside a vm and were able to
find a procedure to run the vm into the issue within minutes (not unreliably
after hours on a physical system as before). This of course helped to pinpoint
Dear Guoqing,
On 20.01.21 17:33, Guoqing Jiang wrote:
Hi Donald,
On 1/19/21 12:30, Donald Buczek wrote:
Dear md-raid people,
I've reported a problem in this thread in December:
"We are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process
seems
Dear md-raid people,
I've reported a problem in this thread in December:
"We are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process
seems to go into a busy loop and all I/O to the md device blocks. We've seen this on various kernel
versions." It was clear,
On 07.01.21 23:19, Dave Chinner wrote:
On Sun, Jan 03, 2021 at 05:03:33PM +0100, Donald Buczek wrote:
On 02.01.21 23:44, Dave Chinner wrote:
On Sat, Jan 02, 2021 at 08:12:56PM +0100, Donald Buczek wrote:
On 31.12.20 22:59, Dave Chinner wrote:
On Thu, Dec 31, 2020 at 12:48:56PM +0100, Donald
On 02.01.21 23:44, Dave Chinner wrote:
On Sat, Jan 02, 2021 at 08:12:56PM +0100, Donald Buczek wrote:
On 31.12.20 22:59, Dave Chinner wrote:
On Thu, Dec 31, 2020 at 12:48:56PM +0100, Donald Buczek wrote:
On 30.12.20 23:16, Dave Chinner wrote:
One could argue that, but one should also
On 31.12.20 22:59, Dave Chinner wrote:
Hey, funny, your email could celebrate New Year a second time :-)
On Thu, Dec 31, 2020 at 12:48:56PM +0100, Donald Buczek wrote:
On 30.12.20 23:16, Dave Chinner wrote:
On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
Threads, which
On 30.12.20 23:16, Dave Chinner wrote:
On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
Threads, which committed items to the CIL, wait in the xc_push_wait
waitqueue when used_space in the push context goes over a limit. These
threads need to be woken when the CIL is pushed.
The
ion. This is possible, because we hold the xc_push_lock
spinlock, which prevents additions to the waitqueue.
Signed-off-by: Donald Buczek
---
fs/xfs/xfs_log_cil.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index b0ef071b3cb5..d620de8e21
ese threads are
not woken.
Always wake all CIL push waiters. Test with waitqueue_active() as an
optimization. This is possible, because we hold the xc_push_lock
spinlock, which prevents additions to the waitqueue.
Signed-off-by: Donald Buczek
---
fs/xfs/xfs_log_cil.c | 2 +-
1 file changed, 1 insert
On 27.12.20 18:34, Donald Buczek wrote:
On 18.12.20 19:35, Donald Buczek wrote:
On 18.12.20 16:35, Brian Foster wrote:
On Thu, Dec 17, 2020 at 10:30:37PM +0100, Donald Buczek wrote:
On 17.12.20 20:43, Brian Foster wrote:
On Thu, Dec 17, 2020 at 06:44:51PM +0100, Donald Buczek wrote:
Dear
On 18.12.20 19:35, Donald Buczek wrote:
On 18.12.20 16:35, Brian Foster wrote:
On Thu, Dec 17, 2020 at 10:30:37PM +0100, Donald Buczek wrote:
On 17.12.20 20:43, Brian Foster wrote:
On Thu, Dec 17, 2020 at 06:44:51PM +0100, Donald Buczek wrote:
Dear xfs developer,
I was doing some testing on
On 21.12.20 13:22, Donald Buczek wrote:
On 18.12.20 22:49, Dave Chinner wrote:
On Thu, Dec 17, 2020 at 06:44:51PM +0100, Donald Buczek wrote:
Dear xfs developer,
I was doing some testing on a Linux 5.10.1 system with two 100 TB xfs
filesystems on md raid6 raids.
The stress test was
Dear Guoging,
I think now that this is not an issue for md. I've driven a system into that
situation again and have clear indication, that this is a problem of the member
block device driver.
With md0 in the described errornous state (md0_raid6 busy looping, echo idle >
.../sync_action blocke
On 18.12.20 22:49, Dave Chinner wrote:
On Thu, Dec 17, 2020 at 06:44:51PM +0100, Donald Buczek wrote:
Dear xfs developer,
I was doing some testing on a Linux 5.10.1 system with two 100 TB xfs
filesystems on md raid6 raids.
The stress test was essentially `cp -a`ing a Linux source repository
On 18.12.20 16:35, Brian Foster wrote:
On Thu, Dec 17, 2020 at 10:30:37PM +0100, Donald Buczek wrote:
On 17.12.20 20:43, Brian Foster wrote:
On Thu, Dec 17, 2020 at 06:44:51PM +0100, Donald Buczek wrote:
Dear xfs developer,
I was doing some testing on a Linux 5.10.1 system with two 100 TB
On 17.12.20 20:43, Brian Foster wrote:
On Thu, Dec 17, 2020 at 06:44:51PM +0100, Donald Buczek wrote:
Dear xfs developer,
I was doing some testing on a Linux 5.10.1 system with two 100 TB xfs
filesystems on md raid6 raids.
The stress test was essentially `cp -a`ing a Linux source repository
Dear xfs developer,
I was doing some testing on a Linux 5.10.1 system with two 100 TB xfs
filesystems on md raid6 raids.
The stress test was essentially `cp -a`ing a Linux source repository with two
threads in parallel on each filesystem.
After about on hour, the processes to one filesystem (
Dear Guoqing,
On 12/3/20 2:55 AM, Guoqing Jiang wrote:
Hi Donald,
On 12/2/20 18:28, Donald Buczek wrote:
Dear Guoqing,
unfortunately the patch didn't fix the problem (unless I messed it up with my
logging). This is what I used:
--- a/drivers/md/md.c
+++ b/drivers/md
x170
[<0>] raid5_sync_request+0x24b/0x390
[<0>] md_do_sync+0xb41/0x1030
[<0>] md_thread+0x122/0x160
[<0>] kthread+0x118/0x130
[<0>] ret_from_fork+0x1f/0x30
I guess, md_bitmap_cond_end_sync+0x12d is the
`wait_event(bitmap->mddev->recovery_wait,atomic_re
Am 30.11.20 um 03:06 schrieb Guoqing Jiang:
On 11/28/20 13:25, Donald Buczek wrote:
Dear Linux mdraid people,
we are using raid6 on several servers. Occasionally we had failures, where a
mdX_raid6 process seems to go into a busy loop and all I/O to the md device
blocks. We've seen th
chunk
There doesn't seem to be any further progress.
I've taken a function_graph trace of the looping md1_raid6 process:
https://owww.molgen.mpg.de/~buczek/2020-11-27_trace.txt (30 MB)
Maybe this helps to get an idea what might be going on?
Best
Donald
--
Donald Buczek
buc...@molgen.mpg.de
Tel: +49 30 8413 1433
On 8/12/20 8:49 AM, Donald Buczek wrote:
On 8/4/20 12:11 AM, Dave Chinner wrote:
On Sat, Aug 01, 2020 at 12:25:40PM +0200, Donald Buczek wrote:
On 01.08.20 00:32, Dave Chinner wrote:
On Fri, Jul 31, 2020 at 01:27:31PM +0200, Donald Buczek wrote:
Dear Linux people,
we have a backup server
sable
lawyertalk.
If there is a problem with the smartpqi driver in Linux 5.9 and if fixes are
available, they should, of course, go upstream ( No Regression Policy ). I'll
be happy to test any fixes.
Best
Donald
Thanks,
Don
-Original Message-
From: Donald Buczek [mailto:buc...@m
On 8/4/20 12:11 AM, Dave Chinner wrote:
On Sat, Aug 01, 2020 at 12:25:40PM +0200, Donald Buczek wrote:
On 01.08.20 00:32, Dave Chinner wrote:
On Fri, Jul 31, 2020 at 01:27:31PM +0200, Donald Buczek wrote:
Dear Linux people,
we have a backup server with two xfs filesystems on 101.9TB md-raid6
On 01.08.20 12:25, Donald Buczek wrote:
So if I understand you correctly, this is expected behavior with this kind of
load and conceptual changes are already scheduled for kernel 5.9. I don't
understand most of it, but isn't it true that with that planned changes the
impact might
On 01.08.20 00:32, Dave Chinner wrote:
On Fri, Jul 31, 2020 at 01:27:31PM +0200, Donald Buczek wrote:
Dear Linux people,
we have a backup server with two xfs filesystems on 101.9TB md-raid6 devices
(16 * 7.3 T disks) each, Current Linux version is 5.4.54.
.
root:done:/home/buczek
y-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
root:done:/home/buczek/linux_problems/shrinker_semaphore/# uname -a
Linux done.molgen.mpg.de 5.4.54.mx64.339 #1 SMP Wed Jul 29 16:24:46 CEST 2020
x86_64 GNU/Linux
--
Donald Buczek
buc...@molgen.mpg.de
Tel: +49 30 8413 1433
On 8/20/19 11:21 PM, Donald Buczek wrote:
Dear Linux folks,
I'm investigating a problem, that the crash utility fails to work with our
crash dumps:
buczek@kreios:/mnt$ crash vmlinux crash.vmcore
crash 7.2.6
Copyright (C) 2002-2019 Red Hat, Inc.
Copyright (C) 2004,
e+0x36/0x80
[1707636.819360] [] ? do_exit+0x8fd/0xae0
[1707636.819663] [] ? rewind_stack_do_exit+0x17/0x20
```
Please find the messages until forceful reboot attached.
Kind regards,
Paul
--
Donald Buczek
buc...@molgen.mpg.de
Tel: +49 30 8413 1433
if no pages can be isolated from
the LRU list. This is a pathological case but other reports from Donald
Buczek have shown that we might indeed hit such a path:
clusterd-989 [009] 118023.654491: mm_vmscan_direct_reclaim_end:
nr_reclaimed=193
kswapd1-86[001] dN
On 12/02/16 10:14, Donald Buczek wrote:
On 11/30/16 12:43, Donald Buczek wrote:
On 11/30/16 12:09, Michal Hocko wrote:
[CCing Paul]
On Wed 30-11-16 11:28:34, Donald Buczek wrote:
[...]
shrink_active_list gets and releases the spinlock and calls
cond_resched().
This should give other tasks a
On 11/30/16 12:43, Donald Buczek wrote:
On 11/30/16 12:09, Michal Hocko wrote:
[CCing Paul]
On Wed 30-11-16 11:28:34, Donald Buczek wrote:
[...]
shrink_active_list gets and releases the spinlock and calls
cond_resched().
This should give other tasks a chance to run. Just as an experiment
On 11/30/16 12:09, Michal Hocko wrote:
[CCing Paul]
On Wed 30-11-16 11:28:34, Donald Buczek wrote:
[...]
shrink_active_list gets and releases the spinlock and calls cond_resched().
This should give other tasks a chance to run. Just as an experiment, I'm
trying
--- a/mm/vmscan.c
+++
process
context?
From looking at the monitoring graphs, there was always enough CPU
resources available. The machine has 12x E5-2630 @ 2.30GHz. So that
shouldn’t have been a problem.
Kind regards,
Paul Menzel
--
Donald Buczek
buc...@molgen.mpg.de
Tel: +49 30 8413 1433
On 24.11.2016 11:15, Michal Hocko wrote:
On Mon 21-11-16 16:35:53, Donald Buczek wrote:
[...]
Hello,
thanks a lot for looking into this!
Let me add some information from the reporting site:
* We've tried the patch from Paul E. McKenney (the one posted Wed, 16 Nov
2016) and it doesn
On 24.11.2016 11:15, Michal Hocko wrote:
On Mon 21-11-16 16:35:53, Donald Buczek wrote:
[...]
Hello,
thanks a lot for looking into this!
Let me add some information from the reporting site:
* We've tried the patch from Paul E. McKenney (the one posted Wed, 16 Nov
2016) and it doesn
tched to 4.8 but not as frequently as on the two backup servers.
Usually there's "shrink_node" and "kswapd" on the top of the stack.
Often "xfs_reclaim_inodes" variants on top of that.
Donald
--
Donald Buczek
buc...@molgen.mpg.de
Tel: +49 30 8413 1433
On 17.11.2016 15:55, Paul Menzel wrote:
Dear Linux folks,
On 11/16/16 22:24, Donald Buczek wrote:
The relevant commit is 703b5fa which includes
The commit message summary is *fs/dcache.c: Save one 32-bit multiply
in dcache lookup*.
static inline unsigned long end_name_hash(unsigned
47 matches
Mail list logo