severity 440676 important thanks Jeffrey B. Green: > > A very similar behaviour can be reproduced with i386 as well (with the > > Etch versions of the Linux kernel and nfs-kernel-server): > > We have a setup where a block device is replicated with drbd. On this > > device an ext3 partition has been created, which is exported over NFS. > > Reading from the NFS share works fairly well, however concurrent writes > > to the share lead to lockups. The client processes copying data to the > > share are stalling and sometimes the system is locked up requiring a hard > > reboot. > > > > This is reproducable with both NFS over UDP and NFS over TCP.
> Actually, the incident is primarily on i386 machines. Sorry about the > report being a bit deceptive by submitting from my standard work machine > which is a powerpc. The powerpc kernel may not exhibit this behavior. Could you please test the attached patch from RHEL5 and report if it resolves the problem for you? Instructions on how to compile a modified kernel can be found at http://wiki.debian.org/DebianKernelCustomCompilation Cheers, Moritz -- Moritz Muehlenhoff [EMAIL PROTECTED] fon: +49 421 22 232- 0 Development Linux for Your Business fax: +49 421 22 232-99 Univention GmbH http://www.univention.de/ mobil: +49 175 22 999 23
From: Steve Dickson <[EMAIL PROTECTED]> Subject: [RHEL5][PATCH] NFS: system stall on NFS stress under high memory pressure Date: Fri, 15 Dec 2006 08:41:22 -0500 Bugzilla: 213137 Message-Id: <[EMAIL PROTECTED]> Changelog: NFS: system stall on NFS stress under high memory pressure The following 3 attached patches solve an NFS hang that an number of upstream and RHEL5 user saw... The hang, which was introduced in the 2.6.11 kernel, was caused by an RPC task continuously getting ignored due the a certain combination of task states. This state combination only seem to happen when there was memory pressure.... The upstream email thread is at: http://sourceforge.net/mailarchive/message.php?msg_id=37252769 The bz is: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=213137 These patch have been tested by a number of upstream people, as well as locally on a RHEL5 B2 kernel... steved. The sunrpc scheduler contains a race condition that can let an RPC task end up being neither running nor on any wait queue. The race takes place between rpc_make_runnable (called from rpc_wake_up_task) and __rpc_execute under the following condition: First __rpc_execute calls tk_action which puts the task on some wait queue. The task is dequeued by another process before __rpc_execute continues its execution. While executing rpc_make_runnable exactly after setting the task `running' bit and before clearing the `queued' bit __rpc_execute picks up execution, clears `running' and subsequently both functions fall through, both under the false assumption somebody else took the job. Swapping rpc_test_and_set_running with rpc_clear_queued in rpc_make_runnable fixes that hole. This introduces another possible race condition that can be handled by checking for `queued' after setting the `running' bit. Bug noticed on a 4-way x86_64 system under XEN with an NFSv4 server on the same physical machine, apparently one of the few ways to hit this race condition at all. Cc: Trond Myklebust <[EMAIL PROTECTED]> Cc: J. Bruce Fields <[EMAIL PROTECTED]> Signed-off-by: Christophe Saout <[EMAIL PROTECTED]> --- linux-2.6.18.i686/net/sunrpc/sched.c.orig 2006-09-19 23:42:06.000000000 -0400 +++ linux-2.6.18.i686/net/sunrpc/sched.c 2006-12-05 07:28:24.251992000 -0500 @@ -302,13 +302,15 @@ EXPORT_SYMBOL(__rpc_wait_for_completion_ */ static void rpc_make_runnable(struct rpc_task *task) { - int do_ret; - BUG_ON(task->tk_timeout_fn); - do_ret = rpc_test_and_set_running(task); rpc_clear_queued(task); - if (do_ret) + if (rpc_test_and_set_running(task)) + return; + /* We might have raced */ + if (RPC_IS_QUEUED(task)) { + rpc_clear_running(task); return; + } if (RPC_IS_ASYNC(task)) { int status; Fix a second potential rpc_wakeup race... Signed-off-by: Trond Myklebust <[EMAIL PROTECTED]> ----------------------------------------------------------- --- linux-2.6.18.i686/fs/nfs/nfs4proc.c.orig 2006-12-06 10:00:51.316430000 -0500 +++ linux-2.6.18.i686/fs/nfs/nfs4proc.c 2006-12-12 12:07:21.607308000 -0500 @@ -636,7 +636,7 @@ static int _nfs4_proc_open_confirm(struc smp_wmb(); } else status = data->rpc_status; - rpc_release_task(task); + rpc_put_task(task); return status; } @@ -742,7 +742,7 @@ static int _nfs4_proc_open(struct nfs4_o smp_wmb(); } else status = data->rpc_status; - rpc_release_task(task); + rpc_put_task(task); if (status != 0) return status; @@ -3059,7 +3059,7 @@ static int _nfs4_proc_delegreturn(struct if (status == 0) nfs_post_op_update_inode(inode, &data->fattr); } - rpc_release_task(task); + rpc_put_task(task); return status; } @@ -3306,7 +3306,7 @@ static int nfs4_proc_unlck(struct nfs4_s if (IS_ERR(task)) goto out; status = nfs4_wait_for_completion_rpc_task(task); - rpc_release_task(task); + rpc_put_task(task); out: return status; } @@ -3422,7 +3422,7 @@ static void nfs4_lock_release(void *call task = nfs4_do_unlck(&data->fl, data->ctx, data->lsp, data->arg.lock_seqid); if (!IS_ERR(task)) - rpc_release_task(task); + rpc_put_task(task); dprintk("%s: cancelling lock!\n", __FUNCTION__); } else nfs_free_seqid(data->arg.lock_seqid); @@ -3464,7 +3464,7 @@ static int _nfs4_do_setlk(struct nfs4_st ret = -EAGAIN; } else data->cancelled = 1; - rpc_release_task(task); + rpc_put_task(task); dprintk("%s: done, ret = %d!\n", __FUNCTION__, ret); return ret; } --- linux-2.6.18.i686/include/linux/sunrpc/sched.h.orig 2006-09-19 23:42:06.000000000 -0400 +++ linux-2.6.18.i686/include/linux/sunrpc/sched.h 2006-12-12 12:07:21.836308000 -0500 @@ -180,13 +180,6 @@ struct rpc_call_ops { } while (0) #define RPC_IS_ACTIVATED(t) (test_bit(RPC_TASK_ACTIVE, &(t)->tk_runstate)) -#define rpc_set_active(t) (set_bit(RPC_TASK_ACTIVE, &(t)->tk_runstate)) -#define rpc_clear_active(t) \ - do { \ - smp_mb__before_clear_bit(); \ - clear_bit(RPC_TASK_ACTIVE, &(t)->tk_runstate); \ - smp_mb__after_clear_bit(); \ - } while(0) /* * Task priorities. @@ -257,6 +250,7 @@ struct rpc_task *rpc_new_child(struct rp void rpc_init_task(struct rpc_task *task, struct rpc_clnt *clnt, int flags, const struct rpc_call_ops *ops, void *data); +void rpc_put_task(struct rpc_task *); void rpc_release_task(struct rpc_task *); void rpc_exit_task(struct rpc_task *); void rpc_killall_tasks(struct rpc_clnt *); --- linux-2.6.18.i686/net/sunrpc/clnt.c.orig 2006-09-19 23:42:06.000000000 -0400 +++ linux-2.6.18.i686/net/sunrpc/clnt.c 2006-12-12 12:07:22.474014000 -0500 @@ -455,10 +455,9 @@ int rpc_call_sync(struct rpc_clnt *clnt, BUG_ON(flags & RPC_TASK_ASYNC); - status = -ENOMEM; task = rpc_new_task(clnt, flags, &rpc_default_ops, NULL); if (task == NULL) - goto out; + return -ENOMEM; /* Mask signals on RPC calls _and_ GSS_AUTH upcalls */ rpc_task_sigmask(task, &oldset); @@ -467,15 +466,17 @@ int rpc_call_sync(struct rpc_clnt *clnt, /* Set up the call info struct and execute the task */ status = task->tk_status; - if (status == 0) { - atomic_inc(&task->tk_count); - status = rpc_execute(task); - if (status == 0) - status = task->tk_status; + if (status != 0) { + rpc_release_task(task); + goto out; } - rpc_restore_sigmask(&oldset); - rpc_release_task(task); + atomic_inc(&task->tk_count); + status = rpc_execute(task); + if (status == 0) + status = task->tk_status; + rpc_put_task(task); out: + rpc_restore_sigmask(&oldset); return status; } --- linux-2.6.18.i686/net/sunrpc/sched.c.orig 2006-12-12 12:00:42.800928000 -0500 +++ linux-2.6.18.i686/net/sunrpc/sched.c 2006-12-12 12:07:23.389718000 -0500 @@ -273,12 +273,28 @@ static int rpc_wait_bit_interruptible(vo return 0; } +static void rpc_set_active(struct rpc_task *task) +{ + if (test_and_set_bit(RPC_TASK_ACTIVE, &task->tk_runstate) != 0) + return; + spin_lock(&rpc_sched_lock); +#ifdef RPC_DEBUG + task->tk_magic = RPC_TASK_MAGIC_ID; + task->tk_pid = rpc_task_id++; +#endif + /* Add to global list of all tasks */ + list_add_tail(&task->tk_task, &all_tasks); + spin_unlock(&rpc_sched_lock); +} + /* * Mark an RPC call as having completed by clearing the 'active' bit */ -static inline void rpc_mark_complete_task(struct rpc_task *task) +static void rpc_mark_complete_task(struct rpc_task *task) { - rpc_clear_active(task); + smp_mb__before_clear_bit(); + clear_bit(RPC_TASK_ACTIVE, &task->tk_runstate); + smp_mb__after_clear_bit(); wake_up_bit(&task->tk_runstate, RPC_TASK_ACTIVE); } @@ -352,9 +368,6 @@ static void __rpc_sleep_on(struct rpc_wa return; } - /* Mark the task as being activated if so needed */ - rpc_set_active(task); - __rpc_add_wait_queue(q, task); BUG_ON(task->tk_callback != NULL); @@ -365,6 +378,9 @@ static void __rpc_sleep_on(struct rpc_wa void rpc_sleep_on(struct rpc_wait_queue *q, struct rpc_task *task, rpc_action action, rpc_action timer) { + /* Mark the task as being activated if so needed */ + rpc_set_active(task); + /* * Protect the queue operations. */ @@ -694,8 +710,6 @@ static int __rpc_execute(struct rpc_task } dprintk("RPC: %4d, return %d, status %d\n", task->tk_pid, status, task->tk_status); - /* Wake up anyone who is waiting for task completion */ - rpc_mark_complete_task(task); /* Release all resources associated with the task */ rpc_release_task(task); return status; @@ -809,15 +823,6 @@ void rpc_init_task(struct rpc_task *task task->tk_flags |= RPC_TASK_NOINTR; } -#ifdef RPC_DEBUG - task->tk_magic = RPC_TASK_MAGIC_ID; - task->tk_pid = rpc_task_id++; -#endif - /* Add to global list of all tasks */ - spin_lock(&rpc_sched_lock); - list_add_tail(&task->tk_task, &all_tasks); - spin_unlock(&rpc_sched_lock); - BUG_ON(task->tk_ops == NULL); /* starting timestamp */ @@ -870,16 +875,35 @@ cleanup: goto out; } -void rpc_release_task(struct rpc_task *task) + +void rpc_put_task(struct rpc_task *task) { const struct rpc_call_ops *tk_ops = task->tk_ops; void *calldata = task->tk_calldata; + if (!atomic_dec_and_test(&task->tk_count)) + return; + /* Release resources */ + if (task->tk_rqstp) + xprt_release(task); + if (task->tk_msg.rpc_cred) + rpcauth_unbindcred(task); + if (task->tk_client) { + rpc_release_client(task->tk_client); + task->tk_client = NULL; + } + if (task->tk_flags & RPC_TASK_DYNAMIC) + rpc_free_task(task); + if (tk_ops->rpc_release) + tk_ops->rpc_release(calldata); +} +EXPORT_SYMBOL(rpc_put_task); + +void rpc_release_task(struct rpc_task *task) +{ #ifdef RPC_DEBUG BUG_ON(task->tk_magic != RPC_TASK_MAGIC_ID); #endif - if (!atomic_dec_and_test(&task->tk_count)) - return; dprintk("RPC: %4d release task\n", task->tk_pid); /* Remove from global task list */ @@ -892,23 +916,13 @@ void rpc_release_task(struct rpc_task *t /* Synchronously delete any running timer */ rpc_delete_timer(task); - /* Release resources */ - if (task->tk_rqstp) - xprt_release(task); - if (task->tk_msg.rpc_cred) - rpcauth_unbindcred(task); - if (task->tk_client) { - rpc_release_client(task->tk_client); - task->tk_client = NULL; - } - #ifdef RPC_DEBUG task->tk_magic = 0; #endif - if (task->tk_flags & RPC_TASK_DYNAMIC) - rpc_free_task(task); - if (tk_ops->rpc_release) - tk_ops->rpc_release(calldata); + /* Wake up anyone who is waiting for task completion */ + rpc_mark_complete_task(task); + + rpc_put_task(task); } /** --- linux-2.6.18.i686/net/sunrpc/sunrpc_syms.c.orig 2006-09-19 23:42:06.000000000 -0400 +++ linux-2.6.18.i686/net/sunrpc/sunrpc_syms.c 2006-12-12 12:07:23.419718000 -0500 @@ -33,7 +33,6 @@ EXPORT_SYMBOL(rpciod_down); EXPORT_SYMBOL(rpciod_up); EXPORT_SYMBOL(rpc_new_task); EXPORT_SYMBOL(rpc_wake_up_status); -EXPORT_SYMBOL(rpc_release_task); /* RPC client functions */ EXPORT_SYMBOL(rpc_create_client); Use RCU to ensure that we can safely call rpc_finish_wakeup after we've called __rpc_do_wake_up_task. If not, there is a theoretical race, in which the rpc_task finishes executing, and gets freed first. Signed-off-by: Trond Myklebust <[EMAIL PROTECTED]> --- --- linux-2.6.18.i686/fs/nfs/read.c.orig 2006-12-06 10:00:51.064520000 -0500 +++ linux-2.6.18.i686/fs/nfs/read.c 2006-12-12 13:17:54.873458000 -0500 @@ -67,13 +67,19 @@ struct nfs_read_data *nfs_readdata_alloc return p; } -static void nfs_readdata_free(struct nfs_read_data *p) +static void nfs_readdata_rcu_free(struct rcu_head *head) { + struct nfs_read_data *p = container_of(head, struct nfs_read_data, task.u.tk_rcu); if (p && (p->pagevec != &p->page_array[0])) kfree(p->pagevec); mempool_free(p, nfs_rdata_mempool); } +static void nfs_readdata_free(struct nfs_read_data *rdata) +{ + call_rcu_bh(&rdata->task.u.tk_rcu, nfs_readdata_rcu_free); +} + void nfs_readdata_release(void *data) { nfs_readdata_free(data); --- linux-2.6.18.i686/fs/nfs/write.c.orig 2006-12-06 10:00:51.622225000 -0500 +++ linux-2.6.18.i686/fs/nfs/write.c 2006-12-12 13:17:55.826163000 -0500 @@ -102,13 +102,19 @@ struct nfs_write_data *nfs_commit_alloc( return p; } -void nfs_commit_free(struct nfs_write_data *p) +void nfs_commit_rcu_free(struct rcu_head *head) { + struct nfs_write_data *p = container_of(head, struct nfs_write_data, task.u.tk_rcu); if (p && (p->pagevec != &p->page_array[0])) kfree(p->pagevec); mempool_free(p, nfs_commit_mempool); } +void nfs_commit_free(struct nfs_write_data *wdata) +{ + call_rcu_bh(&wdata->task.u.tk_rcu, nfs_commit_rcu_free); +} + struct nfs_write_data *nfs_writedata_alloc(size_t len) { unsigned int pagecount = (len + PAGE_SIZE - 1) >> PAGE_SHIFT; @@ -131,13 +137,19 @@ struct nfs_write_data *nfs_writedata_all return p; } -static void nfs_writedata_free(struct nfs_write_data *p) +static void nfs_writedata_rcu_free(struct rcu_head *head) { + struct nfs_write_data *p = container_of(head, struct nfs_write_data, task.u.tk_rcu); if (p && (p->pagevec != &p->page_array[0])) kfree(p->pagevec); mempool_free(p, nfs_wdata_mempool); } +static void nfs_writedata_free(struct nfs_write_data *wdata) +{ + call_rcu_bh(&wdata->task.u.tk_rcu, nfs_writedata_rcu_free); +} + void nfs_writedata_release(void *wdata) { nfs_writedata_free(wdata); @@ -261,7 +273,7 @@ static int nfs_writepage_sync(struct nfs io_error: nfs_end_data_update(inode); end_page_writeback(page); - nfs_writedata_free(wdata); + nfs_writedata_release(wdata); return written ? written : result; } @@ -1048,7 +1060,7 @@ out_bad: while (!list_empty(&list)) { data = list_entry(list.next, struct nfs_write_data, pages); list_del(&data->pages); - nfs_writedata_free(data); + nfs_writedata_release(data); } nfs_mark_request_dirty(req); nfs_clear_page_writeback(req); --- linux-2.6.18.i686/include/linux/sunrpc/sched.h.orig 2006-12-12 12:07:21.836308000 -0500 +++ linux-2.6.18.i686/include/linux/sunrpc/sched.h 2006-12-12 13:17:56.303868000 -0500 @@ -11,6 +11,7 @@ #include <linux/timer.h> #include <linux/sunrpc/types.h> +#include <linux/rcupdate.h> #include <linux/spinlock.h> #include <linux/wait.h> #include <linux/workqueue.h> @@ -85,6 +86,7 @@ struct rpc_task { union { struct work_struct tk_work; /* Async task work queue */ struct rpc_wait tk_wait; /* RPC wait */ + struct rcu_head tk_rcu; /* for task deletion */ } u; unsigned short tk_timeouts; /* maj timeouts */ --- linux-2.6.18.i686/include/linux/nfs_fs.h.orig 2006-12-12 11:57:47.355640000 -0500 +++ linux-2.6.18.i686/include/linux/nfs_fs.h 2006-12-12 13:17:56.070073000 -0500 @@ -437,11 +437,6 @@ extern int nfs_updatepage(struct file * extern int nfs_writeback_done(struct rpc_task *, struct nfs_write_data *); extern void nfs_writedata_release(void *); -#if defined(CONFIG_NFS_V3) || defined(CONFIG_NFS_V4) -struct nfs_write_data *nfs_commit_alloc(void); -void nfs_commit_free(struct nfs_write_data *p); -#endif - /* * Try to write back everything synchronously (but check the * return value!) @@ -449,6 +444,8 @@ void nfs_commit_free(struct nfs_write_da extern int nfs_sync_inode_wait(struct inode *, unsigned long, unsigned int, int); #if defined(CONFIG_NFS_V3) || defined(CONFIG_NFS_V4) extern int nfs_commit_inode(struct inode *, int); +extern struct nfs_write_data *nfs_commit_alloc(void); +extern void nfs_commit_free(struct nfs_write_data *wdata); extern void nfs_commit_release(void *wdata); #else static inline int --- linux-2.6.18.i686/net/sunrpc/sched.c.orig 2006-12-12 12:07:23.389718000 -0500 +++ linux-2.6.18.i686/net/sunrpc/sched.c 2006-12-12 13:17:56.963868000 -0500 @@ -444,16 +444,19 @@ __rpc_default_timer(struct rpc_task *tas */ void rpc_wake_up_task(struct rpc_task *task) { + rcu_read_lock_bh(); if (rpc_start_wakeup(task)) { if (RPC_IS_QUEUED(task)) { struct rpc_wait_queue *queue = task->u.tk_wait.rpc_waitq; - spin_lock_bh(&queue->lock); + /* Note: we're already in a bh-safe context */ + spin_lock(&queue->lock); __rpc_do_wake_up_task(task); - spin_unlock_bh(&queue->lock); + spin_unlock(&queue->lock); } rpc_finish_wakeup(task); } + rcu_read_unlock_bh(); } /* @@ -516,14 +519,16 @@ struct rpc_task * rpc_wake_up_next(struc struct rpc_task *task = NULL; dprintk("RPC: wake_up_next(%p \"%s\")\n", queue, rpc_qname(queue)); - spin_lock_bh(&queue->lock); + rcu_read_lock_bh(); + spin_lock(&queue->lock); if (RPC_IS_PRIORITY(queue)) task = __rpc_wake_up_next_priority(queue); else { task_for_first(task, &queue->tasks[0]) __rpc_wake_up_task(task); } - spin_unlock_bh(&queue->lock); + spin_unlock(&queue->lock); + rcu_read_unlock_bh(); return task; } @@ -539,7 +544,8 @@ void rpc_wake_up(struct rpc_wait_queue * struct rpc_task *task, *next; struct list_head *head; - spin_lock_bh(&queue->lock); + rcu_read_lock_bh(); + spin_lock(&queue->lock); head = &queue->tasks[queue->maxpriority]; for (;;) { list_for_each_entry_safe(task, next, head, u.tk_wait.list) @@ -548,7 +554,8 @@ void rpc_wake_up(struct rpc_wait_queue * break; head--; } - spin_unlock_bh(&queue->lock); + spin_unlock(&queue->lock); + rcu_read_unlock_bh(); } /** @@ -563,7 +570,8 @@ void rpc_wake_up_status(struct rpc_wait_ struct rpc_task *task, *next; struct list_head *head; - spin_lock_bh(&queue->lock); + rcu_read_lock_bh(); + spin_lock(&queue->lock); head = &queue->tasks[queue->maxpriority]; for (;;) { list_for_each_entry_safe(task, next, head, u.tk_wait.list) { @@ -574,7 +582,8 @@ void rpc_wake_up_status(struct rpc_wait_ break; head--; } - spin_unlock_bh(&queue->lock); + spin_unlock(&queue->lock); + rcu_read_unlock_bh(); } /* @@ -838,8 +847,9 @@ rpc_alloc_task(void) return (struct rpc_task *)mempool_alloc(rpc_task_mempool, GFP_NOFS); } -static void rpc_free_task(struct rpc_task *task) +static void rpc_free_task(struct rcu_head *rcu) { + struct rpc_task *task = container_of(rcu, struct rpc_task, u.tk_rcu); dprintk("RPC: %4d freeing task\n", task->tk_pid); mempool_free(task, rpc_task_mempool); } @@ -893,7 +903,7 @@ void rpc_put_task(struct rpc_task *task) task->tk_client = NULL; } if (task->tk_flags & RPC_TASK_DYNAMIC) - rpc_free_task(task); + call_rcu_bh(&task->u.tk_rcu, rpc_free_task); if (tk_ops->rpc_release) tk_ops->rpc_release(calldata); }