Re: [PATCH] nfsd: prevent NULL ptr derefs on fault injection
On Tue, Nov 27, 2012 at 11:31:11AM -0500, Sasha Levin wrote: > A recent patch series has moved hashtable initialization to when the net > struct is initialized. > > When injecting faults, we tried accessing the hashtables even if the struct > wasn't really initialized (nfsd wasn't in use) - this caused a NULL ptr > deref. Thanks, adding Bryan to cc.--b. > > A simple test would be: > > echo 1 > /sys/kernel/debug/nfsd/forget_locks > > Signed-off-by: Sasha Levin > --- > fs/nfsd/netns.h | 3 +++ > fs/nfsd/nfs4state.c | 9 + > 2 files changed, 12 insertions(+) > > diff --git a/fs/nfsd/netns.h b/fs/nfsd/netns.h > index 227b93e..c5806a57 100644 > --- a/fs/nfsd/netns.h > +++ b/fs/nfsd/netns.h > @@ -83,5 +83,8 @@ struct nfsd_net { > struct delayed_work laundromat_work; > }; > > +/* Simple check to find out if a given net was properly initialized */ > +#define nfsd_netns_ready(nn) ((nn)->sessionid_hashtbl) > + > extern int nfsd_net_id; > #endif /* __NFSD_NETNS_H__ */ > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > index e75872f..0e7428c 100644 > --- a/fs/nfsd/nfs4state.c > +++ b/fs/nfsd/nfs4state.c > @@ -4598,6 +4598,9 @@ void nfsd_forget_clients(u64 num) > int count = 0; > struct nfsd_net *nn = net_generic(current->nsproxy->net_ns, > nfsd_net_id); > > + if (!nfsd_netns_ready(nn)) > + return; > + > nfs4_lock_state(); > list_for_each_entry_safe(clp, next, &nn->client_lru, cl_lru) { > expire_client(clp); > @@ -4643,6 +4646,9 @@ void nfsd_forget_locks(u64 num) > int count; > struct nfsd_net *nn = net_generic(&init_net, nfsd_net_id); > > + if (!nfsd_netns_ready(nn)) > + return; > + > nfs4_lock_state(); > count = nfsd_release_n_owners(num, false, release_lockowner_sop, nn); > nfs4_unlock_state(); > @@ -4655,6 +4661,9 @@ void nfsd_forget_openowners(u64 num) > int count; > struct nfsd_net *nn = net_generic(&init_net, nfsd_net_id); > > + if (!nfsd_netns_ready(nn)) > + return; > + > nfs4_lock_state(); > count = nfsd_release_n_owners(num, true, release_openowner_sop, nn); > nfs4_unlock_state(); > -- > 1.8.0 > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] nfsd4: remove state lock from nfs4_state_shutdown
OK, applying.--b. On Wed, Nov 21, 2012 at 06:07:38PM +0300, Stanislav Kinsbursky wrote: > Protection of __nfs4_state_shutdown() with nfs4_lock_state() looks redundant. > > This function is called by the last NFSd thread on it's exit and state lock > protects actually two functions (del_recall_lru is protected by recall_lock): > 1) nfsd4_client_tracking_exit > 2) __nfs4_state_shutdown_net > > "nfsd4_client_tracking_exit" doesn't require state lock protection, because > it's > state can be modified only by tracker callbacks. > Here a re they: > 1) create: is called only from nfsd4_proc_compound. > 2) remove: is called from either nfsd4_proc_compound or nfs4_laundromat. > 3) check: is called only from nfsd4_proc_compound. > 4) grace_done; called only from nfs4_laundromat. > > nfsd4_proc_compound is called onll by NFSd kthread, which is exiting right > now. > nfs4_laundromat is called by laundry_wq. But laundromat_work was canceled > already. > > "__nfs4_state_shutdown_net" also doesn't require state lock protection, > because all NFSd kthreads are dead, and no race can happen with NFSd start, > because "nfsd_up" flag is still set. > Moreover, all Nfsd shutdown is protected with global nfsd_mutex. > > Signed-off-by: Stanislav Kinsbursky > --- > fs/nfsd/nfs4state.c |2 -- > 1 files changed, 0 insertions(+), 2 deletions(-) > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > index e75872f..d41cc71 100644 > --- a/fs/nfsd/nfs4state.c > +++ b/fs/nfsd/nfs4state.c > @@ -4926,9 +4926,7 @@ nfs4_state_shutdown(void) > cancel_delayed_work_sync(&nn->laundromat_work); > destroy_workqueue(laundry_wq); > locks_end_grace(&nn->nfsd4_manager); > - nfs4_lock_state(); > __nfs4_state_shutdown(net); > - nfs4_unlock_state(); > nfsd4_destroy_callback_queue(); > } > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] nfsd: more NFSv4 state containerization
On Mon, Nov 26, 2012 at 03:21:49PM +0300, Stanislav Kinsbursky wrote: > This patch set makes NFSv4 state created and destroyed per net and thus > completes it's containerization (the only exceprtion is client_mutex, but this > one a hard nut to crack). > > Note: patch set depend on previously sent patch named "nfsd4: remove state > lock from nfs4_state_shutdown". Looks OK.--b. > > The following series implements... > > --- > > Stanislav Kinsbursky (5): > nfsd: make client_lock per net > nfsd: make delegations shutdown network namespace aware > nfsd: cleanup NFSd state shutdown a bit > nfsd: cleanup NFSd state start a bit > nfsd: call state init and shutdown twice > > > fs/nfsd/netns.h |3 + > fs/nfsd/nfs4state.c | 137 > +-- > fs/nfsd/nfsd.h |4 + > fs/nfsd/nfssvc.c| 15 +- > 4 files changed, 99 insertions(+), 60 deletions(-) > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] nfsd: containerise NFSv4 recovery state
On Mon, Nov 26, 2012 at 04:16:18PM +0300, Stanislav Kinsbursky wrote: > This patch set makes rec_file and in_grace global static variables allocated > per network namespace. > Both of them are a part of client tracking engine, which is containerised > already. Thanks, applying for 3.8.--b. > > The following series implements... > > --- > > Stanislav Kinsbursky (2): > nfsd: recovery - make rec_file per net > nfsd: recovery - make in_grace per net > > > fs/nfsd/netns.h |3 ++ > fs/nfsd/nfs4recover.c | 79 > - > 2 files changed, 42 insertions(+), 40 deletions(-) > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] nfsd: make NFSv4 lease time per net
On Tue, Nov 27, 2012 at 02:11:44PM +0300, Stanislav Kinsbursky wrote: > diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c > index dab350d..4930981 100644 > --- a/fs/nfsd/nfsctl.c > +++ b/fs/nfsd/nfsctl.c > @@ -912,7 +912,8 @@ static ssize_t nfsd4_write_time(struct file *file, char > *buf, size_t size, time_ > */ > static ssize_t write_leasetime(struct file *file, char *buf, size_t size) > { > - return nfsd4_write_time(file, buf, size, &nfsd4_lease); > + struct nfsd_net *nn = net_generic(&init_net, nfsd_net_id); > + return nfsd4_write_time(file, buf, size, &nn->nfsd4_lease); This is called in the context of whatever process writes to nfsv4leasetime, so should be using its network namespace, right? --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] nfsd: make NFSv4 grace time per net
On Tue, Nov 27, 2012 at 02:11:49PM +0300, Stanislav Kinsbursky wrote: > diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c > index 4930981..7e41bd1 100644 > --- a/fs/nfsd/nfsctl.c > +++ b/fs/nfsd/nfsctl.c > @@ -928,7 +928,8 @@ static ssize_t write_leasetime(struct file *file, char > *buf, size_t size) > */ > static ssize_t write_gracetime(struct file *file, char *buf, size_t size) > { > - return nfsd4_write_time(file, buf, size, &nfsd4_grace); > + struct nfsd_net *nn = net_generic(&init_net, nfsd_net_id); > + return nfsd4_write_time(file, buf, size, &nn->nfsd4_grace); > } Same comment as for lease time.--b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] nfsd: remove redundant declarations
On Tue, Nov 27, 2012 at 02:42:20PM +0300, Stanislav Kinsbursky wrote: > This is a cleanup patch. Functions nfsd_pool_stats_open() and > nfsd_pool_stats_release() are declared in fs/nfsd/nfsd.h. OK, applying. --b. > > Signed-off-by: Stanislav Kinsbursky > --- > fs/nfsd/nfsctl.c |3 --- > 1 files changed, 0 insertions(+), 3 deletions(-) > > diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c > index 7e41bd1..d902f83 100644 > --- a/fs/nfsd/nfsctl.c > +++ b/fs/nfsd/nfsctl.c > @@ -186,9 +186,6 @@ static struct file_operations supported_enctypes_ops = { > }; > #endif /* CONFIG_SUNRPC_GSS or CONFIG_SUNRPC_GSS_MODULE */ > > -extern int nfsd_pool_stats_open(struct inode *inode, struct file *file); > -extern int nfsd_pool_stats_release(struct inode *inode, struct file *file); > - > static const struct file_operations pool_stats_operations = { > .open = nfsd_pool_stats_open, > .read = seq_read, > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] nfsd: make NFSv4 lease time per net
On Wed, Nov 28, 2012 at 07:12:03PM +0400, Stanislav Kinsbursky wrote: > 28.11.2012 19:09, J. Bruce Fields пишет: > >On Tue, Nov 27, 2012 at 02:11:44PM +0300, Stanislav Kinsbursky wrote: > >>diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c > >>index dab350d..4930981 100644 > >>--- a/fs/nfsd/nfsctl.c > >>+++ b/fs/nfsd/nfsctl.c > >>@@ -912,7 +912,8 @@ static ssize_t nfsd4_write_time(struct file *file, char > >>*buf, size_t size, time_ > >> */ > >> static ssize_t write_leasetime(struct file *file, char *buf, size_t size) > >> { > >>- return nfsd4_write_time(file, buf, size, &nfsd4_lease); > >>+ struct nfsd_net *nn = net_generic(&init_net, nfsd_net_id); > >>+ return nfsd4_write_time(file, buf, size, &nn->nfsd4_lease); > > > >This is called in the context of whatever process writes to > >nfsv4leasetime, so should be using its network namespace, right? > > > > This is, actually, a interim solution to preserve existent logic. > I.e. I'm going to convert "nfsd" filesystem into per-net one (like > rpc_pipefs). I, actually, already done it in my tree. > Thus proper network namespace will be taken from nfsd superblock. OK, remind me how that works? It's mounted just once, but each network namespace gets a different view of the filesystem? --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] nfsd: make NFSv4 lease time per net
On Wed, Nov 28, 2012 at 07:35:20PM +0400, Stanislav Kinsbursky wrote: > 28.11.2012 19:15, J. Bruce Fields пишет: > >On Wed, Nov 28, 2012 at 07:12:03PM +0400, Stanislav Kinsbursky wrote: > >>28.11.2012 19:09, J. Bruce Fields пишет: > >>>On Tue, Nov 27, 2012 at 02:11:44PM +0300, Stanislav Kinsbursky wrote: > >>>>diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c > >>>>index dab350d..4930981 100644 > >>>>--- a/fs/nfsd/nfsctl.c > >>>>+++ b/fs/nfsd/nfsctl.c > >>>>@@ -912,7 +912,8 @@ static ssize_t nfsd4_write_time(struct file *file, > >>>>char *buf, size_t size, time_ > >>>> */ > >>>> static ssize_t write_leasetime(struct file *file, char *buf, size_t > >>>> size) > >>>> { > >>>>- return nfsd4_write_time(file, buf, size, &nfsd4_lease); > >>>>+ struct nfsd_net *nn = net_generic(&init_net, nfsd_net_id); > >>>>+ return nfsd4_write_time(file, buf, size, &nn->nfsd4_lease); > >>> > >>>This is called in the context of whatever process writes to > >>>nfsv4leasetime, so should be using its network namespace, right? > >>> > >> > >>This is, actually, a interim solution to preserve existent logic. > >>I.e. I'm going to convert "nfsd" filesystem into per-net one (like > >>rpc_pipefs). I, actually, already done it in my tree. > >>Thus proper network namespace will be taken from nfsd superblock. > > > >OK, remind me how that works? It's mounted just once, but each network > >namespace gets a different view of the filesystem? > > > > Nope. It's a single mount point, but per-namespace (network, in our case) - > not globally. > Pointer to namespace will be placed on sb->s_fs_info. OK, anyway, applying these. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHSET] idr: implement idr_alloc() and convert existing users
On Mon, Feb 04, 2013 at 09:11:28AM -0800, Tejun Heo wrote: > On Mon, Feb 04, 2013 at 09:10:31AM -0800, Tejun Heo wrote: > > Yeah, that should be easily convertable to the new interface. How > > should we route these changes? Your patch can go through the usual > > nfs channel and the conversion and deprecation patches can be held off > > until after -rc1. Does that sound okay? > > Ooh, BTW, the cyclic allocation is broken. It's prone to -ENOSPC > after the first wraparound. There are several cyclic users in the > kernel and I think it probably would be best to implement cyclic > support in idr. Are you looking at this, by the way, or do you have any suggestions? Wondering if it's something I should try to fix or if I should look into converting to some other data structure here. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: build failure after merge of the nfsd tree
On Sat, Feb 02, 2013 at 07:57:19AM -0500, J. Bruce Fields wrote: > On Sat, Feb 02, 2013 at 01:04:03PM +1100, Stephen Rothwell wrote: > > Hi, > > > > After merging the nfsd tree, today's linux-next build (x86_64 > > allmodconfig) failed like this: > > > > fs/nfs/dns_resolve.c: In function 'nfs_dns_resolver_cache_init': > > fs/nfs/dns_resolve.c:375:4: error: 'struct cache_detail' has no member > > named 'cache_upcall' > > fs/nfs/dns_resolve.c:375:35: warning: left-hand operand of comma expression > > has no effect [-Wunused-value] > > fs/nfs/dns_resolve.c:375:35: warning: value computed is not used > > [-Wunused-value] > > fs/nfs/dns_resolve.c:375:35: warning: value computed is not used > > [-Wunused-value] > > fs/nfs/dns_resolve.c:375:35: warning: value computed is not used > > [-Wunused-value] > > fs/nfs/dns_resolve.c:375:35: warning: value computed is not used > > [-Wunused-value] > > fs/nfs/dns_resolve.c:375:35: warning: value computed is not used > > [-Wunused-value] > > fs/nfs/dns_resolve.c:375:35: warning: value computed is not used > > [-Wunused-value] > > > > Caused by commit aab982cb5dfb ("SUNRPC: remove cache_detail->cache_upcall > > callback"). > > Yes, I knew why we'd introduced cache_upcall, so I'm not sure how I > overlooked that. It must have slipped through testing because I didn't > set CONFIG_NFS_USE_KERNEL_DNS. > > We may just be able to revert that patch I can take care of that by > tomorrow. Stanislav, any objections to this? --b. commit 8a55a995d3bada0cd0fc67be92a9e389edf709f8 Author: J. Bruce Fields Date: Sat Feb 2 07:49:01 2013 -0500 Revert "SUNRPC: remove cache_detail->cache_upcall callback" This reverts commit aab982cb5dfb13405236009a6fa24807ee93f62c "SUNRPC: remove cache_detail->cache_upcall callback", which overlooked a cache_upcall user in fs/nfs/dns_resolve.c. Also note 517eb5600b455b608fb4cf38e403909a82c76100 "SUNRPC: introduce cache_detail->cache_request callback" missed initializing ->cache_request in the same file. Reported-by: Stephen Rothwell Cc: Stanislav Kinsbursky Signed-off-by: J. Bruce Fields diff --git a/fs/nfs/dns_resolve.c b/fs/nfs/dns_resolve.c index 0ff1f4c..794b237 100644 --- a/fs/nfs/dns_resolve.c +++ b/fs/nfs/dns_resolve.c @@ -373,6 +373,7 @@ int nfs_dns_resolver_cache_init(struct net *net) cd->name = "dns_resolve", cd->cache_put = nfs_dns_ent_put, cd->cache_upcall = nfs_dns_upcall, + cd->cache_request = nfs_dns_request; cd->cache_parse = nfs_dns_parse, cd->cache_show = nfs_dns_show, cd->match = nfs_dns_match, diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c index 30a572d..b1e367d 100644 --- a/fs/nfsd/export.c +++ b/fs/nfsd/export.c @@ -67,6 +67,11 @@ static void expkey_request(struct cache_detail *cd, (*bpp)[-1] = '\n'; } +static int expkey_upcall(struct cache_detail *cd, struct cache_head *h) +{ + return sunrpc_cache_pipe_upcall(cd, h); +} + static struct svc_expkey *svc_expkey_update(struct cache_detail *cd, struct svc_expkey *new, struct svc_expkey *old); static struct svc_expkey *svc_expkey_lookup(struct cache_detail *cd, struct svc_expkey *); @@ -240,6 +245,7 @@ static struct cache_detail svc_expkey_cache_template = { .hash_size = EXPKEY_HASHMAX, .name = "nfsd.fh", .cache_put = expkey_put, + .cache_upcall = expkey_upcall, .cache_request = expkey_request, .cache_parse= expkey_parse, .cache_show = expkey_show, @@ -333,6 +339,11 @@ static void svc_export_request(struct cache_detail *cd, (*bpp)[-1] = '\n'; } +static int svc_export_upcall(struct cache_detail *cd, struct cache_head *h) +{ + return sunrpc_cache_pipe_upcall(cd, h); +} + static struct svc_export *svc_export_update(struct svc_export *new, struct svc_export *old); static struct svc_export *svc_export_lookup(struct svc_export *); @@ -702,6 +713,7 @@ static struct cache_detail svc_export_cache_template = { .hash_size = EXPORT_HASHMAX, .name = "nfsd.export", .cache_put = svc_export_put, + .cache_upcall = svc_export_upcall, .cache_request = svc_export_request, .cache_parse= svc_export_parse, .cache_show = svc_export_show, diff --git a/fs/nfsd/nfs4idmap.c b/fs/nfsd/nfs4idmap.c index d9402ea..d863cd8 100644 --- a/fs/nfsd/nfs4idmap.c +++ b/fs/nfsd/nfs4idmap.c @@ -140,6 +140,12 @@ idtoname_request(struct cache_detail *cd, struc
Re: [PATCHSET] idr: implement idr_alloc() and convert existing users
On Sat, Feb 02, 2013 at 05:20:01PM -0800, Tejun Heo wrote: > * Bruce, I couldn't convert nfsd. Can you please help? More on it > later. ... > I converted all in-kernel users except nfsd and staging drivers. nfsd > splits preloading and actual id allocation in a way that per-cpu > preloading can't be used. I couldn't follow the control flow to > verify whether the current code is correct either. I think the best > way would be allocating ID upfront without installing the handle and > then later using idr_replace() to install the pointer when the ID > actually gets used. Bruce, would something like that be possible? Actually, I'm not even sure if that's necessary, we can probably just do it all at the start. I'll try to have a patch doing that tomorrow. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHSET] idr: implement idr_alloc() and convert existing users
On Sun, Feb 03, 2013 at 12:02:41PM -0500, J. Bruce Fields wrote: > On Sat, Feb 02, 2013 at 05:20:01PM -0800, Tejun Heo wrote: > > * Bruce, I couldn't convert nfsd. Can you please help? More on it > > later. > ... > > I converted all in-kernel users except nfsd and staging drivers. nfsd > > splits preloading and actual id allocation in a way that per-cpu > > preloading can't be used. I couldn't follow the control flow to > > verify whether the current code is correct either. I think the best > > way would be allocating ID upfront without installing the handle and > > then later using idr_replace() to install the pointer when the ID > > actually gets used. Bruce, would something like that be possible? > > Actually, I'm not even sure if that's necessary, we can probably just > do it all at the start. > > I'll try to have a patch doing that tomorrow. So, something like this. --b. commit 9847160469b40345e1d78a7cbf9536761bb47d91 Author: J. Bruce Fields Date: Sun Feb 3 12:23:01 2013 -0500 nfsd4: simplify idr allocation We don't really need to preallocate at all; just allocate and initialize everything at once, but leave the sc_type field initially 0 to prevent finding the stateid till it's fully initialized. Signed-off-by: J. Bruce Fields diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c index 4db46aa..485b1f7 100644 --- a/fs/nfsd/nfs4state.c +++ b/fs/nfsd/nfs4state.c @@ -261,33 +261,46 @@ static inline int get_new_stid(struct nfs4_stid *stid) return new_stid; } -static void init_stid(struct nfs4_stid *stid, struct nfs4_client *cl, unsigned char type) +static struct nfs4_stid *nfs4_alloc_stid(struct nfs4_client *cl, struct +kmem_cache *slab) { - stateid_t *s = &stid->sc_stateid; + struct idr *stateids = &cl->cl_stateids; + static int min_stateid = 0; + struct nfs4_stid *stid; int new_id; - stid->sc_type = type; + stid = kmem_cache_alloc(slab, GFP_KERNEL); + if (!stid) + return NULL; + + if (!idr_pre_get(stateids, GFP_KERNEL)) + goto out_free; + if (idr_get_new_above(stateids, stid, min_stateid, &new_id)) + goto out_free; stid->sc_client = cl; - s->si_opaque.so_clid = cl->cl_clientid; - new_id = get_new_stid(stid); - s->si_opaque.so_id = (u32)new_id; + stid->sc_type = 0; + stid->sc_stateid.si_opaque.so_id = new_id; + stid->sc_stateid.si_opaque.so_clid = cl->cl_clientid; /* Will be incremented before return to client: */ - s->si_generation = 0; -} + stid->sc_stateid.si_generation = 0; -static struct nfs4_stid *nfs4_alloc_stid(struct nfs4_client *cl, struct kmem_cache *slab) -{ - struct idr *stateids = &cl->cl_stateids; - - if (!idr_pre_get(stateids, GFP_KERNEL)) - return NULL; /* -* Note: if we fail here (or any time between now and the time -* we actually get the new idr), we won't need to undo the idr -* preallocation, since the idr code caps the number of -* preallocated entries. +* It shouldn't be a problem to reuse an opaque stateid value. +* I don't think it is for 4.1. But with 4.0 I worry that, for +* example, a stray write retransmission could be accepted by +* the server when it should have been rejected. Therefore, +* adopt a trick from the sctp code to attempt to maximize the +* amount of time until an id is reused, by ensuring they always +* "increase" (mod INT_MAX): */ - return kmem_cache_alloc(slab, GFP_KERNEL); + + min_stateid = new_id+1; + if (min_stateid == INT_MAX) + min_stateid = 0; + return stid; +out_free: + kfree(stid); + return NULL; } static struct nfs4_ol_stateid * nfs4_alloc_stateid(struct nfs4_client *clp) @@ -316,7 +329,7 @@ alloc_init_deleg(struct nfs4_client *clp, struct nfs4_ol_stateid *stp, struct sv dp = delegstateid(nfs4_alloc_stid(clp, deleg_slab)); if (dp == NULL) return dp; - init_stid(&dp->dl_stid, clp, NFS4_DELEG_STID); + dp->dl_stid.sc_type = NFS4_DELEG_STID; /* * delegation seqid's are never incremented. The 4.1 special * meaning of seqid 0 isn't meaningful, really, but let's avoid @@ -337,13 +350,21 @@ alloc_init_deleg(struct nfs4_client *clp, struct nfs4_ol_stateid *stp, struct sv return dp; } +void free_stid(struct nfs4_stid *s, struct kmem_cache *slab) +{ + struct idr *stateids = &s->sc_client->cl_stateids; + + idr_remove(stateids, s->sc_stateid.si_opaque.so_id); + kmem_cache_free(slab, s); +} + void nfs4_put
Re: [PATCH v2 0/6] SUNRPC: rework cache upcall to avoid NFSd root
On Mon, Feb 04, 2013 at 02:02:29PM +0300, Stanislav Kinsbursky wrote: > swapping > > The main idea of this patch set is to call cache request not on kthread > upcall, but on userspace daemon cache_read call. This fixes the problem with > gaining of wrong dentry path after calling d_path() in kthread root context > (svc_export_request() callback), which always work in init root context, but > containers can work in "root jail" - i.e. have it's own nested root. > > v2: > 1) NFS DNS cache update wasn't done in the firest version. So this patch set > does preparation cleanup of the NFS DNS cache routines. > 2) Also, this patch set doesn't remove cache_upcall helper anymore, because > it's still required for NFS DNS cache. Argh--I really prefer incremental patches once I've already committed something, but OK. Backing out the old patches, I'll take a look at these. The first two should probably get an ACK from Trond. --b. > > The following series implements... > > --- > > Stanislav Kinsbursky (6): > NFS: use SUNRPC cache creation and destruction helper for DNS cache > NFS; simlify and clean cache library > SUNRPC: introduce cache_detail->cache_request callback > SUNRPC: rework cache upcall logic > SUNRPC: remove "cache_request" argument in sunrpc_cache_pipe_upcall() > function > SUNRPC: move cache_detail->cache_request callback call to cache_read() > > > fs/nfs/cache_lib.c| 12 ++- > fs/nfs/cache_lib.h|2 - > fs/nfs/dns_resolve.c | 65 > +++-- > fs/nfsd/export.c | 14 +--- > fs/nfsd/nfs4idmap.c | 16 + > include/linux/sunrpc/cache.h | 10 +++--- > net/sunrpc/auth_gss/svcauth_gss.c |8 + > net/sunrpc/cache.c| 49 > net/sunrpc/svcauth_unix.c | 14 +--- > 9 files changed, 69 insertions(+), 121 deletions(-) > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/4] SUNRPC: remove cache_detail->cache_upcall callback
On Mon, Feb 04, 2013 at 11:50:02AM -0500, Jeff Layton wrote: > On Tue, 15 Jan 2013 11:09:36 +0300 > Stanislav Kinsbursky wrote: > > > This callback is redundant since all that its' implementations are doing is > > calling sunrpc_cache_pipe_upcall() with proper function address argument. > > This > > function address is now stored on cache_detail structure and thus all the > > code > > can be simplified. > > > > Signed-off-by: Stanislav Kinsbursky > > --- > > fs/nfsd/export.c | 12 > > fs/nfsd/nfs4idmap.c | 14 -- > > include/linux/sunrpc/cache.h |3 --- > > net/sunrpc/auth_gss/svcauth_gss.c |7 --- > > net/sunrpc/cache.c|6 +++--- > > net/sunrpc/svcauth_unix.c | 12 > > 6 files changed, 3 insertions(+), 51 deletions(-) > > > > diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c > > index 63ee3bc..6f030a5 100644 > > --- a/fs/nfsd/export.c > > +++ b/fs/nfsd/export.c > > @@ -67,11 +67,6 @@ static void expkey_request(struct cache_detail *cd, > > (*bpp)[-1] = '\n'; > > } > > > > -static int expkey_upcall(struct cache_detail *cd, struct cache_head *h) > > -{ > > - return sunrpc_cache_pipe_upcall(cd, h, cd->cache_request); > > -} > > - > > static struct svc_expkey *svc_expkey_update(struct cache_detail *cd, > > struct svc_expkey *new, > > struct svc_expkey *old); > > static struct svc_expkey *svc_expkey_lookup(struct cache_detail *cd, > > struct svc_expkey *); > > @@ -245,7 +240,6 @@ static struct cache_detail svc_expkey_cache_template = { > > .hash_size = EXPKEY_HASHMAX, > > .name = "nfsd.fh", > > .cache_put = expkey_put, > > - .cache_upcall = expkey_upcall, > > .cache_request = expkey_request, > > .cache_parse= expkey_parse, > > .cache_show = expkey_show, > > @@ -338,11 +332,6 @@ static void svc_export_request(struct cache_detail *cd, > > (*bpp)[-1] = '\n'; > > } > > > > -static int svc_export_upcall(struct cache_detail *cd, struct cache_head *h) > > -{ > > - return sunrpc_cache_pipe_upcall(cd, h, cd->cache_request); > > -} > > - > > static struct svc_export *svc_export_update(struct svc_export *new, > > struct svc_export *old); > > static struct svc_export *svc_export_lookup(struct svc_export *); > > @@ -712,7 +701,6 @@ static struct cache_detail svc_export_cache_template = { > > .hash_size = EXPORT_HASHMAX, > > .name = "nfsd.export", > > .cache_put = svc_export_put, > > - .cache_upcall = svc_export_upcall, > > .cache_request = svc_export_request, > > .cache_parse= svc_export_parse, > > .cache_show = svc_export_show, > > diff --git a/fs/nfsd/nfs4idmap.c b/fs/nfsd/nfs4idmap.c > > index 9033dfd..d9402ea 100644 > > --- a/fs/nfsd/nfs4idmap.c > > +++ b/fs/nfsd/nfs4idmap.c > > @@ -140,12 +140,6 @@ idtoname_request(struct cache_detail *cd, struct > > cache_head *ch, char **bpp, > > } > > > > static int > > -idtoname_upcall(struct cache_detail *cd, struct cache_head *ch) > > -{ > > - return sunrpc_cache_pipe_upcall(cd, ch, cd->cache_request); > > -} > > - > > -static int > > idtoname_match(struct cache_head *ca, struct cache_head *cb) > > { > > struct ent *a = container_of(ca, struct ent, h); > > @@ -192,7 +186,6 @@ static struct cache_detail idtoname_cache_template = { > > .hash_size = ENT_HASHMAX, > > .name = "nfs4.idtoname", > > .cache_put = ent_put, > > - .cache_upcall = idtoname_upcall, > > .cache_request = idtoname_request, > > .cache_parse= idtoname_parse, > > .cache_show = idtoname_show, > > @@ -322,12 +315,6 @@ nametoid_request(struct cache_detail *cd, struct > > cache_head *ch, char **bpp, > > } > > > > static int > > -nametoid_upcall(struct cache_detail *cd, struct cache_head *ch) > > -{ > > - return sunrpc_cache_pipe_upcall(cd, ch, cd->cache_request); > > -} > > - > > -static int > > nametoid_match(struct cache_head *ca, struct cache_head *cb) > > { > > struct ent *a = container_of(ca, struct ent, h); > > @@ -366,7 +353,6 @@ static struct cache_detail nametoid_cache_template = { > > .hash_size = ENT_HASHMAX, > > .name = "nfs4.nametoid", > > .cache_put = ent_put, > > - .cache_upcall = nametoid_upcall, > > .cache_request = nametoid_request, > > .cache_parse= nametoid_parse, > > .cache_show = nametoid_show, > > diff --git a/include/linux/sunrpc/cache.h b/include/linux/sunrpc/cache.h > > index 4f1c858..249e73a 100644 > > --- a/include/linux/sunrpc/cache.h > > +++ b/include/linux/sunrpc/cache.h > > @@ -80,9 +80,6 @@ struct cache_detail { > > char*name; > > void(*cache_put)(struct kref *); > > > > - int (*cache_upcall)(struct cache_detail *, > > -
Re: [PATCHSET] idr: implement idr_alloc() and convert existing users
On Mon, Feb 04, 2013 at 09:10:31AM -0800, Tejun Heo wrote: > Hey, > > On Sun, Feb 03, 2013 at 07:15:58PM -0500, J. Bruce Fields wrote: > > On Sun, Feb 03, 2013 at 12:02:41PM -0500, J. Bruce Fields wrote: > > > On Sat, Feb 02, 2013 at 05:20:01PM -0800, Tejun Heo wrote: > > > > * Bruce, I couldn't convert nfsd. Can you please help? More on it > > > > later. > > > ... > > > > I converted all in-kernel users except nfsd and staging drivers. nfsd > > > > splits preloading and actual id allocation in a way that per-cpu > > > > preloading can't be used. I couldn't follow the control flow to > > > > verify whether the current code is correct either. I think the best > > > > way would be allocating ID upfront without installing the handle and > > > > then later using idr_replace() to install the pointer when the ID > > > > actually gets used. Bruce, would something like that be possible? > > > > > > Actually, I'm not even sure if that's necessary, we can probably just > > > do it all at the start. > > > > > > I'll try to have a patch doing that tomorrow. > > > > So, something like this. > > Yeah, that should be easily convertable to the new interface. How > should we route these changes? Your patch can go through the usual > nfs channel and the conversion and deprecation patches can be held off > until after -rc1. Does that sound okay? Whatever's easiest for you. That sounds fine to me. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHSET] idr: implement idr_alloc() and convert existing users
On Mon, Feb 04, 2013 at 09:11:28AM -0800, Tejun Heo wrote: > On Mon, Feb 04, 2013 at 09:10:31AM -0800, Tejun Heo wrote: > > Yeah, that should be easily convertable to the new interface. How > > should we route these changes? Your patch can go through the usual > > nfs channel and the conversion and deprecation patches can be held off > > until after -rc1. Does that sound okay? > > Ooh, BTW, the cyclic allocation is broken. It's prone to -ENOSPC > after the first wraparound. There are several cyclic users in the > kernel and I think it probably would be best to implement cyclic > support in idr. Ugh. OK, well it'll take some work to wrap that around, so it's probably not urgent for nfsd. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 3/8] vfs: Add O_DENYREAD/WRITE flags support for open syscall
On Tue, Feb 05, 2013 at 03:45:31PM +0400, Pavel Shilovsky wrote: > 2013/1/31 J. Bruce Fields : > > On Thu, Jan 17, 2013 at 08:52:59PM +0400, Pavel Shilovsky wrote: > >> If O_DENYMAND flag is specified, O_DENYREAD/WRITE/MAND flags are > >> translated to flock's flags: > >> > >> !O_DENYREAD -> LOCK_READ > >> !O_DENYWRITE -> LOCK_WRITE > >> O_DENYMAND -> LOCK_MAND > >> > >> and set through flock_lock_file on a file. > >> > >> This change only affects opens that use O_DENYMAND flag - all other > >> native Linux opens don't care about these flags. It allow us to > >> enable this feature for applications that need it (e.g. NFS and > >> Samba servers that export the same directory for Windows clients, > >> or Wine applications that access the same files simultaneously). > > > > The use of an is_conflict callback seems unnecessarily convoluted. > > > > If we need two different behaviors, let's just use another flag (or an > > extra boolean argument if we need to, or something). > > Ok, we can pass "bool is_mand" to flock_lock_file that will pass it > further to flock_locks_conflict. > > > > > The only caller for this new deny_lock_file is in the nfs code--I'm a > > little unclear why that is. > > deny_lock_file is called not only in the nfs code but also in 2 places > of fs/namei.c -- that enable this logic for VFS. Oops, apologies, I overlooked those somehow. What prevents somebody else from grabbing a lock on a newly-created file before we grab our own lock? I couldn't tell on a quick look whether we hold some lock that prevents that. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/8] Add O_DENY* support for VFS and CIFS/NFS
On Tue, Feb 05, 2013 at 03:33:21PM +0400, Pavel Shilovsky wrote: > 2013/1/31 J. Bruce Fields : > > On Thu, Jan 17, 2013 at 08:52:09PM +0400, Pavel Shilovsky wrote: > >> This patchset adds support of O_DENY* flags for Linux fs layer. These > >> flags can be used by any application that needs share reservations to > >> organize a file access. VFS already has some sort of this capability - now > >> it's done through flock/LOCK_MAND mechanis, but that approach is > >> non-atomic. This patchset build new capabilities on top of the existing > >> one but doesn't bring any changes into the flock call semantic. > >> > >> These flags can be used by NFS (built-in-kernel) and CIFS (Samba) servers > >> and Wine applications through VFS (for local filesystems) or CIFS/NFS > >> modules. This will help when e.g. Samba and NFS server share the same > >> directory for Windows and Linux users or Wine applications use Samba/NFS > >> share to access the same data from different clients. > >> > >> According to the previous discussions the most problematic question is how > >> to prevent situations like DoS attacks where e.g /lib/liba.so file can be > >> open with DENYREAD, or smth like this. That's why one extra flag > >> O_DENYMAND is added. It indicates to underlying layer that an application > >> want to use O_DENY* flags semantic. It allows us not affect native Linux > >> applications (that don't use O_DENYMAND flag) - so, these flags (and the > >> semantic of open syscall that they bring) are used only for those > >> applications that really want it proccessed that way. > > > > Maybe that's good enough. A mount flag might be simpler and give > > consistent enforcement for all users. > > > >> > >> So, we have four new flags: > >> O_DENYREAD - to prevent other opens with read access, > >> O_DENYWRITE - to prevent other opens with write access, > >> O_DENYDELETE - to prevent delete operations (this flag is not implemented > >> in VFS and NFS part and only suitable for CIFS module), > >> O_DENYMAND - to switch on/off three flags above. > > > > It would be useful to have some really careful documentation of how > > these are meant to work. Maybe try updating the open man page? > > Yes, that's a good idea. Do you mean smth like this? > > O_DENYMAND - used to inforce a mandatory share reservation scheme of > the file access. If this flag is passed, > the open fails with -ETXTBSY in following cases: > > 1) if O_DENYREAD flag is specified and there is another open with > O_DENYMAND flag and READ access to the file; > 2) if O_DENYWRITE flag is specified and there is another open with > O_DENYMAND flag and WRITE access to the file; > 3) if READ access is requested and there is another open with > O_DENYMAND and O_DENYREAD flags; > 4) if WRITE access is requested and there is another open with > O_DENYMAND and O_DENYWRITE flags; > > Also, if O_DENYDELETE flag is specified and the open succeded, any > further unlink operation will fail with -ETXTBSY untill this open is > closed. Now this flag is processed by CIFS filesystems only. Do you need to document interactions with flock() as well? --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 3/8] vfs: Add O_DENYREAD/WRITE flags support for open syscall
On Thu, Feb 07, 2013 at 01:53:46PM +0400, Pavel Shilovsky wrote: > 2013/2/5 J. Bruce Fields : > > On Tue, Feb 05, 2013 at 03:45:31PM +0400, Pavel Shilovsky wrote: > >> 2013/1/31 J. Bruce Fields : > >> > On Thu, Jan 17, 2013 at 08:52:59PM +0400, Pavel Shilovsky wrote: > >> >> If O_DENYMAND flag is specified, O_DENYREAD/WRITE/MAND flags are > >> >> translated to flock's flags: > >> >> > >> >> !O_DENYREAD -> LOCK_READ > >> >> !O_DENYWRITE -> LOCK_WRITE > >> >> O_DENYMAND -> LOCK_MAND > >> >> > >> >> and set through flock_lock_file on a file. > >> >> > >> >> This change only affects opens that use O_DENYMAND flag - all other > >> >> native Linux opens don't care about these flags. It allow us to > >> >> enable this feature for applications that need it (e.g. NFS and > >> >> Samba servers that export the same directory for Windows clients, > >> >> or Wine applications that access the same files simultaneously). > >> > > >> > The use of an is_conflict callback seems unnecessarily convoluted. > >> > > >> > If we need two different behaviors, let's just use another flag (or an > >> > extra boolean argument if we need to, or something). > >> > >> Ok, we can pass "bool is_mand" to flock_lock_file that will pass it > >> further to flock_locks_conflict. > >> > >> > > >> > The only caller for this new deny_lock_file is in the nfs code--I'm a > >> > little unclear why that is. > >> > >> deny_lock_file is called not only in the nfs code but also in 2 places > >> of fs/namei.c -- that enable this logic for VFS. > > > > Oops, apologies, I overlooked those somehow. > > > > What prevents somebody else from grabbing a lock on a newly-created file > > before we grab our own lock? > > > > I couldn't tell on a quick look whether we hold some lock that prevents > > that. > > Nothing prevents it. If somebody grabbed a share mode lock on a file > before we call deny_lock_file, we simply close this file and return > -ETXTBSY. But leave the newly-created file there--ugh. > We can't grab it before atomic_open because we don't have an > inode there. If you can get the lock while still holding the directory i_mutex can't you prevent anyone else from looking up the new file until you've gotten the lock? --b. > Anyway, we can't make it atomic for VFS without big code > changes, but for CIFS and NFS it is already atomic with the discussed > patch. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 3/8] vfs: Add O_DENYREAD/WRITE flags support for open syscall
On Thu, Feb 07, 2013 at 06:32:38PM +0400, Pavel Shilovsky wrote: > 2013/2/7 J. Bruce Fields : > > On Thu, Feb 07, 2013 at 01:53:46PM +0400, Pavel Shilovsky wrote: > >> Nothing prevents it. If somebody grabbed a share mode lock on a file > >> before we call deny_lock_file, we simply close this file and return > >> -ETXTBSY. > > > > But leave the newly-created file there--ugh. > > > >> We can't grab it before atomic_open because we don't have an > >> inode there. > > > > If you can get the lock while still holding the directory i_mutex can't > > you prevent anyone else from looking up the new file until you've gotten > > the lock? > > > > Hm..., seems you are right, I missed this part: > mutex_lock > lookup_open -> atomic_open -> deny_lock_file > mutex_unlock > > that means that nobody can open and of course set flock on the newly > created file (because flock is done through file descriptor). So, it > should be fine to call flock after f_ops->atomic_open in atomic_open > function. Thanks. Whether that works may also depend on how the new dentry is set up? If it's hashed before you call flock then I suppose it's already visible to others. Not knowing that code as well as I should, I might test by introducing an artificial delay there and trying to reproduce the race. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 3/8] vfs: Add O_DENYREAD/WRITE flags support for open syscall
On Thu, Feb 07, 2013 at 08:00:13PM +0400, Pavel Shilovsky wrote: > 2013/2/7 J. Bruce Fields : > > On Thu, Feb 07, 2013 at 06:32:38PM +0400, Pavel Shilovsky wrote: > >> 2013/2/7 J. Bruce Fields : > >> > On Thu, Feb 07, 2013 at 01:53:46PM +0400, Pavel Shilovsky wrote: > >> >> Nothing prevents it. If somebody grabbed a share mode lock on a file > >> >> before we call deny_lock_file, we simply close this file and return > >> >> -ETXTBSY. > >> > > >> > But leave the newly-created file there--ugh. > >> > > >> >> We can't grab it before atomic_open because we don't have an > >> >> inode there. > >> > > >> > If you can get the lock while still holding the directory i_mutex can't > >> > you prevent anyone else from looking up the new file until you've gotten > >> > the lock? > >> > > >> > >> Hm..., seems you are right, I missed this part: > >> mutex_lock > >> lookup_open -> atomic_open -> deny_lock_file > >> mutex_unlock > >> > >> that means that nobody can open and of course set flock on the newly > >> created file (because flock is done through file descriptor). So, it > >> should be fine to call flock after f_ops->atomic_open in atomic_open > >> function. Thanks. > > > > Whether that works may also depend on how the new dentry is set up? If > > it's hashed before you call flock then I suppose it's already visible to > > others. > > It seems it should be hashed in f_ops->atomic_open() (at least cifs > and nfs do it this way). In do_last when we do an ordinary open, we > don't hit parent i_mutex if lookup is succeeded through lookup_fast. > lookup_fast can catch newly created dentry and set it's share mode > before atomic_open codepath hits deny_lock_file. > > Also, I noted that: atomic open does f_ops->atomic_open and then it > processes may_open check; if may_open fails, the file is closed and > open returns with a error code (but file is created anyway). That would be a bug, I think. E.g. "man 3posix open": No files shall be created or modified if the function returns -1. Looking at the code... See the references to FILE_CREATED in atomic_open--looks like that's trying to prevent may_open from failing in this case. > I think > there is no difference between this case and the situation with > deny_lock_file there. Looks to me like it would be a bug in either case. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 3/8] vfs: Add O_DENYREAD/WRITE flags support for open syscall
On Thu, Feb 07, 2013 at 08:50:16PM +0400, Pavel Shilovsky wrote: > 2013/2/7 J. Bruce Fields : > > That would be a bug, I think. E.g. "man 3posix open": > > > > No files shall be created or modified if the function returns > > -1. > > > > Looking at the code... See the references to FILE_CREATED in > > atomic_open--looks like that's trying to prevent may_open from failing > > in this case. > > > >> I think > >> there is no difference between this case and the situation with > >> deny_lock_file there. > > > > Looks to me like it would be a bug in either case. > > Then we returned from lookup_open in do_last we go to 'opened' lable. > Then we have a 3(!) chances to return -1 while a file is created > (open_check_o_direct, ima_file_check, handle_truncate I don't know about the first two, but handle_truncate won't be hit since will_truncate is false. > ). In this case > these places are bugs too. > > We can call vfs_unlink if we failed after a file was created, but > possible affects need to be investigated. We definitely don't want to try to undo the create with an unlink. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG at net/sunrpc/svc_xprt.c:921
On Thu, Feb 07, 2013 at 07:56:51AM -0500, Tom Horsley wrote: > I noticed some previous messages with this subject, but the > walkback I'm getting doesn't match exactly the ones shown > in the threads I saw, so I figured I'd send this in. > > This happens on both my Fedora 18 and Fedora 17 partitions > when mounting filesystems from very old servers that > need the proto=udp option to talk. > > The redhat bugzilla is here: > > https://bugzilla.redhat.com/show_bug.cgi?id=908451 > > There is a photo of the walkback in that bugzilla: > > svc_delete_xprt+0x12a > svc_recv+0x101 > ? nfs_callback_authenticate+0x50 > nfs4_callback_svc+0x3b > kthread+0xc0 > ? ftrace_raw_event_xen_mmu_flush_tlb_others+0x50 > ? kthread_create_on_node+0x120 > ret_from_fork+0x7c > ? kthread_create_on_node+0x120 OK, yes this is the if (test_and_set_bit(XPT_DEAD, &xprt->xpt_flags)) BUG(); in svc_delete_xprt() and is a known (but unfixed) problem. Stanislav has some patches, that should be fixed soon --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/6] NFS: use SUNRPC cache creation and destruction helper for DNS cache
On Mon, Feb 04, 2013 at 02:02:35PM +0300, Stanislav Kinsbursky wrote: > This cache was the first containerized and doesn't use net-aware cache > creation and destruction helpers. > This is a cleanup patch which just makes code looks clearer and reduce amount > of lines of code. This looks like uncontroversial cleanup, and should probably go in with the rest through my tree, so I'm inclined to just merge it. But an "ACK" from Trond would be reassuring.--b. > > Signed-off-by: Stanislav Kinsbursky > --- > fs/nfs/dns_resolve.c | 60 > +- > 1 files changed, 25 insertions(+), 35 deletions(-) > > diff --git a/fs/nfs/dns_resolve.c b/fs/nfs/dns_resolve.c > index ca4b11e..5015447 100644 > --- a/fs/nfs/dns_resolve.c > +++ b/fs/nfs/dns_resolve.c > @@ -351,48 +351,39 @@ ssize_t nfs_dns_resolve_name(struct net *net, char > *name, > } > EXPORT_SYMBOL_GPL(nfs_dns_resolve_name); > > +static struct cache_detail nfs_dns_resolve_template = { > + .owner = THIS_MODULE, > + .hash_size = NFS_DNS_HASHTBL_SIZE, > + .name = "dns_resolve", > + .cache_put = nfs_dns_ent_put, > + .cache_upcall = nfs_dns_upcall, > + .cache_parse= nfs_dns_parse, > + .cache_show = nfs_dns_show, > + .match = nfs_dns_match, > + .init = nfs_dns_ent_init, > + .update = nfs_dns_ent_update, > + .alloc = nfs_dns_ent_alloc, > +}; > + > + > int nfs_dns_resolver_cache_init(struct net *net) > { > - int err = -ENOMEM; > + int err; > struct nfs_net *nn = net_generic(net, nfs_net_id); > - struct cache_detail *cd; > - struct cache_head **tbl; > > - cd = kzalloc(sizeof(struct cache_detail), GFP_KERNEL); > - if (cd == NULL) > - goto err_cd; > - > - tbl = kzalloc(NFS_DNS_HASHTBL_SIZE * sizeof(struct cache_head *), > - GFP_KERNEL); > - if (tbl == NULL) > - goto err_tbl; > - > - cd->owner = THIS_MODULE, > - cd->hash_size = NFS_DNS_HASHTBL_SIZE, > - cd->hash_table = tbl, > - cd->name = "dns_resolve", > - cd->cache_put = nfs_dns_ent_put, > - cd->cache_upcall = nfs_dns_upcall, > - cd->cache_parse = nfs_dns_parse, > - cd->cache_show = nfs_dns_show, > - cd->match = nfs_dns_match, > - cd->init = nfs_dns_ent_init, > - cd->update = nfs_dns_ent_update, > - cd->alloc = nfs_dns_ent_alloc, > - > - nfs_cache_init(cd); > - err = nfs_cache_register_net(net, cd); > + nn->nfs_dns_resolve = cache_create_net(&nfs_dns_resolve_template, net); > + if (IS_ERR(nn->nfs_dns_resolve)) > + return PTR_ERR(nn->nfs_dns_resolve); > + > + nfs_cache_init(nn->nfs_dns_resolve); > + err = nfs_cache_register_net(net, nn->nfs_dns_resolve); > if (err) > goto err_reg; > - nn->nfs_dns_resolve = cd; > return 0; > > err_reg: > - nfs_cache_destroy(cd); > - kfree(cd->hash_table); > -err_tbl: > - kfree(cd); > -err_cd: > + nfs_cache_destroy(nn->nfs_dns_resolve); > + cache_destroy_net(nn->nfs_dns_resolve, net); > return err; > } > > @@ -403,8 +394,7 @@ void nfs_dns_resolver_cache_destroy(struct net *net) > > nfs_cache_unregister_net(net, cd); > nfs_cache_destroy(cd); > - kfree(cd->hash_table); > - kfree(cd); > + cache_destroy_net(nn->nfs_dns_resolve, net); > } > > static int rpc_pipefs_event(struct notifier_block *nb, unsigned long event, > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/6] NFS; simlify and clean cache library
On Mon, Feb 04, 2013 at 02:02:40PM +0300, Stanislav Kinsbursky wrote: > This is a cleanup patch. > Such helpers like nfs_cache_init() and nfs_cache_destroy() are redundant, > because they are just a wrappers around sunrpc_init_cache_detail() and > sunrpc_destroy_cache_detail() respectively. > So let's remove them completely and move corresponding logic to > nfs_cache_register_net() and nfs_cache_unregister_net() respectively (since > they are called together anyway). Ditto for this one--Trond, any objection? --b. > > Signed-off-by: Stanislav Kinsbursky > --- > fs/nfs/cache_lib.c | 12 +++- > fs/nfs/cache_lib.h |2 -- > fs/nfs/dns_resolve.c |6 +- > 3 files changed, 4 insertions(+), 16 deletions(-) > > diff --git a/fs/nfs/cache_lib.c b/fs/nfs/cache_lib.c > index 862a2f1..5f7b053 100644 > --- a/fs/nfs/cache_lib.c > +++ b/fs/nfs/cache_lib.c > @@ -128,10 +128,13 @@ int nfs_cache_register_net(struct net *net, struct > cache_detail *cd) > struct super_block *pipefs_sb; > int ret = 0; > > + sunrpc_init_cache_detail(cd); > pipefs_sb = rpc_get_sb_net(net); > if (pipefs_sb) { > ret = nfs_cache_register_sb(pipefs_sb, cd); > rpc_put_sb_net(net); > + if (ret) > + sunrpc_destroy_cache_detail(cd); > } > return ret; > } > @@ -151,14 +154,5 @@ void nfs_cache_unregister_net(struct net *net, struct > cache_detail *cd) > nfs_cache_unregister_sb(pipefs_sb, cd); > rpc_put_sb_net(net); > } > -} > - > -void nfs_cache_init(struct cache_detail *cd) > -{ > - sunrpc_init_cache_detail(cd); > -} > - > -void nfs_cache_destroy(struct cache_detail *cd) > -{ > sunrpc_destroy_cache_detail(cd); > } > diff --git a/fs/nfs/cache_lib.h b/fs/nfs/cache_lib.h > index 317db95..4116d2c 100644 > --- a/fs/nfs/cache_lib.h > +++ b/fs/nfs/cache_lib.h > @@ -23,8 +23,6 @@ extern struct nfs_cache_defer_req > *nfs_cache_defer_req_alloc(void); > extern void nfs_cache_defer_req_put(struct nfs_cache_defer_req *dreq); > extern int nfs_cache_wait_for_upcall(struct nfs_cache_defer_req *dreq); > > -extern void nfs_cache_init(struct cache_detail *cd); > -extern void nfs_cache_destroy(struct cache_detail *cd); > extern int nfs_cache_register_net(struct net *net, struct cache_detail *cd); > extern void nfs_cache_unregister_net(struct net *net, struct cache_detail > *cd); > extern int nfs_cache_register_sb(struct super_block *sb, > diff --git a/fs/nfs/dns_resolve.c b/fs/nfs/dns_resolve.c > index 5015447..499834b 100644 > --- a/fs/nfs/dns_resolve.c > +++ b/fs/nfs/dns_resolve.c > @@ -375,14 +375,12 @@ int nfs_dns_resolver_cache_init(struct net *net) > if (IS_ERR(nn->nfs_dns_resolve)) > return PTR_ERR(nn->nfs_dns_resolve); > > - nfs_cache_init(nn->nfs_dns_resolve); > err = nfs_cache_register_net(net, nn->nfs_dns_resolve); > if (err) > goto err_reg; > return 0; > > err_reg: > - nfs_cache_destroy(nn->nfs_dns_resolve); > cache_destroy_net(nn->nfs_dns_resolve, net); > return err; > } > @@ -390,10 +388,8 @@ err_reg: > void nfs_dns_resolver_cache_destroy(struct net *net) > { > struct nfs_net *nn = net_generic(net, nfs_net_id); > - struct cache_detail *cd = nn->nfs_dns_resolve; > > - nfs_cache_unregister_net(net, cd); > - nfs_cache_destroy(cd); > + nfs_cache_unregister_net(net, nn->nfs_dns_resolve); > cache_destroy_net(nn->nfs_dns_resolve, net); > } > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: build failure after merge of the nfsd tree
On Fri, Feb 08, 2013 at 09:41:05AM +0400, Stanislav Kinsbursky wrote: > 03.02.2013 18:41, J. Bruce Fields пишет: > >On Sat, Feb 02, 2013 at 07:57:19AM -0500, J. Bruce Fields wrote: > >>On Sat, Feb 02, 2013 at 01:04:03PM +1100, Stephen Rothwell wrote: > >>>Hi, > >>> > >>>After merging the nfsd tree, today's linux-next build (x86_64 > >>>allmodconfig) failed like this: > >>> > >>>fs/nfs/dns_resolve.c: In function 'nfs_dns_resolver_cache_init': > >>>fs/nfs/dns_resolve.c:375:4: error: 'struct cache_detail' has no member > >>>named 'cache_upcall' > >>>fs/nfs/dns_resolve.c:375:35: warning: left-hand operand of comma > >>>expression has no effect [-Wunused-value] > >>>fs/nfs/dns_resolve.c:375:35: warning: value computed is not used > >>>[-Wunused-value] > >>>fs/nfs/dns_resolve.c:375:35: warning: value computed is not used > >>>[-Wunused-value] > >>>fs/nfs/dns_resolve.c:375:35: warning: value computed is not used > >>>[-Wunused-value] > >>>fs/nfs/dns_resolve.c:375:35: warning: value computed is not used > >>>[-Wunused-value] > >>>fs/nfs/dns_resolve.c:375:35: warning: value computed is not used > >>>[-Wunused-value] > >>>fs/nfs/dns_resolve.c:375:35: warning: value computed is not used > >>>[-Wunused-value] > >>> > >>>Caused by commit aab982cb5dfb ("SUNRPC: remove cache_detail->cache_upcall > >>>callback"). > >> > >>Yes, I knew why we'd introduced cache_upcall, so I'm not sure how I > >>overlooked that. It must have slipped through testing because I didn't > >>set CONFIG_NFS_USE_KERNEL_DNS. > >> > >>We may just be able to revert that patch I can take care of that by > >>tomorrow. > > > >Stanislav, any objections to this? > > > > The only objection is that I've sent you the patch set witch fixes all these > problems already: > > "[PATCH v2 0/6] SUNRPC: rework cache upcall to avoid NFSd root". :) > > The only reason why I removed cache_upcall at all was that all it's > users (except NFS DNS cache - my mistake) just call > sunrpc_cache_pipe_upcall and thus these wrapper looked redundant to > me. > Second patch set leaves cache_upcall only for NFS DNS cache (since this > upcall is not just a wrapper around sunrpc_cache_pipe_upcall). > And second patch set implies the first one will be dropped. > I can, actually, send one more (incremental this time) patch set to fix the > problem, if you wish. No, the replacement series is fine. I gave Trond another poke and then they should get committed.--b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] NFSD: fix races in service per-net resources allocation
On Fri, Feb 01, 2013 at 02:28:21PM +0300, Stanislav Kinsbursky wrote: > After "NFS" (SUNRPC + NFSd actually) containerization work some basic > principles of SUNRPC service initialization and deinitialization has been > changed: now one service can be shared between different network namespaces > and network "resources" can be attached or detached from the running service. > This leads to races, described here: > > https://bugzilla.redhat.com/show_bug.cgi?id=904870 > > and which this small patch set is aimed to solve by using per-cpu rw semphores > to sync per-net resources processing and shutdown. Sorry for the slow response. I think this is probably correct. But I think we got into this mess because the server shutdown logic is too complicated. So I'd prefer to find a way to fix the problem by simplifying things rather than by adding another lock. Do you see anything wrong with the following? --b commit e8202f39f84b8863337f55159dd18478b9ccb616 Author: J. Bruce Fields Date: Sun Feb 10 16:08:11 2013 -0500 svcrpc: fix and simplify server shutdown Simplify server shutdown, and make it correct whether or not there are still threads running (as will happen in the case we're only shutting down the service in one network namespace). Do that by doing what we'd do in normal circumstances: just CLOSE each socket, then enqueue it. Since there may not be threads to handle the resulting queued xprts, also run a simplified version of the svc_recv() loop run by a server to clean up any closed xprts afterwards. Signed-off-by: J. Bruce Fields diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c index 024a241..a98e818 100644 --- a/net/sunrpc/svc_xprt.c +++ b/net/sunrpc/svc_xprt.c @@ -966,12 +966,12 @@ static void svc_close_list(struct svc_serv *serv, struct list_head *xprt_list, s if (xprt->xpt_net != net) continue; set_bit(XPT_CLOSE, &xprt->xpt_flags); - set_bit(XPT_BUSY, &xprt->xpt_flags); + svc_xprt_enqueue(xprt); } spin_unlock(&serv->sv_lock); } -static void svc_clear_pools(struct svc_serv *serv, struct net *net) +static struct svc_xprt *svc_dequeue_net(struct svc_serv *serv, struct net *net) { struct svc_pool *pool; struct svc_xprt *xprt; @@ -986,42 +986,31 @@ static void svc_clear_pools(struct svc_serv *serv, struct net *net) if (xprt->xpt_net != net) continue; list_del_init(&xprt->xpt_ready); + spin_unlock_bh(&pool->sp_lock); + return xprt; } spin_unlock_bh(&pool->sp_lock); } + return NULL; } -static void svc_clear_list(struct svc_serv *serv, struct list_head *xprt_list, struct net *net) +static void svc_clean_up_xprts(struct svc_serv *serv, struct net *net) { struct svc_xprt *xprt; - struct svc_xprt *tmp; - LIST_HEAD(victims); - spin_lock(&serv->sv_lock); - list_for_each_entry_safe(xprt, tmp, xprt_list, xpt_list) { - if (xprt->xpt_net != net) - continue; - list_move(&xprt->xpt_list, &victims); - } - spin_unlock(&serv->sv_lock); - - list_for_each_entry_safe(xprt, tmp, &victims, xpt_list) + while ((xprt = svc_dequeue_net(serv, net))) { + if (!test_bit(XPT_CLOSE, &xprt->xpt_flags)) + pr_err("found un-closed xprt on service shutdown\n"); svc_delete_xprt(xprt); + } } void svc_close_net(struct svc_serv *serv, struct net *net) { - svc_close_list(serv, &serv->sv_tempsocks, net); svc_close_list(serv, &serv->sv_permsocks, net); - - svc_clear_pools(serv, net); - /* -* At this point the sp_sockets lists will stay empty, since -* svc_xprt_enqueue will not add new entries without taking the -* sp_lock and checking XPT_BUSY. -*/ - svc_clear_list(serv, &serv->sv_tempsocks, net); - svc_clear_list(serv, &serv->sv_permsocks, net); + svc_clean_up_xprts(serv, net); + svc_close_list(serv, &serv->sv_tempsocks, net); + svc_clean_up_xprts(serv, net); } /* -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] NFSD: fix races in service per-net resources allocation
On Mon, Feb 11, 2013 at 10:18:18AM +0400, Stanislav Kinsbursky wrote: > This one looks a bit complicated and confusing to me. Probably because > I'm not that familiar with service transports processing logic. So, > as I can see, we now try to run over all per-net pool-assigned > transports, remove them from "ready" queue and delete one by one. > Then we try to enqueue all temporary sockets. But where in enqueueing > of permanent sockets? I.e. how does they be destroyed with this patch? > Then we once again try to run over all per-net pool-assigned > transports, remove them from "ready" queue and delete one by one. Why > twice? I.e. why not just lose them, then enqueue them and > svc_clean_up_xprts()? I think you missed the first svc_close_list?: > > svc_close_list(serv, &serv->sv_permsocks, net); > >+svc_clean_up_xprts(serv, net); > >+svc_close_list(serv, &serv->sv_tempsocks, net); > >+svc_clean_up_xprts(serv, net); The idea is that before we'd like to close all the listeners first, so that they aren't busy creating more tempsocks while we're trying to close them. I overlooked a race, though: if another thread was already handling an accept for one of the listeners then it might not get closed by that first svc_clean_up_xprts. I guess we could do something like: delay = 0; again: numclosed = svc_close_list(serv, &serv->sv_permsocks, net); numclosed += svc_close_list(serv, &serv->sv_tempsocks, net); if (numclosed) { svc_clean_up_xprts(serv, net); msleep(delay++); goto again; } Seems a little cheesy, but if we don't care much about shutdown performance in a rare corner case, maybe it's the simplest way out? --b -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] NFSD: fix races in service per-net resources allocation
On Mon, Feb 11, 2013 at 11:37:15AM -0500, J. Bruce Fields wrote: > On Mon, Feb 11, 2013 at 10:18:18AM +0400, Stanislav Kinsbursky wrote: > > This one looks a bit complicated and confusing to me. Probably because > > I'm not that familiar with service transports processing logic. So, > > as I can see, we now try to run over all per-net pool-assigned > > transports, remove them from "ready" queue and delete one by one. > > Then we try to enqueue all temporary sockets. But where in enqueueing > > of permanent sockets? I.e. how does they be destroyed with this patch? > > Then we once again try to run over all per-net pool-assigned > > transports, remove them from "ready" queue and delete one by one. Why > > twice? I.e. why not just lose them, then enqueue them and > > svc_clean_up_xprts()? > > I think you missed the first svc_close_list?: > > > > svc_close_list(serv, &serv->sv_permsocks, net); > > >+ svc_clean_up_xprts(serv, net); > > >+ svc_close_list(serv, &serv->sv_tempsocks, net); > > >+ svc_clean_up_xprts(serv, net); > > The idea is that before we'd like to close all the listeners first, so > that they aren't busy creating more tempsocks while we're trying to > close them. > > I overlooked a race, though: if another thread was already handling an > accept for one of the listeners then it might not get closed by that > first svc_clean_up_xprts. > > I guess we could do something like: > > delay = 0; > > again: > numclosed = svc_close_list(serv, &serv->sv_permsocks, net); > numclosed += svc_close_list(serv, &serv->sv_tempsocks, net); > if (numclosed) { > svc_clean_up_xprts(serv, net); > msleep(delay++); > goto again; > } > > Seems a little cheesy, but if we don't care much about shutdown > performance in a rare corner case, maybe it's the simplest way out? That ends up looking like this.--b. commit 8468ca5003356bbf5d6157807d4daed075fd438f Author: J. Bruce Fields Date: Sun Feb 10 16:08:11 2013 -0500 svcrpc: fix rpc server shutdown races Rewrite server shutdown to remove the assumption that there are no longer any threads running (no longer true, for example, when shutting down the service in one network namespace while it's still running in others). Do that by doing what we'd do in normal circumstances: just CLOSE each socket, then enqueue it. Since there may not be threads to handle the resulting queued xprts, also run a simplified version of the svc_recv() loop run by a server to clean up any closed xprts afterwards. Signed-off-by: J. Bruce Fields diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c index dbf12ac..2d34b6b 100644 --- a/net/sunrpc/svc.c +++ b/net/sunrpc/svc.c @@ -515,15 +515,6 @@ EXPORT_SYMBOL_GPL(svc_create_pooled); void svc_shutdown_net(struct svc_serv *serv, struct net *net) { - /* -* The set of xprts (contained in the sv_tempsocks and -* sv_permsocks lists) is now constant, since it is modified -* only by accepting new sockets (done by service threads in -* svc_recv) or aging old ones (done by sv_temptimer), or -* configuration changes (excluded by whatever locking the -* caller is using--nfsd_mutex in the case of nfsd). So it's -* safe to traverse those lists and shut everything down: -*/ svc_close_net(serv, net); if (serv->sv_shutdown) diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c index 0b67409..0bd0b6f 100644 --- a/net/sunrpc/svc_xprt.c +++ b/net/sunrpc/svc_xprt.c @@ -948,21 +948,24 @@ void svc_close_xprt(struct svc_xprt *xprt) } EXPORT_SYMBOL_GPL(svc_close_xprt); -static void svc_close_list(struct svc_serv *serv, struct list_head *xprt_list, struct net *net) +static int svc_close_list(struct svc_serv *serv, struct list_head *xprt_list, struct net *net) { struct svc_xprt *xprt; + int ret = 0; spin_lock(&serv->sv_lock); list_for_each_entry(xprt, xprt_list, xpt_list) { if (xprt->xpt_net != net) continue; + ret++; set_bit(XPT_CLOSE, &xprt->xpt_flags); - set_bit(XPT_BUSY, &xprt->xpt_flags); + svc_xprt_enqueue(xprt); } spin_unlock(&serv->sv_lock); + return ret; } -static void svc_clear_pools(struct svc_serv *serv, struct net *net) +static struct svc_xprt *svc_dequeue_net(struct svc_serv *serv, struct net *net) { struct svc_pool *pool; struct svc_xprt *xprt; @@ -977,42 +980,49 @@ static void svc_clear_pools(struct svc_serv *serv, struct net *net
Re: [PATCH 0/2] NFSD: fix races in service per-net resources allocation
On Tue, Feb 12, 2013 at 01:52:32PM +0400, Stanislav Kinsbursky wrote: > 12.02.2013 00:58, J. Bruce Fields пишет: > > > void svc_close_net(struct svc_serv *serv, struct net *net) > > { > >-svc_close_list(serv, &serv->sv_tempsocks, net); > >-svc_close_list(serv, &serv->sv_permsocks, net); > >- > >-svc_clear_pools(serv, net); > >-/* > >- * At this point the sp_sockets lists will stay empty, since > >- * svc_xprt_enqueue will not add new entries without taking the > >- * sp_lock and checking XPT_BUSY. > >- */ > >-svc_clear_list(serv, &serv->sv_tempsocks, net); > >-svc_clear_list(serv, &serv->sv_permsocks, net); > >+int closed; > >+int delay = 0; > >+ > >+again: > >+closed = svc_close_list(serv, &serv->sv_permsocks, net); > >+closed += svc_close_list(serv, &serv->sv_tempsocks, net); > >+if (closed) { > >+svc_clean_up_xprts(serv, net); > >+msleep(delay++); > >+goto again; > >+} > > Frankly, this hunk above makes me feel sick... :( > But I have no better idea right now... > Maybe make this hunk a bit less weird (this is from my POW only, of course), > like this: > > > + while (svc_close_list(serv, &serv->sv_permsocks, net) + > > + svc_close_list(serv, &serv->sv_tempsocks, net)) { > > + svc_clean_up_xprts(serv, net); > > + msleep(delay++); > > + } > > ? OK, that's a little more compact at least. --b. > > Anyway, thanks! > > Acked-by: Stanislav Kinsbursky > > -- > Best regards, > Stanislav Kinsbursky -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG at net/sunrpc/svc_xprt.c:921 (another one)
On Sun, Jan 20, 2013 at 05:51:12PM -0500, Mark Lord wrote: > Got it again, this time on a different system > running mostly the same software. Mark, Paweł, Tom, could any of you confirm whether this helps? --b. diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c index dbf12ac..2d34b6b 100644 --- a/net/sunrpc/svc.c +++ b/net/sunrpc/svc.c @@ -515,15 +515,6 @@ EXPORT_SYMBOL_GPL(svc_create_pooled); void svc_shutdown_net(struct svc_serv *serv, struct net *net) { - /* -* The set of xprts (contained in the sv_tempsocks and -* sv_permsocks lists) is now constant, since it is modified -* only by accepting new sockets (done by service threads in -* svc_recv) or aging old ones (done by sv_temptimer), or -* configuration changes (excluded by whatever locking the -* caller is using--nfsd_mutex in the case of nfsd). So it's -* safe to traverse those lists and shut everything down: -*/ svc_close_net(serv, net); if (serv->sv_shutdown) diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c index b8e47fa..ca71056 100644 --- a/net/sunrpc/svc_xprt.c +++ b/net/sunrpc/svc_xprt.c @@ -856,7 +856,6 @@ static void svc_age_temp_xprts(unsigned long closure) struct svc_serv *serv = (struct svc_serv *)closure; struct svc_xprt *xprt; struct list_head *le, *next; - LIST_HEAD(to_be_aged); dprintk("svc_age_temp_xprts\n"); @@ -877,25 +876,15 @@ static void svc_age_temp_xprts(unsigned long closure) if (atomic_read(&xprt->xpt_ref.refcount) > 1 || test_bit(XPT_BUSY, &xprt->xpt_flags)) continue; - svc_xprt_get(xprt); - list_move(le, &to_be_aged); + list_del_init(le); set_bit(XPT_CLOSE, &xprt->xpt_flags); set_bit(XPT_DETACHED, &xprt->xpt_flags); - } - spin_unlock_bh(&serv->sv_lock); - - while (!list_empty(&to_be_aged)) { - le = to_be_aged.next; - /* fiddling the xpt_list node is safe 'cos we're XPT_DETACHED */ - list_del_init(le); - xprt = list_entry(le, struct svc_xprt, xpt_list); - dprintk("queuing xprt %p for closing\n", xprt); /* a thread will dequeue and close it soon */ svc_xprt_enqueue(xprt); - svc_xprt_put(xprt); } + spin_unlock_bh(&serv->sv_lock); mod_timer(&serv->sv_temptimer, jiffies + svc_conn_age_period * HZ); } @@ -959,21 +948,24 @@ void svc_close_xprt(struct svc_xprt *xprt) } EXPORT_SYMBOL_GPL(svc_close_xprt); -static void svc_close_list(struct svc_serv *serv, struct list_head *xprt_list, struct net *net) +static int svc_close_list(struct svc_serv *serv, struct list_head *xprt_list, struct net *net) { struct svc_xprt *xprt; + int ret = 0; spin_lock(&serv->sv_lock); list_for_each_entry(xprt, xprt_list, xpt_list) { if (xprt->xpt_net != net) continue; + ret++; set_bit(XPT_CLOSE, &xprt->xpt_flags); - set_bit(XPT_BUSY, &xprt->xpt_flags); + svc_xprt_enqueue(xprt); } spin_unlock(&serv->sv_lock); + return ret; } -static void svc_clear_pools(struct svc_serv *serv, struct net *net) +static struct svc_xprt *svc_dequeue_net(struct svc_serv *serv, struct net *net) { struct svc_pool *pool; struct svc_xprt *xprt; @@ -988,42 +980,46 @@ static void svc_clear_pools(struct svc_serv *serv, struct net *net) if (xprt->xpt_net != net) continue; list_del_init(&xprt->xpt_ready); + spin_unlock_bh(&pool->sp_lock); + return xprt; } spin_unlock_bh(&pool->sp_lock); } + return NULL; } -static void svc_clear_list(struct svc_serv *serv, struct list_head *xprt_list, struct net *net) +static void svc_clean_up_xprts(struct svc_serv *serv, struct net *net) { struct svc_xprt *xprt; - struct svc_xprt *tmp; - LIST_HEAD(victims); - - spin_lock(&serv->sv_lock); - list_for_each_entry_safe(xprt, tmp, xprt_list, xpt_list) { - if (xprt->xpt_net != net) - continue; - list_move(&xprt->xpt_list, &victims); - } - spin_unlock(&serv->sv_lock); - list_for_each_entry_safe(xprt, tmp, &victims, xpt_list) + while ((xprt = svc_dequeue_net(serv, net))) { + set_bit(XPT_CLOSE, &xprt->xpt_flags); svc_delete_xprt(xprt); + } } +/* + * Server threads may still be running (especially in the case where the + * service is still running in other network namespaces). + * + * So we shut down sockets the same way we would on a running server, by + * setting XPT_CLOSE, enqueuing, an
Re: [PATCH linux-next] lockd: nlmclnt_reclaim(): avoid stack overflow
On Tue, Feb 12, 2013 at 12:33:15PM -0700, Tim Gardner wrote: > Even though nlmclnt_reclaim() is only one call into the stack frame, > 928 bytes on the stack seems like a lot. Recode to dynamically > allocate the request structure once from within the reclaimer task, > then pass this pointer into nlmclnt_reclaim() for reuse on > subsequent calls. > > smatch analysis: > > fs/lockd/clntproc.c:620 nlmclnt_reclaim() warn: 'reqst' puts > 928 bytes on stack > > Also remove redundant assignment of 0 after memset. > > Cc: Trond Myklebust > Cc: "J. Bruce Fields" > Cc: linux-...@vger.kernel.org > Signed-off-by: Tim Gardner > --- > fs/lockd/clntlock.c |8 +++- > fs/lockd/clntproc.c |6 ++ > include/linux/lockd/lockd.h |3 ++- > 3 files changed, 11 insertions(+), 6 deletions(-) > > diff --git a/fs/lockd/clntlock.c b/fs/lockd/clntlock.c > index 4885b53..5dd23ef 100644 > --- a/fs/lockd/clntlock.c > +++ b/fs/lockd/clntlock.c > @@ -220,10 +220,15 @@ reclaimer(void *ptr) > { > struct nlm_host *host = (struct nlm_host *) ptr; > struct nlm_wait *block; > + struct nlm_rqst *req; > struct file_lock *fl, *next; > u32 nsmstate; > struct net *net = host->net; > > + req = kmalloc(sizeof(*req), GFP_KERNEL); > + if (!req) > + return -ENOMEM; > + It caught my eye that this function has never actually returned anything but 0 before But OK I don't think there's a problem there: it could already fail to reclaim individual locks due to allocation or other failures, and could fail to run entirely if kthread_run failed. But maybe we should log the same "Locks for %s won't be reclaimed!" error in this case that we would in the case the kthread_run() in nlmclnt_recovery fails. --b. > allow_signal(SIGKILL); > > down_write(&host->h_rwsem); > @@ -253,7 +258,7 @@ restart: >*/ > if (signalled()) > continue; > - if (nlmclnt_reclaim(host, fl) != 0) > + if (nlmclnt_reclaim(host, fl, req) != 0) > continue; > list_add_tail(&fl->fl_u.nfs_fl.list, &host->h_granted); > if (host->h_nsmstate != nsmstate) { > @@ -279,5 +284,6 @@ restart: > /* Release host handle after use */ > nlmclnt_release_host(host); > lockd_down(net); > + kfree(req); > return 0; > } > diff --git a/fs/lockd/clntproc.c b/fs/lockd/clntproc.c > index 54f9e6c..b43114c 100644 > --- a/fs/lockd/clntproc.c > +++ b/fs/lockd/clntproc.c > @@ -615,17 +615,15 @@ out_unlock: > * RECLAIM: Try to reclaim a lock > */ > int > -nlmclnt_reclaim(struct nlm_host *host, struct file_lock *fl) > +nlmclnt_reclaim(struct nlm_host *host, struct file_lock *fl, > + struct nlm_rqst *req) > { > - struct nlm_rqst reqst, *req; > int status; > > - req = &reqst; > memset(req, 0, sizeof(*req)); > locks_init_lock(&req->a_args.lock.fl); > locks_init_lock(&req->a_res.lock.fl); > req->a_host = host; > - req->a_flags = 0; > > /* Set up the argument struct */ > nlmclnt_setlockargs(req, fl); > diff --git a/include/linux/lockd/lockd.h b/include/linux/lockd/lockd.h > index f5a051a..a395f1e 100644 > --- a/include/linux/lockd/lockd.h > +++ b/include/linux/lockd/lockd.h > @@ -212,7 +212,8 @@ int nlmclnt_block(struct nlm_wait *block, > struct nlm_rqst *req, long timeout) > __be32 nlmclnt_grant(const struct sockaddr *addr, > const struct nlm_lock *lock); > void nlmclnt_recovery(struct nlm_host *); > -intnlmclnt_reclaim(struct nlm_host *, struct file_lock *); > +intnlmclnt_reclaim(struct nlm_host *, struct file_lock *, > + struct nlm_rqst *); > void nlmclnt_next_cookie(struct nlm_cookie *); > > /* > -- > 1.7.9.5 > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH linux-next] lockd: nlmsvc_mark_resources(): avoid stack overflow
On Tue, Feb 12, 2013 at 12:48:58PM -0700, Tim Gardner wrote: > Dynamically allocate the NLM host structure in order to avoid stack overflow. > nlmsvc_mark_resources() is several call levels deep in a stack > that has a number of large variables. 512 bytes seems like a lot > on the stack at this point. > > smatch analysis: > > fs/lockd/svcsubs.c:366 nlmsvc_mark_resources() warn: 'hint' puts > 512 bytes on stack > > Cc: Trond Myklebust > Cc: "J. Bruce Fields" > Cc: linux-...@vger.kernel.org > Signed-off-by: Tim Gardner > --- > fs/lockd/svcsubs.c | 12 > 1 file changed, 8 insertions(+), 4 deletions(-) > > diff --git a/fs/lockd/svcsubs.c b/fs/lockd/svcsubs.c > index b904f41..f3abb7f 100644 > --- a/fs/lockd/svcsubs.c > +++ b/fs/lockd/svcsubs.c > @@ -363,11 +363,15 @@ nlmsvc_is_client(void *data, struct nlm_host *dummy) > void > nlmsvc_mark_resources(struct net *net) > { > - struct nlm_host hint; > + struct nlm_host *hint = kzalloc(sizeof(*hint), GFP_KERNEL); > > - dprintk("lockd: nlmsvc_mark_resources for net %p\n", net); > - hint.net = net; > - nlm_traverse_files(&hint, nlmsvc_mark_host, NULL); > + if (hint) { > + dprintk("lockd: nlmsvc_mark_resources for net %p\n", net); > + hint->net = net; > + nlm_traverse_files(hint, nlmsvc_mark_host, NULL); > + } Silently neglecting to do this looks like a bad idea. It's strange that we're passing in an nlm_host when all we actually use is the struct net*. Why not just change this to pass in the net instead? --b. > + > + kfree(hint); > } > > /* > -- > 1.7.9.5 > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
nfsd bugfix for 3.9
Please pull another nfsd bugfix from the for-3.9 branch at git://linux-nfs.org/~bfields/linux.git for-3.9 (An xdr decoding error--thanks, Toralf Förster, and Trinity!) --b. J. Bruce Fields (1): nfsd4: reject "negative" acl lengths fs/nfsd/nfs4xdr.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: build failure after merge of the final tree (nfsd tree related)
On Wed, Apr 03, 2013 at 07:10:54AM -0400, Jeff Layton wrote: > On Wed, 3 Apr 2013 17:42:19 +1100 > Stephen Rothwell wrote: > > > Hi all, > > > > After merging the final tree, today's linux-next build (arm defconfig) > > failed like this: > > > > fs/built-in.o: In function `nfsd_reply_cache_stats_show': > > super.c:(.text+0x87308): undefined reference to `__udivdi3' > > > > Probably caused by commit 187da2f90879 ("nfsd: keep track of the max and > > average time to search the cache") which adds a divide by u64 > > (num_searches). > > > > I have reverted that commit and the following one for today. > > Thanks, known problem... > > Looks like Bruce's tree has an older version of that patch series. I > think we just need to get him to drop that one and merge the new one. Arrgh, sorry--could you remind me which is the new one? --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: build failure after merge of the final tree (nfsd tree related)
On Wed, Apr 03, 2013 at 07:38:57AM -0400, Jeff Layton wrote: > On Wed, 3 Apr 2013 07:33:01 -0400 > "J. Bruce Fields" wrote: > > > On Wed, Apr 03, 2013 at 07:10:54AM -0400, Jeff Layton wrote: > > > On Wed, 3 Apr 2013 17:42:19 +1100 > > > Stephen Rothwell wrote: > > > > > > > Hi all, > > > > > > > > After merging the final tree, today's linux-next build (arm defconfig) > > > > failed like this: > > > > > > > > fs/built-in.o: In function `nfsd_reply_cache_stats_show': > > > > super.c:(.text+0x87308): undefined reference to `__udivdi3' > > > > > > > > Probably caused by commit 187da2f90879 ("nfsd: keep track of the max and > > > > average time to search the cache") which adds a divide by u64 > > > > (num_searches). > > > > > > > > I have reverted that commit and the following one for today. > > > > > > Thanks, known problem... > > > > > > Looks like Bruce's tree has an older version of that patch series. I > > > think we just need to get him to drop that one and merge the new one. > > > > Arrgh, sorry--could you remind me which is the new one? > > > > It was the one I sent on 3/19. Those patches (plus a couple more) are > also in the current nfsd-3.10 branch of my git tree too, so it may be > easiest to just pick them from there. I hate rewriting that branch, but OK, done: does my for-3.10 look right to you now? (It's still missing some of your latest patches.) --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] nfsd4: Fix NULL dereference in legacy_recdir_name_error()
On Wed, Apr 03, 2013 at 06:27:26PM +0200, Takashi Iwai wrote: > The recent rewrite of NFSv4 recovery client tracking options per net > (commit 9a9c6478) introduced Oops when it faces an error for recdir > generation. Thanks. Looks like that could hit a lot of people actually, so I'll pass that along for 3.9 soon.--b. > > NFSD: unable to generate recoverydir name (-2). > NFSD: disabling legacy clientid tracking. Reboot recovery will not function > correctly! > BUG: unable to handle kernel NULL pointer dereference at 07a8 > IP: [] nfsd4_client_tracking_exit+0x17/0x70 [nfsd] > PGD 0 > Oops: [#1] PREEMPT SMP > Modules linked in: nfsd fuse nfsv3 nfs_acl nfsv4 auth_rpcgss nfs lockd > sunrpc cpufreq_conservative cpufreq_userspace cpufreq_powersave > snd_hda_codec_hdmi snd_hda_codec_realtek intel_powerclamp acpi_cpufreq mperf > coretemp ghash_clmulni_intel aesni_intel kvm_intel snd_hda_intel ablk_helper > snd_hda_codec snd_hwdep kvm snd_pcm cryptd lrw aes_x86_64 snd_timer xts > gf128mul e1000e snd sr_mod iTCO_wdt microcode cdrom usb_storage dcdbas > iTCO_vendor_support i2c_i801 cdc_acm sg ptp lpc_ich mei soundcore pps_core > mfd_core snd_page_alloc pciehp pci_hotplug autofs4 btrfs raid6_pq > zlib_deflate xor libcrc32c i915 crc32c_intel drm_kms_helper drm xhci_hcd > i2c_algo_bit thermal button video processor thermal_sys scsi_dh_rdac > scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_dh > CPU 1 > Pid: 19567, comm: nfsd Not tainted 3.9.0-rc5-test+ #3 Dell Inc. OptiPlex > 9010/0M9KCM > RIP: 0010:[] [] > nfsd4_client_tracking_exit+0x17/0x70 [nfsd] > RSP: 0018:880181099c28 EFLAGS: 00010202 > RAX: 8801810900c0 RBX: 0004 RCX: 0006 > RDX: 0007 RSI: 0046 RDI: > RBP: 880181099c38 R08: 000a R09: 039f > R10: R11: 039e R12: > R13: 81a87280 R14: 88014c819220 R15: 88020b75d200 > FS: () GS:88021e24() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 07a8 CR3: 01a0d000 CR4: 001407e0 > DR0: DR1: DR2: > DR3: DR6: 0ff0 DR7: 0400 > Process nfsd (pid: 19567, threadinfo 880181098000, task > 8801810900c0) > Stack: >fffe 88020b75d200 880181099c58 a060c75c >81a87280 880002ba7000 880181099cc8 a060cb37 >880181099d20 88014c819220 0001 88020b75d200 > Call Trace: >[] legacy_recdir_name_error+0x3c/0x40 [nfsd] >[] nfsd4_create_clid_dir+0xe7/0x200 [nfsd] >[] ? nfs4_preprocess_seqid_op+0x63/0x160 [nfsd] >[] nfsd4_client_record_create+0x5f/0x80 [nfsd] >[] nfsd4_open_confirm+0x12f/0x1b0 [nfsd] >[] nfsd4_proc_compound+0x55f/0x770 [nfsd] >[] nfsd_dispatch+0xdd/0x220 [nfsd] >[] svc_process_common+0x328/0x6d0 [sunrpc] >[] svc_process+0x10c/0x160 [sunrpc] >[] nfsd+0xbf/0x130 [nfsd] >[] ? nfsd_destroy+0x90/0x90 [nfsd] >[] kthread+0xbb/0xc0 >[] ? kthread_create_on_node+0x130/0x130 >[] ret_from_fork+0x7c/0xb0 >[] ? kthread_create_on_node+0x130/0x130 > Code: e0 49 8b 84 24 48 01 00 00 e9 25 ff ff ff 66 0f 1f 44 00 00 55 48 89 > e5 41 54 49 89 fc 53 8b 1d 44 b4 00 00 e8 bb a9 a5 e0 85 db <49> 8b 84 24 a8 > 07 00 00 74 43 3b 18 77 3f 83 eb 01 48 63 db 48 > RIP [] nfsd4_client_tracking_exit+0x17/0x70 [nfsd] >RSP > CR2: 07a8 > ---[ end trace 5dd4307598e98cef ]--- > > This patch fixes it by passing the proper net instance instead of > NULL. > > Signed-off-by: Takashi Iwai > Cc: [v3.8+] > --- > fs/nfsd/nfs4recover.c | 11 +-- > 1 file changed, 5 insertions(+), 6 deletions(-) > > diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c > index 899ca26..ae0d5c9 100644 > --- a/fs/nfsd/nfs4recover.c > +++ b/fs/nfsd/nfs4recover.c > @@ -146,7 +146,7 @@ out_no_tfm: > * then disable recovery tracking. > */ > static void > -legacy_recdir_name_error(int error) > +legacy_recdir_name_error(struct net *net, int error) > { > printk(KERN_ERR "NFSD: unable to generate recoverydir " > "name (%d).\n", error); > @@ -160,8 +160,7 @@ legacy_recdir_name_error(int error) > printk(KERN_ERR "NFSD: disabling legacy clientid tracking. " > "Reboot recovery will not function correctly!\n"); > > - /* the argument is ignored by the legacy exit function */ > - nfsd4_client_tracking_exit(NULL); > + nfsd4_client_tracking_exit(net); > } > } > > @@ -184,7 +183,7 @@ nfsd4_create_clid_dir(struct nfs4_client *clp) > > status = nfs4_make_rec_clidname(dname, &clp->cl_name); > if (status) > - return legacy_recdir_name_error(status); > + r
Re: [PATCH v2] nfsd: fix bug on nfs4 stateid deallocation
On Wed, Apr 03, 2013 at 06:58:43PM +0800, Yanchuan Nian wrote: > On Mon, Apr 01, 2013 at 09:50:44PM -0400, J. Bruce Fields wrote: > > On Wed, Mar 13, 2013 at 11:04:54PM +0800, Yanchuan Nian wrote: > > > 2013/3/11 J. Bruce Fields > > > > > > > On Mon, Mar 11, 2013 at 08:46:14AM +0800, ycn...@gmail.com wrote: > > > > > NFS4_OO_PURGE_CLOSE is not handled properly. To avoid memory leak, > > > > > nfs4 > > > > > stateid which is pointed by oo_last_closed_stid is freed in > > > > nfsd4_close(), > > > > > but NFS4_OO_PURGE_CLOSE isn't cleared meanwhile. So the stateid > > > > > released > > > > in > > > > > THIS close procedure may be freed immediately in the coming encoding > > > > function. > > > > > > > > OK, makes sense. This code is confusing, I wonder if there's some way > > > > we could make it simpler. > > > > > > > > > > The only purpose of NFS4_OO_PURGE_CLOSE is to decide whether the stateid > > > pointed by last-closed-stateid should be freed or not. I wonder if this > > > flag is necessary. oo_last_closed_stid is set only in CLOSE, so in other > > > procedures, if the pointer is not NULL, it must be set in previous > > > procedure, we can free it directly. In CLOSE procedure, we also free it > > > directly before asigning the new released stateid to it in nfsd4_close(), > > > and then we can ignore it in nfsd4_encode_close(). What do you think? > > > > Yeah, something like that would be simpler, I think. Maybe the > > following? (On top of your patch.) > > This patch may cause memory leak in NFSv4.1. If the stateid released > is the last one in nfs4_openowner, nfs4_openowner will be deallocated > immediately, and NULL will be assigned to cstate->replay_owner at the > same time. In this situation encode_seqid_op_tail() cannot be called. > To avoid this problem, we can free the stateid just before or after > nfs4_opwnowner deallocation. Yes, thanks for catching that!: nfsd4_close_open_stateid(stp); - close->cl_closed_stateid = stp; + if (cstate->minorversion) { + unhash_stid(&stp->st_stid); + free_generic_stateid(stp); + } else + close->cl_closed_stateid = stp; if (list_empty(&oo->oo_owner.so_stateids)) { if (cstate->minorversion) { > Another question, cl_stateid in struct nfsd4_close will be returned to > the client, so original comment is right even though applying this > patch. Can this struct be commented like this. > > struct nfsd4_close { > u32 cl_seqid; /* request */ > stateid_t cl_stateid; /* request+response */ > struct nfs4_ol_stateid *cl_closed_stateid; /* used during processing > */ Agreed; done. The result is the following (untested). --b. diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c index 3e1101a..9bd6653 100644 --- a/fs/nfsd/nfs4state.c +++ b/fs/nfsd/nfs4state.c @@ -674,7 +674,7 @@ static void unhash_openowner(struct nfs4_openowner *oo) } } -static void release_last_closed_stateid(struct nfs4_openowner *oo) +void release_last_closed_stateid(struct nfs4_openowner *oo) { struct nfs4_ol_stateid *s = oo->oo_last_closed_stid; @@ -3783,26 +3783,6 @@ out: return status; } -void nfsd4_purge_closed_stateid(struct nfs4_stateowner *so) -{ - struct nfs4_openowner *oo; - struct nfs4_ol_stateid *s; - - if (!so->so_is_open_owner) - return; - oo = openowner(so); - s = oo->oo_last_closed_stid; - if (!s) - return; - if (!(oo->oo_flags & NFS4_OO_PURGE_CLOSE)) { - /* Release the last_closed_stid on the next seqid bump: */ - oo->oo_flags |= NFS4_OO_PURGE_CLOSE; - return; - } - oo->oo_flags &= ~NFS4_OO_PURGE_CLOSE; - release_last_closed_stateid(oo); -} - static void nfsd4_close_open_stateid(struct nfs4_ol_stateid *s) { unhash_open_stateid(s); @@ -3839,9 +3819,11 @@ nfsd4_close(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, memcpy(&close->cl_stateid, &stp->st_stid.sc_stateid, sizeof(stateid_t)); nfsd4_close_open_stateid(stp); - release_last_closed_stateid(oo); - oo->oo_flags &= ~NFS4_OO_PURGE_CLOSE; - oo->oo_last_closed_stid = stp; + if (cstate->minorversion) { + unhash_stid(&stp->st_stid); + free_generic_stateid(stp); + } else + close->cl_closed_stateid = stp; if (list_empty(&oo->oo_owne
Re: [PATCH v1 4/6] nfsd: convert nfs4_alloc_stid to use idr_alloc_cyclic
ACK.--b. On Wed, Mar 27, 2013 at 09:18:06AM -0400, Jeff Layton wrote: > Signed-off-by: Jeff Layton > Cc: "J. Bruce Fields" > Cc: linux-...@vger.kernel.org > --- > fs/nfsd/nfs4state.c | 9 ++--- > 1 file changed, 2 insertions(+), 7 deletions(-) > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > index 2e27430..2efb034 100644 > --- a/fs/nfsd/nfs4state.c > +++ b/fs/nfsd/nfs4state.c > @@ -234,7 +234,6 @@ static struct nfs4_stid *nfs4_alloc_stid(struct > nfs4_client *cl, struct > kmem_cache *slab) > { > struct idr *stateids = &cl->cl_stateids; > - static int min_stateid = 0; > struct nfs4_stid *stid; > int new_id; > > @@ -242,7 +241,7 @@ kmem_cache *slab) > if (!stid) > return NULL; > > - new_id = idr_alloc(stateids, stid, min_stateid, 0, GFP_KERNEL); > + new_id = idr_alloc_cyclic(stateids, stid, 0, 0, GFP_KERNEL); > if (new_id < 0) > goto out_free; > stid->sc_client = cl; > @@ -261,10 +260,6 @@ kmem_cache *slab) >* amount of time until an id is reused, by ensuring they always >* "increase" (mod INT_MAX): >*/ > - > - min_stateid = new_id+1; > - if (min_stateid == INT_MAX) > - min_stateid = 0; > return stid; > out_free: > kfree(stid); > @@ -1287,7 +1282,7 @@ static struct nfs4_client *create_client(struct > xdr_netobj name, > spin_unlock(&nn->client_lock); > return NULL; > } > - idr_init(&clp->cl_stateids); > + idr_init_cyclic(&clp->cl_stateids, 0); > atomic_set(&clp->cl_refcount, 0); > clp->cl_cb_state = NFSD4_CB_UNKNOWN; > INIT_LIST_HEAD(&clp->cl_idhash); > -- > 1.7.11.7 > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 4/6] nfsd: convert nfs4_alloc_stid to use idr_alloc_cyclic
Err, and ack v2.--b. On Wed, Mar 27, 2013 at 03:29:36PM -0400, Jeff Layton wrote: > Signed-off-by: Jeff Layton > Cc: "J. Bruce Fields" > Cc: linux-...@vger.kernel.org > --- > fs/nfsd/nfs4state.c | 7 +-- > 1 file changed, 1 insertion(+), 6 deletions(-) > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > index 2e27430..417c848 100644 > --- a/fs/nfsd/nfs4state.c > +++ b/fs/nfsd/nfs4state.c > @@ -234,7 +234,6 @@ static struct nfs4_stid *nfs4_alloc_stid(struct > nfs4_client *cl, struct > kmem_cache *slab) > { > struct idr *stateids = &cl->cl_stateids; > - static int min_stateid = 0; > struct nfs4_stid *stid; > int new_id; > > @@ -242,7 +241,7 @@ kmem_cache *slab) > if (!stid) > return NULL; > > - new_id = idr_alloc(stateids, stid, min_stateid, 0, GFP_KERNEL); > + new_id = idr_alloc_cyclic(stateids, stid, 0, 0, GFP_KERNEL); > if (new_id < 0) > goto out_free; > stid->sc_client = cl; > @@ -261,10 +260,6 @@ kmem_cache *slab) >* amount of time until an id is reused, by ensuring they always >* "increase" (mod INT_MAX): >*/ > - > - min_stateid = new_id+1; > - if (min_stateid == INT_MAX) > - min_stateid = 0; > return stid; > out_free: > kfree(stid); > -- > 1.7.11.7 > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SUNRPC/cache: add module_put() on error path in cache_open()
Thanks, applying.--b. On Sat, Mar 23, 2013 at 12:36:44AM +0400, Alexey Khoroshilov wrote: > If kmalloc() fails in cache_open(), module cd->owner left locked. > The patch adds module_put(cd->owner) on this path. > > Found by Linux Driver Verification project (linuxtesting.org). > > Signed-off-by: Alexey Khoroshilov > --- > net/sunrpc/cache.c |4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c > index 25d58e76..1d3c514 100644 > --- a/net/sunrpc/cache.c > +++ b/net/sunrpc/cache.c > @@ -986,8 +986,10 @@ static int cache_open(struct inode *inode, struct file > *filp, > nonseekable_open(inode, filp); > if (filp->f_mode & FMODE_READ) { > rp = kmalloc(sizeof(*rp), GFP_KERNEL); > - if (!rp) > + if (!rp) { > + module_put(cd->owner); > return -ENOMEM; > + } > rp->offset = 0; > rp->q.reader = 1; > atomic_inc(&cd->readers); > -- > 1.7.9.5 > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] nfsd: fix bug on nfs4 stateid deallocation
On Thu, Apr 04, 2013 at 10:04:03PM +0800, Yanchuan Nian wrote: > On Wed, Apr 03, 2013 at 02:55:08PM -0400, J. Bruce Fields wrote: > > On Wed, Apr 03, 2013 at 06:58:43PM +0800, Yanchuan Nian wrote: > > > On Mon, Apr 01, 2013 at 09:50:44PM -0400, J. Bruce Fields wrote: > > > > On Wed, Mar 13, 2013 at 11:04:54PM +0800, Yanchuan Nian wrote: > > > > > 2013/3/11 J. Bruce Fields > > > > > > > > > > > On Mon, Mar 11, 2013 at 08:46:14AM +0800, ycn...@gmail.com wrote: > > > > > > > NFS4_OO_PURGE_CLOSE is not handled properly. To avoid memory > > > > > > > leak, nfs4 > > > > > > > stateid which is pointed by oo_last_closed_stid is freed in > > > > > > nfsd4_close(), > > > > > > > but NFS4_OO_PURGE_CLOSE isn't cleared meanwhile. So the stateid > > > > > > > released > > > > > > in > > > > > > > THIS close procedure may be freed immediately in the coming > > > > > > > encoding > > > > > > function. > > > > > > > > > > > > OK, makes sense. This code is confusing, I wonder if there's some > > > > > > way > > > > > > we could make it simpler. > > > > > > > > > > > > > > > > The only purpose of NFS4_OO_PURGE_CLOSE is to decide whether the > > > > > stateid > > > > > pointed by last-closed-stateid should be freed or not. I wonder if > > > > > this > > > > > flag is necessary. oo_last_closed_stid is set only in CLOSE, so in > > > > > other > > > > > procedures, if the pointer is not NULL, it must be set in previous > > > > > procedure, we can free it directly. In CLOSE procedure, we also free > > > > > it > > > > > directly before asigning the new released stateid to it in > > > > > nfsd4_close(), > > > > > and then we can ignore it in nfsd4_encode_close(). What do you think? > > > > > > > > Yeah, something like that would be simpler, I think. Maybe the > > > > following? (On top of your patch.) > > > > > > > > This patch may cause memory leak in NFSv4.1. If the stateid released > > > is the last one in nfs4_openowner, nfs4_openowner will be deallocated > > > immediately, and NULL will be assigned to cstate->replay_owner at the > > > same time. In this situation encode_seqid_op_tail() cannot be called. > > > To avoid this problem, we can free the stateid just before or after > > > nfs4_opwnowner deallocation. > > > > Yes, thanks for catching that!: > > > > nfsd4_close_open_stateid(stp); > > - close->cl_closed_stateid = stp; > > + if (cstate->minorversion) { > > + unhash_stid(&stp->st_stid); > > + free_generic_stateid(stp); > > + } else > > + close->cl_closed_stateid = stp; > > > > if (list_empty(&oo->oo_owner.so_stateids)) { > > if (cstate->minorversion) { > > > > > Another question, cl_stateid in struct nfsd4_close will be returned to > > > the client, so original comment is right even though applying this > > > patch. Can this struct be commented like this. > > > > > > struct nfsd4_close { > > > u32 cl_seqid; /* request */ > > > stateid_t cl_stateid; /* request+response */ > > > struct nfs4_ol_stateid *cl_closed_stateid; /* used during > > > processing */ > > > > Agreed; done. The result is the following (untested). > > Yes, it works on my machine. Just the comment of cl_stateid in struct > nfsd4_close is still a little misleading. Whoops, fixed. Actually, I find encode_seqid_op_tail completely confusing, and I don't see why it's necessary--it would seem more straightforward to do the seqid bump in the procedure itself. How about the following? --b. commit e58235072acd52ecab71d498b2b06633a2a0c376 Author: J. Bruce Fields Date: Mon Apr 1 16:37:12 2013 -0400 nfsd4: cleanup handling of nfsv4.0 closed stateid's Closed stateid's are kept around a little while to handle close replays in the 4.0 case. So we stash them in the last-used stateid in the oo_last_closed_stateid field of the open owner. We can free that in encode_seqid_op_tail once the seqid on the open owner is next incremented. But we don't want to do that on the close itself; s
Re: [PATCH 15/19] sunrpc: don't warn for unused variable 'buf'
On Sat, Jan 26, 2013 at 01:34:56PM +, Arnd Bergmann wrote: > On Saturday 26 January 2013, Russell King - ARM Linux wrote: > > On Fri, Jan 25, 2013 at 11:45:25PM +, Arnd Bergmann wrote: > > > On Friday 25 January 2013, Myklebust, Trond wrote: > > > > > -Original Message- > > > > > From: Arnd Bergmann [mailto:a...@arndb.de] > > > > > Marking it as __maybe_unused avoids a harmless gcc warning. > > > > > > > > Alternatively, just declare it using the RPC_IFDEBUG() macro. > > > > > > Right, makes sense: that's more consistent with other functions > > > doing the same thing. Thanks for taking a look. > > > > NAK. > > > > There is already a fix queued up as a result of a previous report I > > sent, but for some reason (which I didn't question) it was decided > > not to queue it for -rc. > > > > See Bruce's reply on lkml: 20130108212816.ga24...@fieldses.org Apologies, I've seen so many "stop sending me post-rc1 patches that don't fix serious crashes!" flames. I guess obviousl compile fixes should be an exception--if nothing else it'd save a lot of duplicated work as this is something like the 3rd patch I've seen for this. --b. > > Ok, makes sense. Then again, if that fix is queued for 3.9, maybe > it still makes sense to take the simpler fix into 3.8, and remove > it in 3.9 along with the other instances of RPC_IFDEBUG. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] SUNRPC: protect transport processing with rw sem
On Tue, Jan 29, 2013 at 02:03:30PM +0300, Stanislav Kinsbursky wrote: > There could be a service transport, which is processed by service thread and > racing in the same time with per-net service shutdown like listed below: > > CPU#0:CPU#1: > > svc_recvsvc_close_net > svc_get_next_xprt (list_del_init(xpt_ready)) > svc_close_list (set XPT_BUSY and XPT_CLOSE) > svc_clear_pools(xprt was gained on CPU#0 already) > svc_delete_xprt (set XPT_DEAD) > svc_handle_xprt (is XPT_CLOSE => svc_delete_xprt() > BUG() > > There could be different solutions of the problem. > Probably, the patch doesn't implement the best one, but I hope the simple one. > IOW, it protects critical section (dequeuing of pending transport and > enqueuing it back to the pool) by per-service rw semaphore, It's actually per-thread (per-struct svc_rqst) here. > taken for read. > On per-net transports shutdown, this semaphore have to be taken for write. There's no down_write in this patch. Did you forget this part? The server rpc code goes to some care not to write to any global structure, to prevent server threads running on multiple cores from bouncing cache lines between them. But my understanding is that even down_read() does modify the semaphore. So we might want something like the percpu semaphore describe in Documentation/percpu-rw-semaphore.txt. --b. > > Signed-off-by: Stanislav Kinsbursky > --- > fs/nfs/callback.c |2 ++ > include/linux/sunrpc/svc.h |2 ++ > net/sunrpc/svc.c |2 ++ > net/sunrpc/svc_xprt.c | 24 ++-- > 4 files changed, 24 insertions(+), 6 deletions(-) > > diff --git a/fs/nfs/callback.c b/fs/nfs/callback.c > index 5088b57..76ba260 100644 > --- a/fs/nfs/callback.c > +++ b/fs/nfs/callback.c > @@ -393,7 +393,9 @@ void nfs_callback_down(int minorversion, struct net *net) > struct nfs_callback_data *cb_info = &nfs_callback_info[minorversion]; > > mutex_lock(&nfs_callback_mutex); > + down_write(&cb_info->rqst->rq_sem); > nfs_callback_down_net(minorversion, cb_info->serv, net); > + up_write(&cb_info->rqst->rq_sem); > cb_info->users--; > if (cb_info->users == 0 && cb_info->task != NULL) { > kthread_stop(cb_info->task); > diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h > index 1f0216b..8145009 100644 > --- a/include/linux/sunrpc/svc.h > +++ b/include/linux/sunrpc/svc.h > @@ -278,6 +278,8 @@ struct svc_rqst { >* cache pages */ > wait_queue_head_t rq_wait;/* synchronization */ > struct task_struct *rq_task; /* service thread */ > + > + struct rw_semaphore rq_sem; > }; > > #define SVC_NET(svc_rqst)(svc_rqst->rq_xprt->xpt_net) > diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c > index b9ba2a8..71a53c1 100644 > --- a/net/sunrpc/svc.c > +++ b/net/sunrpc/svc.c > @@ -642,6 +642,8 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool > *pool, int node) > if (!svc_init_buffer(rqstp, serv->sv_max_mesg, node)) > goto out_thread; > > + init_rwsem(&rqstp->rq_sem); > + > return rqstp; > out_thread: > svc_exit_thread(rqstp); > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c > index 5a9d40c..e75b20c 100644 > --- a/net/sunrpc/svc_xprt.c > +++ b/net/sunrpc/svc_xprt.c > @@ -470,6 +470,7 @@ static void svc_xprt_release(struct svc_rqst *rqstp) > rqstp->rq_res.head[0].iov_len = 0; > svc_reserve(rqstp, 0); > rqstp->rq_xprt = NULL; > + up_read(&rqstp->rq_sem); > > svc_xprt_put(xprt); > } > @@ -624,6 +625,7 @@ struct svc_xprt *svc_get_next_xprt(struct svc_rqst > *rqstp, long timeout) >*/ > rqstp->rq_chandle.thread_wait = 5*HZ; > > + down_read(&rqstp->rq_sem); > spin_lock_bh(&pool->sp_lock); > xprt = svc_xprt_dequeue(pool); > if (xprt) { > @@ -640,7 +642,8 @@ struct svc_xprt *svc_get_next_xprt(struct svc_rqst > *rqstp, long timeout) > if (pool->sp_task_pending) { > pool->sp_task_pending = 0; > spin_unlock_bh(&pool->sp_lock); > - return ERR_PTR(-EAGAIN); > + xprt = ERR_PTR(-EAGAIN); > + goto out_err; > } > /* No data pending. Go to sleep */ > svc_thread_enqueue(pool, rqstp); > @@ -661,16 +664,19 @@ struct svc_xprt *svc_get_next_xprt(struct svc_rqst > *rqstp, long timeout) > if (kthread_should_stop()) { > set_current_state(TASK_RUNNING); > spin_unlock_bh(&pool->sp_lock); > - return ERR_PTR(-EINTR); > + xprt = ERR_PTR(-EINTR); > + goto out_err; > } > > add
Re: [PATCH 11/14] nfs: idr_destroy() no longer needs idr_remove_all()
On Fri, Jan 25, 2013 at 05:31:09PM -0800, Tejun Heo wrote: > idr_destroy() can destroy idr by itself and idr_remove_all() is being > deprecated. Drop reference to idr_remove_all(). Note that the code > wasn't completely correct before because idr_remove() on all entries > doesn't necessarily release all idr_layers which could lead to memory > leak. Seems fine, but actually this is client-side so I think you meant the cc to be to Trond. --b. > > Signed-off-by: Tejun Heo > Cc: "J. Bruce Fields" > Cc: linux-...@vger.kernel.org > --- > This patch depends on an earlier idr patch and given the trivial > nature of the patch, I think it would be best to route these together > through -mm. Please holler if there's any objection. > > Thanks. > > fs/nfs/client.c | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/fs/nfs/client.c b/fs/nfs/client.c > index 9f3c664..84d8eae 100644 > --- a/fs/nfs/client.c > +++ b/fs/nfs/client.c > @@ -197,7 +197,6 @@ error_0: > EXPORT_SYMBOL_GPL(nfs_alloc_client); > > #if IS_ENABLED(CONFIG_NFS_V4) > -/* idr_remove_all is not needed as all id's are removed by nfs_put_client */ > void nfs_cleanup_cb_ident_idr(struct net *net) > { > struct nfs_net *nn = net_generic(net, nfs_net_id); > -- > 1.8.1 > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] SUNRPC: protect transport processing with rw sem
On Wed, Jan 30, 2013 at 09:42:14AM +0400, Stanislav Kinsbursky wrote: > 30.01.2013 02:57, J. Bruce Fields пишет: > >On Tue, Jan 29, 2013 at 02:03:30PM +0300, Stanislav Kinsbursky wrote: > >>There could be a service transport, which is processed by service thread and > >>racing in the same time with per-net service shutdown like listed below: > >> > >>CPU#0:CPU#1: > >> > >>svc_recvsvc_close_net > >>svc_get_next_xprt (list_del_init(xpt_ready)) > >> svc_close_list (set XPT_BUSY and XPT_CLOSE) > >> svc_clear_pools(xprt was gained on CPU#0 > >> already) > >> svc_delete_xprt (set XPT_DEAD) > >>svc_handle_xprt (is XPT_CLOSE => svc_delete_xprt() > >>BUG() > >> > >>There could be different solutions of the problem. > >>Probably, the patch doesn't implement the best one, but I hope the simple > >>one. > >>IOW, it protects critical section (dequeuing of pending transport and > >>enqueuing it back to the pool) by per-service rw semaphore, > > > >It's actually per-thread (per-struct svc_rqst) here. > > > > Yes, sure. > > >>taken for read. > >>On per-net transports shutdown, this semaphore have to be taken for write. > > > >There's no down_write in this patch. Did you forget this part? > > > > See "fs/nfs/callback.c" part Whoops, sorry; got it.--b. > > >The server rpc code goes to some care not to write to any global > >structure, to prevent server threads running on multiple cores from > >bouncing cache lines between them. > > > > This is just an idea. I.e. I wasn't trying to polish the patch - just to > share the vision. > > >But my understanding is that even down_read() does modify the semaphore. > >So we might want something like the percpu semaphore describe in > >Documentation/percpu-rw-semaphore.txt. > > > > Sure, I'll have a look. > > > -- > Best regards, > Stanislav Kinsbursky -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/8] Add O_DENY* support for VFS and CIFS/NFS
On Thu, Jan 17, 2013 at 08:52:09PM +0400, Pavel Shilovsky wrote: > This patchset adds support of O_DENY* flags for Linux fs layer. These flags > can be used by any application that needs share reservations to organize a > file access. VFS already has some sort of this capability - now it's done > through flock/LOCK_MAND mechanis, but that approach is non-atomic. This > patchset build new capabilities on top of the existing one but doesn't bring > any changes into the flock call semantic. > > These flags can be used by NFS (built-in-kernel) and CIFS (Samba) servers and > Wine applications through VFS (for local filesystems) or CIFS/NFS modules. > This will help when e.g. Samba and NFS server share the same directory for > Windows and Linux users or Wine applications use Samba/NFS share to access > the same data from different clients. > > According to the previous discussions the most problematic question is how to > prevent situations like DoS attacks where e.g /lib/liba.so file can be open > with DENYREAD, or smth like this. That's why one extra flag O_DENYMAND is > added. It indicates to underlying layer that an application want to use > O_DENY* flags semantic. It allows us not affect native Linux applications > (that don't use O_DENYMAND flag) - so, these flags (and the semantic of open > syscall that they bring) are used only for those applications that really > want it proccessed that way. Maybe that's good enough. A mount flag might be simpler and give consistent enforcement for all users. > > So, we have four new flags: > O_DENYREAD - to prevent other opens with read access, > O_DENYWRITE - to prevent other opens with write access, > O_DENYDELETE - to prevent delete operations (this flag is not implemented in > VFS and NFS part and only suitable for CIFS module), > O_DENYMAND - to switch on/off three flags above. It would be useful to have some really careful documentation of how these are meant to work. Maybe try updating the open man page? --b. > > The #1 patch bring infrastructure to let us add this capability easier. #2 > patch adds flags to fcntl while #3 patch implement VFS part. Patches #4, #5, > #6 are related to CIFS-specific changes. #7 and #8 describe NFS and NFSD > parts. > > Also, I created the preliminary patch for Samba that replaces the existing > use of flock/LOCK_MAND mechanism with O_DENY* flags: > http://git.etersoft.ru/people/piastry/packages/?p=samba.git;a=commitdiff;h=f116c478bf9a1bc3985e9a719fb20d854914d67a > > Pavel Shilovsky (8): > locks: make flock_lock_file take is_conflict callback parm > fcntl: Introduce new O_DENY* open flags > vfs: Add O_DENYREAD/WRITE flags support for open syscall > CIFS: Add O_DENY* open flags support > CIFS: Use NT_CREATE_ANDX command for forcemand mounts > CIFS: Translate SHARING_VIOLATION to -ETXTBSY error code for SMB2 > NFSv4: Add O_DENY* open flags support > NFSD: Pass share reservations flags to VFS > > fs/cifs/cifsacl.c| 10 ++-- > fs/cifs/cifsglob.h | 12 +++- > fs/cifs/cifsproto.h | 9 +-- > fs/cifs/cifssmb.c| 47 --- > fs/cifs/dir.c| 14 +++-- > fs/cifs/file.c | 18 -- > fs/cifs/inode.c | 11 ++-- > fs/cifs/link.c | 10 ++-- > fs/cifs/readdir.c| 2 +- > fs/cifs/smb1ops.c| 15 ++--- > fs/cifs/smb2file.c | 10 ++-- > fs/cifs/smb2inode.c | 4 +- > fs/cifs/smb2maperror.c | 2 +- > fs/cifs/smb2ops.c| 10 ++-- > fs/cifs/smb2pdu.c| 6 +- > fs/cifs/smb2proto.h | 14 +++-- > fs/fcntl.c | 5 +- > fs/locks.c | 121 > --- > fs/namei.c | 10 +++- > fs/nfs/nfs4xdr.c | 24 ++-- > fs/nfsd/nfs4state.c | 46 ++- > include/linux/fs.h | 6 ++ > include/uapi/asm-generic/fcntl.h | 14 + > 23 files changed, 324 insertions(+), 96 deletions(-) > > -- > 1.8.1.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 3/8] vfs: Add O_DENYREAD/WRITE flags support for open syscall
On Thu, Jan 17, 2013 at 08:52:59PM +0400, Pavel Shilovsky wrote: > If O_DENYMAND flag is specified, O_DENYREAD/WRITE/MAND flags are > translated to flock's flags: > > !O_DENYREAD -> LOCK_READ > !O_DENYWRITE -> LOCK_WRITE > O_DENYMAND -> LOCK_MAND > > and set through flock_lock_file on a file. > > This change only affects opens that use O_DENYMAND flag - all other > native Linux opens don't care about these flags. It allow us to > enable this feature for applications that need it (e.g. NFS and > Samba servers that export the same directory for Windows clients, > or Wine applications that access the same files simultaneously). The use of an is_conflict callback seems unnecessarily convoluted. If we need two different behaviors, let's just use another flag (or an extra boolean argument if we need to, or something). The only caller for this new deny_lock_file is in the nfs code--I'm a little unclear why that is. --b. > > Signed-off-by: Pavel Shilovsky > --- > fs/locks.c | 106 > +++-- > fs/namei.c | 10 - > include/linux/fs.h | 6 +++ > 3 files changed, 118 insertions(+), 4 deletions(-) > > diff --git a/fs/locks.c b/fs/locks.c > index 9edfec4..ebe9a30 100644 > --- a/fs/locks.c > +++ b/fs/locks.c > @@ -605,12 +605,86 @@ static int posix_locks_conflict(struct file_lock > *caller_fl, struct file_lock *s > return (locks_conflict(caller_fl, sys_fl)); > } > > -/* Determine if lock sys_fl blocks lock caller_fl. FLOCK specific > +static unsigned int > +deny_flags_to_cmd(unsigned int flags) > +{ > + unsigned int cmd = LOCK_MAND; > + > + if (!(flags & O_DENYREAD)) > + cmd |= LOCK_READ; > + if (!(flags & O_DENYWRITE)) > + cmd |= LOCK_WRITE; > + > + return cmd; > +} > + > +/* > + * locks_mand_conflict - Determine if there's a share reservation conflict > + * @caller_fl: lock we're attempting to acquire > + * @sys_fl: lock already present on system that we're checking against > + * > + * Check to see if there's a share_reservation conflict. LOCK_READ/LOCK_WRITE > + * tell us whether the reservation allows other readers and writers. > + * > + * We only check against other LOCK_MAND locks, so applications that want to > + * use share mode locking will only conflict against one another. "normal" > + * applications that open files won't be affected by and won't themselves > + * affect the share reservations. > + */ > +static int > +locks_mand_conflict(struct file_lock *caller_fl, struct file_lock *sys_fl) > +{ > + unsigned char caller_type = caller_fl->fl_type; > + unsigned char sys_type = sys_fl->fl_type; > + fmode_t caller_fmode = caller_fl->fl_file->f_mode; > + fmode_t sys_fmode = sys_fl->fl_file->f_mode; > + > + /* they can only conflict if they're both LOCK_MAND */ > + if (!(caller_type & LOCK_MAND) || !(sys_type & LOCK_MAND)) > + return 0; > + > + if (!(caller_type & LOCK_READ) && (sys_fmode & FMODE_READ)) > + return 1; > + if (!(caller_type & LOCK_WRITE) && (sys_fmode & FMODE_WRITE)) > + return 1; > + if (!(sys_type & LOCK_READ) && (caller_fmode & FMODE_READ)) > + return 1; > + if (!(sys_type & LOCK_WRITE) && (caller_fmode & FMODE_WRITE)) > + return 1; > + > + return 0; > +} > + > +/* > + * Determine if lock sys_fl blocks lock caller_fl. O_DENY* flags specific > + * checking before calling the locks_conflict(). > + */ > +static int > +deny_locks_conflict(struct file_lock *caller_fl, struct file_lock *sys_fl) > +{ > + /* > + * FLOCK locks referring to the same filp do not conflict with > + * each other. > + */ > + if (!IS_FLOCK(sys_fl)) > + return 0; > + if ((caller_fl->fl_type & LOCK_MAND) || (sys_fl->fl_type & LOCK_MAND)) > + return locks_mand_conflict(caller_fl, sys_fl); > + if (caller_fl->fl_file == sys_fl->fl_file) > + return 0; > + > + return locks_conflict(caller_fl, sys_fl); > +} > + > +/* > + * Determine if lock sys_fl blocks lock caller_fl. FLOCK specific > * checking before calling the locks_conflict(). > */ > -static int flock_locks_conflict(struct file_lock *caller_fl, struct > file_lock *sys_fl) > +static int > +flock_locks_conflict(struct file_lock *caller_fl, struct file_lock *sys_fl) > { > - /* FLOCK locks referring to the same filp do not conflict with > + /* > + * FLOCK locks referring to the same filp do not conflict with >* each other. >*/ > if (!IS_FLOCK(sys_fl) || (caller_fl->fl_file == sys_fl->fl_file)) > @@ -789,6 +863,32 @@ out: > return error; > } > > +/* > + * Determine if a file is allowed to be opened with specified access and deny > + * modes. Lock the file and return 0 if checks passed, otherwise return a > error > + * code. > + */ > +int > +deny_lock_file(struct file *filp) > +{ > + struct file_lock *lock; > + int error = 0;
Re: [PATCH v2 4/8] CIFS: Add O_DENY* open flags support
On Thu, Jan 17, 2013 at 08:53:00PM +0400, Pavel Shilovsky wrote: > Make CIFSSMBOpen take share_flags as a parm that allows us > to pass new O_DENY* flags to the server. > > Signed-off-by: Pavel Shilovsky > --- > fs/cifs/cifsacl.c | 10 ++ > fs/cifs/cifsglob.h | 12 +++- > fs/cifs/cifsproto.h | 9 + > fs/cifs/cifssmb.c | 47 +-- > fs/cifs/dir.c | 13 - > fs/cifs/file.c | 12 > fs/cifs/inode.c | 11 ++- > fs/cifs/link.c | 10 +- > fs/cifs/readdir.c | 2 +- > fs/cifs/smb1ops.c | 15 --- > fs/cifs/smb2file.c | 10 +- > fs/cifs/smb2inode.c | 4 ++-- > fs/cifs/smb2ops.c | 10 ++ > fs/cifs/smb2pdu.c | 6 +++--- > fs/cifs/smb2proto.h | 14 -- > 15 files changed, 107 insertions(+), 78 deletions(-) > > diff --git a/fs/cifs/cifsacl.c b/fs/cifs/cifsacl.c > index 0fb15bb..a42c77f 100644 > --- a/fs/cifs/cifsacl.c > +++ b/fs/cifs/cifsacl.c > @@ -1184,8 +1184,9 @@ static struct cifs_ntsd *get_cifs_acl_by_path(struct > cifs_sb_info *cifs_sb, > create_options |= CREATE_OPEN_BACKUP_INTENT; > > rc = CIFSSMBOpen(xid, tcon, path, FILE_OPEN, READ_CONTROL, > - create_options, &fid, &oplock, NULL, cifs_sb->local_nls, > - cifs_sb->mnt_cifs_flags & CIFS_MOUNT_MAP_SPECIAL_CHR); > + FILE_SHARE_ALL, create_options, &fid, &oplock, > + NULL, cifs_sb->local_nls, > + cifs_sb->mnt_cifs_flags & CIFS_MOUNT_MAP_SPECIAL_CHR); > if (!rc) { > rc = CIFSSMBGetCIFSACL(xid, tcon, fid, &pntsd, pacllen); > CIFSSMBClose(xid, tcon, fid); > @@ -1245,8 +1246,9 @@ int set_cifs_acl(struct cifs_ntsd *pnntsd, __u32 acllen, > access_flags = WRITE_DAC; > > rc = CIFSSMBOpen(xid, tcon, path, FILE_OPEN, access_flags, > - create_options, &fid, &oplock, NULL, cifs_sb->local_nls, > - cifs_sb->mnt_cifs_flags & CIFS_MOUNT_MAP_SPECIAL_CHR); > + FILE_SHARE_ALL, create_options, &fid, &oplock, > + NULL, cifs_sb->local_nls, > + cifs_sb->mnt_cifs_flags & CIFS_MOUNT_MAP_SPECIAL_CHR); It would be easier to tell what you changed in the above two chunks if you didn't change the whitespace at the same time. --b. > if (rc) { > cERROR(1, "Unable to open file to set ACL"); > goto out; > diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h > index 3763624..8c1a27f 100644 > --- a/fs/cifs/cifsglob.h > +++ b/fs/cifs/cifsglob.h > @@ -313,7 +313,7 @@ struct smb_version_operations { > struct cifs_sb_info *); > /* open a file for non-posix mounts */ > int (*open)(const unsigned int, struct cifs_tcon *, const char *, int, > - int, int, struct cifs_fid *, __u32 *, FILE_ALL_INFO *, > + int, int, int, struct cifs_fid *, __u32 *, FILE_ALL_INFO *, > struct cifs_sb_info *); > /* set fid protocol-specific info */ > void (*set_fid)(struct cifsFileInfo *, struct cifs_fid *, __u32); > @@ -948,6 +948,16 @@ struct cifsFileInfo { > struct work_struct oplock_break; /* work for oplock breaks */ > }; > > +#define CIFS_DENY_RW_FLAGS_SHIFT 22 > +#define CIFS_DENY_DEL_FLAG_SHIFT 23 > + > +static inline int cifs_get_share_flags(unsigned int flags) > +{ > + return (flags & O_DENYMAND) ? > + (((~(flags >> CIFS_DENY_RW_FLAGS_SHIFT)) & 3) | > + ((~(flags >> CIFS_DENY_DEL_FLAG_SHIFT)) & 4)) : FILE_SHARE_ALL; > +} > + > struct cifs_io_parms { > __u16 netfid; > #ifdef CONFIG_CIFS_SMB2 > diff --git a/fs/cifs/cifsproto.h b/fs/cifs/cifsproto.h > index 5144e9f..3f5a46a 100644 > --- a/fs/cifs/cifsproto.h > +++ b/fs/cifs/cifsproto.h > @@ -361,10 +361,11 @@ extern int CIFSSMBQueryReparseLinkInfo(const unsigned > int xid, > const struct nls_table *nls_codepage); > #endif /* temporarily unused until cifs_symlink fixed */ > extern int CIFSSMBOpen(const unsigned int xid, struct cifs_tcon *tcon, > - const char *fileName, const int disposition, > - const int access_flags, const int omode, > - __u16 *netfid, int *pOplock, FILE_ALL_INFO *, > - const struct nls_table *nls_codepage, int remap); > + const char *file_name, const int disposition, > + const int access_flags, const int share_flags, > + const int omode, __u16 *netfid, int *oplock, > + FILE_ALL_INFO *, const struct nls_table *nls_codepage, > + int remap); > extern int SMBLegacyOpen(const unsigned int xid, struct cifs_tcon *tcon, > const char *fileName, const int disposition, >
Re: [RFC PATCH 0/4] SUNRPC: rework cache upcall to avoid NFSd root swapping
On Tue, Jan 15, 2013 at 11:09:23AM +0300, Stanislav Kinsbursky wrote: > The main idea of this patch set is to call cache request not on kthread > upcall, but on userspace daemon cache_read call. This fixes the problem with > gaining of wrong dentry path after calling d_path() in kthread root context > (svc_export_request() callback), which always work in init root context, but > containers can work in "root jail" - i.e. have it's own nested root. Sorry for the delay. This looks good to me--committing pending some testing. What's left now for basic containerized nfsd support? --b. > > The following series implements... > > --- > > Stanislav Kinsbursky (4): > SUNRPC: introduce cache_detail->cache_request callback > SUNRPC: remove cache_detail->cache_upcall callback > SUNRPC: remove "cache_request" argument in sunrpc_cache_pipe_upcall() > function > SUNRPC: move cache_detail->cache_request callback call to cache_read() > > > fs/nfs/dns_resolve.c |2 +- > fs/nfsd/export.c | 14 ++-- > fs/nfsd/nfs4idmap.c | 16 ++--- > include/linux/sunrpc/cache.h | 11 +++-- > net/sunrpc/auth_gss/svcauth_gss.c |8 +-- > net/sunrpc/cache.c| 44 > - > net/sunrpc/svcauth_unix.c | 14 ++-- > 7 files changed, 36 insertions(+), 73 deletions(-) > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: build failure after merge of the nfsd tree
On Sat, Feb 02, 2013 at 01:04:03PM +1100, Stephen Rothwell wrote: > Hi, > > After merging the nfsd tree, today's linux-next build (x86_64 > allmodconfig) failed like this: > > fs/nfs/dns_resolve.c: In function 'nfs_dns_resolver_cache_init': > fs/nfs/dns_resolve.c:375:4: error: 'struct cache_detail' has no member named > 'cache_upcall' > fs/nfs/dns_resolve.c:375:35: warning: left-hand operand of comma expression > has no effect [-Wunused-value] > fs/nfs/dns_resolve.c:375:35: warning: value computed is not used > [-Wunused-value] > fs/nfs/dns_resolve.c:375:35: warning: value computed is not used > [-Wunused-value] > fs/nfs/dns_resolve.c:375:35: warning: value computed is not used > [-Wunused-value] > fs/nfs/dns_resolve.c:375:35: warning: value computed is not used > [-Wunused-value] > fs/nfs/dns_resolve.c:375:35: warning: value computed is not used > [-Wunused-value] > fs/nfs/dns_resolve.c:375:35: warning: value computed is not used > [-Wunused-value] > > Caused by commit aab982cb5dfb ("SUNRPC: remove cache_detail->cache_upcall > callback"). Yes, I knew why we'd introduced cache_upcall, so I'm not sure how I overlooked that. It must have slipped through testing because I didn't set CONFIG_NFS_USE_KERNEL_DNS. We may just be able to revert that patch I can take care of that by tomorrow. > Also, why don't those statements end in semicolons? Looks like that whole chunk was cut-and-pasted from a previous static initializer in 1b340d0118da1d7c60c664f17d7c8fce2bb1cd9d "NFS: DNS resolver cache per network namespace context introduced". > > I have used the nsfd tree from next-20130128 for today. Thanks. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG at net/sunrpc/svc_xprt.c:921
On Mon, Feb 25, 2013 at 03:45:07PM -0500, Mark Lord wrote: > On 13-01-17 08:53 AM, J. Bruce Fields wrote: > > On Thu, Jan 17, 2013 at 08:11:52AM -0500, Mark Lord wrote: > >> On 13-01-14 11:17 AM, Mark Lord wrote: > >>> > >>> Here's the code with the BUG() at net/sunrpc/svc_xprt.c line 921: > >>> > >>> /* > >>> * Remove a dead transport > >>> */ > >>> static void svc_delete_xprt(struct svc_xprt *xprt) > >>> { > >>> struct svc_serv *serv = xprt->xpt_server; > >>> struct svc_deferred_req *dr; > >>> > >>> /* Only do this once */ > >>> if (test_and_set_bit(XPT_DEAD, &xprt->xpt_flags)) > >>> BUG(); > >> > > Saw this again today on 3.7.9 -- dunno if your changes are in that kernel yet > though. Nope. The nfsd changes for 3.9 should get merged in a few days and then backported to stable kernels not much later. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
nfsd changes for 3.9
Please pull nfsd changes for 3.9 from the for-3.9 branch at: git://linux-nfs.org/~bfields/linux.git for-3.9 Miscellaneous bugfixes, plus: - An overhaul of the DRC cache by Jeff Layton. The main effect is just to make it larger. This decreases the chances of intermittent errors especially in the UDP case. But we'll need to watch for any reports of performance regressions. - Containerized nfsd: with some limitations, we now support per-container nfs-service, thanks to extensive work from Stanislav Kinsbursky over the last year. --b. Andriy Skulysh (1): sunrpc: Fix lockd sleeping until timeout Fengguang Wu (1): nfsd4: free_stid can be static J. Bruce Fields (7): nfsd4: simplify nfsd4_encode_fattr interface slightly svcrpc: silence "unused variable" warning in !RPC_DEBUG case nfsd4: require version 4 when enabling or disabling minorversion nfsd4: simplify idr allocation svcrpc: make svc_age_temp_xprts enqueue under sv_lock svcrpc: fix rpc server shutdown races SUNRPC: make AF_LOCAL connect synchronous Jeff Layton (23): nfsd: fix IPv6 address handling in the DRC nfsd: remove unneeded spinlock in nfsd_cache_update nfsd: get rid of RC_INTR nfsd: create a dedicated slabcache for DRC entries nfsd: add alloc and free functions for DRC entries nfsd: remove redundant test from nfsd_reply_cache_free nfsd: clean up and clarify the cache expiration code nfsd: break out hashtable search into separate function nfsd: initialize the exp->ex_uuid field in svc_export_init nfsd: always move DRC entries to the end of LRU list when updating timestamp nfsd: track the number of DRC entries in the cache nfsd: dynamically allocate DRC entries nfsd: remove the cache_disabled flag nfsd: when updating an entry with RC_NOCACHE, just free it nfsd: add recurring workqueue job to clean the cache nfsd: register a shrinker for DRC cache entries sunrpc: copy scope ID in __rpc_copy_addr6 sunrpc: move address copy/cmp/convert routines and prototypes from clnt.h to addr.h sunrpc: fix comment in struct xdr_buf definition sunrpc: trim off trailing checksum before returning decrypted or integrity authenticated buffer nfsd: keep a checksum of the first 256 bytes of request nfsd: fix comments on nfsd_cache_lookup nfsd: fix compiler warning about ambiguous types in nfsd_cache_csum Stanislav Kinsbursky (11): nfsd: fix unused "nn" variable warning in free_client() NFS: use SUNRPC cache creation and destruction helper for DNS cache NFS: simplify and clean cache library SUNRPC: introduce cache_detail->cache_request callback SUNRPC: rework cache upcall logic SUNRPC: remove "cache_request" argument in sunrpc_cache_pipe_upcall() function SUNRPC: move cache_detail->cache_request callback call to cache_read() nfsd: containerize NFSd filesystem nfsd: use proper net while reading "exports" file nfsd: disable usermode helper client tracker in container nfsd: enable NFSv4 state in containers Tim Gardner (1): lockd: nlmclnt_reclaim(): avoid stack overflow Yanchuan Nian (3): nfsd: Pass correct slot number to nfsd4_put_drc_mem() nfsd: Don't unlock the state while it's not locked nfsd: Remove write permission from file content majianpeng (2): nfsd: Fix memleak in svc_export_put nfsd: Fix memleak fs/lockd/clntlock.c | 14 +- fs/lockd/clntproc.c | 6 +- fs/lockd/host.c | 1 + fs/lockd/mon.c | 1 + fs/lockd/svcsubs.c | 2 +- fs/nfs/cache_lib.c | 12 +- fs/nfs/cache_lib.h | 2 - fs/nfs/dns_resolve.c| 67 +++ fs/nfs/nfs4client.c | 1 + fs/nfs/nfs4filelayoutdev.c | 1 + fs/nfs/nfs4namespace.c | 1 + fs/nfs/super.c | 1 + fs/nfsd/cache.h | 17 +- fs/nfsd/export.c| 16 +- fs/nfsd/fault_inject.c | 2 +- fs/nfsd/nfs4idmap.c | 16 +- fs/nfsd/nfs4proc.c | 7 +- fs/nfsd/nfs4recover.c | 6 + fs/nfsd/nfs4state.c | 102 ++- fs/nfsd/nfs4xdr.c | 21 +-- fs/nfsd/nfscache.c | 354 fs/nfsd/nfsctl.c| 81 ++--- fs/nfsd/nfssvc.c| 6 +- fs/nfsd/xdr4.h | 2 +- include/linux/lockd/lockd.h | 3 +- include/linux/sunrpc/addr.h | 170 + include/linux/sunrpc/ca
Re: [PATCH 1/6] fs/nfsd/nfs4idmap.c: adjust inconsistent IS_ERR and PTR_ERR
On Sat, Aug 25, 2012 at 09:57:04PM +0200, Julia Lawall wrote: > From: Julia Lawall > > Change the call to PTR_ERR to access the value just tested by IS_ERR. Applying for 3.7, thanks.--b. > > The semantic match that finds this problem is as follows: > (http://coccinelle.lip6.fr/) > > // > @@ > expression e,e1; > @@ > > ( > if (IS_ERR(e)) { ... PTR_ERR(e) ... } > | > if (IS_ERR(e=e1)) { ... PTR_ERR(e) ... } > | > *if (IS_ERR(e)) > { ... > * PTR_ERR(e1) >... } > ) > // > > Signed-off-by: Julia Lawall > > --- > fs/nfsd/nfs4idmap.c |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/nfsd/nfs4idmap.c b/fs/nfsd/nfs4idmap.c > index fdc91a6..11df4ac 100644 > --- a/fs/nfsd/nfs4idmap.c > +++ b/fs/nfsd/nfs4idmap.c > @@ -478,7 +478,7 @@ nfsd_idmap_init(struct net *net) > goto destroy_idtoname_cache; > nn->nametoid_cache = cache_create_net(&nametoid_cache_template, net); > if (IS_ERR(nn->nametoid_cache)) { > - rv = PTR_ERR(nn->idtoname_cache); > + rv = PTR_ERR(nn->nametoid_cache); > goto unregister_idtoname_cache; > } > rv = cache_register_net(nn->nametoid_cache, net); > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] exportfs: add FILEID_INVALID to indicate invalid fid_type
On Wed, Aug 29, 2012 at 10:10:10AM -0400, Namjae Jeon wrote: > This commit adds FILEID_INVALID = 0xff in fid_type to > indicate invalid fid_type OK, applying for 3.7. Looks like this shows up in a lot of filesystems too as just "255". Are you planning to patch up the filesystems afterwards? --b. > > It avoids using magic number 255 > > Signed-off-by: Namjae Jeon > Signed-off-by: Vivek Trivedi > --- > fs/exportfs/expfs.c |4 ++-- > fs/fhandle.c |2 +- > fs/nfsd/nfsfh.c |4 ++-- > include/linux/exportfs.h |5 + > 4 files changed, 10 insertions(+), 5 deletions(-) > > diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c > index 29ab099..f1f1c59 100644 > --- a/fs/exportfs/expfs.c > +++ b/fs/exportfs/expfs.c > @@ -322,10 +322,10 @@ static int export_encode_fh(struct inode *inode, struct > fid *fid, > > if (parent && (len < 4)) { > *max_len = 4; > - return 255; > + return FILEID_INVALID; > } else if (len < 2) { > *max_len = 2; > - return 255; > + return FILEID_INVALID; > } > > len = 2; > diff --git a/fs/fhandle.c b/fs/fhandle.c > index a48e4a1..78a7879 100644 > --- a/fs/fhandle.c > +++ b/fs/fhandle.c > @@ -52,7 +52,7 @@ static long do_sys_name_to_handle(struct path *path, > handle_bytes = handle_dwords * sizeof(u32); > handle->handle_bytes = handle_bytes; > if ((handle->handle_bytes > f_handle.handle_bytes) || > - (retval == 255) || (retval == -ENOSPC)) { > + (retval == FILEID_INVALID) || (retval == -ENOSPC)) { > /* As per old exportfs_encode_fh documentation >* we could return ENOSPC to indicate overflow >* But file system returned 255 always. So handle > diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c > index 032af38..814afaa 100644 > --- a/fs/nfsd/nfsfh.c > +++ b/fs/nfsd/nfsfh.c > @@ -572,7 +572,7 @@ fh_compose(struct svc_fh *fhp, struct svc_export *exp, > struct dentry *dentry, > > if (inode) > _fh_update(fhp, exp, dentry); > - if (fhp->fh_handle.fh_fileid_type == 255) { > + if (fhp->fh_handle.fh_fileid_type == FILEID_INVALID) { > fh_put(fhp); > return nfserr_opnotsupp; > } > @@ -603,7 +603,7 @@ fh_update(struct svc_fh *fhp) > goto out; > > _fh_update(fhp, fhp->fh_export, dentry); > - if (fhp->fh_handle.fh_fileid_type == 255) > + if (fhp->fh_handle.fh_fileid_type == FILEID_INVALID) > return nfserr_opnotsupp; > } > out: > diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h > index 12291a7..0e14525 100644 > --- a/include/linux/exportfs.h > +++ b/include/linux/exportfs.h > @@ -83,6 +83,11 @@ enum fid_type { >* 64 bit parent inode number. >*/ > FILEID_NILFS_WITH_PARENT = 0x62, > + > + /* > + * Filesystems must not use 0xff file ID. > + */ > + FILEID_INVALID = 0xff, > }; > > struct fid { > -- > 1.7.9.5 > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] nfsd: remove duplicate init in nfsd4_cb_recall
On Wed, Aug 29, 2012 at 10:10:53AM -0400, Namjae Jeon wrote: > remove duplicate init in nfsd4_cb_recall Thanks, applying for 3.7.--b. > > Signed-off-by: Namjae Jeon > Signed-off-by: Vivek Trivedi > --- > fs/nfsd/nfs4callback.c |1 - > 1 file changed, 1 deletion(-) > > diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c > index 4c7bd35..bdf29c9 100644 > --- a/fs/nfsd/nfs4callback.c > +++ b/fs/nfsd/nfs4callback.c > @@ -1028,7 +1028,6 @@ void nfsd4_cb_recall(struct nfs4_delegation *dp) > cb->cb_msg.rpc_cred = callback_cred; > > cb->cb_ops = &nfsd4_cb_recall_ops; > - dp->dl_retries = 1; > > INIT_LIST_HEAD(&cb->cb_per_client); > cb->cb_done = true; > -- > 1.7.9.5 > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 4/5] fat: eliminate orphaned inode number allocation
On Wed, Sep 05, 2012 at 02:07:40AM +0900, OGAWA Hirofumi wrote: > OGAWA Hirofumi writes: > > > Namjae Jeon writes: > > > >> From: Namjae Jeon > >> > >> Maintain a list of inode(i_pos) numbers of orphaned inodes (i.e the > >> inodes that have been unlinked but still having open file > >> descriptors).At file/directory creation time, skip using such i_pos > >> values.Removal of the i_pos from the list is done during inode eviction. > > > > What happens if the directory (has busy entries) was completely removed? > > > > > > And Al's point is important for NFS too. If you want stable ino for NFS, > > you never can't change it. > > s/never can't/never can/ If vfat exports aren't fixable, maybe we should just remove that feature? I'm afraid that having unfixable half-working vfat exports is just an attractive nuisance that causes users and developers to waste their time --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/5] net: socket bind to file descriptor introduced
On Mon, Aug 20, 2012 at 02:18:13PM +0400, Stanislav Kinsbursky wrote: > 16.08.2012 07:03, Eric W. Biederman пишет: > >Stanislav Kinsbursky writes: > > > >>This patch set introduces new socket operation and new system call: > >>sys_fbind(), which allows to bind socket to opened file. > >>File to bind to can be created by sys_mknod(S_IFSOCK) and opened by > >>open(O_PATH). > >> > >>This system call is especially required for UNIX sockets, which has name > >>lenght limitation. > >> > >>The following series implements... > > > >Hmm. I just realized this patchset is even sillier than I thought. > > > >Stanislav is the problem you are ultimately trying to solve nfs clients > >in a container connecting to the wrong user space rpciod? > > > > Hi, Eric. > The problem you mentioned was the reason why I started to think about this. > But currently I believe, that limitations in unix sockets connect or > bind should be removed, because it will be useful it least for CRIU > project. > > >Aka net/sunrpc/xprtsock.c:xs_setup_local only taking an absolute path > >and then creating a delayed work item to actually open the unix domain > >socket? > > > >The straight correct and straight forward thing to do appears to be: > >- Capture the root from current->fs in xs_setup_local. > >- In xs_local_finish_connect change current->fs.root to the captured > > version of root before kernel_connect, and restore current->fs.root > > after kernel_connect. > > > >It might not be a bad idea to implement open on unix domain sockets in > >a filesystem as create(AF_LOCAL)+connect() which would allow you to > >replace __sock_create + kernel_connect with a simple file_open_root. > > > > I like the idea of introducing new family (AF_LOCAL_AT for example) > and new sockaddr for connecting or binding from specified root. The > only thing I'm worrying is passing file descriptor to unix bind or > connect routine. Because this approach doesn't provide easy way to > use such family and sockaddr in kernel (like in NFS example). > > >But I think the simple scheme of: > >struct path old_root; > >old_root = current->fs.root; > >kernel_connect(...); > >current->fs.root = old_root; > > > >Is more than sufficient and will remove the need for anything > >except a purely local change to get nfs clients to connect from > >containers. > > > > That was my first idea. So is this what you're planning on doing now? > And probably it would be worth to change all > fs_struct to support sockets with relative path. > What do you think about it? I didn't understand the question. Are you suggesting that changes to fs_struct would be required to make this work? I don't see why. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 4/5] fat: eliminate orphaned inode number allocation
On Wed, Sep 05, 2012 at 04:02:13AM +0900, OGAWA Hirofumi wrote: > "J. Bruce Fields" writes: > > > On Wed, Sep 05, 2012 at 02:07:40AM +0900, OGAWA Hirofumi wrote: > >> OGAWA Hirofumi writes: > >> > >> > Namjae Jeon writes: > >> > > >> >> From: Namjae Jeon > >> >> > >> >> Maintain a list of inode(i_pos) numbers of orphaned inodes (i.e the > >> >> inodes that have been unlinked but still having open file > >> >> descriptors).At file/directory creation time, skip using such i_pos > >> >> values.Removal of the i_pos from the list is done during inode eviction. > >> > > >> > What happens if the directory (has busy entries) was completely removed? > >> > > >> > > >> > And Al's point is important for NFS too. If you want stable ino for NFS, > >> > you never can't change it. > >> > >> s/never can't/never can/ > > > > If vfat exports aren't fixable, maybe we should just remove that > > feature? > > > > I'm afraid that having unfixable half-working vfat exports is just an > > attractive nuisance that causes users and developers to waste their > > time > > In historically, it was introduced by Neil Brown, when nfs export > interface was rewritten (I'm not sure what was intended). > > Personally, I'm ok to remove it though, it is really personal > opinion. The state would be rather I don't have strong opinion to > remove. Neil, any opinion? If we can document circumstances under which nfs exports of fat filesystems are reliable, fine. Otherwise I'd rather just be clear that we don't support it. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/2] issues with NFS filesystems as lower layer
On Tue, Sep 11, 2012 at 10:56:52PM +0200, Miklos Szeredi wrote: > "J. Bruce Fields" writes: > > >> > Secondly when using an NFSv3 R/O lower layer the filesystem permissions > >> > check refuses permission to write to the inode which prevents us from > >> > copying it up even though we have a writable upper layer. (With an ext4 > >> > lower layer the inode check will succeed if the inode is writable even > >> > if the filesystem is not.) It is not clear what the right solution is > >> > here. One approach is to check the inode permissions only (avoiding the > >> > filesystem specific permissions op), but it is not clear we can rely on > >> > these for all underlying filesystems. Perhaps this check should only be > >> > used for NFS. > > > > Then couldn't you for example end up circumventing ACLs on the > > underlying file to access data cached by reads from another user on the > > same system? > > Ignoring ACL's should always give less access, isn't that right? Not necessarily. (It's up to the server--and if anything servers probably want to err on the side of returning mode bits that are an upper, not a lower, bound on the permissions.) > > Is it possible to arrange that the check for a readonly filesystem be > > done only by the vfs and not also by ->permission? > > You'd need to modify NFS servers for that to work, no? It's possible > but not practical. Oh, OK, I guess I assumed you were dealing with an NFS filesystem that had been mounted readonly on the NFS client. If it's a read-write mount of a filesystem that's read-only on the server side: well, there is at least an error for that case: the server should return NFSERR_ROFS, and you should see EROFS--could you do the copy-up only in the case you get that error? --b. > > Thanks, > Miklos > > > > > > > --b. > > > >> > Perhaps it needs to be a mount option. The second patch > >> > (for discussion) following this email implements this, using the inode > >> > permissions when the lowerlayer is read-only. This seems to work as > >> > expected in my limited testing. > >> > >> I fear that will create an inconsistency between the read-only and the > >> non-read-only case, even though both should behave the same. > >> > >> I think the cleanest would be to create a mount option to always use > >> generic_permission (on both the lower and the upper fs). That would > >> give us two, slightly different, operating modes but each would be > >> self consistent. > >> > >> Thanks, > >> Miklos > >> -- > >> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > >> the body of a message to majord...@vger.kernel.org > >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/5] fat: allocate persistent inode numbers
On Wed, Sep 12, 2012 at 11:12:56PM +0900, Namjae Jeon wrote: > 2012/9/12 OGAWA Hirofumi : > > Namjae Jeon writes: > > > I think that it is unfixable because we can not know i_pos of inode > changed by rename. > And even though we know it, there is no rebuild inode routine in -mm. > And It even can not fix in our patches. > >>> > > And are you tried https://lkml.org/lkml/2012/6/29/381 patches? It sounds > > like to improve performance by enabling lookupcache. > We checked this patches when facing estale issue in -mm. > But It is no use, these patches just retry system call one more when > estale error. > >>> > >>> What happens if client retried from lookup() after -ESTALE? (client NFS > >>> doesn't have the name of entry anymore?) > >> Need to rebuild inode routine because inode cache is already evicted on > >> Server. > >>> > >>> I'm assuming the retry means - it restarts from building the NFS file > >>> handle. I might be just wrong here though. > >> As I remember, just retry in VFS of NFS client..I heard this patch is > >> needed for > >> a very specific set of circumstances where an entry goes stale once > >> between the lookup and the actual operation(s). > >> It is not related with current issues(inode cache eviction on server). > > > > Supposing, the server/client state is after cold boot, and client try to > > rename at first without any cache on client/server. > > > > Even if this state, does the server return ESTALE? If it doesn't return > > ESTALE, I can't understand why it is really unfixable. > Hi OGAWA. > Server will not return ESTALE in this case. because the client does > not have any information for files yet. It does if the client mounted before the server rebooted. NFS is designed so that servers can reboot without causing clients to fail. (Applications will just see a delay during the reboot.) It probably isn't possible to this work in the case of fat. But from fat's point of view there probably isn't much difference between a filehandle lookup after a reboot and a filehandle lookup after the inode's gone from cache. I really don't see what you can do to help here. Won't anything that allows looking up an uncached inode by filehandle also risk finding the wrong file? (If looking up the same filehandle ever results in finding a *different* file from before, that's a bug. Probably a more dangerous bug than an ESTALE--in the ESTALE case the failure is obvious whereas in the case where you get the wrong file, you may silently corrupt data.) --b. > I mean NFS client does not have any old NFS FH(containing old inode > number) for this. > > > > > If it returns ESTALE, why does it return? I'm assuming the previous code > > path is the cached FH path. > The main point for observation is the file handle-which is used for > all the NFS operation. > So for all the NFS operation(read/write) which makes use of the > NFS file handle in between if there is a change in inode number > It will result in ESTALE. > Changing inode number on rename happened at NFS server by inode cache > eviction with memory pressure. > > lookupcache is used at NFS client to reduce number of LOOKUP operations. > But , we can still get ESTALE if inode number at NFS Server change > after LOOKUP, although lookupcache is disable. > > LOOKUP return NFS FH->[inode number changed at NFS Server] -> > But we still use old NFS FH returned from LOOKUP for any file > operation(write,read,etc..) > -> ESTALE will be returned. > > Thanks! > > -- > > OGAWA Hirofumi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/2] issues with NFS filesystems as lower layer
On Wed, Sep 12, 2012 at 05:20:17PM +0200, Miklos Szeredi wrote: > "J. Bruce Fields" writes: > > > On Tue, Sep 11, 2012 at 10:56:52PM +0200, Miklos Szeredi wrote: > >> "J. Bruce Fields" writes: > >> > >> >> > Secondly when using an NFSv3 R/O lower layer the filesystem > >> >> > permissions > >> >> > check refuses permission to write to the inode which prevents us from > >> >> > copying it up even though we have a writable upper layer. (With an > >> >> > ext4 > >> >> > lower layer the inode check will succeed if the inode is writable > >> >> > even > >> >> > if the filesystem is not.) It is not clear what the right solution is > >> >> > here. One approach is to check the inode permissions only (avoiding > >> >> > the > >> >> > filesystem specific permissions op), but it is not clear we can rely > >> >> > on > >> >> > these for all underlying filesystems. Perhaps this check should only > >> >> > be > >> >> > used for NFS. > >> > > >> > Then couldn't you for example end up circumventing ACLs on the > >> > underlying file to access data cached by reads from another user on the > >> > same system? > >> > >> Ignoring ACL's should always give less access, isn't that right? > > > > Not necessarily. > > > > (It's up to the server--and if anything servers probably want to err on > > the side of returning mode bits that are an upper, not a lower, bound on > > the permissions.) > > Okay, I looked at the POSIX ACL access check algorithm and now > understand things better. > > Now i think that enforcing ACL's on the overlay is not possible if ACL's > are not copied up together with the data. And the NFSv4 case is different (it doesn't use "posix" acls), etc., etc. But forget all that, even mode bits aren't much use since you don't understand how the server might be mapping your uid, so you don't know whether you're the owner, or a member of the group. On the other hand, for some cases maybe what you want is to just forget all that and trust the file owners and mode bits, so maybe that's useful at least as an option. > > Oh, OK, I guess I assumed you were dealing with an NFS filesystem that > > had been mounted readonly on the NFS client. > > > > If it's a read-write mount of a filesystem that's read-only on the > > server side: well, there is at least an error for that case: the server > > should return NFSERR_ROFS, and you should see EROFS--could you do the > > copy-up only in the case you get that error? > > If the server returns EROFS then all you know is that the filesystem is > read-only. But that's not what we are interested in. We are interested > in whether the access would be allowed *if* the filesystem *wasn't* > read-only. The server doesn't tell us that. Yeah. I wonder if it would be reasonable in the future to let servers tell us that. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/5] fat: allocate persistent inode numbers
On Thu, Sep 13, 2012 at 02:03:51AM +0900, OGAWA Hirofumi wrote: > "J. Bruce Fields" writes: > > >> > Supposing, the server/client state is after cold boot, and client try to > >> > rename at first without any cache on client/server. > >> > > >> > Even if this state, does the server return ESTALE? If it doesn't return > >> > ESTALE, I can't understand why it is really unfixable. > >> Hi OGAWA. > >> Server will not return ESTALE in this case. because the client does > >> not have any information for files yet. > > > > It does if the client mounted before the server rebooted. NFS is > > designed so that servers can reboot without causing clients to fail. > > (Applications will just see a delay during the reboot.) > > > > It probably isn't possible to this work in the case of fat. > > > > But from fat's point of view there probably isn't much difference > > between a filehandle lookup after a reboot and a filehandle lookup after > > the inode's gone from cache. > > This is talking about to retry by client side, not server side solution. > What happens if client retry after got ESTALE? (Yes, this would not be > the solution for all NFS clients. But, I guess this solution can be for > linux NFS client.) > > > I really don't see what you can do to help here. Won't anything that > > allows looking up an uncached inode by filehandle also risk finding the > > wrong file? > > On other view (as server side solution), we are thinking there is > possible to make the stable filehandle on FAT if we disabled some > operations (e.g. rename(), unlink()) which change inode location in FAT. > > Yes, it would be stable, but supporting limited operations. > > This is server side solution, and we comparing it with client solution. Is that useful to anyone? > >> > If it returns ESTALE, why does it return? I'm assuming the previous code > >> > path is the cached FH path. > >> The main point for observation is the file handle-which is used for > >> all the NFS operation. > >> So for all the NFS operation(read/write) which makes use of the > >> NFS file handle in between if there is a change in inode number > >> It will result in ESTALE. > >> Changing inode number on rename happened at NFS server by inode cache > >> eviction with memory pressure. > >> > >> lookupcache is used at NFS client to reduce number of LOOKUP operations. > >> But , we can still get ESTALE if inode number at NFS Server change > >> after LOOKUP, although lookupcache is disable. > >> > >> LOOKUP return NFS FH->[inode number changed at NFS Server] -> > >> But we still use old NFS FH returned from LOOKUP for any file > >> operation(write,read,etc..) > >> -> ESTALE will be returned. > > Yes. And I'm expecting as client side solution, > > -> ESTALE will be returned -> discard old FH -> restart from LOOKUP -> > make cached inode -> use returned new FH. > > Yeah, I know this is unstable (there is no perfect solution for now), You may end up with a totally different file, of course: client: server: open "/foo/bar" rename "/foo/baz"->"/foo/bar" write to file And now we're writing to the file that was originally named /foo/baz when we should have gotten ESTALE. --b. > but if this works, this doesn't limit any operations. > > We would want to compare client solution (-mm) and server solution > (stable ino). Or I'd like to know which my knowledges/understanding are > wrong here. > -- > OGAWA Hirofumi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/5] fat: allocate persistent inode numbers
On Thu, Sep 13, 2012 at 02:38:11AM +0900, OGAWA Hirofumi wrote: > "J. Bruce Fields" writes: > > >> On other view (as server side solution), we are thinking there is > >> possible to make the stable filehandle on FAT if we disabled some > >> operations (e.g. rename(), unlink()) which change inode location in FAT. > >> > >> Yes, it would be stable, but supporting limited operations. > >> > >> This is server side solution, and we comparing it with client solution. > > > > Is that useful to anyone? > > Good question. I'm not sure though, Namjae is asking. And I was asked > about stable read-only export in past. > > >> >> LOOKUP return NFS FH->[inode number changed at NFS Server] -> > >> >> But we still use old NFS FH returned from LOOKUP for any file > >> >> operation(write,read,etc..) > >> >> -> ESTALE will be returned. > >> > >> Yes. And I'm expecting as client side solution, > >> > >> -> ESTALE will be returned -> discard old FH -> restart from LOOKUP -> > >> make cached inode -> use returned new FH. > >> > >> Yeah, I know this is unstable (there is no perfect solution for now), > > > > You may end up with a totally different file, of course: > > > > client: server: > > > > open "/foo/bar" > > rename "/foo/baz"->"/foo/bar" > > write to file > > > > And now we're writing to the file that was originally named /foo/baz > > when we should have gotten ESTALE. > > I see. So, client can't solve the ESTALE if inode cache was evicted, > right? (without application changes) I don't see how. As another server-side workaround: maybe they could try tuning the inode caching to make eviction less likely? Grepping around... Documentation/sysctl/vm.txt mentions a vfs_cache_pressure parameter. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/5] fat: allocate persistent inode numbers
On Thu, Sep 13, 2012 at 05:33:02PM +0900, OGAWA Hirofumi wrote: > Namjae Jeon writes: > > >> I see. So, client can't solve the ESTALE if inode cache was evicted, > >> right? (without application changes) > > > > There can be situation where we may get not only ESTALE but EIO also. > > > > For example, > > --- > > fd = open(“foo.txt”); > > while (1) { > >sleep(1); > >write(fd..); > > } > > > > > > Here “write” may fail when inode number of “foo.txt” is changed at > > server due to cache eviction under memory pressure. > > When we tried a similar test, we found that “write” is retuning “EIO” > > instead of “ESTALE” > > > > - > > #> ./write_test_dbg bbb 1000 0 > > FILE : bbb, SIZE : 1048576000 , FSYNC : OFF , RECORD_SIZE = 4096 > > 106264 -rwxr-xr-x 1 root 0 0 Jan 1 00:14 bbb > > write failed after 60080128 bytes:, errno = 5: Input/output error > > - > > > > As we get EIO instead of ESTALE, it may be difficult to decide when > > "restart from LOOKUP” in such situation. > > Also, as per Bruce opinion, we can not avoid ESTALE from inode number > > change in rebooted server case. > > In reboot case, it is worst as it may attempt to write in a different > > file if NFS handle at NFS client match with inode number of some other > > file at NFS server. > > I see. > > >> Grepping around... Documentation/sysctl/vm.txt mentions a > >> vfs_cache_pressure parameter. > >> Yeah. And dirty hack will be possible to adjust sb->s_shrink.batch. > > I am worrying if it could lead to OOM condition on embedded > > system(short memory(DRAM) and support 3TB HDD disk of big size.) > > > > Please let me know if any issues or queries. > > So, now I think stable inode number may be useful if there are users of > it. And I guess those functionality is no collisions with -mm. And I > suppose we can add two modes for "nfs" option (e.g. nfs=1 and nfs=2). > > If nfs=1, works like current -mm without no limited operations. Apologies, I haven't been following the conversation carefully: remind me what "works like current -mm" means? --b. > If nfs=2, try to make stable FH and limit some operations > > (option name doesn't matter here.) > > Does this work fine? > -- > OGAWA Hirofumi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/5] fat: allocate persistent inode numbers
On Thu, Sep 13, 2012 at 11:24:30PM +0900, Namjae Jeon wrote: > 2012/9/13, OGAWA Hirofumi : > > "J. Bruce Fields" writes: > > > >>> >> Grepping around... Documentation/sysctl/vm.txt mentions a > >>> >> vfs_cache_pressure parameter. > >>> >> Yeah. And dirty hack will be possible to adjust sb->s_shrink.batch. > >>> > I am worrying if it could lead to OOM condition on embedded > >>> > system(short memory(DRAM) and support 3TB HDD disk of big size.) > >>> > > >>> > Please let me know if any issues or queries. > >>> > >>> So, now I think stable inode number may be useful if there are users of > >>> it. And I guess those functionality is no collisions with -mm. And I > >>> suppose we can add two modes for "nfs" option (e.g. nfs=1 and nfs=2). > >>> > >>> If nfs=1, works like current -mm without no limited operations. > >> > >> Apologies, I haven't been following the conversation carefully: remind > >> me what "works like current -mm" means? > > > > Current -mm means the best-effort work only if inode cache is not > > evicted. I.e. if there is no inode cache anymore on server, server > > would return ESTALE. So I guess the behavior would not be stable > > relatively. > Hi OGAWA. > Sorry for late response. > Okay, I will resend patchset include your suggeston.(-o nfs=2) > Do you mind adding busy list patch to avoid unlink issue ? > And in case of rename, FAT retrun EBUSY while opening file. > We can limit only rename. The server doesn't necessarily know whether a client has the file open, so does that really help? --b. > Let me know your opinion. > > Thanks OGAWA! > > > > > Thanks. > > > >>> If nfs=2, try to make stable FH and limit some operations > >>> > >>> (option name doesn't matter here.) > >>> > >>> Does this work fine? > > -- > > OGAWA Hirofumi > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] strings: helper for maximum decimal encoding of an unsigned integer
On Fri, Sep 14, 2012 at 08:30:14AM -0400, Jim Rees wrote: > But I still like my way better. Yeah. It looks less mysterious once you realize it's just a straightforward application of the usual coincidence 2^10 = 1024 ~ 1000 = 10^3 How about this? Also with some more defines because I like typing char buf[ULONG_STR_MAX + 1]; better than char bug[base10len(unsigned long) + 1]; --b. commit 59a620640cd05d3d29e678ff893cfe266091fba7 Author: J. Bruce Fields Date: Wed Aug 15 17:41:47 2012 -0400 strings: helper for maximum decimal encoding of an unsigned integer I've seen a couple examples recently where we've gotten this wrong. Maybe something like this would help? Suggested-by: Jim Rees Signed-off-by: J. Bruce Fields diff --git a/include/linux/string.h b/include/linux/string.h index ffe0442..38da7a0 100644 --- a/include/linux/string.h +++ b/include/linux/string.h @@ -126,6 +126,27 @@ extern void argv_free(char **argv); extern bool sysfs_streq(const char *s1, const char *s2); extern int strtobool(const char *s, bool *res); +/* + * length of the decimal representation of an integer, not including any + * sign or null termination. Just an approximation, but it's right for + * unsigned types of size 1 to 26 bytes: + */ +#define base10len(i) (sizeof(i) * 8 * 3 / 10 + 1) + +/* Actually a slight overestimate in the signed 64-bit case: */ + +#define UCHAR_STR_MAX base10len(char) +#define CHAR_STR_MAX base10len(char) + 1 +#define UINT_STR_MAX base10len(int) +#define INT_STR_MAXbase10len(int) + 1 +#define ULONG_STR_MAX base10len(long) +#define LONG_STR_MAX base10len(long) + 1 + +#define U32_STR_MAXbase10len(u32) +#define S32_STR_MAXbase10len(u32) + 1 +#define U64_STR_MAXbase10len(u64) +#define S64_STR_MAXbase10len(u64) + 1 + #ifdef CONFIG_BINARY_PRINTF int vbin_printf(u32 *bin_buf, size_t size, const char *fmt, va_list args); int bstr_printf(char *buf, size_t size, const char *fmt, const u32 *bin_buf); diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c index 2afd2a8..d80d482 100644 --- a/net/sunrpc/cache.c +++ b/net/sunrpc/cache.c @@ -1409,7 +1409,7 @@ static ssize_t read_flush(struct file *file, char __user *buf, size_t count, loff_t *ppos, struct cache_detail *cd) { - char tbuf[20]; + char tbuf[ULONG_STR_MAX + 1]; unsigned long p = *ppos; size_t len; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
nfsd patches for 2.6.25
The following changes are available from the for-linus branch of the git repository at: git://linux-nfs.org/~bfields/linux.git for-linus The largest single change is the addition of rdma support for nfsd, and the preceding major refactoring of the server rpc code to add a transport api. The rest is mostly smaller cleanup and bugfixes. --b. Chuck Lever (10): SUNRPC: Prevent length underflow in read_flush() SUNRPC: Use unsigned string lengths in xdr_decode_string_inplace NLM: Fix sign of length of NLM variable length strings NFSD: Use unsigned length argument for decode_filename NFSD: File name length signage in nfsd request argument structures NFSD: Adjust filename length argument of nfsd_lookup NFSD: Use unsigned length argument for decode_pathname NFSD: Fix mixed sign comparison in nfs3svc_decode_symlinkargs NFSD: Path name length signage in nfsd request argument structures SUNRPC: RPC program information is stored in unsigned integers Frank Filz (1): nfsd: Allow AIX client to read dir containing mountpoints J. Bruce Fields (34): nfsd4: probe callback channel only once nfsd: move callback rpc_client creation into separate thread knfsd: fix broken length check in nfs4idmap.c knfsd: fix cache.c comment nfsd: Fix handling of negative lengths in read_buf() knfsd: cleanup nfsd4 properly on module init failure nfsd: cleanup nfsd module initialization cleanup nfsd: fail module init on reply cache init failure knfsd: cache unregistration needn't return error nfsd: select CONFIG_PROC_FS in nfsv4 and gss server cases nfsd: fail init on /proc/fs/nfs/exports creation failure nfsd: move cache proc (un)registration to separate function knfsd: allow cache_register to return error on failure nfsd: move nfsd/auth.h into fs/nfsd nfsd: minor fs/nfsd/auth.h cleanup nfsd4: kill some unneeded setclientid comments nfsd: eliminate final bogus case from setclientid logic nfsd: uniquify cl_confirm values nfsd4: kill unnecessary same_name() in setclientid_confirm nfsd4: remove unnecessary cl_verifier check from setclientid_confirm nfsd4: kill unneeded cl_confirm check nfsd: fix encode_entryplus_baggage() indentation nfsd4: make current_clientid local nfsd4: miscellaneous nfs4state.c style fixes nfsd4: recognize callback channel failure earlier nfsd4: fix bad seqid on lock request incompatible with open mode nfsd: allow root to set uid and gid on create nfsd: fix rsi_cache reference count leak sunrpc: gss: simplify rsi_parse logic nfsd4: clean up access_valid, deny_valid checks. svcrpc: ensure gss DESTROY tokens free contexts from cache knfsd: don't bother mapping putrootfh enoent to eperm lockd: minor log message fix nfsd: more careful input validation in nfsctl write methods Jeff Layton (2): SUNRPC: spin svc_rqst initialization to its own function NLM: tear down RPC clients in nlm_shutdown_hosts Neil Brown (1): knfsd: change mailing list for nfsd in MAINTAINERS Oleg Drokin (3): lockd: fix reference count leaks in async locking case lockd: fix a leak in nlmsvc_testlock asynchronous request handling Leak in nlmsvc_testlock for async GETFL case Prasad P (1): nfsd: Fix inconsistent assignment Tom Tucker (47): svc: Add an svc transport class svc: Make svc_sock the tcp/udp transport svc: Change the svc_sock in the rqstp structure to a transport svc: Add a max payload value to the transport svc: Move sk_sendto and sk_recvfrom to svc_xprt_class svc: Add transport specific xpo_release function svc: Add per-transport delete functions svc: Add xpo_prep_reply_hdr svc: Add a transport function that checks for write space svc: Move close processing to a single place svc: Add xpo_accept transport function svc: Remove unnecessary call to svc_sock_enqueue svc: Move connection limit checking to its own function svc: Add a generic transport svc_create_xprt function svc: Change services to use new svc_create_xprt service svc: Change sk_inuse to a kref svc: Move sk_flags to the svc_xprt structure svc: Move sk_server and sk_pool to svc_xprt svc: Make close transport independent svc: Move sk_reserved to svc_xprt svc: Make the enqueue service transport neutral and export it. svc: Make svc_send transport neutral svc: Change svc_sock_received to svc_xprt_received and export it svc: Move accept call to svc_xprt_received to common code svc: Remove sk_lastrecv svc: Move the authinfo cache to svc_xprt. svc: Make deferral processing xprt independent svc: Move the sockaddr information to svc_xprt svc: Make svc_sock_release svc_xprt_release svc: Mak
Re: Integration of SCST in the mainstream Linux kernel
On Mon, Feb 04, 2008 at 11:44:31AM -0800, Linus Torvalds wrote: ... > Pure user-space solutions work, but tend to eventually be turned into > kernel-space if they are simple enough and really do have throughput and > latency considerations (eg nfsd), and aren't quite complex and crazy > enough to have a large impedance-matching problem even for basic IO stuff > (eg samba). ... > So just going by what has happened in the past, I'd assume that iSCSI > would eventually turn into "connecting/authentication in user space" with > "data transfers in kernel space". But only if it really does end up > mattering enough. We had a totally user-space NFS daemon for a long time, > and it was perfectly fine until people really started caring. I'd assumed the move was primarily because of the difficulty of getting correct semantics on a shared filesystem--if you're content with NFS-only access to your filesystem, then you can probably do everything in userspace, but once you start worrying about getting stable filehandles, consistent file locking, etc., from a real disk filesystem with local users, then you require much closer cooperation from the kernel. And I seem to recall being told that sort of thing was the motivation more than performance, but I wasn't there (and I haven't seen performance comparisons). --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/5] net: socket bind to file descriptor introduced
On Tue, Sep 04, 2012 at 03:00:07PM -0400, bfields wrote: > On Mon, Aug 20, 2012 at 02:18:13PM +0400, Stanislav Kinsbursky wrote: > > 16.08.2012 07:03, Eric W. Biederman пишет: > > >Stanislav Kinsbursky writes: > > > > > >>This patch set introduces new socket operation and new system call: > > >>sys_fbind(), which allows to bind socket to opened file. > > >>File to bind to can be created by sys_mknod(S_IFSOCK) and opened by > > >>open(O_PATH). > > >> > > >>This system call is especially required for UNIX sockets, which has name > > >>lenght limitation. > > >> > > >>The following series implements... > > > > > >Hmm. I just realized this patchset is even sillier than I thought. > > > > > >Stanislav is the problem you are ultimately trying to solve nfs clients > > >in a container connecting to the wrong user space rpciod? > > > > > > > Hi, Eric. > > The problem you mentioned was the reason why I started to think about this. > > But currently I believe, that limitations in unix sockets connect or > > bind should be removed, because it will be useful it least for CRIU > > project. > > > > >Aka net/sunrpc/xprtsock.c:xs_setup_local only taking an absolute path > > >and then creating a delayed work item to actually open the unix domain > > >socket? > > > > > >The straight correct and straight forward thing to do appears to be: > > >- Capture the root from current->fs in xs_setup_local. > > >- In xs_local_finish_connect change current->fs.root to the captured > > > version of root before kernel_connect, and restore current->fs.root > > > after kernel_connect. > > > > > >It might not be a bad idea to implement open on unix domain sockets in > > >a filesystem as create(AF_LOCAL)+connect() which would allow you to > > >replace __sock_create + kernel_connect with a simple file_open_root. > > > > > > > I like the idea of introducing new family (AF_LOCAL_AT for example) > > and new sockaddr for connecting or binding from specified root. The > > only thing I'm worrying is passing file descriptor to unix bind or > > connect routine. Because this approach doesn't provide easy way to > > use such family and sockaddr in kernel (like in NFS example). > > > > >But I think the simple scheme of: > > >struct path old_root; > > >old_root = current->fs.root; > > >kernel_connect(...); > > >current->fs.root = old_root; > > > > > >Is more than sufficient and will remove the need for anything > > >except a purely local change to get nfs clients to connect from > > >containers. > > > > > > > That was my first idea. > > So is this what you're planning on doing now? What ever happened to this? --b. > > > And probably it would be worth to change all > > fs_struct to support sockets with relative path. > > What do you think about it? > > I didn't understand the question. Are you suggesting that changes to > fs_struct would be required to make this work? I don't see why. > > --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Heads-up: 3.6.2 / 3.6.3 NFS server panic: 3.6.2+ regression?
On Tue, Oct 23, 2012 at 03:07:59PM +0100, Nix wrote: > On 23 Oct 2012, J. Bruce Fields uttered the following: > > > On Mon, Oct 22, 2012 at 05:17:04PM +0100, Nix wrote: > >> I just had a panic/oops on upgrading from 3.6.1 to 3.6.3, after weeks of > >> smooth operation on 3.6.1: one of the NFS changes that went into one of > >> the two latest stable kernels appears to be lethal after around half an > >> hour of uptime. The oops came from NFSv4, IIRC (relying on memory since > >> my camera was recharging and there is no netconsole from that box > >> because it is where the netconsole logs go, so I'll have to reproduce it > >> later today). The machine is an NFSv3 server only at present, with no > >> NFSv4 running (though NFSv4 is built in). > > > > Note recent clients may try to negotiate NFSv4 by default, so it's > > possible to use it without knowing. > > Every NFS import from all this server's clients has 'vers=3' forced on > (for now, until I get around to figuring out what if anything needs to > be done to move to NFSv4: it may be the answer is 'nothing' but I tread > quite carefully with this server, since my $HOME is there). > > /proc/fs/nfsfs/volumes on the clients confirms that everything is v3. > > > You didn't change anything else about your server or clients recently? > > Nope (other than upgrading the clients to 3.6.3 in concert). Running > 3.6.1 here on that server now and 3.6.3 on all the clients, and no crash > in over a day. > > nfs-utils is a bit old, 1.2.6-rc6, I should probably upgrade it... nfs-utils shouldn't be capable of oopsing the kernel, so from my (selfish) point of view I'd actually rather you stick with whatever you have and try to reproduce the oops. (It's unlikely to make any difference anyway.) --b. > > I don't see an obvious candidate on a quick skim of v3.6.1..v3.6.3 > > commits, but of course I could be missing something. > > I'll try rebooting into it again soon, and get an oops report if it > should happen again (and hopefully less filesystem corruption this time, > right after a reboot into a new kernel was clearly the wrong time to > move a bunch of directories around). > > Sorry for the continuing lack of useful information... I'll try to fix > that shortly (once with a report of a false alarm, since I really want > this stable kernel: my server is affected by the tx stalls fixed by > 3468e9d and I'm getting tired of the frequent 1s lags talking to it: I > could just apply that one fix, but I'd rather track this down properly). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug)
On Tue, Oct 23, 2012 at 05:32:21PM +0100, Nix wrote: > On 23 Oct 2012, J. Bruce Fields uttered the following: > > nfs-utils shouldn't be capable of oopsing the kernel, so from my > > (selfish) point of view I'd actually rather you stick with whatever you > > have and try to reproduce the oops. > > Reproduced in 3.6.3, not in 3.6.1, not tried 3.6.2. Capturing it was > rendered somewhat difficult by an ext4/JBD2 bug which leads to data loss > in /var on every reboot out of 3.6.1 and on some reboots out of 3.6.3 (I > have runs of NULs in my logs now, which keep eating the oopses): > > [while in 3.6.1] > [ 88.565698] JBD2: Spotted dirty metadata buffer (dev = dm-5, blocknr = 0). > There's a risk of filesystem corruption in case of system crash. > [ 88.799263] JBD2: Spotted dirty metadata buffer (dev = dm-5, blocknr = 0). > There's a risk of filesystem corruption in case of system crash. > [ 89.648152] [ cut here ] > [ 89.648386] WARNING: at fs/inode.c:280 drop_nlink+0x25/0x42() > [ 89.648614] Hardware name: empty > [ 89.648833] Modules linked in: firewire_ohci firewire_core [last unloaded: > microcode] > [ 89.649382] Pid: 1484, comm: dhcpd Not tainted 3.6.1-dirty #1 > [ 89.649610] Call Trace: > [ 89.649832] [] warn_slowpath_common+0x83/0x9b > [ 89.650063] [] warn_slowpath_null+0x1a/0x1c > [ 89.650292] [] drop_nlink+0x25/0x42 > [ 89.650533] [] ext4_dec_count+0x26/0x28 > [ 89.650763] [] ext4_rename+0x4ec/0x7b4 > [ 89.650993] [] ? vfs_rename+0xbe/0x3b7 > [ 89.651224] [] vfs_rename+0x27c/0x3b7 > [ 89.651454] [] sys_renameat+0x1b1/0x228 > [ 89.651682] [] ? fsnotify+0x226/0x249 > [ 89.651911] [] ? security_inode_permission+0x1e/0x20 > [ 89.652145] [] ? vfs_write+0x116/0x142 > [ 89.652372] [] sys_rename+0x1b/0x1e > [ 89.652601] [] system_call_fastpath+0x16/0x1b > [...] > [while having just booted into 3.6.1 after some time in 3.6.3: the FS > was clean, and fscked on the previous boot into 3.6.3 after a previous > instance of this bug] > Oct 23 17:18:26 spindle crit: [ 67.625319] EXT4-fs error (device dm-5): > mb_free_blocks:1300: group 65, block 2143748:freeing already freed block (bit > 13828) > > This may well be a 3.6.1-specific bug fixed in 3.6.3, but it's hard to > tell since most of my reboots are 3.6.1->3.6.3 or vice versa right now. > > > Anyway, here's the NFSv4 oops (not a panic: it helps if I remember to > turn off panic_on_oops when I come home from the holidays). > > It's a lockd problem, and probably happens during delivery of mail over > NFS (my mailserver load soars when it happens): > > [ 813.110354] [ cut here ] > [ 813.110585] kernel BUG at fs/lockd/mon.c:150! So nsm_mon_unmon() is being passed a NULL client. There are three container patches between 3.6.1 and 3.6.3: lockd: per-net NSM client creation and destruction helpers introduced lockd: use rpc client's cl_nodename for id encoding lockd: create and use per-net NSM RPC clients on MON/UNMON requests and that last does change nsm_monitor's call to nsm_mon_unmon, so that's almost certainly it Looks like there's some confusion about whether nsm_client_get() returns NULL or an error? --b. > [ 813.110878] invalid opcode: [#1] SMP > [ 813.73] Modules linked in: firewire_ohci firewire_core [last unloaded: > microcode] > [ 813.111727] CPU 1 > [ 813.111772] Pid: 1040, comm: lockd Not tainted 3.6.3-dirty #1 empty > empty/S7010 > [ 813.112388] RIP: 0010:[] [] > nsm_mon_unmon+0x64/0x98 > [ 813.112840] RSP: 0018:88062163dcf0 EFLAGS: 00010246 > [ 813.113069] RAX: 88062163dcf8 RBX: RCX: > > [ 813.113303] RDX: 88062163dd68 RSI: 0002 RDI: > 88062163dd40 > [ 813.113537] RBP: 88062163dd50 R08: R09: > 8120c463 > [ 813.113771] R10: 8120c463 R11: 0001 R12: > > [ 813.114006] R13: 88061f067e40 R14: 88059c763a00 R15: > 88062163de28 > [ 813.114241] FS: () GS:88063fc4() > knlGS: > [ 813.114651] CS: 0010 DS: ES: CR0: 8005003b > [ 813.114882] CR2: ff600400 CR3: 01a0b000 CR4: > 07e0 > [ 813.115116] DR0: DR1: DR2: > > [ 813.115350] DR3: DR6: 0ff0 DR7: > 0400 > [ 813.115584] Process lockd (pid: 1040, threadinfo 88062163c000, task > 880621611640) > [ 813.115994] Stack: > [ 813.116210] 88062163dd90 8805f4e044b1 0003
Re: Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug)
On Tue, Oct 23, 2012 at 12:46:21PM -0400, J. Bruce Fields wrote: > On Tue, Oct 23, 2012 at 05:32:21PM +0100, Nix wrote: > > On 23 Oct 2012, J. Bruce Fields uttered the following: > > > nfs-utils shouldn't be capable of oopsing the kernel, so from my > > > (selfish) point of view I'd actually rather you stick with whatever you > > > have and try to reproduce the oops. > > > > Reproduced in 3.6.3, not in 3.6.1, not tried 3.6.2. Capturing it was > > rendered somewhat difficult by an ext4/JBD2 bug which leads to data loss > > in /var on every reboot out of 3.6.1 and on some reboots out of 3.6.3 (I > > have runs of NULs in my logs now, which keep eating the oopses): > > > > [while in 3.6.1] > > [ 88.565698] JBD2: Spotted dirty metadata buffer (dev = dm-5, blocknr = > > 0). There's a risk of filesystem corruption in case of system crash. > > [ 88.799263] JBD2: Spotted dirty metadata buffer (dev = dm-5, blocknr = > > 0). There's a risk of filesystem corruption in case of system crash. > > [ 89.648152] [ cut here ] > > [ 89.648386] WARNING: at fs/inode.c:280 drop_nlink+0x25/0x42() > > [ 89.648614] Hardware name: empty > > [ 89.648833] Modules linked in: firewire_ohci firewire_core [last > > unloaded: microcode] > > [ 89.649382] Pid: 1484, comm: dhcpd Not tainted 3.6.1-dirty #1 > > [ 89.649610] Call Trace: > > [ 89.649832] [] warn_slowpath_common+0x83/0x9b > > [ 89.650063] [] warn_slowpath_null+0x1a/0x1c > > [ 89.650292] [] drop_nlink+0x25/0x42 > > [ 89.650533] [] ext4_dec_count+0x26/0x28 > > [ 89.650763] [] ext4_rename+0x4ec/0x7b4 > > [ 89.650993] [] ? vfs_rename+0xbe/0x3b7 > > [ 89.651224] [] vfs_rename+0x27c/0x3b7 > > [ 89.651454] [] sys_renameat+0x1b1/0x228 > > [ 89.651682] [] ? fsnotify+0x226/0x249 > > [ 89.651911] [] ? security_inode_permission+0x1e/0x20 > > [ 89.652145] [] ? vfs_write+0x116/0x142 > > [ 89.652372] [] sys_rename+0x1b/0x1e > > [ 89.652601] [] system_call_fastpath+0x16/0x1b > > [...] > > [while having just booted into 3.6.1 after some time in 3.6.3: the FS > > was clean, and fscked on the previous boot into 3.6.3 after a previous > > instance of this bug] > > Oct 23 17:18:26 spindle crit: [ 67.625319] EXT4-fs error (device dm-5): > > mb_free_blocks:1300: group 65, block 2143748:freeing already freed block > > (bit 13828) > > > > This may well be a 3.6.1-specific bug fixed in 3.6.3, but it's hard to > > tell since most of my reboots are 3.6.1->3.6.3 or vice versa right now. > > > > > > Anyway, here's the NFSv4 oops (not a panic: it helps if I remember to > > turn off panic_on_oops when I come home from the holidays). > > > > It's a lockd problem, and probably happens during delivery of mail over > > NFS (my mailserver load soars when it happens): > > > > [ 813.110354] [ cut here ] > > [ 813.110585] kernel BUG at fs/lockd/mon.c:150! > > So nsm_mon_unmon() is being passed a NULL client. > > There are three container patches between 3.6.1 and 3.6.3: > > lockd: per-net NSM client creation and destruction helpers introduced > lockd: use rpc client's cl_nodename for id encoding > lockd: create and use per-net NSM RPC clients on MON/UNMON requests > > and that last does change nsm_monitor's call to nsm_mon_unmon, so that's > almost certainly it > > Looks like there's some confusion about whether nsm_client_get() returns > NULL or an error? The return from nsm_client_get() is either from nsm_create() or from ln->nsm_clnt. nsm_create's return is from rpc_create, and it doesn't look possible for rpc_creat to return NULL. So probably we have some case where, while holding ln->nsm_clnt_lock, you can see ln->nsm_users nonzero, but ln->nsm_clnt NULL ? --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug)
On Tue, Oct 23, 2012 at 06:36:15PM +0100, Nix wrote: > On 23 Oct 2012, n...@esperi.org.uk uttered the following: > > > On 23 Oct 2012, Trond Myklebust spake thusly: > >> On Tue, 2012-10-23 at 12:46 -0400, J. Bruce Fields wrote: > >>> Looks like there's some confusion about whether nsm_client_get() returns > >>> NULL or an error? > >> > >> nsm_client_get() looks extremely racy in the case where ln->nsm_users == > >> 0. Since we never recheck the value of ln->nsm_users after taking > >> nsm_create_mutex, what is stopping 2 different threads from both setting > >> ln->nsm_clnt and re-initialising ln->nsm_users? > > > > Yep. At the worst possible time: > > > > spin_lock(&ln->nsm_clnt_lock); > > if (ln->nsm_users) { > > if (--ln->nsm_users) > > ln->nsm_clnt = NULL; > > (1) shutdown = !ln->nsm_users; > > } > > spin_unlock(&ln->nsm_clnt_lock); > > > > If a thread reinitializes nsm_users at point (1), after the assignment, > > we could well end up with ln->nsm_clnt NULL and shutdown false. A bit > > later, nsm_mon_unmon gets called with a NULL clnt, and boom. > > Possible fix if so, utterly untested so far (will test when I can face > yet another reboot and fs-corruption-recovery-hell cycle, in a few > hours), may ruin performance, violate locking hierarchies, and consume > kittens: Right, mutexes can't be taken while holding spinlocks. Keep the kittens well away from the computer. --b. > > diff --git a/fs/lockd/mon.c b/fs/lockd/mon.c > index e4fb3ba..da91cdf 100644 > --- a/fs/lockd/mon.c > +++ b/fs/lockd/mon.c > @@ -98,7 +98,6 @@ static struct rpc_clnt *nsm_client_get(struct net *net) > spin_unlock(&ln->nsm_clnt_lock); > goto out; > } > - spin_unlock(&ln->nsm_clnt_lock); > > mutex_lock(&nsm_create_mutex); > clnt = nsm_create(net); > @@ -108,6 +107,7 @@ static struct rpc_clnt *nsm_client_get(struct net *net) > ln->nsm_users = 1; > } > mutex_unlock(&nsm_create_mutex); > + spin_unlock(&ln->nsm_clnt_lock); > out: > return clnt; > } > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] nfs: fix wrong object type in lockowner_slab
On Wed, Oct 24, 2012 at 02:44:19PM +0800, ycn...@gmail.com wrote: > From: Yanchuan Nian > > The object type in the cache of lockowner_slab is wrong, and it is better to > fix it. > > Signed-off-by: Yanchuan Nian > --- > fs/nfsd/nfs4state.c |2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > index cc894ed..1609eb2 100644 > --- a/fs/nfsd/nfs4state.c > +++ b/fs/nfsd/nfs4state.c > @@ -2325,7 +2325,7 @@ nfsd4_init_slabs(void) > if (openowner_slab == NULL) > goto out_nomem; > lockowner_slab = kmem_cache_create("nfsd4_lockowners", > - sizeof(struct nfs4_openowner), 0, 0, NULL); > + sizeof(struct nfs4_lockowner), 0, 0, NULL); Thanks! The first three fields of each are the same, the rest is: struct nfs4_openowner { ... pointer time_t unsigned char } struct nfs4_lockowner { ... struct list_head } Is it always true that sizeof(struct nfs4_openowner) >= sizeof9struct nfs4_lockowner) ? I think so, in which case this isn't urgent Applying for 3.8. --b. > if (lockowner_slab == NULL) > goto out_nomem; > file_slab = kmem_cache_create("nfsd4_files", > -- > 1.7.4.4 > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Announce: Linux-next (Or Andrew's dream :-))
On Wed, Feb 13, 2008 at 12:48:41AM +, Al Viro wrote: > On Tue, Feb 12, 2008 at 07:16:50PM -0500, J. Bruce Fields wrote: > > > Ahem... Use of git-cherry-pick preserves commit information just fine. > > > > Not by default, at least (note they said "commiters", not "authors"): > > That's why you give it -r. What version of git have you tried this with? At least on my version, -r is a no-op, and the results are the same; the author is still kept and the maintainer changed. I thought it had been that way for ages. --b. [EMAIL PROTECTED]:~/local/linux-2.6$ git --version git version 1.5.4.rc2.60.gb2e62 [EMAIL PROTECTED]:~/local/linux-2.6$ git cherry-pick -r v2.6.25-rc1 Finished one cherry-pick. Created commit 0277143: Linux 2.6.25-rc1 1 files changed, 2 insertions(+), 2 deletions(-) [EMAIL PROTECTED]:~/local/linux-2.6$ git cat-file -p HEAD tree 4018d5d93f857d946dd89acbb4e45c9da04eadaf parent b6ce068a1285a24185b01be8a49021827516b3e1 author Linus Torvalds <[EMAIL PROTECTED]> 1202681894 -0800 committer J. Bruce Fields <[EMAIL PROTECTED]> 1202863894 -0500 Linux 2.6.25-rc1 .. and I really need to call it something else. Maybe it is time to bring back the weasel series, since weasels always make me feel good about a kernel. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Announce: Linux-next (Or Andrew's dream :-))
On Tue, Feb 12, 2008 at 11:54:01PM +, Al Viro wrote: > On Tue, Feb 12, 2008 at 03:51:07PM -0800, David Miller wrote: > > From: Jeff Garzik <[EMAIL PROTECTED]> > > Date: Tue, 12 Feb 2008 11:31:48 -0500 > > > > > I understand the desire to want a nice and clean history, but the > > > frequency here really has a negative impact on your downstreams. > > > > Ok, fair enough. Any alternative suggestions on how to remove turds > > without them being in the history? > > > > > It also totally screws the commit statistics, wiping me and John and the > > > committers we have preserved out, replacing everybody's committer with > > > David Miller. > > > > I am well aware of this downside, sorry. > > Ahem... Use of git-cherry-pick preserves commit information just fine. Not by default, at least (note they said "commiters", not "authors"): [EMAIL PROTECTED]:~/local/linux-2.6$ git cherry-pick v2.6.25-rc1 Finished one cherry-pick. Created commit c445dc0: Linux 2.6.25-rc1 1 files changed, 2 insertions(+), 2 deletions(-) [EMAIL PROTECTED]:~/local/linux-2.6$ git show --pretty=raw HEAD commit c445dc0a8b11877d6038dc528254733348c9f110 tree 4018d5d93f857d946dd89acbb4e45c9da04eadaf parent b6ce068a1285a24185b01be8a49021827516b3e1 author Linus Torvalds <[EMAIL PROTECTED]> 1202681894 -0800 committer J. Bruce Fields <[EMAIL PROTECTED]> 1202861481 -0500 --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Announce: Linux-next (Or Andrew's dream :-))
On Tue, Feb 12, 2008 at 05:31:10PM -0800, Linus Torvalds wrote: > The importance of merging (rather, not screwing up history in general) > becomes really obvious when things go tits-up. Then they go tits-up > *without* screwing up the history of the trees that were hopefully tested > individually. > > If you re-base things that others developed, you lose that. Imagine if I > merged first Greg's tree (by rebasing), and then there was some > fundamental thing that didn't cause a conflict, but just made something > not work, when I rebased yours on top. Think about what happens. > > Now I've merged (say) 1500 networking-related commits by rebasing, but > because I rebased on top of Greg's tree that I had also rebased, > absolutely *none* of that has been tested in any shape of form. I'd not > use most of the things I pulled, so I'd never see it, I'd just push out > something that was very different from *both* trees I pulled, with no way > to really blame the merge - because it doesn't even exist. > > So as a result, some *random* commit that was actually fine on its own has > now become a bug, just because it was re-written. If there was a "fundamental thing that didn't cause a conflict", then the two trees in question probably didn't touch the same code, so would probably merge cleanly, for the same reason that one rebased onto the other cleanly. But depending on what the "fundamental thing" was, the merge might still introduce the same bug, right? --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Announce: Linux-next (Or Andrew's dream :-))
On Tue, Feb 12, 2008 at 05:20:51PM -0800, David Miller wrote: > From: Linus Torvalds <[EMAIL PROTECTED]> > Date: Tue, 12 Feb 2008 16:44:47 -0800 (PST) > > > gitk --merge > ... > > This is something where I actually think git could and should do better: > > git has the capability to act as more of a "quilt replacement", but > > because it wasn't part of the original design, we never actualy exposed > > the simple queue management commands to do this (stgit does things like > > that, though). > > > > So if you haven't pushed out, right now you'd have to do this stupid > > thing: > > Thanks for all the useful info. > > But as soon as I've applied any patches to my tree I've "pushed out". > So this scheme doesn't work for me. The first thing I do when I have > changes to apply is clone a tree locally and on master.kernel.org, > then I apply that first patch locally and push it out to master. > > What would be really cool is if you could do the rebase thing, push > that to a remote tree you were already pushing into and others could > pull from that and all the right things happen. > > A rebase is just a series of events, and those could propagate to > people who are pulling from the tree. I'm pretty sure GIT could > support this. git checkout -b new-tree old-tree # hack on new-tree: rebase, drop bad commits, whatever git merge -s ours old-tree git push your-public-repo new-tree:public-tree Now public-tree has a merge commit on the top with two parents, public-tree^1 and public-tree^2. public-tree^1 is the tip of the new cleaned-up history, and public-tree^2 points to the old stuff. I sometimes use that privately as a way to keep track of the history of a patch-series, but I assume Linus would shoot anyone that tried to submit such a monstrosity upstream --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Announce: Linux-next (Or Andrew's dream :-))
On Tue, Feb 12, 2008 at 11:31:48AM -0500, Jeff Garzik wrote: > David Miller wrote: >> I rebase my tree all the time, at least once or twice per >> week. Why? >> >> Firstly, to remove crap. When you have "great idea A" then "oh shit A >> won't work, revert that" there is zero sense in keeping both >> changesets around. >> >> Secondly, I want to fix up the rejects caused by conflicts with >> upstream bug fixes and the like (and there are tons when the tree gets >> to 1500 or so patches like the networking did). I don't want git to >> merge the thing by hand, I want to see what the conflict is and make >> sure the "obvious" resolution is OK and the most efficient way I know >> how to do that is to suck my tree apart as patches, then suck them >> back into a fresh tree. > > FWIW, that is annoying and painful for us downstream jobbers, since it > isn't really how git was meant to be used. You use it more like a patch > queue, where commits are very fluid. > > Unfortunately, if there is any synchronization lag between me and you -- > not uncommon -- then I cannot commit changes on top of the changes just > sent, in my own local tree. Why? Because you rebase so often, I cannot > even locally commit dependent patches due to the end result merge > getting so nasty. > > I understand the desire to want a nice and clean history, but the > frequency here really has a negative impact on your downstreams. > > It also totally screws the commit statistics, wiping me and John and the > committers we have preserved out, replacing everybody's committer with > David Miller. But the "author" is still preserved, right? Why do you need the committer name to be preserved? (I'm not denying that there could be reasons, I'm just curious what they are.) --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Announce: Linux-next (Or Andrew's dream :-))
On Tue, Feb 12, 2008 at 09:43:10PM -0800, Linus Torvalds wrote: > So just the fact that the right commit gets blamed when somebody does a > "git bisect" is I think a big issue. It's just fundamentally more fair to > everybody. And it means that the people who push their work to me can > really choose to stand behind it, knowing that whatever happens, their > work won't get diluted by bad luck or others' incompetence. > > And no, maybe most people don't feel things like that matters. But I do > think it's important. The obvious advantage to rebasing in this case is that the blame (misplaced though it may be), at least lands on a commit that made a single small change, likely making the problem easier to diagnose. (As opposed to the case of a large merge, where all you may know is that somewhere in the hundreds of commits done on one side of the merge there was a conflict with the hundreds of commits on the other side.) I think a lot of people would see rebasing as an acceptable tradeof that gives up a small amount of accuracy in assigning blame to individuals in return for a large increase in ability to debug problems. I suppose one response to that would be that it's important that people learn how to work in parallel, that failures to do so are particularly important failures in the process, and that it's therefore worth it to make sure that such failures are always identified specifically as merge failures. It would be nice if merges, like patches, were broken up into somewhat smaller units. There's an understandable desire to wait to the last minute to actually commit to one's commits, but a willingness to do so a little earlier might avoid some of the problems that seem to come from having a lot of large merges happen all at once. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2.6 patch] make nfsd_create_setattr() static
On Wed, Feb 13, 2008 at 11:30:26PM +0200, Adrian Bunk wrote: > This patch makes the needlessly global nfsd_create_setattr() static. > > Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]> Thanks, applied.--b. > > --- > 7cfb17a22a0cf76477fbf1fbe6bf586108a13a44 diff --git a/fs/nfsd/vfs.c > b/fs/nfsd/vfs.c > index cc75e4f..035204f 100644 > --- a/fs/nfsd/vfs.c > +++ b/fs/nfsd/vfs.c > @@ -1151,7 +1151,7 @@ nfsd_commit(struct svc_rqst *rqstp, struct svc_fh *fhp, > } > #endif /* CONFIG_NFSD_V3 */ > > -__be32 > +static __be32 > nfsd_create_setattr(struct svc_rqst *rqstp, struct svc_fh *resfhp, > struct iattr *iap) > { > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: first tree
On Thu, Feb 14, 2008 at 05:27:57PM -0500, Trond Myklebust wrote: > > On Fri, 2008-02-15 at 00:35 +1100, Stephen Rothwell wrote: > > > > Also, more trees please ... :-) > > Please add the 'linux-next' branch of > > git://git.linux-nfs.org/projects/trondmy/nfs-2.6.git And 'nfsd-next' at git://git.linux-nfs.org/~bfields/linux.git nfsd-next --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Announce: Linux-next (Or Andrew's dream :-))
On Thu, Feb 14, 2008 at 07:35:03PM +0200, Benny Halevy wrote: > One idea that I thought about when debating rebase vs. merge (and it's > far far from being fully baked) is versioned commits. The gist of it > is that patches are assigned an hash identifier like today when they > are first committed into the tree, but, and this is the main change: > if they mutate, e.g. by a rebase, or even git commit --amend, their > version is bumped up rather than SHA changed. The SHA1 is uniquely determined by the contents of that commit--commit and author names and times, changelog message, snapshot of the tree at that point, and parents--hence, recursively, by the entire history leading up to that commit. Naming objects by their content in that way has a lot of advantages--for example, you can sign an entire history by signing just the SHA1 of the commit at its tip. So you can't break that link between the names of commits and their contents without ending up with a fundamentally different (and probably weaker, in some sense), system. I suspect there's an unavoidable tradeoff--if you want to be able to reliably and efficiently determine how two branches are related, then you can't just throw away their (possibly messy) history. The best you may be able to do, if you want the advantages of both rebasing and merging, is to maintain on your own the messier meta-history as a superset of the simpler history that you end up submitting. --b. > This way all versions of the commit would be accessible and addressable using > their respective SHA *and* version. I think that this can help keep > the tree's history in a more intuitive way (since the patches' base identifier > won't change, just its version number), and you get a bonus of seeing each > commit's history, who changed it, and what was the change. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv3] locks: prevent side-effects of locks_release_private before file_lock is initialized
On Fri, Jul 27, 2012 at 12:42:52AM -0400, Filipe Brandenburger wrote: > When calling fcntl(fd, F_SETLEASE, lck) [with lck=F_WRLCK or F_RDLCK], > the custom signal or owner (if any were previously set using F_SETSIG > or F_SETOWN fcntls) would be reset when F_SETLEASE was called for the > second time on the same file descriptor. > > This bug is a regression of 2.6.37 and is described here: > https://bugzilla.kernel.org/show_bug.cgi?id=43336 > > This patch reverts a commit from Oct 2004 (with subject "nfs4 lease: > move the f_delown processing") which originally introduced the > lm_release_private callback. Looks fine, thanks. I think can also do something like the following (on top of your patch). --b. commit 96d6d59ceaeaacba4088862f3c57fcd011f52832 Author: J. Bruce Fields Date: Fri Jul 27 16:18:00 2012 -0400 locks: move lease-specific code out of locks_delete_lock No point putting something only used by one caller into common code. Signed-off-by: J. Bruce Fields diff --git a/fs/locks.c b/fs/locks.c index 86668dd..541075a 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -570,12 +570,6 @@ static void locks_delete_lock(struct file_lock **thisfl_p) fl->fl_next = NULL; list_del_init(&fl->fl_link); - fasync_helper(0, fl->fl_file, 0, &fl->fl_fasync); - if (fl->fl_fasync != NULL) { - printk(KERN_ERR "locks_delete_lock: fasync == %p\n", fl->fl_fasync); - fl->fl_fasync = NULL; - } - if (fl->fl_nspid) { put_pid(fl->fl_nspid); fl->fl_nspid = NULL; @@ -1150,6 +1144,11 @@ int lease_modify(struct file_lock **before, int arg) f_delown(filp); filp->f_owner.signum = 0; + fasync_helper(0, fl->fl_file, 0, &fl->fl_fasync); + if (fl->fl_fasync != NULL) { + printk(KERN_ERR "locks_delete_lock: fasync == %p\n", fl->fl_fasync); + fl->fl_fasync = NULL; + } locks_delete_lock(before); } return 0; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 09/15] NFSd: make nfsd4_manager allocated per network namespace context.
On Wed, Jul 25, 2012 at 04:56:58PM +0400, Stanislav Kinsbursky wrote: > Signed-off-by: Stanislav Kinsbursky > --- > fs/nfsd/netns.h |2 ++ > fs/nfsd/nfs4state.c | 32 +++- > 2 files changed, 21 insertions(+), 13 deletions(-) > > diff --git a/fs/nfsd/netns.h b/fs/nfsd/netns.h > index 3936563..e99767d 100644 > --- a/fs/nfsd/netns.h > +++ b/fs/nfsd/netns.h > @@ -34,6 +34,8 @@ struct nfsd_net { > > struct cache_detail *idtoname_cache; > struct cache_detail *nametoid_cache; > + > + struct lock_manager nfsd4_manager; > }; > > extern int nfsd_net_id; > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > index ab0a02a..fad2408 100644 > --- a/fs/nfsd/nfs4state.c > +++ b/fs/nfsd/nfs4state.c > @@ -45,6 +45,8 @@ > #include "vfs.h" > #include "current_stateid.h" > > +#include "netns.h" > + > #define NFSDDBG_FACILITYNFSDDBG_PROC > > /* Globals */ > @@ -3115,22 +3117,21 @@ out: > return status; > } > > -static struct lock_manager nfsd4_manager = { > -}; > - > static bool grace_ended; > > static void > -nfsd4_end_grace(void) > +nfsd4_end_grace(struct net *net) > { > + struct nfsd_net *nn = net_generic(net, nfsd_net_id); > + > /* do nothing if grace period already ended */ > if (grace_ended) > return; That doesn't look right But, OK, looking through later patches I see you make grace_ended per-namespace later. OK. --b. > > dprintk("NFSD: end of grace period\n"); > grace_ended = true; > - nfsd4_record_grace_done(&init_net, boot_time); > - locks_end_grace(&nfsd4_manager); > + nfsd4_record_grace_done(net, boot_time); > + locks_end_grace(&nn->nfsd4_manager); > /* >* Now that every NFSv4 client has had the chance to recover and >* to see the (possibly new, possibly shorter) lease time, we > @@ -3153,7 +3154,7 @@ nfs4_laundromat(void) > nfs4_lock_state(); > > dprintk("NFSD: laundromat service - starting\n"); > - nfsd4_end_grace(); > + nfsd4_end_grace(&init_net); > INIT_LIST_HEAD(&reaplist); > spin_lock(&client_lock); > list_for_each_safe(pos, next, &client_lru) { > @@ -4687,6 +4688,8 @@ set_max_delegations(void) > int > nfs4_state_start(void) > { > + struct net *net = &init_net; > + struct nfsd_net *nn = net_generic(net, nfsd_net_id); > int ret; > > /* > @@ -4696,10 +4699,10 @@ nfs4_state_start(void) >* to that instead and then do most of the rest of this on a per-net >* basis. >*/ > - get_net(&init_net); > - nfsd4_client_tracking_init(&init_net); > + get_net(net); > + nfsd4_client_tracking_init(net); > boot_time = get_seconds(); > - locks_start_grace(&nfsd4_manager); > + locks_start_grace(&nn->nfsd4_manager); > grace_ended = false; > printk(KERN_INFO "NFSD: starting %ld-second grace period\n", > nfsd4_grace); > @@ -4722,8 +4725,8 @@ nfs4_state_start(void) > out_free_laundry: > destroy_workqueue(laundry_wq); > out_recovery: > - nfsd4_client_tracking_exit(&init_net); > - put_net(&init_net); > + nfsd4_client_tracking_exit(net); > + put_net(net); > return ret; > } > > @@ -4764,9 +4767,12 @@ __nfs4_state_shutdown(void) > void > nfs4_state_shutdown(void) > { > + struct net *net = &init_net; > + struct nfsd_net *nn = net_generic(net, nfsd_net_id); > + > cancel_delayed_work_sync(&laundromat_work); > destroy_workqueue(laundry_wq); > - locks_end_grace(&nfsd4_manager); > + locks_end_grace(&nn->nfsd4_manager); > nfs4_lock_state(); > __nfs4_state_shutdown(); > nfs4_unlock_state(); > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/15] Lockd: grace period containerization
On Wed, Jul 25, 2012 at 04:55:45PM +0400, Stanislav Kinsbursky wrote: > Bruce, I feel this patch set is ready for inclusion. > > v2: > 1) Rebase on Bruce's "for-3.6" branch. > > This patch set makes grace period and hosts reclaiming network namespace > aware. On a quick skim--yes, that looks reasonable to me. It doesn't help with active/active cluster exports, because in that case we need some additional coordination between nfsd's. But it looks good enough to handle the case where each filesystem is exported from at most one server at a time, which is more than we currently handle. It's a little late for 3.6. Also I get the impression Al Viro has some lockd rework in progress, which we may want to wait for. So I'll likely look again into queueing this up for 3.7 once 3.6-rc1 is out. --b. > > Main ideas: > 1) moving of > > unsigned long next_gc; > unsigned long nrhosts; > > struct delayed_work grace_period_end; > struct lock_manager lockd_manager; > struct list_head grace_list; > > to per-net Lockd data. > > 2) moving of > > struct lock_manager nfsd4_manager; > > to per-net NFSd data. > > 3) shutdown + gc of NLM hosts done now network namespace aware. > > 4) restart_grace() now works only for init_net. > > The following series implements... > > --- > > Stanislav Kinsbursky (15): > LockD: mark host per network namespace on garbage collect > LockD: make garbage collector network namespace aware. > LockD: manage garbage collection timeout per networks namespace > LockD: manage used host count per networks namespace > Lockd: host complaining function introduced > Lockd: add more debug to host shutdown functions > LockD: manage grace period per network namespace > LockD: make lockd manager allocated per network namespace > NFSd: make nfsd4_manager allocated per network namespace context. > SUNRPC: service request network namespace helper introduced > LockD: manage grace list per network namespace > LockD: pass actual network namespace to grace period management > functions > Lockd: move grace period management from lockd() to per-net functions > NFSd: make grace end flag per network namespace > NFSd: make boot_time variable per network namespace > > > fs/lockd/grace.c| 16 +-- > fs/lockd/host.c | 92 ++ > fs/lockd/netns.h|7 +++ > fs/lockd/svc.c | 43 ++ > fs/lockd/svc4proc.c | 13 +++-- > fs/lockd/svclock.c | 16 +++ > fs/lockd/svcproc.c | 15 -- > fs/lockd/svcsubs.c | 19 +--- > fs/nfs/callback_xdr.c |4 +- > fs/nfsd/export.c|4 +- > fs/nfsd/netns.h |4 ++ > fs/nfsd/nfs4idmap.c |4 +- > fs/nfsd/nfs4proc.c | 18 --- > fs/nfsd/nfs4state.c | 104 > --- > fs/nfsd/state.h |4 +- > include/linux/fs.h |5 +- > include/linux/lockd/lockd.h |6 +- > include/linux/sunrpc/svc.h |2 + > 18 files changed, 231 insertions(+), 145 deletions(-) > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv3] locks: prevent side-effects of locks_release_private before file_lock is initialized
On Fri, Jul 27, 2012 at 04:45:52PM -0400, J. Bruce Fields wrote: > On Fri, Jul 27, 2012 at 12:42:52AM -0400, Filipe Brandenburger wrote: > > When calling fcntl(fd, F_SETLEASE, lck) [with lck=F_WRLCK or F_RDLCK], > > the custom signal or owner (if any were previously set using F_SETSIG > > or F_SETOWN fcntls) would be reset when F_SETLEASE was called for the > > second time on the same file descriptor. > > > > This bug is a regression of 2.6.37 and is described here: > > https://bugzilla.kernel.org/show_bug.cgi?id=43336 > > > > This patch reverts a commit from Oct 2004 (with subject "nfs4 lease: > > move the f_delown processing") which originally introduced the > > lm_release_private callback. > > Looks fine, thanks. I think can also do something like the following > (on top of your patch). (Committing this as well.)--b. > > --b. > > commit 96d6d59ceaeaacba4088862f3c57fcd011f52832 > Author: J. Bruce Fields > Date: Fri Jul 27 16:18:00 2012 -0400 > > locks: move lease-specific code out of locks_delete_lock > > No point putting something only used by one caller into common code. > > Signed-off-by: J. Bruce Fields > > diff --git a/fs/locks.c b/fs/locks.c > index 86668dd..541075a 100644 > --- a/fs/locks.c > +++ b/fs/locks.c > @@ -570,12 +570,6 @@ static void locks_delete_lock(struct file_lock > **thisfl_p) > fl->fl_next = NULL; > list_del_init(&fl->fl_link); > > - fasync_helper(0, fl->fl_file, 0, &fl->fl_fasync); > - if (fl->fl_fasync != NULL) { > - printk(KERN_ERR "locks_delete_lock: fasync == %p\n", > fl->fl_fasync); > - fl->fl_fasync = NULL; > - } > - > if (fl->fl_nspid) { > put_pid(fl->fl_nspid); > fl->fl_nspid = NULL; > @@ -1150,6 +1144,11 @@ int lease_modify(struct file_lock **before, int arg) > > f_delown(filp); > filp->f_owner.signum = 0; > + fasync_helper(0, fl->fl_file, 0, &fl->fl_fasync); > + if (fl->fl_fasync != NULL) { > + printk(KERN_ERR "locks_delete_lock: fasync == %p\n", > fl->fl_fasync); > + fl->fl_fasync = NULL; > + } > locks_delete_lock(before); > } > return 0; > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/15] Lockd: grace period containerization
On Mon, Jul 30, 2012 at 02:03:57PM +0400, Stanislav Kinsbursky wrote: > 28.07.2012 01:54, J. Bruce Fields пишет: > >On Wed, Jul 25, 2012 at 04:55:45PM +0400, Stanislav Kinsbursky wrote: > >>Bruce, I feel this patch set is ready for inclusion. > >> > >>v2: > >>1) Rebase on Bruce's "for-3.6" branch. > >> > >>This patch set makes grace period and hosts reclaiming network namespace > >>aware. > > > >On a quick skim--yes, that looks reasonable to me. > > > >It doesn't help with active/active cluster exports, because in that case > >we need some additional coordination between nfsd's. > > > >But it looks good enough to handle the case where each filesystem is > >exported from at most one server at a time, which is more than we > >currently handle. > > > >It's a little late for 3.6. Also I get the impression Al Viro has some > >lockd rework in progress, which we may want to wait for. > > > >So I'll likely look again into queueing this up for 3.7 once 3.6-rc1 is > >out. > > > Ok. > Will Al Viro's lockd rework be a part of 3.6 kernel? Actually I think it mostly won't be. And this looks pretty safe, really. I've gone ahead and merged it. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
nfsd changes for 3.6
Please pull nfsd (and some lockd and a couple locks/lease) changes from: git://linux-nfs.org/~bfields/linux.git nfsd-next This has been an unusually quiet cycle--mostly bugfixes and cleanup. The one large piece is Stanislav's work to containerize the server's grace period--but that in itself is just one more step in a not-yet-complete project to allow fully containerized nfs service. There are a number of outstanding delegation, container, v4 state, and gss patches that aren't quite ready yet; 3.7 may be wilder. --b. Chuck Lever (1): NFSD: TEST_STATEID should not return NFS4ERR_STALE_STATEID Eldad Zack (2): sunrpc/cache.h: fix coding style sunrpc/cache.h: replace simple_strtoul Filipe Brandenburger (1): locks: prevent side-effects of locks_release_private before file_lock is initialized J. Bruce Fields (10): nfsd4: remove unnecessary comment nfsd4: nfsd4_lock() cleanup nfsd4: process_open2 cleanup nfsd4: our filesystems are normally case sensitive nfsd4: release openowners on free in >=4.1 case nfsd: allow owner_override only for regular files nfsd: share some function prototypes nfsd: add get_uint for u32's locks: move lease-specific code out of locks_delete_lock nfsd4: fix missing fault_inject.h include NeilBrown (1): SUNRPC/cache: fix reporting of expired cache entries in 'content' file. Stanislav Kinsbursky (18): NFSd: fix locking in nfsd_forget_delegations() NFSd: introduce nfsd_destroy() helper NFSd: set nfsd_serv to NULL after service destruction LockD: mark host per network namespace on garbage collect LockD: make garbage collector network namespace aware. LockD: manage garbage collection timeout per networks namespace LockD: manage used host count per networks namespace Lockd: host complaining function introduced Lockd: add more debug to host shutdown functions LockD: manage grace period per network namespace LockD: make lockd manager allocated per network namespace NFSd: make nfsd4_manager allocated per network namespace context. SUNRPC: service request network namespace helper introduced LockD: manage grace list per network namespace LockD: pass actual network namespace to grace period management functions Lockd: move grace period management from lockd() to per-net functions NFSd: make grace end flag per network namespace NFSd: make boot_time variable per network namespace Vivek Trivedi (1): nfsd4: fix cr_principal comparison check in same_creds Weston Andros Adamson (1): nfsd: probe the back channel on new connections fs/lockd/grace.c | 16 ++-- fs/lockd/host.c | 92 --- fs/lockd/netns.h |7 ++ fs/lockd/svc.c | 43 + fs/lockd/svc4proc.c | 13 +-- fs/lockd/svclock.c | 16 ++-- fs/lockd/svcproc.c | 15 ++-- fs/lockd/svcsubs.c | 19 ++-- fs/locks.c | 28 +++--- fs/nfs/callback_xdr.c|4 +- fs/nfsd/export.c | 10 +-- fs/nfsd/netns.h |4 + fs/nfsd/nfs4callback.c |1 - fs/nfsd/nfs4idmap.c |4 +- fs/nfsd/nfs4proc.c | 18 ++-- fs/nfsd/nfs4state.c | 201 +- fs/nfsd/nfs4xdr.c|2 +- fs/nfsd/nfsctl.c |8 +- fs/nfsd/nfsd.h | 13 +++ fs/nfsd/nfssvc.c | 24 ++--- fs/nfsd/state.h |5 +- fs/nfsd/vfs.c| 10 ++- include/linux/fs.h |5 +- include/linux/lockd/lockd.h |6 +- include/linux/sunrpc/cache.h | 34 +-- include/linux/sunrpc/svc.h |2 + net/sunrpc/cache.c |5 +- 27 files changed, 368 insertions(+), 237 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nfsd changes for 3.6
On Tue, Jul 31, 2012 at 10:29:48AM -0400, bfields wrote: > Please pull nfsd (and some lockd and a couple locks/lease) changes from: By the way, for a few years now I've been semi-regularly picking up locks.c changes for my tree. I wonder if I should be. At the least we should probably remove this entry from MAINTAINERS? And by default I suppose that stuff reverts to Al, who rumor has it will have a fair number of locks.c patches next time around. Or I'm happy to claim to be maintainer of that file if somebody prefers, it doesn't matter much to me. --b. diff --git a/MAINTAINERS b/MAINTAINERS index 3e30a3a..305ac2c 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -2767,15 +2767,6 @@ F: include/scsi/fc/ F: include/scsi/libfc.h F: include/scsi/libfcoe.h -FILE LOCKING (flock() and fcntl()/lockf()) -M: Matthew Wilcox -L: linux-fsde...@vger.kernel.org -S: Maintained -F: include/linux/fcntl.h -F: include/linux/fs.h -F: fs/fcntl.c -F: fs/locks.c - FILESYSTEMS (VFS and infrastructure) M: Alexander Viro L: linux-fsde...@vger.kernel.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/5] fat: allocate persistent inode numbers
On Mon, Sep 24, 2012 at 09:32:00PM +0900, OGAWA Hirofumi wrote: > Namjae Jeon writes: > > > 2012/9/24, OGAWA Hirofumi : > >> Namjae Jeon writes: > >> > I see. fileid seems to be stat.ino on nfsd4. inode->i_ino is actually > just a hash key of inode hash (exception is only in audit, iirc). > > So, what happens if we set "stat->ino = i_pos" on fat_getattr(). > > int fat_getattr(struct vfsmount *mnt, struct dentry *dentry, struct > kstat > *stat) > { > struct inode *inode = dentry->d_inode; > generic_fillattr(inode, stat); > stat->blksize = MSDOS_SB(inode->i_sb)->cluster_size; > if (opts->nfs == FAT_NFS_LIMITED) { > /* Use i_pos for ino. This is used as fileid of nfs. */ > stat->ino = MSDOS_SB(inode->i_sb)->i_pos; > >> > >>stat->ino = fat_i_pos_read(MSDOS_SB(inode->i_sb), inode); > >> > >> Ouch, I forgot to use fat_i_pos_read(). > >> > > There is some unclear thing. > > When I see first mail, I think maybe you don't want to use i_pos for > > inode->ino. > > FAT allocate inode->ino from i_unique on server side and If NFS client > > use i_pos for inode->ino in fat_get_attr, inode numbers on each > > client/server will still be mismatched. > > > > Would you plz give me hint ? > > ->i_ino is long. It can't hold i_pos fully on 32bit arch, so we can't > use ->i_no to store i_pos, and changing ->i_ino is unnecessary. If > getattr() returned i_pos as ino, nobody see ->i_ino anymore except > internal of kernel. The NFS server must always return the same inode number for the same filehandle. To do otherwise is a bug. > Furthermore I think there is no issue even if server and client didn't > have same ino. Because client just uses FH (nfs4 seems to be using > stat.ino though). The client may expose a different inode number to userspace, but it's probably the server-provided inode number that it's checking. (And even if the Linux client didn't currently happen to do that check, this would still be a bug.) --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/5] fat: allocate persistent inode numbers
On Tue, Sep 25, 2012 at 01:16:45AM +0900, OGAWA Hirofumi wrote: > "J. Bruce Fields" writes: > > >> > There is some unclear thing. > >> > When I see first mail, I think maybe you don't want to use i_pos for > >> > inode->ino. > >> > FAT allocate inode->ino from i_unique on server side and If NFS client > >> > use i_pos for inode->ino in fat_get_attr, inode numbers on each > >> > client/server will still be mismatched. > >> > > >> > Would you plz give me hint ? > >> > >> ->i_ino is long. It can't hold i_pos fully on 32bit arch, so we can't > >> use ->i_no to store i_pos, and changing ->i_ino is unnecessary. If > >> getattr() returned i_pos as ino, nobody see ->i_ino anymore except > >> internal of kernel. > > > > The NFS server must always return the same inode number for the same > > filehandle. To do otherwise is a bug. > > > >> Furthermore I think there is no issue even if server and client didn't > >> have same ino. Because client just uses FH (nfs4 seems to be using > >> stat.ino though). > > > > The client may expose a different inode number to userspace, but it's > > probably the server-provided inode number that it's checking. > > > > (And even if the Linux client didn't currently happen to do that check, > > this would still be a bug.) > > In this context, inode number != inode->i_ino, right? It should be > kstat.ino, and in FAT case, it will return i_pos always. Otherwise 64bit > inode number would not work. > > So, I think we are doing right thing for now. Oh, OK. On a quick check, yes, the numbers the server returns to clients are taken from either kstat.ino or the ino argument of the filldir function. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3] SUNRPC: set desired file system root before connecting local transports
On Wed, Oct 10, 2012 at 02:32:28PM +0400, Stanislav Kinsbursky wrote: > 10.10.2012 05:23, J. Bruce Fields пишет: > >On Tue, Oct 09, 2012 at 03:47:42PM -0700, Eric W. Biederman wrote: > >>"J. Bruce Fields" writes: > >> > >>>On Tue, Oct 09, 2012 at 01:20:48PM -0700, Eric W. Biederman wrote: > >>>>"Myklebust, Trond" writes: > >>>> > >>>>>On Tue, 2012-10-09 at 15:35 -0400, J. Bruce Fields wrote: > >>>>>>Cc'ing Eric since I seem to recall he suggested doing it this way? > >>>> > >>>>Yes. On second look setting fs->root won't work. We need to change fs. > >>>>The problem is that by default all kernel threads share fs so changing > >>>>fs->root will have non-local consequences. > >>> > >>>Oh, huh. And we can't "unshare" it somehow? > >> > >>I don't fully understand how nfs uses kernel threads and work queues. > >>My general understanding is work queues reuse their kernel threads > >>between different users. So it is mostly a don't pollute your > >>environment thing. If there was a dedicated kernel thread for each > >>environment this would be trivial. > >> > >>What I was suggesting here is changing task->fs instead of > >>task->fs.root. That should just require task_lock(). > > > >Oh, OK, got it--if that works, great. > > > > The main problem with swapping fs struct is actually the same as in > root swapping. I.e. routines for copy fs_struct are not exported. > It could be done on place, but I don't think, that Al Viro would > support such implementation. > Trond? It seems like we got stalled here Could you go ahead and try a patch, and see what people think? --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v7 09/16] SUNRPC/cache: use new hashtable implementation
On Mon, Oct 29, 2012 at 11:13:43AM -0400, Mathieu Desnoyers wrote: > * Linus Torvalds (torva...@linux-foundation.org) wrote: > > On Mon, Oct 29, 2012 at 5:42 AM, Mathieu Desnoyers > > wrote: > > > > > > So defining e.g.: > > > > > > #include > > > > > > #define DFR_HASH_BITS (PAGE_SHIFT - ilog2(BITS_PER_LONG)) > > > > > > would keep the intended behavior in all cases: use one page for the hash > > > array. > > > > Well, since that wasn't true before either because of the long-time > > bug you point out, clearly the page size isn't all that important. I > > think it's more important to have small and simple code, and "9" is > > certainly that, compared to playing ilog2 games with not-so-obvious > > things. > > > > Because there's no reason to believe that '9' is in any way a worse > > random number than something page-shift-related, is there? And getting > > away from *previous* overly-complicated size calculations that had > > been broken because they were too complicated and random, sounds like > > a good idea. > > Good point. I agree that unless we really care about the precise number > of TLB entries and cache lines used by this hash table, we might want to > stay away from page-size and pointer-size based calculation. > > It might not hurt to explain this in the patch changelog though. I'd also be happy to take that as a separate patch now. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 11/32] vfs: make do_unlinkat retry on ESTALE errors
On Sat, Oct 27, 2012 at 08:33:18AM -0400, Jeff Layton wrote: > Signed-off-by: Jeff Layton > --- > fs/namei.c | 12 ++-- > 1 file changed, 10 insertions(+), 2 deletions(-) > > diff --git a/fs/namei.c b/fs/namei.c > index 7c9bb50..467b9f1 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -3446,9 +3446,13 @@ static long do_unlinkat(int dfd, const char __user > *pathname) > struct filename *name; > struct dentry *dentry; > struct nameidata nd; > - struct inode *inode = NULL; > + struct inode *inode; > + unsigned int try = 0; > + unsigned int lookup_flags = LOOKUP_PARENT; > > - name = user_path_parent(dfd, pathname, &nd, 0); > +retry: > + inode = NULL; So, you fail after "inode" was set (say vfs_unlink returned an error) the first time, then before "inode" was set (lookup_hash returns an error), and you end up incorrectly doing another iput() the second time through if you don't reset inode here? (I think I made the same mistake in another patch, actually) --b. > + name = user_path_parent(dfd, pathname, &nd, try); > if (IS_ERR(name)) > return PTR_ERR(name); > > @@ -3486,6 +3490,10 @@ exit2: > exit1: > path_put(&nd.path); > putname(name); > + if (retry_estale(error, try++)) { > + lookup_flags |= LOOKUP_REVAL; > + goto retry; > + } > return error; > > slashes: > -- > 1.7.11.7 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 11/32] vfs: make do_unlinkat retry on ESTALE errors
On Tue, Oct 30, 2012 at 12:33:55PM -0400, Jeff Layton wrote: > On Tue, 30 Oct 2012 12:14:29 -0400 > "J. Bruce Fields" wrote: > > > On Sat, Oct 27, 2012 at 08:33:18AM -0400, Jeff Layton wrote: > > > Signed-off-by: Jeff Layton > > > --- > > > fs/namei.c | 12 ++-- > > > 1 file changed, 10 insertions(+), 2 deletions(-) > > > > > > diff --git a/fs/namei.c b/fs/namei.c > > > index 7c9bb50..467b9f1 100644 > > > --- a/fs/namei.c > > > +++ b/fs/namei.c > > > @@ -3446,9 +3446,13 @@ static long do_unlinkat(int dfd, const char __user > > > *pathname) > > > struct filename *name; > > > struct dentry *dentry; > > > struct nameidata nd; > > > - struct inode *inode = NULL; > > > + struct inode *inode; > > > + unsigned int try = 0; > > > + unsigned int lookup_flags = LOOKUP_PARENT; > > > > > > - name = user_path_parent(dfd, pathname, &nd, 0); > > > +retry: > > > + inode = NULL; > > > > So, you fail after "inode" was set (say vfs_unlink returned an error) > > the first time, then before "inode" was set (lookup_hash returns an > > error), and you end up incorrectly doing another iput() the second time > > through if you don't reset inode here? > > > > (I think I made the same mistake in another patch, actually) > > > > --b. > > > > Correct. That's a new delta in this patch, btw. The original patch > didn't do that and it was causing a busy inodes on umount bug in > testing. > > It would occasionally hit an ESTALE error in this function and > because "inode" wasn't reset to NULL, it would do a double-put of the > inode and cause the counter to underflow. > > It might be good to restructure this code to make those sorts of bugs > less likely, but the error handling in here is already so hairy that I > decided to punt on that for now... Understood. I might find it just a little more obvious why we're doing this if the assignment was next to the final iput: if (inode) iput(inode); inode = NULL; ... --b. > > > > + name = user_path_parent(dfd, pathname, &nd, try); > > > if (IS_ERR(name)) > > > return PTR_ERR(name); > > > > > > @@ -3486,6 +3490,10 @@ exit2: > > > exit1: > > > path_put(&nd.path); > > > putname(name); > > > + if (retry_estale(error, try++)) { > > > + lookup_flags |= LOOKUP_REVAL; > > > + goto retry; > > > + } > > > return error; > > > > > > slashes: > > > -- > > > 1.7.11.7 > > > > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > > > the body of a message to majord...@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- > Jeff Layton -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Issues with a rather unusual configured NFS server
On Tue, Aug 13, 2013 at 05:53:14PM -0400, bfields wrote: > On Mon, Aug 12, 2013 at 04:36:40PM +0200, Jan Kara wrote: > > On Sun 11-08-13 11:48:49, Toralf Förster wrote: > > > so that the server either crashes (if it is a user mode linux image) or > > > at least its reboot functionality got broken > > > - if the NFS server is hammered with scary NFS calls using a fuzzy tool > > > running at a remote NFS client under a non-privileged user id. > > > > > > It can re reproduced, if > > > - the NFS share is an EXT3 or EXT4 directory > > > - and it is created at file located at tempfs and mounted via loop > > > device > > > - and the NFS server is forced to umount the NFS share > > > - and the server forced to restart the NSF service afterwards > > > - and trinity is used > > > > > > I could find a scenario for an automated bisect. 2 times it brought this > > > commit > > > commit 68a3396178e6688ad7367202cdf0af8ed03c8727 > > > Author: J. Bruce Fields > > > Date: Thu Mar 21 11:21:50 2013 -0400 > > > > > > nfsd4: shut down more of delegation earlier > > Thanks for the report. I think I see the problem--after this commit > nfs4_set_delegation() failures result in nfs4_put_delegation being > called, but nfs4_put_delegation doesn't free the nfs4_file that has > already been set by alloc_init_deleg(). > > Let me think about how to fix that Sorry for the slow response--can you check whether this fixes the problem? --b. commit 624a0ee0375940ce4aa36330b0b5a70af6d2b6f5 Author: J. Bruce Fields Date: Thu Aug 15 16:55:26 2013 -0400 nfsd4: fix leak of inode reference on delegation failure This fixes a regression from 68a3396178e6688ad7367202cdf0af8ed03c8727 "nfsd4: shut down more of delegation earlier". After that commit, nfs4_set_delegation() failures result in nfs4_put_delegation being called, but nfs4_put_delegation doesn't free the nfs4_file that has already been set by alloc_init_deleg(). This can result in an oops on later unmounting the exported filesystem. Note also delaying the fi_had_conflict check we're able to return a better error (hence give 4.1 clients a better idea why the delegation failed; though note CONFLICT isn't an exact match here, as that's supposed to indicate a current conflict, but all we know here is that there was one recently). Reported-by: Toralf Förster Signed-off-by: J. Bruce Fields diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c index eb9cf81..0874998 100644 --- a/fs/nfsd/nfs4state.c +++ b/fs/nfsd/nfs4state.c @@ -368,11 +368,8 @@ static struct nfs4_delegation * alloc_init_deleg(struct nfs4_client *clp, struct nfs4_ol_stateid *stp, struct svc_fh *current_fh) { struct nfs4_delegation *dp; - struct nfs4_file *fp = stp->st_file; dprintk("NFSD alloc_init_deleg\n"); - if (fp->fi_had_conflict) - return NULL; if (num_delegations > max_delegations) return NULL; dp = delegstateid(nfs4_alloc_stid(clp, deleg_slab)); @@ -389,8 +386,7 @@ alloc_init_deleg(struct nfs4_client *clp, struct nfs4_ol_stateid *stp, struct sv INIT_LIST_HEAD(&dp->dl_perfile); INIT_LIST_HEAD(&dp->dl_perclnt); INIT_LIST_HEAD(&dp->dl_recall_lru); - get_nfs4_file(fp); - dp->dl_file = fp; + dp->dl_file = NULL; dp->dl_type = NFS4_OPEN_DELEGATE_READ; fh_copy_shallow(&dp->dl_fh, ¤t_fh->fh_handle); dp->dl_time = 0; @@ -3044,22 +3040,35 @@ static int nfs4_setlease(struct nfs4_delegation *dp) return 0; } -static int nfs4_set_delegation(struct nfs4_delegation *dp) +static int nfs4_set_delegation(struct nfs4_delegation *dp, struct nfs4_file *fp) { - struct nfs4_file *fp = dp->dl_file; + int status; - if (!fp->fi_lease) - return nfs4_setlease(dp); + if (fp->fi_had_conflict) + return -EAGAIN; + get_nfs4_file(fp); + dp->dl_file = fp; + if (!fp->fi_lease) { + status = nfs4_setlease(dp); + if (status) + goto out_free; + return 0; + } spin_lock(&recall_lock); if (fp->fi_had_conflict) { spin_unlock(&recall_lock); - return -EAGAIN; + status = -EAGAIN; + goto out_free; } atomic_inc(&fp->fi_delegees); list_add(&dp->dl_perfile, &fp->fi_delegations); spin_unlock(&recall_lock); list_add(&dp->dl_perclnt, &dp->dl_stid.sc_client->cl_delegations); ret
Re: [uml-devel] Issues with a rather unusual configured NFS server
On Thu, Aug 29, 2013 at 11:57:45AM +0200, richard -rw- weinberger wrote: > On Wed, Aug 28, 2013 at 7:21 PM, Toralf Förster > wrote: > > On 08/27/2013 08:06 PM, J. Bruce Fields wrote: > >> On Tue, Aug 13, 2013 at 05:53:14PM -0400, bfields wrote: > >>> On Mon, Aug 12, 2013 at 04:36:40PM +0200, Jan Kara wrote: > >>>> On Sun 11-08-13 11:48:49, Toralf Förster wrote: > >>>>> so that the server either crashes (if it is a user mode linux image) or > >>>>> at least its reboot functionality got broken > >>>>> - if the NFS server is hammered with scary NFS calls using a fuzzy tool > >>>>> running at a remote NFS client under a non-privileged user id. > >>>>> > >>>>> It can re reproduced, if > >>>>>- the NFS share is an EXT3 or EXT4 directory > >>>>>- and it is created at file located at tempfs and mounted via loop > >>>>> device > >>>>>- and the NFS server is forced to umount the NFS share > >>>>>- and the server forced to restart the NSF service afterwards > >>>>>- and trinity is used > >>>>> > >>>>> I could find a scenario for an automated bisect. 2 times it brought > >>>>> this commit > >>>>> commit 68a3396178e6688ad7367202cdf0af8ed03c8727 > >>>>> Author: J. Bruce Fields > >>>>> Date: Thu Mar 21 11:21:50 2013 -0400 > >>>>> > >>>>> nfsd4: shut down more of delegation earlier > >>> > >>> Thanks for the report. I think I see the problem--after this commit > >>> nfs4_set_delegation() failures result in nfs4_put_delegation being > >>> called, but nfs4_put_delegation doesn't free the nfs4_file that has > >>> already been set by alloc_init_deleg(). > >>> > >>> Let me think about how to fix that > >> > >> Sorry for the slow response--can you check whether this fixes the > >> problem? > >> > > Yes. > > > > With the attached patch the problem can't be reproduced any longer with > > the prepared test case and current git kernels. > > BTW: Is nobody else fuzz testing NFS? I don't know. Toralf's reports are the only ones I recall off the top of my head, but I may have forgotten others. > Or are these bugs just more likely to hit on UML? That's also possible. > This is not the first NFS issue found by Toralf using UML and Trinity. Yep. The testing is definitely appreciated. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/