Re: [PATCH 0/3] lockd: use per-net refrence-counted NSM clients
What happens if statd is restarted? Sent from my iPhone On Sep 14, 2012, at 10:25 AM, Stanislav Kinsbursky wrote: > This is a bug fix for https://bugzilla.redhat.com/show_bug.cgi?id=830862. > > The problem is that with NFSv4 mount in container (with separated mount > namesapce) and active lock on it, dying child reaped of this container will > try to umount NFS and doing this will try to create RPC client to send > unmonitor request to statd. > But creation of RCP client requires valid current->nsproxy (for operation with > utsname()) and during umount on child reaper exit it's equal to zero. > > Proposed solution is to introduce refrence-counter per-net NSM client, which > is created on fist monitor call and destroyed after the lst monitor call. > > The following series implements... > > --- > > Stanislav Kinsbursky (3): > lockd: use rpc client's cl_nodename for id encoding > lockd: per-net NSM client creation and destruction helpers introduced > lockd: create and use per-net NSM RPC clients on MON/UNMON requests > > > fs/lockd/mon.c | 91 +++--- > fs/lockd/netns.h |4 ++ > fs/lockd/svc.c |1 + > 3 files changed, 77 insertions(+), 19 deletions(-) > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] lockd: use rpc client's cl_nodename for id encoding
Hi- On Sep 14, 2012, at 10:25 AM, Stanislav Kinsbursky wrote: > Taking hostname from uts namespace if not safe, because this cuold be > performind during umount operation on child reaper death. And in this case > current->nsproxy is NULL already. > > Signed-off-by: Stanislav Kinsbursky > Cc: > --- > fs/lockd/mon.c | 14 -- > 1 files changed, 8 insertions(+), 6 deletions(-) > > diff --git a/fs/lockd/mon.c b/fs/lockd/mon.c > index 7ef14b3..c6186fb 100644 > --- a/fs/lockd/mon.c > +++ b/fs/lockd/mon.c > @@ -426,11 +426,12 @@ static void encode_mon_name(struct xdr_stream *xdr, > const struct nsm_args *argp) > * (via the NLMPROC_SM_NOTIFY call) that the state of host "mon_name" > * has changed. > */ > -static void encode_my_id(struct xdr_stream *xdr, const struct nsm_args *argp) > +static void encode_my_id(struct xdr_stream *xdr, const struct nsm_args *argp, > + char *nodename) > { > __be32 *p; > > - encode_nsm_string(xdr, utsname()->nodename); > + encode_nsm_string(xdr, nodename); > p = xdr_reserve_space(xdr, 4 + 4 + 4); > *p++ = cpu_to_be32(argp->prog); > *p++ = cpu_to_be32(argp->vers); > @@ -441,10 +442,11 @@ static void encode_my_id(struct xdr_stream *xdr, const > struct nsm_args *argp) > * The "mon_id" argument specifies the non-private arguments > * of an NSMPROC_MON or NSMPROC_UNMON call. > */ > -static void encode_mon_id(struct xdr_stream *xdr, const struct nsm_args > *argp) > +static void encode_mon_id(struct xdr_stream *xdr, const struct nsm_args > *argp, > + char *nodename) > { > encode_mon_name(xdr, argp); > - encode_my_id(xdr, argp); > + encode_my_id(xdr, argp, nodename); > } > > /* > @@ -463,14 +465,14 @@ static void encode_priv(struct xdr_stream *xdr, const > struct nsm_args *argp) > static void nsm_xdr_enc_mon(struct rpc_rqst *req, struct xdr_stream *xdr, > const struct nsm_args *argp) > { > - encode_mon_id(xdr, argp); > + encode_mon_id(xdr, argp, req->rq_task->tk_client->cl_nodename); IMO you should get the cl_nodename in nsm_mon_unmon() from clnt->cl_nodename, and pass it in as part of *argp . The choice of which nodename to use is clearly a decision for an "upper layer" not a choice for the XDR functions. Long ago I had patches which fixed this layering violation for a very similar purpose as yours, but they were never applied. > encode_priv(xdr, argp); > } > > static void nsm_xdr_enc_unmon(struct rpc_rqst *req, struct xdr_stream *xdr, > const struct nsm_args *argp) > { > - encode_mon_id(xdr, argp); > + encode_mon_id(xdr, argp, req->rq_task->tk_client->cl_nodename); Ditto. > } > > static int nsm_xdr_dec_stat_res(struct rpc_rqst *rqstp, > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] lockd: use per-net refrence-counted NSM clients
On Sep 14, 2012, at 1:38 PM, Myklebust, Trond wrote: > On Fri, 2012-09-14 at 13:01 -0400, Chuck Lever wrote: >> What happens if statd is restarted? > > Nothing unusual. Why? The NSM upcall transport is a potential application for TCP + softconn, now that a persistent rpc_clnt is used. It just depends on what failure mode we'd like to optimize for. > >> Sent from my iPhone >> >> On Sep 14, 2012, at 10:25 AM, Stanislav Kinsbursky >> wrote: >> >>> This is a bug fix for https://bugzilla.redhat.com/show_bug.cgi?id=830862. >>> >>> The problem is that with NFSv4 mount in container (with separated mount >>> namesapce) and active lock on it, dying child reaped of this container will >>> try to umount NFS and doing this will try to create RPC client to send >>> unmonitor request to statd. >>> But creation of RCP client requires valid current->nsproxy (for operation >>> with >>> utsname()) and during umount on child reaper exit it's equal to zero. >>> >>> Proposed solution is to introduce refrence-counter per-net NSM client, which >>> is created on fist monitor call and destroyed after the lst monitor call. >>> >>> The following series implements... >>> >>> --- >>> >>> Stanislav Kinsbursky (3): >>> lockd: use rpc client's cl_nodename for id encoding >>> lockd: per-net NSM client creation and destruction helpers introduced >>> lockd: create and use per-net NSM RPC clients on MON/UNMON requests >>> >>> >>> fs/lockd/mon.c | 91 >>> +++--- >>> fs/lockd/netns.h |4 ++ >>> fs/lockd/svc.c |1 + >>> 3 files changed, 77 insertions(+), 19 deletions(-) >>> >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in >>> the body of a message to majord...@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > Trond Myklebust > Linux NFS client maintainer > > NetApp > trond.mykleb...@netapp.com > www.netapp.com -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 24/27] NFS: Use local caching [try #2]
Hi David- On Jan 29, 2008, at 10:25 PM, David Howells wrote: Chuck Lever <[EMAIL PROTECTED]> wrote: This patch really ought to be broken into more manageable atomic changes to make it easier to review, and to provide more fine-grained explanation and rationalization for each specific change via individual patch descriptions. Hmmm I broke the patch up as Trond stipulated - at least, I thought I had. In many ways this request doesn't make sense. You can't do NFS caching without all the appropriate bits, so logically they should be one patch. Breaking it up won't help git-bisect since the option to enable all this is the last (or nearly last) patch. In addition to adding a new feature, you are changing existing code. If any one of the changes you made breaks existing behavior, having them all in small atomic patches makes it practical to bisect and find the problem. In addition it makes it worlds easier to review by people who are not so familiar with your fscache implementation. And smaller patches means the ratio of patch descriptions to code changes can be much higher. It does make sense to introduce the files under fs/fsc in a single patch. But when you are changing code that is already being used, more care needs to be taken. This should no longer be necessary. The latest mount.nfs subcommand from nfs-utils supports text-based mounts when running on kernels 2.6.23 and later. Okay. I'll update my patches to reflect this. Note, however, I've got someone reporting a bug that seems to show otherwise. I'll have to investigate this more next week. The very latest version (post 1.1.1) is required today for text-based NFS mounts. (That is, the bleeding edge version you get by cloning the nfs-utils git repo). And it only works on kernels later than 2.6.22 -- if that particular user is testing fscache on 2.6.22 or older, then only the legacy binary NFS mount system call API is supported. Add comments like this in a separate clean up patch. +/* + * Notification that a PTE pointing to an NFS page is about to be made + * writable, implying that someone is about to modify the page through a + * shared-writable mapping + */ What does that have to do with local disk caching? +struct nfs_fh_auxdata { + struct timespec i_mtime; + struct timespec i_ctime; + loff_t i_size; +}; It might be useful to explain here why you need to supplement the mtime, ctime, and size fields that already exist in an NFS inode. Supplement? I don't understand. Why is it necessary to add additional mtime, ctime and size fields for NFS inodes? Similar metadata is already stored in nfsi. All I'm asking for is some documentation of what these fields do that the existing time stamps and size fields in nfsi don't. Explain why the NFS fsc implementation needs this data structure. + key->port = clp->cl_addr.sin_port; Not sure why you are using the server's port here. In almost every case the server side port number will be 2049, so it really doesn't add any uniquification. The reason lies is "in almost every case". It's possible to configure it such that a server is running two separate NFS servers on different ports. We should explore whether it is typical or even possible that such a configuration exports the same file handles on different ports, and whether that really matters to the client. I strongly recommend you use the existing IPv6 address conversion macros for this instead of open-coding yet another way of mapping an IPv4 address to an IPv6 address. However, since AF_INET6 support is being introduced in the NFS client in 2.6.24, I recommend you take a look at these source files after Trond has pushed his NFS_ALL for 2.6.24. I'll look at them. I always do this: I meant 2.6.25, not 2.6.24. By the time you return, basic IPv6 support for NFSv4 should be in 2.6.25-rc1's NFS client (not server). Not that it is bug-free, but an implementation is now there. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Fix {clear,copy}_user_page() declarations in page.h
Clean up: eliminate some compiler noise on x86 when building with strict warnings enabled, introduced by commit 345b904c. In file included from include2/asm/thread_info_64.h:12, from include2/asm/thread_info.h:4, from /home/cel/src/linux/nfs-2.6/include/linux/thread_info.h:35, from /home/cel/src/linux/nfs-2.6/include/linux/preempt.h:9, from /home/cel/src/linux/nfs-2.6/include/linux/spinlock.h:49, from /home/cel/src/linux/nfs-2.6/include/linux/mmzone.h:7, from /home/cel/src/linux/nfs-2.6/include/linux/gfp.h:4, from /home/cel/src/linux/nfs-2.6/include/linux/slab.h:14, from /home/cel/src/linux/nfs-2.6/fs/nfsd/nfs4acl.c:40: include2/asm/page.h:55: warning: `inline' is not at beginning of declaration include2/asm/page.h:61: warning: `inline' is not at beginning of declaration Signed-off-by: Chuck Lever <[EMAIL PROTECTED]> Cc: Jeremy Fitzhardinge <[EMAIL PROTECTED]> --- include/asm-x86/page.h |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/include/asm-x86/page.h b/include/asm-x86/page.h index 1cb7c51..a05b289 100644 --- a/include/asm-x86/page.h +++ b/include/asm-x86/page.h @@ -52,13 +52,13 @@ extern int page_is_ram(unsigned long pagenr); struct page; -static void inline clear_user_page(void *page, unsigned long vaddr, +static inline void clear_user_page(void *page, unsigned long vaddr, struct page *pg) { clear_page(page); } -static void inline copy_user_page(void *to, void *from, unsigned long vaddr, +static inline void copy_user_page(void *to, void *from, unsigned long vaddr, struct page *topage) { copy_page(to, from); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.10-rc3 NFSv3 mount issues
On May 30, 2013, at 4:19 PM, Jim Schutt wrote: > Hi, > > I've been trying to test 3.10-rc3 on some diskless clients, and found > that I can no longer mount my root file system via NFSv3. > > I poked around looking at NFS changes for 3.10, and found these two > commits: > > d497ab9751 "NFSv3: match sec= flavor against server list" > 4580a92d44 "NFS: Use server-recommended security flavor by default (NFSv3)" > > If I revert both of these commits from 3.10-rc3, then my diskless > client can mount its root file system. > > The busybox mount command fails like this, when using 3.10-rc3: > > / # mount -t nfs -o ro,nolock,vers=3,proto=tcp > 172.17.0.122:/gmi/images/jaschut/ceph.toss-2x /mnt > mount: mounting 172.17.0.122:/gmi/images/jaschut/ceph.toss-2x on /mnt failed: > Invalid argument > > The commit messages for both these commits seem to say that mounting > with the "sys=sec" option should work, but unfortunately, my busybox doesn't > seem to understand the "sec=" mount option: > > / # mount -t nfs -o ro,nolock,vers=3,proto=tcp,sec=sys > 172.17.0.122:/gmi/images/jaschut/ceph.toss-2x /mnt > mount: invalid number 'sys' > > My NFS server is based on RHEL6, and is not using any "sec=" option > in its export for this file system. I did try exporting with "sec=sys", > but it didn't seem to make any difference either. > > So far, this seems like a regression to me > Any ideas what I might be doing wrong? How can I > help make this work again? 3.10-rc3 appears to be missing the fix for this. See: http://marc.info/?l=linux-nfs&m=136855668104598&w=2 Trond, can we get this applied? -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] NFSv4: Use exponential backoff delay for NFS4_ERRDELAY
On Apr 25, 2013, at 9:49 AM, bfie...@fieldses.org wrote: > On Thu, Apr 25, 2013 at 01:30:58PM +, Myklebust, Trond wrote: >> On Thu, 2013-04-25 at 09:29 -0400, bfie...@fieldses.org wrote: >> >>> My position is that we simply have no idea what order of magnitude even >>> delay should be. And that in such a situation exponential backoff such >>> as implemented in the synchronous case seems the reasonable default as >>> it guarantees at worst doubling the delay while still bounding the >>> long-term average frequency of retries. >> >> So we start with a 15 second delay, and then go to 60 seconds? > > I agree that a server should normally be doing the wait on its own if > the wait would be on the order of an rpc round trip. > > So I'd be inclined to start with a delay that was an order of magnitude > or two more than a round trip. > > And I'd expect NFS isn't common on networks with 1-second latencies. > > So the 1/10 second we're using in the synchronous case sounds closer to > the right ballpark to me. The RPC layer already keeps RPC round trip statistics, so the client doesn't have to guess with a "one size fits all" number. I'm all for keeping client recovery time short. But after following this argument, I think 10xRTT is crazy short. Aggressive retransmits can lead to data corruption, and RTT on a fast server is going to be on the order of a millisecond. And what about RDMA, where RTT is about 20usecs? A better answer might be to start at one second then exponentially back off to the minimum of 0.25x the lease time and 0.25x the RPC retransmit time out. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] NFSv4: Use exponential backoff delay for NFS4_ERRDELAY
On Apr 25, 2013, at 2:19 PM, "bfie...@fieldses.org" wrote: > On Thu, Apr 25, 2013 at 02:10:36PM +, Myklebust, Trond wrote: >> On Thu, 2013-04-25 at 09:49 -0400, bfie...@fieldses.org wrote: >>> On Thu, Apr 25, 2013 at 01:30:58PM +, Myklebust, Trond wrote: >>>> On Thu, 2013-04-25 at 09:29 -0400, bfie...@fieldses.org wrote: >>>> >>>>> My position is that we simply have no idea what order of magnitude even >>>>> delay should be. And that in such a situation exponential backoff such >>>>> as implemented in the synchronous case seems the reasonable default as >>>>> it guarantees at worst doubling the delay while still bounding the >>>>> long-term average frequency of retries. >>>> >>>> So we start with a 15 second delay, and then go to 60 seconds? >>> >>> I agree that a server should normally be doing the wait on its own if >>> the wait would be on the order of an rpc round trip. >>> >>> So I'd be inclined to start with a delay that was an order of magnitude >>> or two more than a round trip. >>> >>> And I'd expect NFS isn't common on networks with 1-second latencies. >>> >>> So the 1/10 second we're using in the synchronous case sounds closer to >>> the right ballpark to me. >> >> OK, then. Now all I need is actual motivation for changing the existing >> code other than handwaving arguments about "polling is better than flat >> waits". >> What actual use cases are impacting us now, other than the AIX design >> decision to force CLOSE to retry at least once before succeeding? > > Nah, I've got nothing, and I agree that the AIX problem is there bug. > > Just for fun I looked at re-checked the Linux server cases. As far as I > can tell they are: > > - delegations: returned immediately on detection of any > conflict. The current behavior in the sync case looks > reasonable to me. > - allocation failures: not really sure it's the best error, but > it seems to be all the protocol offers. We probably don't > care much what the client does in this case. > - some rare cases that would probably indicate bugs (e.g., > attempting to destroy a client while other rpc's from that > client are running.) Again we don't care what the client does > here. > - the 4.1 slot-inuse case. > > We also by default map four errors (ETIMEDOUT, EAGAIN, EWOULDBLOCK, > ENOMEM) to delay. I thought I remembered one of those being used by > some HFS system, but can't actually find an example now. A quick grep > doesn't show anything interesting. It's worth mentioning that servers that have frozen state (say, in preparation for Transparent State Migration) may use NFS4ERR_DELAY to prevent clients from modifying open or lock state until that state has transitioned to a destination server. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] NFSv4: Use exponential backoff delay for NFS4_ERRDELAY
On Apr 25, 2013, at 2:46 PM, "bfie...@fieldses.org" wrote: > On Thu, Apr 25, 2013 at 02:40:11PM -0400, Chuck Lever wrote: >> >> On Apr 25, 2013, at 2:19 PM, "bfie...@fieldses.org" >> wrote: >> >>> On Thu, Apr 25, 2013 at 02:10:36PM +, Myklebust, Trond wrote: >>>> On Thu, 2013-04-25 at 09:49 -0400, bfie...@fieldses.org wrote: >>>>> On Thu, Apr 25, 2013 at 01:30:58PM +, Myklebust, Trond wrote: >>>>>> On Thu, 2013-04-25 at 09:29 -0400, bfie...@fieldses.org wrote: >>>>>> >>>>>>> My position is that we simply have no idea what order of magnitude even >>>>>>> delay should be. And that in such a situation exponential backoff such >>>>>>> as implemented in the synchronous case seems the reasonable default as >>>>>>> it guarantees at worst doubling the delay while still bounding the >>>>>>> long-term average frequency of retries. >>>>>> >>>>>> So we start with a 15 second delay, and then go to 60 seconds? >>>>> >>>>> I agree that a server should normally be doing the wait on its own if >>>>> the wait would be on the order of an rpc round trip. >>>>> >>>>> So I'd be inclined to start with a delay that was an order of magnitude >>>>> or two more than a round trip. >>>>> >>>>> And I'd expect NFS isn't common on networks with 1-second latencies. >>>>> >>>>> So the 1/10 second we're using in the synchronous case sounds closer to >>>>> the right ballpark to me. >>>> >>>> OK, then. Now all I need is actual motivation for changing the existing >>>> code other than handwaving arguments about "polling is better than flat >>>> waits". >>>> What actual use cases are impacting us now, other than the AIX design >>>> decision to force CLOSE to retry at least once before succeeding? >>> >>> Nah, I've got nothing, and I agree that the AIX problem is there bug. >>> >>> Just for fun I looked at re-checked the Linux server cases. As far as I >>> can tell they are: >>> >>> - delegations: returned immediately on detection of any >>> conflict. The current behavior in the sync case looks >>> reasonable to me. >>> - allocation failures: not really sure it's the best error, but >>> it seems to be all the protocol offers. We probably don't >>> care much what the client does in this case. >>> - some rare cases that would probably indicate bugs (e.g., >>> attempting to destroy a client while other rpc's from that >>> client are running.) Again we don't care what the client does >>> here. >>> - the 4.1 slot-inuse case. >>> >>> We also by default map four errors (ETIMEDOUT, EAGAIN, EWOULDBLOCK, >>> ENOMEM) to delay. I thought I remembered one of those being used by >>> some HFS system, but can't actually find an example now. A quick grep >>> doesn't show anything interesting. >> >> It's worth mentioning that servers that have frozen state (say, in >> preparation for Transparent State Migration) may use NFS4ERR_DELAY to >> prevent clients from modifying open or lock state until that state has >> transitioned to a destination server. > > I thought they'd decided they'll be forced to find a different way to do > that? > > (The issue being that it only works if you're using 4.1, and if the > session state itself isn't part of the state to be transferred. > Otherwise you're forced to modify the state anyway since NFS4ERR_DELAY > is seqid-modifying.) The answer is not to return NFS4ERR_DELAY on seqid-modifying operations. The source server can return NFS4ERR_DELAY to the client's migration recovery operations (the GETATTR(fs_locations) request) for example. Or, the server could return it on the initial PUTFH operation in a compound containing seqid-modifying operations. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: build failure after merge of the nfsd tree
On Apr 28, 2013, at 9:24 PM, Stephen Rothwell wrote: > Hi J., > > After merging the nfsd tree, today's linux-next build (powerpc > ppc64_defconfig) failed like this: > > net/sunrpc/auth_gss/svcauth_gss.c: In function 'gss_proxy_save_rsc': > net/sunrpc/auth_gss/svcauth_gss.c:1182:3: error: implicit declaration of > function 'gss_mech_get_by_OID' [-Werror=implicit-function-declaration] > > Caused byc ommit 030d794bf498 ("SUNRPC: Use gssproxy upcall for server > RPCGSS authentication"). gss_mech_get_by_OID() made static to > net/sunrpc/auth_gss/gss_mech_switch.c by commit 9568c5e9a61d ("SUNRPC: > Introduce rpcauth_get_pseudoflavor()") in the nfs tree (part of the nfs > tree that you did not merge). > > I don't know how to fix this, so I have used the nfsd tree from > next-20130426 for today. Bruce, it might make sense for me to submit the three server-side RPC GSS patches, and then you can rebase the gssproxy work on top of those. Let me know how you would like to proceed. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: build failure after merge of the nfsd tree
On Apr 29, 2013, at 11:45 AM, "J. Bruce Fields" wrote: > On Mon, Apr 29, 2013 at 10:53:37AM -0400, Chuck Lever wrote: >> >> On Apr 28, 2013, at 9:24 PM, Stephen Rothwell wrote: >> >>> Hi J., >>> >>> After merging the nfsd tree, today's linux-next build (powerpc >>> ppc64_defconfig) failed like this: >>> >>> net/sunrpc/auth_gss/svcauth_gss.c: In function 'gss_proxy_save_rsc': >>> net/sunrpc/auth_gss/svcauth_gss.c:1182:3: error: implicit declaration of >>> function 'gss_mech_get_by_OID' [-Werror=implicit-function-declaration] >>> >>> Caused byc ommit 030d794bf498 ("SUNRPC: Use gssproxy upcall for server >>> RPCGSS authentication"). gss_mech_get_by_OID() made static to >>> net/sunrpc/auth_gss/gss_mech_switch.c by commit 9568c5e9a61d ("SUNRPC: >>> Introduce rpcauth_get_pseudoflavor()") in the nfs tree (part of the nfs >>> tree that you did not merge). >>> >>> I don't know how to fix this, so I have used the nfsd tree from >>> next-20130426 for today. >> >> Bruce, it might make sense for me to submit the three server-side RPC GSS >> patches, and then you can rebase the gssproxy work on top of those. Let me >> know how you would like to proceed. > > I'm happy to take those patches whenever you consider them ready. Would > that fix the problem? Someone would need to modify the gssproxy patches to use the new interfaces. > Also: it looks like 030d794bf498 "SUNRPC: Introduce > rpcauth_get_pseudoflavor()" is in Trond's linux-next, but not his > nfs-for-next. I'm not sure what that means--is it safe to rebase on top > of *that*? That doesn't seem right to me. > I was hoping I could consider the gss-proxy work committed at this point > and pile any fixes on top, but... whatever works for you guys, I guess. > > --b. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: build failure after merge of the nfsd tree
On Apr 29, 2013, at 12:29 PM, Simo Sorce wrote: > On Mon, 2013-04-29 at 12:05 -0400, Chuck Lever wrote: >> On Apr 29, 2013, at 11:45 AM, "J. Bruce Fields" wrote: >> >>> On Mon, Apr 29, 2013 at 10:53:37AM -0400, Chuck Lever wrote: >>>> >>>> On Apr 28, 2013, at 9:24 PM, Stephen Rothwell >>>> wrote: >>>> >>>>> Hi J., >>>>> >>>>> After merging the nfsd tree, today's linux-next build (powerpc >>>>> ppc64_defconfig) failed like this: >>>>> >>>>> net/sunrpc/auth_gss/svcauth_gss.c: In function 'gss_proxy_save_rsc': >>>>> net/sunrpc/auth_gss/svcauth_gss.c:1182:3: error: implicit declaration of >>>>> function 'gss_mech_get_by_OID' [-Werror=implicit-function-declaration] >>>>> >>>>> Caused byc ommit 030d794bf498 ("SUNRPC: Use gssproxy upcall for server >>>>> RPCGSS authentication"). gss_mech_get_by_OID() made static to >>>>> net/sunrpc/auth_gss/gss_mech_switch.c by commit 9568c5e9a61d ("SUNRPC: >>>>> Introduce rpcauth_get_pseudoflavor()") in the nfs tree (part of the nfs >>>>> tree that you did not merge). >>>>> >>>>> I don't know how to fix this, so I have used the nfsd tree from >>>>> next-20130426 for today. >>>> >>>> Bruce, it might make sense for me to submit the three server-side RPC GSS >>>> patches, and then you can rebase the gssproxy work on top of those. Let >>>> me know how you would like to proceed. >>> >>> I'm happy to take those patches whenever you consider them ready. Would >>> that fix the problem? >> >> Someone would need to modify the gssproxy patches to use the new interfaces. >> >>> Also: it looks like 030d794bf498 "SUNRPC: Introduce >>> rpcauth_get_pseudoflavor()" is in Trond's linux-next, but not his >>> nfs-for-next. I'm not sure what that means--is it safe to rebase on top >>> of *that*? >> >> That doesn't seem right to me. > > GSS-Proxy patches are 1 year old and we've been delayed once already to > accomodate the containers work, maybe it's time for your patches to be > rebased on gssproxy ones ? :-) Don't sweat it. IMO this is a simple merge problem, unlike the containers work. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: build failure after merge of the nfsd tree
On Apr 29, 2013, at 12:21 PM, Trond Myklebust wrote: > On Mon, 2013-04-29 at 12:05 -0400, Chuck Lever wrote: >> On Apr 29, 2013, at 11:45 AM, "J. Bruce Fields" wrote: >> >>> On Mon, Apr 29, 2013 at 10:53:37AM -0400, Chuck Lever wrote: >>>> >>>> On Apr 28, 2013, at 9:24 PM, Stephen Rothwell >>>> wrote: >>>> >>>>> Hi J., >>>>> >>>>> After merging the nfsd tree, today's linux-next build (powerpc >>>>> ppc64_defconfig) failed like this: >>>>> >>>>> net/sunrpc/auth_gss/svcauth_gss.c: In function 'gss_proxy_save_rsc': >>>>> net/sunrpc/auth_gss/svcauth_gss.c:1182:3: error: implicit declaration of >>>>> function 'gss_mech_get_by_OID' [-Werror=implicit-function-declaration] >>>>> >>>>> Caused byc ommit 030d794bf498 ("SUNRPC: Use gssproxy upcall for server >>>>> RPCGSS authentication"). gss_mech_get_by_OID() made static to >>>>> net/sunrpc/auth_gss/gss_mech_switch.c by commit 9568c5e9a61d ("SUNRPC: >>>>> Introduce rpcauth_get_pseudoflavor()") in the nfs tree (part of the nfs >>>>> tree that you did not merge). >>>>> >>>>> I don't know how to fix this, so I have used the nfsd tree from >>>>> next-20130426 for today. >>>> >>>> Bruce, it might make sense for me to submit the three server-side RPC GSS >>>> patches, and then you can rebase the gssproxy work on top of those. Let >>>> me know how you would like to proceed. >>> >>> I'm happy to take those patches whenever you consider them ready. Would >>> that fix the problem? >> >> Someone would need to modify the gssproxy patches to use the new interfaces. >> >>> Also: it looks like 030d794bf498 "SUNRPC: Introduce >>> rpcauth_get_pseudoflavor()" is in Trond's linux-next, but not his >>> nfs-for-next. I'm not sure what that means--is it safe to rebase on top >>> of *that*? >> >> That doesn't seem right to me. > > I've now pulled the rpcsec_gss changes into the nfs-for-next. The main > reason why they were not pulled in earlier was due to uncertainty what > to do about the increase in "AUTH_GSS upcall timed out." syslog > warnings. Trond's nfs-for-next now has the new rpcauth_get_gssinfo() and rpcauth_get_pseudoflavor() APIs, which are replacements for direct calls into the GSS mech switch. These APIs are a little more generic, and more robust in the face of unloaded GSS kernel modules. Instead of gss_mech_get_by_OID(), I suspect you want rpcauth_get_pseudoflavor(), but I haven't looked at the gssproxy code. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: build failure after merge of the nfsd tree
On Apr 29, 2013, at 1:38 PM, "J. Bruce Fields" wrote: > On Mon, Apr 29, 2013 at 01:04:01PM -0400, Chuck Lever wrote: >> >> On Apr 29, 2013, at 12:21 PM, Trond Myklebust >> wrote: >> >>> On Mon, 2013-04-29 at 12:05 -0400, Chuck Lever wrote: >>>> On Apr 29, 2013, at 11:45 AM, "J. Bruce Fields" >>>> wrote: >>>> >>>>> On Mon, Apr 29, 2013 at 10:53:37AM -0400, Chuck Lever wrote: >>>>>> >>>>>> On Apr 28, 2013, at 9:24 PM, Stephen Rothwell >>>>>> wrote: >>>>>> >>>>>>> Hi J., >>>>>>> >>>>>>> After merging the nfsd tree, today's linux-next build (powerpc >>>>>>> ppc64_defconfig) failed like this: >>>>>>> >>>>>>> net/sunrpc/auth_gss/svcauth_gss.c: In function >>>>>>> 'gss_proxy_save_rsc': net/sunrpc/auth_gss/svcauth_gss.c:1182:3: >>>>>>> error: implicit declaration of function 'gss_mech_get_by_OID' >>>>>>> [-Werror=implicit-function-declaration] >>>>>>> >>>>>>> Caused byc ommit 030d794bf498 ("SUNRPC: Use gssproxy upcall for >>>>>>> server RPCGSS authentication"). gss_mech_get_by_OID() made >>>>>>> static to net/sunrpc/auth_gss/gss_mech_switch.c by commit >>>>>>> 9568c5e9a61d ("SUNRPC: Introduce rpcauth_get_pseudoflavor()") in >>>>>>> the nfs tree (part of the nfs tree that you did not merge). >>>>>>> >>>>>>> I don't know how to fix this, so I have used the nfsd tree from >>>>>>> next-20130426 for today. >>>>>> >>>>>> Bruce, it might make sense for me to submit the three server-side >>>>>> RPC GSS patches, and then you can rebase the gssproxy work on top >>>>>> of those. Let me know how you would like to proceed. >>>>> >>>>> I'm happy to take those patches whenever you consider them ready. >>>>> Would that fix the problem? >>>> >>>> Someone would need to modify the gssproxy patches to use the new >>>> interfaces. >>>> >>>>> Also: it looks like 030d794bf498 "SUNRPC: Introduce >>>>> rpcauth_get_pseudoflavor()" is in Trond's linux-next, but not his >>>>> nfs-for-next. I'm not sure what that means--is it safe to rebase >>>>> on top of *that*? >>>> >>>> That doesn't seem right to me. >>> >>> I've now pulled the rpcsec_gss changes into the nfs-for-next. The >>> main reason why they were not pulled in earlier was due to >>> uncertainty what to do about the increase in "AUTH_GSS upcall timed >>> out." syslog warnings. >> >> Trond's nfs-for-next now has the new rpcauth_get_gssinfo() and >> rpcauth_get_pseudoflavor() APIs, which are replacements for direct >> calls into the GSS mech switch. These APIs are a little more generic, >> and more robust in the face of unloaded GSS kernel modules. >> >> Instead of gss_mech_get_by_OID(), I suspect you want >> rpcauth_get_pseudoflavor(), but I haven't looked at the gssproxy code. > > It's doing > > status = -EOPNOTSUPP; > gm = gss_mech_get_by_OID(&ud->mech_oid); > if (!gm) > goto out; > status = -EINVAL; > status = gss_import_sec_context(ud->out_handle.data, > ud->out_handle.len, > gm, &rsci.mechctx, > &expiry, GFP_KERNEL); > if (status) > goto out; > > So we need a way to get from an OID and some mechanism-specific data to > a context. > > Looks to me like we just want to re-export gss_mech_get_by_OID(). The reason for the new wrappers is to load the kernel modules properly before trying to discover the OID -> mechanism mapping. Where are you calling it from? If it's from outside of the GSS module, how do you guarantee the rpc_gss_auth module is loaded? What if the GSS mechanism for that OID isn't loaded? -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: build failure after merge of the nfsd tree
On Apr 29, 2013, at 1:59 PM, "J. Bruce Fields" wrote: > On Mon, Apr 29, 2013 at 01:47:16PM -0400, Chuck Lever wrote: >> >> On Apr 29, 2013, at 1:38 PM, "J. Bruce Fields" wrote: >> >>> On Mon, Apr 29, 2013 at 01:04:01PM -0400, Chuck Lever wrote: >>>> >>>> On Apr 29, 2013, at 12:21 PM, Trond Myklebust >>>> wrote: >>>> >>>>> On Mon, 2013-04-29 at 12:05 -0400, Chuck Lever wrote: >>>>>> On Apr 29, 2013, at 11:45 AM, "J. Bruce Fields" >>>>>> wrote: >>>>>> >>>>>>> On Mon, Apr 29, 2013 at 10:53:37AM -0400, Chuck Lever wrote: >>>>>>>> >>>>>>>> On Apr 28, 2013, at 9:24 PM, Stephen Rothwell >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi J., >>>>>>>>> >>>>>>>>> After merging the nfsd tree, today's linux-next build (powerpc >>>>>>>>> ppc64_defconfig) failed like this: >>>>>>>>> >>>>>>>>> net/sunrpc/auth_gss/svcauth_gss.c: In function >>>>>>>>> 'gss_proxy_save_rsc': net/sunrpc/auth_gss/svcauth_gss.c:1182:3: >>>>>>>>> error: implicit declaration of function 'gss_mech_get_by_OID' >>>>>>>>> [-Werror=implicit-function-declaration] >>>>>>>>> >>>>>>>>> Caused byc ommit 030d794bf498 ("SUNRPC: Use gssproxy upcall for >>>>>>>>> server RPCGSS authentication"). gss_mech_get_by_OID() made >>>>>>>>> static to net/sunrpc/auth_gss/gss_mech_switch.c by commit >>>>>>>>> 9568c5e9a61d ("SUNRPC: Introduce rpcauth_get_pseudoflavor()") in >>>>>>>>> the nfs tree (part of the nfs tree that you did not merge). >>>>>>>>> >>>>>>>>> I don't know how to fix this, so I have used the nfsd tree from >>>>>>>>> next-20130426 for today. >>>>>>>> >>>>>>>> Bruce, it might make sense for me to submit the three server-side >>>>>>>> RPC GSS patches, and then you can rebase the gssproxy work on top >>>>>>>> of those. Let me know how you would like to proceed. >>>>>>> >>>>>>> I'm happy to take those patches whenever you consider them ready. >>>>>>> Would that fix the problem? >>>>>> >>>>>> Someone would need to modify the gssproxy patches to use the new >>>>>> interfaces. >>>>>> >>>>>>> Also: it looks like 030d794bf498 "SUNRPC: Introduce >>>>>>> rpcauth_get_pseudoflavor()" is in Trond's linux-next, but not his >>>>>>> nfs-for-next. I'm not sure what that means--is it safe to rebase >>>>>>> on top of *that*? >>>>>> >>>>>> That doesn't seem right to me. >>>>> >>>>> I've now pulled the rpcsec_gss changes into the nfs-for-next. The >>>>> main reason why they were not pulled in earlier was due to >>>>> uncertainty what to do about the increase in "AUTH_GSS upcall timed >>>>> out." syslog warnings. >>>> >>>> Trond's nfs-for-next now has the new rpcauth_get_gssinfo() and >>>> rpcauth_get_pseudoflavor() APIs, which are replacements for direct >>>> calls into the GSS mech switch. These APIs are a little more generic, >>>> and more robust in the face of unloaded GSS kernel modules. >>>> >>>> Instead of gss_mech_get_by_OID(), I suspect you want >>>> rpcauth_get_pseudoflavor(), but I haven't looked at the gssproxy code. >>> >>> It's doing >>> >>> status = -EOPNOTSUPP; >>> gm = gss_mech_get_by_OID(&ud->mech_oid); >>> if (!gm) >>> goto out; >>> status = -EINVAL; >>> status = gss_import_sec_context(ud->out_handle.data, >>> ud->out_handle.len, >>> gm, &rsci.mechctx, >>> &expiry, GFP_KERNEL); >>> if (status) >>> goto out; >>> >>> So we need a way to get from an OID and some mechanism-specific data to >>> a context. >>> >>> Looks to me like we just want to re-export gss_mech_get_by_OID(). >> >> The reason for the new wrappers is to load the kernel modules properly >> before trying to discover the OID -> mechanism mapping. >> >> Where are you calling it from? If it's from outside of the GSS module, how >> do you guarantee the rpc_gss_auth module is loaded? What if the GSS >> mechanism for that OID isn't loaded? > > Sorry, I should have said just "remove static from", not > "re-export"--it's in the same module. So there should be no problem > here, right? OK, gotcha. Architecturally outside of the mech switch I'd like to see OIDs passed around embedded in GSS tuples rather than by themselves. An alternative would be to use gss_mech_get_by_name(), which is already visible, loads GSS mechanism modules automatically, and returns struct gss_api_mech *. For NFS, we should already have a clean mapping of mechanism name to pseudoflavor to GSS tuple. Looks like rsc_parse() already uses this API. Do you have gssproxy patches published in a git tree somewhere I could review? -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: build failure after merge of the nfsd tree
On Apr 29, 2013, at 2:57 PM, J. Bruce Fields wrote: > On Mon, Apr 29, 2013 at 02:30:33PM -0400, Chuck Lever wrote: >> >> On Apr 29, 2013, at 1:59 PM, "J. Bruce Fields" >> wrote: >> >>> On Mon, Apr 29, 2013 at 01:47:16PM -0400, Chuck Lever wrote: >>>> >>>> On Apr 29, 2013, at 1:38 PM, "J. Bruce Fields" >>>> wrote: >>>> >>>>> On Mon, Apr 29, 2013 at 01:04:01PM -0400, Chuck Lever wrote: >>>>>> Trond's nfs-for-next now has the new rpcauth_get_gssinfo() and >>>>>> rpcauth_get_pseudoflavor() APIs, which are replacements for >>>>>> direct calls into the GSS mech switch. These APIs are a little >>>>>> more generic, and more robust in the face of unloaded GSS kernel >>>>>> modules. >>>>>> >>>>>> Instead of gss_mech_get_by_OID(), I suspect you want >>>>>> rpcauth_get_pseudoflavor(), but I haven't looked at the gssproxy >>>>>> code. >>>>> >>>>> It's doing >>>>> >>>>> status = -EOPNOTSUPP; gm = >>>>> gss_mech_get_by_OID(&ud->mech_oid); if (!gm) goto out; >>>>> status = -EINVAL; status = >>>>> gss_import_sec_context(ud->out_handle.data, >>>>> ud->out_handle.len, gm, &rsci.mechctx, &expiry, >>>>> GFP_KERNEL); if (status) goto out; >>>>> >>>>> So we need a way to get from an OID and some mechanism-specific >>>>> data to a context. >>>>> >>>>> Looks to me like we just want to re-export gss_mech_get_by_OID(). >>>> >>>> The reason for the new wrappers is to load the kernel modules >>>> properly before trying to discover the OID -> mechanism mapping. >>>> >>>> Where are you calling it from? If it's from outside of the GSS >>>> module, how do you guarantee the rpc_gss_auth module is loaded? >>>> What if the GSS mechanism for that OID isn't loaded? >>> >>> Sorry, I should have said just "remove static from", not >>> "re-export"--it's in the same module. So there should be no problem >>> here, right? >> >> OK, gotcha. Architecturally outside of the mech switch I'd like to >> see OIDs passed around embedded in GSS tuples rather than by >> themselves. > > I'm not sure what you mean. When I accept a gss context I don't yet > know what service or qop it's going to be used with, I only know the > mechanism OID. RPC server GSS support didn't need the gss_mech_get_by_OID() interface before gssproxy, so I'm trying to understand why it is needed now. But it sounds like you do need it now, so go ahead and make gss_mech_get_by_OID() global within the AUTH_GSS module. > >> An alternative would be to use gss_mech_get_by_name(), which is >> already visible, loads GSS mechanism modules automatically, and >> returns struct gss_api_mech *. For NFS, we should already have a >> clean mapping of mechanism name to pseudoflavor to GSS tuple. Looks >> like rsc_parse() already uses this API. > > We don't have a name here, only an OID. > >> Do you have gssproxy patches published in a git tree somewhere I could >> review? > > It's in my for-3.10 branch. > > Which is more or less what I plan to submit for 3.10, so I'd prefer not > to rebase it at this point. > > It looks like just removing "static" would resolve the immediate > conflict, is that right? And then I'd happily help deal with cleaning > up the API as followup work. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] lockd: use per-net refrence-counted NSM clients
On Sep 17, 2012, at 6:49 AM, Stanislav Kinsbursky wrote: > 14.09.2012 23:10, Chuck Lever пишет: >> >> On Sep 14, 2012, at 1:38 PM, Myklebust, Trond wrote: >> >>> On Fri, 2012-09-14 at 13:01 -0400, Chuck Lever wrote: >>>> What happens if statd is restarted? >>> >>> Nothing unusual. Why? >> >> The NSM upcall transport is a potential application for TCP + softconn, now >> that a persistent rpc_clnt is used. It just depends on what failure mode >> we'd like to optimize for. >> > > I don't understand, where the problem is. > Could you be more specific, please? I'm suggesting an enhancement. The change is to use TCP for the NSM upcall transport, and set RPC_TASK_SOFTCONN on the individual RPCs. The advantage of this is that the kernel could discover when statd is not running and fail the upcall immediately, rather than waiting possibly many seconds for each upcall RPC to time out. The client already has a check in the mount.nfs command to see that statd is running, likely to avoid this lengthly timeout. Since the client already has long-standing logic to avoid it, I think the benefit would be mostly on the server side. But this change can be done at some later point. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NFS EINVAL on open(... | O_TRUNC) on 2.6.23.9
Hi Gianluca- On Feb 6, 2008, at 1:25 PM, Gianluca Alberici wrote: Hello all, Thanks to Chuck's help i finally decided to proceed to a git bisect and found the bad patch. Is there anybody that has an idea why it breaks userspace nfs servers as we have seen ? Sorry for emailing directly Chuck Lever and Andrew Morton but i really wanted to thank Chuck for his precious help and thought that /akpm/ having signed this commit maybe he's going to figure out whats wrong easily The commit you found is a plausible source of the trouble (based on our current theory about the problem). What isn't quite clear to me is whether this commit causes your user- space server to start failing suddenly, or it causes the client to start sending the special non-standard time stamps in the SETATTR request. My guess is the latter, but I want to confirm this guess against reality :-) Are you running the client and server concurrently on the same system? If so, it would be helpful if you could run this test with a constant kernel version on one side while varying it on the other. If client and server are already on different systems, can you tell us which system and which kernel combinations caused the failure? A matrix of combinations might be: 1. server kernel is before 1c710c89, client kernel is before 1c710c89 2. server kernel is before 1c710c89, client kernel is after 1c710c89 3. server kernel is after 1c710c89, client kernel is before 1c710c89 4. server kernel is after 1c710c89, client kernel is after 1c710c89 Thanks. This is what i finally get from git: 1c710c896eb461895d3c399e15bb5f20b39c9073 is first bad commit commit 1c710c896eb461895d3c399e15bb5f20b39c9073 Author: Ulrich Drepper <[EMAIL PROTECTED]> Date: Tue May 8 00:33:25 2007 -0700 utimensat implementation Implement utimensat(2) which is an extension to futimesat(2) in that it a) supports nano-second resolution for the timestamps b) allows to selectively ignore the atime/mtime value c) allows to selectively use the current time for either atime or mtime d) supports changing the atime/mtime of a symlink itself along the lines of the BSD lutimes(3) functions [...] [EMAIL PROTECTED]: add missing i386 syscall table entry] Signed-off-by: Ulrich Drepper <[EMAIL PROTECTED]> Cc: Alexey Dobriyan <[EMAIL PROTECTED]> Cc: Michael Kerrisk <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Signed-off-by: Andrew Morton <[EMAIL PROTECTED]> Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]> :04 04 3bedbc7fd919ba167b8e5f208a630261570853bb 927002a9423dcb51ba4f7bee53e60cdca6c1df43 M arch :04 04 fd688c5b534efd3111cbf1e1095d6ff631738325 3d0fbf20fb3da1cb380c92f5b2b39815897376d3 M fs :04 04 bfb1a907a9a842db4fa3543e12a8381d4e11b1eb 9c1d99324db12e066c0d17870fe48457809ad43b M include Thanks in advance, regards, Gianluca Hi Gianluca- On Jan 30, 2008, at 7:40 AM, Gianluca Alberici wrote: Hello again everybody Here follows the testbench: - I got two mirrors, same machine, same disk etc...chaged hostname, IP, and on the second i have recompiled kernel. - First: 2.6.21.7 on debian sarge - Second: 2.6.22 same system. - Onto both i got nfs-user-server and cfsd last versions - The export file is the same (localhost /opt/nfs (rw, async), stripping off the async option does not changes anything) - Mount options are exactly the same. The problem arises in the very same manner with both nfs and cfsd: NFS:setattr { ... ... RPC:call_decode { return 22; } ... return 22; } Again, there is nothing wrong with the RPC client or call_decode. The *server* is returning NFSERR_INVAL (22) to a SETATTR request; the RPC client is simply passing that along to the NFS client, as it is designed to do. I have tried these kernels: 2.6.16.11 works 2.6.20 works 2.6.21 works 2.6.21.7 works 2.6.22 doesnt work (contiguous to previous version) 2.6.23 doesnt work (same behavior as previous) 2.6.23.9 doesnt work (as above) 2.6.24rc7 doesnt work (as above) I would really like to do more, client or server side, if you ave any suggestions. Can we find out what is the change (doesnt matter if it is a buf or bug fix) that caused this problem ? The goal here is to identify the kernel change between 2.6.21 and 2.6.22 that makes the client generate SETATTR requests the user- space server chokes on. It may be a change in the NFS client, or it could be somewhere else in the file system stack, like the VFS. The usual procedure is to use "git bisect". It does a binary search on the kernel patches between the working kernel version and the kernel version that is known not to work. It works like this: 1. You clone a linux kernel git repository (if you don't have a git repository already) 2. You tell git bisect which kernel version is working, and which isn't. git bisect t
[PATCH 00/11] RFC: Kconfig changes for NFS_FS and NFSD
Scratching my head to find an appropriate mailing list for review of some proposed NFS-related Kconfig changes. These have already seen some light on linux-nfs, but Herbert Xu suggested lkml for review by Kconfig experts. In addition to updating the help text, I've tried to untangle the entry dependencies somewhat. I'm not really a Kconfig expert, so I'd appreciate some review of these patches by those more knowlegeable than me. -- corporate: -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 09/11] NFSD: Move "select NFSD_V2_ACL if NFSD_V3_ACL"
Clean up: since NFSD_V2_ACL is a boolean, it can be selected safely under the NFSD_V3_ACL entry (also a boolean). Signed-off-by: Chuck Lever <[EMAIL PROTECTED]> --- fs/Kconfig |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/Kconfig b/fs/Kconfig index 32c84d9..21362e9 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -1657,7 +1657,6 @@ config NFSD select LOCKD select SUNRPC select EXPORTFS - select NFSD_V2_ACL if NFSD_V3_ACL select NFS_ACL_SUPPORT if NFSD_V2_ACL select CRYPTO_MD5 if NFSD_V4 select CRYPTO if NFSD_V4 @@ -1699,6 +1698,7 @@ config NFSD_V3 config NFSD_V3_ACL bool "NFS server support for the NFSv3 ACL protocol extension" depends on NFSD_V3 + select NFSD_V2_ACL help Solaris NFS servers support an auxiliary NFSv3 ACL protocol that never became an official part of the NFS version 3 protocol. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 11/11] NLM: LOCKD fails to load if CONFIG_SYSCTL is not set
By the way, we've got another config-related nit here: http://bugzilla.linux-nfs.org/show_bug.cgi?id=156 You can build lockd without CONFIG_SYSCTL set, but then the module will fail to load. Signed-off-by: Chuck Lever <[EMAIL PROTECTED]> --- fs/Kconfig |1 + fs/lockd/svc.c |4 2 files changed, 5 insertions(+), 0 deletions(-) diff --git a/fs/Kconfig b/fs/Kconfig index f799964..b7aeed4 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -1734,6 +1734,7 @@ config NFSD_V4 config LOCKD tristate + select SYSCTL config LOCKD_V4 bool diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c index 0822646..adb8e2c 100644 --- a/fs/lockd/svc.c +++ b/fs/lockd/svc.c @@ -516,8 +516,12 @@ module_param(nsm_use_hostnames, bool, 0644); static int __init init_nlm(void) { +#if CONFIG_SYSCTL nlm_sysctl_table = register_sysctl_table(nlm_sysctl_root); return nlm_sysctl_table ? 0 : -ENOMEM; +#else + return 0; +#endif } static void __exit exit_nlm(void) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 06/11] NFSD: Remove NFSv4 dependency on NFSv3
On Feb 8, 2008, at 3:20 PM, Sam Ravnborg wrote: On Fri, Feb 08, 2008 at 12:52:08PM -0500, Chuck Lever wrote: Clean up: Because NFSD_V4 "depends on" NFSD_V3, it appears as a child of the NFSD_V3 menu entry, and is not visible if NFSD_V3 is unselected. Replace the dependency on NFSD_V3 with a "select NFSD_V3". This makes NFSD_V4 look and work just like NFS_V3, while ensuring that NFSD_V3 is enabled if NFSD_V4 is. Signed-off-by: Chuck Lever <[EMAIL PROTECTED]> --- fs/Kconfig |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/fs/Kconfig b/fs/Kconfig index 9ad62a9..4c16789 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -1723,7 +1723,8 @@ config NFSD_V3_ACL config NFSD_V4 bool "NFS server support for NFS version 4 (EXPERIMENTAL)" - depends on NFSD && NFSD_V3 && EXPERIMENTAL + depends on NFSD && EXPERIMENTAL + select NFSD_V3 select RPCSEC_GSS_KRB5 help This option enables support in your system's NFS server for This use of select is questionable. In general it is bad to select a symbol with dependencies. In this case the dependencies of NFSD_V3 are duplicated for NFSD_V4 so we will not se erratic configurations but do you remember to update NFSD_V4 when you add a depends on NFSD_V3? But I see no other clean way to do it rithg now. This patch is entirely optional, and I can just drop it for now. Thanks for the review. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 06/11] NFSD: Remove NFSv4 dependency on NFSv3
Clean up: Because NFSD_V4 "depends on" NFSD_V3, it appears as a child of the NFSD_V3 menu entry, and is not visible if NFSD_V3 is unselected. Replace the dependency on NFSD_V3 with a "select NFSD_V3". This makes NFSD_V4 look and work just like NFS_V3, while ensuring that NFSD_V3 is enabled if NFSD_V4 is. Signed-off-by: Chuck Lever <[EMAIL PROTECTED]> --- fs/Kconfig |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/fs/Kconfig b/fs/Kconfig index 9ad62a9..4c16789 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -1723,7 +1723,8 @@ config NFSD_V3_ACL config NFSD_V4 bool "NFS server support for NFS version 4 (EXPERIMENTAL)" - depends on NFSD && NFSD_V3 && EXPERIMENTAL + depends on NFSD && EXPERIMENTAL + select NFSD_V3 select RPCSEC_GSS_KRB5 help This option enables support in your system's NFS server for -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 03/11] NFS: Update help text for CONFIG_NFS_FS
Clean up: refresh the help text for Kconfig items related to the NFS client. Remove obsolete URLs, and make the language consistent among the options. Also move the ROOT_NFS config option next to the options related to the NFS client. Signed-off-by: Chuck Lever <[EMAIL PROTECTED]> --- fs/Kconfig | 115 ++-- 1 files changed, 57 insertions(+), 58 deletions(-) diff --git a/fs/Kconfig b/fs/Kconfig index 9c2c24b..9427c73 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -1517,10 +1517,6 @@ config UFS_FS The recently released UFS2 variant (used in FreeBSD 5.x) is READ-ONLY supported. - If you only intend to mount files from some other Unix over the - network using NFS, you don't need the UFS file system support (but - you need NFS file system support obviously). - Note that this option is generally not needed for floppies, since a good portable way to transport files and directories between unixes (and even other operating systems) is given by the tar program ("man @@ -1560,6 +1556,7 @@ menuconfig NETWORK_FILESYSTEMS Say Y here to get to see options for network filesystems and filesystem-related networking code, such as NFS daemon and RPCSEC security modules. + This option alone does not add any kernel code. If you say N, all options in this submenu will be skipped and @@ -1568,76 +1565,92 @@ menuconfig NETWORK_FILESYSTEMS if NETWORK_FILESYSTEMS config NFS_FS - tristate "NFS file system support" + tristate "NFS client support" depends on INET select LOCKD select SUNRPC select NFS_ACL_SUPPORT if NFS_V3_ACL help - If you are connected to some other (usually local) Unix computer - (using SLIP, PLIP, PPP or Ethernet) and want to mount files residing - on that computer (the NFS server) using the Network File Sharing - protocol, say Y. "Mounting files" means that the client can access - the files with usual UNIX commands as if they were sitting on the - client's hard disk. For this to work, the server must run the - programs nfsd and mountd (but does not need to have NFS file system - support enabled in its kernel). NFS is explained in the Network - Administrator's Guide, available from - <http://www.tldp.org/docs.html#guide>, on its man page: "man - nfs", and in the NFS-HOWTO. + Choose Y here if you want to access files residing on other + computers using Sun's Network File System protocol. To compile + this file system support as a module, choose M here: the module + will be called nfs. - A superior but less widely used alternative to NFS is provided by - the Coda file system; see "Coda file system support" below. + To mount file systems exported by NFS servers, you also need to + install the user space mount.nfs command which can be found in + the Linux nfs-utils package, available from http://linux-nfs.org/. + Information about using the mount command is available in the + mount(8) man page. More detail about the Linux NFS client + implementation is available via the nfs(5) man page. - If you say Y here, you should have said Y to TCP/IP networking also. - This option would enlarge your kernel by about 27 KB. + Below you can choose which versions of the NFS protocol are + available in the kernel to mount NFS servers. Support for NFS + version 2 (RFC 1094) is always available when NFS_FS is selected. - To compile this file system support as a module, choose M here: the - module will be called nfs. - - If you are configuring a diskless machine which will mount its root - file system over NFS at boot time, say Y here and to "Kernel - level IP autoconfiguration" above and to "Root file system on NFS" - below. You cannot compile this driver as a module in this case. - There are two packages designed for booting diskless machines over - the net: netboot, available from - <http://ftp1.sourceforge.net/netboot/>, and Etherboot, - available from <http://ftp1.sourceforge.net/etherboot/>. + To configure a system which mounts its root file system via NFS + at boot time, say Y here, select "Kernel level IP + autoconfiguration" in the NETWORK menu, and select "Root file + system on NFS" below. You cannot compile this file system as a + module in this case. - If you don't know what all this is about, say N. + If unsure, say N. config NFS_V3 - bool "Provide NFSv3 client su
[PATCH 07/11] NFSD: Use "depends on" for PROC_FS dependency
Recently a reverse dependency was added to fs/Kconfig to ensure that PROC_FS was enabled if NFSD_V4 was enabled. There is a guideline in Documentation/kbuild/kconfig-language.txt that states "In general use select only for non-visible symbols (no prompts anywhere) and for symbols with no dependencies." A quick grep around other Kconfig files reveals that no entry currently uses "select PROC_FS" -- every one uses "depends on". Thus CONFIG_NFSD_V4 should use "depends on PROC_FS" as well. For SUNRPC_GSS, it's a little more complex. Other entries can "select" SUNRPC_GSS, as it is non-visible. However, the guideline suggests an entry can't "select" it if it has a dependency (such as PROC_FS). So, we add forward dependencies on PROC_FS to RPCSEC_GSS_FOO instead. XXX: Both CONFIG_NFSV4 and CONFIG_NFSD_V4 select RPCSEC_GSS_KRB5, which is visible, which kconfig-language.txt also frowns upon. The intent was to enable at least one GSS mechanism if V4 was enabled. Perhaps we should make SUNRPC_GSS visible, and make the NFSv4 options visible only if SUNRPC_GSS is enabled. Signed-off-by: Chuck Lever <[EMAIL PROTECTED]> --- fs/Kconfig |8 +++- 1 files changed, 3 insertions(+), 5 deletions(-) diff --git a/fs/Kconfig b/fs/Kconfig index 4c16789..5f00ee7 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -1662,8 +1662,6 @@ config NFSD select CRYPTO_MD5 if NFSD_V4 select CRYPTO if NFSD_V4 select FS_POSIX_ACL if NFSD_V4 - select PROC_FS if NFSD_V4 - select PROC_FS if SUNRPC_GSS help Choose Y here if you want to allow other computers to access files residing on this system using Sun's Network File System @@ -1723,7 +1721,7 @@ config NFSD_V3_ACL config NFSD_V4 bool "NFS server support for NFS version 4 (EXPERIMENTAL)" - depends on NFSD && EXPERIMENTAL + depends on NFSD && PROC_FS && EXPERIMENTAL select NFSD_V3 select RPCSEC_GSS_KRB5 help @@ -1796,7 +1794,7 @@ config SUNRPC_BIND34 config RPCSEC_GSS_KRB5 tristate "Secure RPC: Kerberos V mechanism (EXPERIMENTAL)" - depends on SUNRPC && EXPERIMENTAL + depends on SUNRPC && PROC_FS && EXPERIMENTAL select SUNRPC_GSS select CRYPTO select CRYPTO_MD5 @@ -1815,7 +1813,7 @@ config RPCSEC_GSS_KRB5 config RPCSEC_GSS_SPKM3 tristate "Secure RPC: SPKM3 mechanism (EXPERIMENTAL)" - depends on SUNRPC && EXPERIMENTAL + depends on SUNRPC && PROC_FS && EXPERIMENTAL select SUNRPC_GSS select CRYPTO select CRYPTO_MD5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 08/11] NFSD: Move "select FS_POSIX_ACL if NFSD_V4"
Clean up: since FS_POSIX_ACL is a non-visible boolean entry, it can be selected safely under the NFSD_V4 entry (also a boolean). Signed-off-by: Chuck Lever <[EMAIL PROTECTED]> --- fs/Kconfig |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/Kconfig b/fs/Kconfig index 5f00ee7..32c84d9 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -411,7 +411,7 @@ config JFS_STATISTICS to be made available to the user in the /proc/fs/jfs/ directory. config FS_POSIX_ACL -# Posix ACL utility routines (for now, only ext2/ext3/jfs/reiserfs) +# Posix ACL utility routines (for now, only ext2/ext3/jfs/reiserfs/nfs) # # NOTE: you can implement Posix ACLs without these helpers (XFS does). # Never use this symbol for ifdefs. @@ -1661,7 +1661,6 @@ config NFSD select NFS_ACL_SUPPORT if NFSD_V2_ACL select CRYPTO_MD5 if NFSD_V4 select CRYPTO if NFSD_V4 - select FS_POSIX_ACL if NFSD_V4 help Choose Y here if you want to allow other computers to access files residing on this system using Sun's Network File System @@ -1723,6 +1722,7 @@ config NFSD_V4 bool "NFS server support for NFS version 4 (EXPERIMENTAL)" depends on NFSD && PROC_FS && EXPERIMENTAL select NFSD_V3 + select FS_POSIX_ACL select RPCSEC_GSS_KRB5 help This option enables support in your system's NFS server for -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 01/11] NFS: Always enable NFS direct I/O
Since O_DIRECT is a standard feature that is enabled in most distros, eliminate the CONFIG_NFS_DIRECTIO build option, and change the fs/nfs/Makefile to always build in the NFS direct I/O engine. Signed-off-by: Chuck Lever <[EMAIL PROTECTED]> --- fs/Kconfig| 24 fs/nfs/Makefile |3 +-- fs/nfs/file.c |6 -- fs/nfs/internal.h |5 - 4 files changed, 1 insertions(+), 37 deletions(-) diff --git a/fs/Kconfig b/fs/Kconfig index 987b5d7..4965fd8 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -1638,30 +1638,6 @@ config NFS_V4 If unsure, say N. -config NFS_DIRECTIO - bool "Allow direct I/O on NFS files" - depends on NFS_FS - help - This option enables applications to perform uncached I/O on files - in NFS file systems using the O_DIRECT open() flag. When O_DIRECT - is set for a file, its data is not cached in the system's page - cache. Data is moved to and from user-level application buffers - directly. Unlike local disk-based file systems, NFS O_DIRECT has - no alignment restrictions. - - Unless your program is designed to use O_DIRECT properly, you are - much better off allowing the NFS client to manage data caching for - you. Misusing O_DIRECT can cause poor server performance or network - storms. This kernel build option defaults OFF to avoid exposing - system administrators unwittingly to a potentially hazardous - feature. - - For more details on NFS O_DIRECT, see fs/nfs/direct.c. - - If unsure, say N. This reduces the size of the NFS client, and - causes open() to return EINVAL if a file residing in NFS is - opened with the O_DIRECT flag. - config NFSD tristate "NFS server support" depends on INET diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile index f54c4d7..344bccb 100644 --- a/fs/nfs/Makefile +++ b/fs/nfs/Makefile @@ -8,7 +8,7 @@ EXTRA_CFLAGS = -W -Wall -Wextra -Wsign-compare \ obj-$(CONFIG_NFS_FS) += nfs.o nfs-y := client.o dir.o file.o getroot.o inode.o super.o nfs2xdr.o \ - pagelist.o proc.o read.o symlink.o unlink.o \ + direct.o pagelist.o proc.o read.o symlink.o unlink.o \ write.o namespace.o mount_clnt.o nfs-$(CONFIG_ROOT_NFS) += nfsroot.o nfs-$(CONFIG_NFS_V3) += nfs3proc.o nfs3xdr.o @@ -17,5 +17,4 @@ nfs-$(CONFIG_NFS_V4) += nfs4proc.o nfs4xdr.o nfs4state.o nfs4renewd.o \ delegation.o idmap.o \ callback.o callback_xdr.o callback_proc.o \ nfs4namespace.o -nfs-$(CONFIG_NFS_DIRECTIO) += direct.o nfs-$(CONFIG_SYSCTL) += sysctl.o diff --git a/fs/nfs/file.c b/fs/nfs/file.c index ef57a5a..10e8b80 100644 --- a/fs/nfs/file.c +++ b/fs/nfs/file.c @@ -234,10 +234,8 @@ nfs_file_read(struct kiocb *iocb, const struct iovec *iov, ssize_t result; size_t count = iov_length(iov, nr_segs); -#ifdef CONFIG_NFS_DIRECTIO if (iocb->ki_filp->f_flags & O_DIRECT) return nfs_file_direct_read(iocb, iov, nr_segs, pos); -#endif dfprintk(VFS, "nfs: read(%s/%s, [EMAIL PROTECTED])\n", dentry->d_parent->d_name.name, dentry->d_name.name, @@ -383,9 +381,7 @@ const struct address_space_operations nfs_file_aops = { .write_end = nfs_write_end, .invalidatepage = nfs_invalidate_page, .releasepage = nfs_release_page, -#ifdef CONFIG_NFS_DIRECTIO .direct_IO = nfs_direct_IO, -#endif .launder_page = nfs_launder_page, }; @@ -443,10 +439,8 @@ static ssize_t nfs_file_write(struct kiocb *iocb, const struct iovec *iov, ssize_t result; size_t count = iov_length(iov, nr_segs); -#ifdef CONFIG_NFS_DIRECTIO if (iocb->ki_filp->f_flags & O_DIRECT) return nfs_file_direct_write(iocb, iov, nr_segs, pos); -#endif dfprintk(VFS, "nfs: write(%s/%s(%ld), [EMAIL PROTECTED])\n", dentry->d_parent->d_name.name, dentry->d_name.name, diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h index fbe5ba4..13f43d6 100644 --- a/fs/nfs/internal.h +++ b/fs/nfs/internal.h @@ -113,13 +113,8 @@ extern void nfs_destroy_readpagecache(void); extern int __init nfs_init_writepagecache(void); extern void nfs_destroy_writepagecache(void); -#ifdef CONFIG_NFS_DIRECTIO extern int __init nfs_init_directcache(void); extern void nfs_destroy_directcache(void); -#else -#define nfs_init_directcache() (0) -#define nfs_destroy_directcache() do {} while(0) -#endif /* nfs2xdr.c */ extern int nfs_stat_to_errno(int); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 04/11] NFSD: Update help text for CONFIG_NFSD
Clean up: refresh the help text for Kconfig items related to the NFS server. Remove obsolete URLs, and make the language consistent among the options. Signed-off-by: Chuck Lever <[EMAIL PROTECTED]> --- fs/Kconfig | 76 +--- 1 files changed, 47 insertions(+), 29 deletions(-) diff --git a/fs/Kconfig b/fs/Kconfig index 9427c73..53e21eb 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -1665,56 +1665,74 @@ config NFSD select PROC_FS if NFSD_V4 select PROC_FS if SUNRPC_GSS help - If you want your Linux box to act as an NFS *server*, so that other - computers on your local network which support NFS can access certain - directories on your box transparently, you have two options: you can - use the self-contained user space program nfsd, in which case you - should say N here, or you can say Y and use the kernel based NFS - server. The advantage of the kernel based solution is that it is - faster. + Choose Y here if you want to allow other computers to access + files residing on this system using Sun's Network File System + protocol. To compile the NFS server support as a module, + choose M here: the module will be called nfsd. - In either case, you will need support software; the respective - locations are given in the file in the - NFS section. + You may choose to use a user-space NFS server instead, in which + case you can choose N. - If you say Y here, you will get support for version 2 of the NFS - protocol (NFSv2). If you also want NFSv3, say Y to the next question - as well. + To export local file systems using NFS, you also need to install + user space programs which can be found in the Linux nfs-utils + package, available from http://linux-nfs.org/. More detail about + the Linux NFS server implementation is available via the + exports(5) man page. - Please read the NFS-HOWTO, available from - <http://www.tldp.org/docs.html#howto>. + Below you can choose which versions of the NFS protocol are + available to clients mounting the NFS server on this system. + Support for NFS version 2 (RFC 1094) is always available when + CONFIG_NFSD is selected. - To compile the NFS server support as a module, choose M here: the - module will be called nfsd. If unsure, say N. + If unsure, say N. config NFSD_V2_ACL bool depends on NFSD config NFSD_V3 - bool "Provide NFSv3 server support" + bool "NFS server support for NFS version 3" depends on NFSD help - If you would like to include the NFSv3 server as well as the NFSv2 - server, say Y here. If unsure, say Y. + This option enables support in your system's NFS server for + version 3 of the NFS protocol (RFC 1813). + + If unsure, say Y. config NFSD_V3_ACL - bool "Provide server support for the NFSv3 ACL protocol extension" + bool "NFS server support for the NFSv3 ACL protocol extension" depends on NFSD_V3 help - Implement the NFSv3 ACL protocol extension for manipulating POSIX - Access Control Lists on exported file systems. NFS clients should - be compiled with the NFSv3 ACL protocol extension; see the - CONFIG_NFS_V3_ACL option. If unsure, say N. + Solaris NFS servers support an auxiliary NFSv3 ACL protocol that + never became an official part of the NFS version 3 protocol. + This protocol extension allows applications on NFS clients to + manipulate POSIX Access Control Lists on files residing on NFS + servers. NFS servers enforce POSIX ACLs on local files whether + this protocol is available or not. + + This option enables support in your system's NFS server for the + NFSv3 ACL protocol extension allowing NFS clients to manipulate + POSIX ACLs on files exported by your system's NFS server. NFS + clients which support the Solaris NFSv3 ACL protocol can then + access and modify ACLs on your NFS server. + + To store ACLs on your NFS server, you also need to enable ACL- + related CONFIG options for your local file systems of choice. + + If unsure, say N. config NFSD_V4 - bool "Provide NFSv4 server support (EXPERIMENTAL)" + bool "NFS server support for NFS version 4 (EXPERIMENTAL)" depends on NFSD && NFSD_V3 && EXPERIMENTAL select RPCSEC_GSS_KRB5 help - If you would like to include the NFSv4 server as well as the NFSv2 - and NFSv3 servers, say Y here. This feature is experimental, and - shoul
[PATCH 10/11] NFSD: Remove redundant "select" clauses in fs/Kconfig
As far as I can tell, selecting the CRYPTO and CRYPTO_MD5 entries under CONFIG_NFSD is redundant, since CONFIG_NFSD_V4 already selects RPCSEC_GSS_KRB5, which selects these entries. Testing with "make menuconfig" shows that the entries under CRYPTO still properly reflect "Y" or "M" based on the setting of CONFIG_NFSD after this change is applied. Signed-off-by: Chuck Lever <[EMAIL PROTECTED]> --- fs/Kconfig |2 -- 1 files changed, 0 insertions(+), 2 deletions(-) diff --git a/fs/Kconfig b/fs/Kconfig index 21362e9..f799964 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -1658,8 +1658,6 @@ config NFSD select SUNRPC select EXPORTFS select NFS_ACL_SUPPORT if NFSD_V2_ACL - select CRYPTO_MD5 if NFSD_V4 - select CRYPTO if NFSD_V4 help Choose Y here if you want to allow other computers to access files residing on this system using Sun's Network File System -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 05/11] SUNRPC: Update help Kconfig text
Clean up: refresh the help text for Kconfig items related to the sunrpc module. Remove obsolete URLs, and make the language consistent among the options. Signed-off-by: Chuck Lever <[EMAIL PROTECTED]> --- fs/Kconfig | 44 +--- 1 files changed, 29 insertions(+), 15 deletions(-) diff --git a/fs/Kconfig b/fs/Kconfig index 53e21eb..9ad62a9 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -1766,17 +1766,29 @@ config SUNRPC_XPRT_RDMA depends on SUNRPC && INFINIBAND && EXPERIMENTAL default m help - Adds a client RPC transport for supporting kernel NFS over RDMA - mounts, including Infiniband and iWARP. Experimental. + This option enables the RPC client transport capability that + supports NFS over RDMA in the kernel's NFS client. To compile + RPC client RDMA transport support as a module, choose M here: + the module will be called xprtrdma. + + If unsure, say N. config SUNRPC_BIND34 bool "Support for rpcbind versions 3 & 4 (EXPERIMENTAL)" depends on SUNRPC && EXPERIMENTAL help - Provides kernel support for querying rpcbind servers via versions 3 - and 4 of the rpcbind protocol. The kernel automatically falls back - to version 2 if a remote rpcbind service does not support versions - 3 or 4. + RPC requests over IPv6 networks require support for larger + addresses when performing an RPC bind. Sun added support for + IPv6 addressing by creating two new versions of the rpcbind + protocol (RFC 1833). + + This option enables support in the kernel RPC client for + querying rpcbind servers via versions 3 and 4 of the rpcbind + protocol. The kernel automatically falls back to version 2 + if a remote rpcbind service does not support versions 3 or 4. + By themselves, these new versions do not provide support for + RPC over IPv6, but the new protocol versions are necessary to + support it. If unsure, say N to get traditional behavior (version 2 rpcbind requests only). @@ -1790,12 +1802,13 @@ config RPCSEC_GSS_KRB5 select CRYPTO_DES select CRYPTO_CBC help - Provides for secure RPC calls by means of a gss-api - mechanism based on Kerberos V5. This is required for - NFSv4. + Choose Y here to enable Secure RPC using the Kerberos version 5 + GSS-API mechanism (RFC 1964). - Note: Requires an auxiliary userspace daemon which may be found on - http://www.citi.umich.edu/projects/nfsv4/ + Secure RPC calls with Kerberos require an auxiliary user-space + daemon which may be found in the Linux nfs-utils package + available from http://linux-nfs.org/. In addition, user-space + Kerberos support should be installed. If unsure, say N. @@ -1809,11 +1822,12 @@ config RPCSEC_GSS_SPKM3 select CRYPTO_CAST5 select CRYPTO_CBC help - Provides for secure RPC calls by means of a gss-api - mechanism based on the SPKM3 public-key mechanism. + Choose Y here to enable Secure RPC using the SPKM3 public key + GSS-API mechansim (RFC 2025). - Note: Requires an auxiliary userspace daemon which may be found on - http://www.citi.umich.edu/projects/nfsv4/ + Secure RPC calls with SPKM3 require an auxiliary userspace + daemon which may be found in the Linux nfs-utils package + available from http://linux-nfs.org/. If unsure, say N. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 02/11] NFSD: Remove NFSD_TCP kernel build option
Likewise, distros usually leave CONFIG_NFSD_TCP enabled. TCP support in the Linux NFS server is stable enough that we can leave it on always. CONFIG_NFSD_TCP adds about 10 lines of code, and defaults to "Y" anyway. A run-time switch might be more appropriate if people feel they would like to disable NFSD's TCP support. Signed-off-by: Chuck Lever <[EMAIL PROTECTED]> --- fs/Kconfig | 10 -- fs/nfsd/nfssvc.c |2 -- 2 files changed, 0 insertions(+), 12 deletions(-) diff --git a/fs/Kconfig b/fs/Kconfig index 4965fd8..9c2c24b 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -1646,7 +1646,6 @@ config NFSD select EXPORTFS select NFSD_V2_ACL if NFSD_V3_ACL select NFS_ACL_SUPPORT if NFSD_V2_ACL - select NFSD_TCP if NFSD_V4 select CRYPTO_MD5 if NFSD_V4 select CRYPTO if NFSD_V4 select FS_POSIX_ACL if NFSD_V4 @@ -1705,15 +1704,6 @@ config NFSD_V4 should only be used if you are interested in helping to test NFSv4. If unsure, say N. -config NFSD_TCP - bool "Provide NFS server over TCP support" - depends on NFSD - default y - help - If you want your NFS server to support TCP connections, say Y here. - TCP connections usually perform better than the default UDP when - the network is lossy or congested. If unsure, say Y. - config ROOT_NFS bool "Root file system on NFS" depends on NFS_FS=y && IP_PNP diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c index 9647b0f..941041f 100644 --- a/fs/nfsd/nfssvc.c +++ b/fs/nfsd/nfssvc.c @@ -244,7 +244,6 @@ static int nfsd_init_socks(int port) if (error < 0) return error; -#ifdef CONFIG_NFSD_TCP error = lockd_up(IPPROTO_TCP); if (error >= 0) { error = svc_create_xprt(nfsd_serv, "tcp", port, @@ -254,7 +253,6 @@ static int nfsd_init_socks(int port) } if (error < 0) return error; -#endif return 0; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix queued SIGIO
On Tue, 19 Sep 2000, Julian Anastasov wrote: > On Mon, 18 Sep 2000, Andi Kleen wrote: > > > > SI_SIGIO is not generated from kernel. The same is for the > > > other SI_ consts < 0 not defined with __SI_CODE. > > > > Ok, then you have already broken binary compatibility between 2.2 and 2.4 > > Looking in the old kernels, it seems the binary compatibility > was broken in 2.3.21 when si_code returns POLL_xxx events just like > mentioned in "The Single UNIX ® Specification, Version 2", > xsh/signal.h.html and not SI_SIGIO. > > SI_SIGIO in si_code for 2.2 does not return any information > about the events. I even see that Redhat maintains patch against 2.2 > to backport the POLL_xxx events from 2.3. Not sure after the changes > in 2.4.0-test1. Anyway, 2.2 looks unusable for me and I don't see other > way this problem to be fixed. The binary compatibility is impossible > to exist. The applications can support the both ways: the old SI_SIGIO > and the new POLL_xxx events (recompiled after test1) in si_code. because the 2.2 kernels are already "broken" in this regard, i can't see how binary compatibility between 2.2 and 2.4 could even be an issue. applications can't use this stuff in 2.2 without at least the RedHat patch. unless there's a problem implementing a glibc that runs on both 2.2 and 2.4, perhaps this should be revisited. also, the test at issue here (from line 363 of kernel/signal.c): /* If this was sent by a rt mechanism, try again. */ if (info->si_code != SI_USER) { ret = -EAGAIN; goto out; } has always been unclear as to its intent. it seems like there is overloading going on here -- if the real intent is to prevent users without credentials from sending "kernel-only" signals, then that should be the logic here. > The next step is somebody to implement event merging and to > allow receiving of many events with one call. For the next kernels. we just published a paper in the ALS proceedings describing our implementation of a new system call similar to sigtimedwait() that collects many events at once. - Chuck Lever -- corporate: <[EMAIL PROTECTED]> personal: <[EMAIL PROTECTED]> The Linux Scalability project: http://www.citi.umich.edu/projects/linux-scalability/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: How to manage shared persistent local caching (FS-Cache) with NFS?
Hi David- On Dec 5, 2007, at 8:22 PM, David Howells wrote: Chuck Lever <[EMAIL PROTECTED]> wrote: I don't see how persistent local caching means we can no longer ignore (a) and (b) above. Can you amplify this a bit? How about I put it like this. There are two principal problems to be dealt with: (1) Reconnection. Imagine that the administrator requests a mount that uses part of a cache. The client machine is at some time later rebooted and the administrator requests the same mount again. Since the cache is meant to be persistent, the administrator is at liberty to expect that the second mount immediately begins to use the data that the first mount left in the cache. For this to occur, the second mount has to be able to determine which part of the cache the first mount was using and request to use the same piece of cache. To aid with this, FS-Cache has the concept of a 'key'. Each object in the cache is addressed by a unique key. NFS currently builds a key to the cache object for a file from: "NFS", the server IP address, port and NFS version and the file handle for that file. Why not use the fsid as well? The NFS client already uses the fsid to detect when it is crossing a server-side mount point. Fsids are supposed to be stable over server reboots (although sometimes they aren't, it could be made a condition of supporting FS-cache on clients). I also note the inclusion of server IP address in the key. For multi- homed servers, you have the same unavoidable cache aliasing issues if the client mounts the same server and export via different server network interfaces. (2) Cache coherency. Imagine that the administrator requests a mount that uses part of a cache. The administrator then makes a second mount that overlaps the first, maybe because it's a different part of the same server export or maybe it uses the same part, but with different parameters. Imagine further that a particular server file is accessible through both mountpoints. This means that the kernel, and therefore the user, has two views of the one file. If the kernel maintains these two views of the files as totally separate copies, then coherency is mostly not a kernel problem, it's an application problem - as it is now. However, if these two views are shared at any level - such as if they share an FS-Cache cache object - then coherency can be a problem. Is it a problem because, if there are multiple copies of the same remote file in its cache, then FS-cache doesn't know, upon reconnection, which item to match against a particular remote file? I think that's actually going to be a fairly typical situation -- you'll have conditions where some cache items will become orphaned, for example, so you're going to have to deal with that ambiguity as a part of normal operation. For example, if the FS-caching client is disconnected or powered off when a remote rename occurs that replaces a file it has cached, the client will have an orphaned item left over. Maybe this use case is only a garbage collection problem. The two simplest solutions to the coherency problem are (a) to enforce sharing at all levels (superblocks, inodes, cache objects), (b) to enforce non-sharing. In-between states are possible, but are much trickier and more complex. Note that cache coherency management can't be entirely avoided: upon reconnection a cache object has to be checked against the server to see whether it's still valid. How do you propose to do that? First, clearly, FS-cache has to know that it's the same object, so fsid and filehandle have to be the same (you refer to that as the "reconnection problem", but it may generally be a "cache aliasing problem"). I assume FS-cache has a record of the state of the remote file when it was last connected -- mtime, ctime, size, change attribute (I'll refer to this as the "reconciliation problem")? Does it, for instance, checksum both the cache item and the remote file to detect data differences? You have the same problem here as we have with file system search tools such as Beagle. Reconciling file contents after a reconnection event may be too expensive to consider for NFS, especially if a file is terabytes in size. Note that both these problems only really exist because the cache is persistent between mounts. If it were volatile between mounts, then (1) would not exist, and (2) can be ignored as it is now. Do you allow administrators to select whether the FS-cache is persistent? Or is it always unconditionally persistent? An adequate first pass at FS-cache can be done without guarantee
Re: How to manage shared persistent local caching (FS-Cache) with NFS?
Hi David- [ Some history snipped... ] On Dec 6, 2007, at 3:00 PM, David Howells wrote: Chuck Lever <[EMAIL PROTECTED]> wrote: Is it a problem because, if there are multiple copies of the same remote file in its cache, then FS-cache doesn't know, upon reconnection, which item to match against a particular remote file? There are multiple copies of the same remote file that are described by the same remote parameters. Same IP address, same port, same NFS version, same FSID, same FH. The difference may be a local connection parameter. Why not encode the local mounted-on directory in the key? A cryptographic hash of the directory's absolute pathname would be bounded in size. And the mounted-on directory is usually persistent across client reboots. That way you can use the directory name hash to distinguish the different views of the same remote object. An adequate first pass at FS-cache can be done without guaranteeing persistence. True. But it's not particularly interesting to me in such a case. There are a host of other issues that need exposure -- steady-state performance; Meaning what? Meaning your cache is at quota all the time, and to continue operation it must eject items constantly. This is a scenario where it pays to cache the read-mostly items on disk, and leave the frequently changing items in memory. The economics of disk caches is different than memory caches. Disk caches are much larger and cheaper, but their performance tanks when they have to track frequently changing files. Memory caches are smaller, but tracking frequently changing data is only a little more expensive than tracking data that doesn't change often. I have been measuring the performance improvement and degradation numbers, and I can say that if you've one client and one server, the server has all the files in memory, and there's gigabit ethernet between them, an on- disk cache really doesn't help. Basically, the consideration of whether to use a cache is a compromise between a host of factors. cache garbage collection Done. and reclamation; Done. cache item aliasing; Partly done. whether all files on a mount point should be cached on disk, or some in memory and some on disk; I've thought about that, but no-one seems particularly interested in discussing it. I think it's key to preventing FS-cache from making performance worse in many common scenarios. And what would it harm if FS-cache decides that certain items in its cache have become ambiguous or otherwise unusable after a reconnection event, thus it reclaims them instead of re-using them? It depends. At some point I'd like to make disconnected operation possible, and that means storing data to be written back in the cache. You can't necessarily just chuck that away. Disconnected operation for NFS is fraught with challenges. Access to data on servers is traditionally gated by the client's IP address, for example. The client may disconnect from the network, then reconnect using a different address where suddenly all of its accesses are rebuffed. NFS servers, not clients, traditionally determine the file's mtime and ctime, and its file handle. So file updates and file creation become problematic. The client has to reconcile the server's file handle, for files created offline, with its own when reconnecting. And, for disconnected operation, the cache is required to contain every item from the remote. You can't just drop items from the cache because they are inconvenient. I can't just say: "Well, it'll oops if you configure your NFS shares like that, so don't. It's not worth me implementing round it.". What causes that instability? Why can't you insulate against the instability but allow cache incoherence and aliased cache items? Insulate how? The only way to do that is to add something to the cache key that says that these two otherwise identical items are actually diffent things. That something might be the pathname of the mounted-on directory or of the file itself. I'm arguing that cache coherence isn't supported by the NFS protocol, so how can FS-cache *require* a facility to support persistent local caching that the protocol doesn't have in the first place? NFS has just enough to just about support a persistent local cache for unmodified files. It has unique file keys per server, and it has a (limited) amount of coherency data per file. That's not really the problem. The problem is that the client can create loads of different views of a remote export and the kernel treats them as if they're views of different remote exports. These views do not necessarily have *anything* to distinguish them at all (nosharecache option). Yes, they do. The combin
Re: [2.6 patch] fs/nfs/direct.c: remove dead code
commit b9148c6b should be reverted. It was recently forward-ported from some years-old patches, and is clearly not needed now. On Dec 11, 2007, at 5:21 PM, Adrian Bunk wrote: This code became dead after commit b9148c6b80d802dbc2a7530b29915a80432e50c7 (which BTW doesn't seem to have changed any behaviour) and can therefore be removed. Spotted by the Coverity checker. Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]> --- --- linux-2.6/fs/nfs/direct.c.old 2007-12-02 21:54:53.0 +0100 +++ linux-2.6/fs/nfs/direct.c 2007-12-02 21:55:10.0 +0100 @@ -897,15 +897,12 @@ ssize_t nfs_file_direct_write(struct kio if (!count) goto out; /* return 0 */ retval = -EINVAL; if ((ssize_t) count < 0) goto out; - retval = 0; - if (!count) - goto out; retval = nfs_sync_mapping(mapping); if (retval) goto out; retval = nfs_direct_write(iocb, iov, nr_segs, pos, count); -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] enhanced ESTALE error handling
Hi Peter- On Jan 18, 2008, at 10:35 AM, Peter Staubach wrote: Hi. Here is a patch set which modifies the system to enhance the ESTALE error handling for system calls which take pathnames as arguments. The VFS already handles ESTALE. If a pathname resolution encounters an ESTALE at any point, the resolution is restarted exactly once, and an additional flag is passed to the file system during each lookup that forces each component in the path to be revalidated on the server. This has no possibility of causing an infinite loop. Is there some part of this logic that is no longer working? -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] enhanced ESTALE error handling
On Jan 18, 2008, at 12:30 PM, Peter Staubach wrote: Chuck Lever wrote: On Jan 18, 2008, at 11:55 AM, Peter Staubach wrote: Chuck Lever wrote: Hi Peter- On Jan 18, 2008, at 10:35 AM, Peter Staubach wrote: Hi. Here is a patch set which modifies the system to enhance the ESTALE error handling for system calls which take pathnames as arguments. The VFS already handles ESTALE. If a pathname resolution encounters an ESTALE at any point, the resolution is restarted exactly once, and an additional flag is passed to the file system during each lookup that forces each component in the path to be revalidated on the server. This has no possibility of causing an infinite loop. Is there some part of this logic that is no longer working? The VFS does not fully handle ESTALE. An ESTALE error can occur during the second pathname resolution attempt. If an ESTALE occurs during the second resolution attempt, we should give up. When I addressed this issue two years ago, the two-try logic was the only acceptable solution because there's no way to guarantee the pathname resolution will ever finish unless we put a hard limit on it. I can probably imagine a situation where the pathname resolution would never finish, but I am not sure that it could ever happen in nature. Unless someone is doing something malicious. Or if the server is repeatedly returning ESTALE for some reason. There are lots of reasons, some of which are the 1 second resolution from some file systems on the server Which is a server bug, AFAICS. It's simply impossible to close all the windows that result from sloppy file time stamps without completely disabling client-side caching. The NFS protocol relies on file time stamps to manage cache coherence. If the server is lying about time stamps, there's no way the client can cache coherently. Server bug or not, it is something that the client has to live with. We can't get the server file system fixed, so it is something that we should find a way to live with. This support can help. We haven't identified a server-side solution yet, but that doesn't mean it doesn't exist. If we address the time stamp problem in the client, should we also go to lengths to address it in every other corner of the NFS client? Should we also address every other server bug we discover with a client side fix? Also, there was no support for ESTALE errors which occur during subsequent operations to the pathname resolution process. For example, during a mkdir(2) operation, the ESTALE can occur from the over the wire MKDIR operation after the LOOKUP operations have all succeeded. If the final operation fails after a pathname resolution, then it's a real error. Is there a fixed and valid recovery script for the client in this case that will allow the mkdir to proceed? Why do you think that it is an error? Because this is a problem that sometimes requires application-level recovery. Can we guarantee that retrying the mkdir is the right thing to do every time? It can easily occur if the directory in which the new directory is to be created disppears after it is looked up and before the MKDIR is issued. The recovery is to perform the lookup again. Have you tried this client against a file server when you unexport the filesystem under test? The server returns ESTALE no matter what the client does. Should the client continue to retry the request if the file system has been permanently taken offline? Admittedly, the NFS client could recover more cleanly from some of these problems, but given the architecture of the Linux VFS, it will be difficult to address some of the corner cases. Could you outline some of these corner cases that this proposal would not address, please? I think we have one right here: should the client retry a mkdir if gets an ESTALE? -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] enhanced ESTALE error handling
On Jan 18, 2008, at 11:55 AM, Peter Staubach wrote: Chuck Lever wrote: Hi Peter- On Jan 18, 2008, at 10:35 AM, Peter Staubach wrote: Hi. Here is a patch set which modifies the system to enhance the ESTALE error handling for system calls which take pathnames as arguments. The VFS already handles ESTALE. If a pathname resolution encounters an ESTALE at any point, the resolution is restarted exactly once, and an additional flag is passed to the file system during each lookup that forces each component in the path to be revalidated on the server. This has no possibility of causing an infinite loop. Is there some part of this logic that is no longer working? The VFS does not fully handle ESTALE. An ESTALE error can occur during the second pathname resolution attempt. If an ESTALE occurs during the second resolution attempt, we should give up. When I addressed this issue two years ago, the two-try logic was the only acceptable solution because there's no way to guarantee the pathname resolution will ever finish unless we put a hard limit on it. There are lots of reasons, some of which are the 1 second resolution from some file systems on the server Which is a server bug, AFAICS. It's simply impossible to close all the windows that result from sloppy file time stamps without completely disabling client-side caching. The NFS protocol relies on file time stamps to manage cache coherence. If the server is lying about time stamps, there's no way the client can cache coherently. and the window in between the revalidation and the actual use of the file handle associated with each dentry/inode pair. A use case or two would be useful to explore (on linux-nfs or linux- fsdevel, rather than lkml). Also, there was no support for ESTALE errors which occur during subsequent operations to the pathname resolution process. For example, during a mkdir(2) operation, the ESTALE can occur from the over the wire MKDIR operation after the LOOKUP operations have all succeeded. If the final operation fails after a pathname resolution, then it's a real error. Is there a fixed and valid recovery script for the client in this case that will allow the mkdir to proceed? Admittedly, the NFS client could recover more cleanly from some of these problems, but given the architecture of the Linux VFS, it will be difficult to address some of the corner cases. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] enhanced ESTALE error handling
Hi Peter- On Jan 18, 2008, at 12:30 PM, Peter Staubach wrote: Chuck Lever wrote: On Jan 18, 2008, at 11:55 AM, Peter Staubach wrote: Chuck Lever wrote: Hi Peter- On Jan 18, 2008, at 10:35 AM, Peter Staubach wrote: and the window in between the revalidation and the actual use of the file handle associated with each dentry/inode pair. A use case or two would be useful to explore (on linux-nfs or linux-fsdevel, rather than lkml). I created a bunch of use cases in the gensyscall.c program that I attached to the original description of the problem and my proposed solution. It was very useful in generating many, many ESTALE errors over the wire from a variety of different over the wire operations, which were originally getting returned to the user level. The gensyscall.c program is what I would call a set of unit test, btw. This is not the same as a use case, which would include information about the application environment, its users, a detailed description of current system behavior, and some discussion of alternatives for improving it (including doing nothing). A test case is written in a programming language, a use case is written in a natural language. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 21/26] mount options: partially fix nfs
Hi Miklos- Miklos Szeredi wrote: From: Miklos Szeredi <[EMAIL PROTECTED]> Add posix, bsize=, namelen= options to /proc/mounts for nfs filesystems. Document several other options that are still missing. NFS lists only some options in /proc/mounts on purpose: only the essential options are mentioned there to keep clutter down. The three you've added here are for all intents and purposes deprecated, which is why they are not supported. NFS lists a more complete set of mount options for a mount point in /proc/self/mountstats. See nfs_show_stats(). Since your cover letter does not explain why you are changing this code, can you refer me to a description of why you are doing this? More below. Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]> --- Index: linux/fs/nfs/super.c === --- linux.orig/fs/nfs/super.c 2008-01-19 11:56:34.0 +0100 +++ linux/fs/nfs/super.c2008-01-21 20:41:30.0 +0100 @@ -449,6 +449,7 @@ static void nfs_show_mount_options(struc } nfs_info[] = { { NFS_MOUNT_SOFT, ",soft", ",hard" }, { NFS_MOUNT_INTR, ",intr", ",nointr" }, + { NFS_MOUNT_POSIX, ",posix", "" }, { NFS_MOUNT_NOCTO, ",nocto", "" }, { NFS_MOUNT_NOAC, ",noac", "" }, { NFS_MOUNT_NONLM, ",nolock", "" }, @@ -459,10 +460,17 @@ static void nfs_show_mount_options(struc }; const struct proc_nfs_info *nfs_infop; struct nfs_client *clp = nfss->nfs_client; + unsigned int default_namelen = + clp->rpc_ops->version == 4 ? NFS4_MAXNAMLEN : + clp->rpc_ops->version == 3 ? NFS3_MAXNAMLEN : NFS2_MAXNAMLEN; seq_printf(m, ",vers=%d", clp->rpc_ops->version); seq_printf(m, ",rsize=%d", nfss->rsize); seq_printf(m, ",wsize=%d", nfss->wsize); + if (nfss->bsize != 0) + seq_printf(m, ",bsize=%d", nfss->bsize); + if (nfss->namelen != default_namelen) + seq_printf(m, ",namelen=%d", nfss->namelen); if (nfss->acregmin != 3*HZ || showdefaults) seq_printf(m, ",acregmin=%d", nfss->acregmin/HZ); if (nfss->acregmax != 60*HZ || showdefaults) @@ -482,6 +490,18 @@ static void nfs_show_mount_options(struc seq_printf(m, ",timeo=%lu", 10U * nfss->client->cl_timeout->to_initval / HZ); seq_printf(m, ",retrans=%u", nfss->client->cl_timeout->to_retries); seq_printf(m, ",sec=%s", nfs_pseudoflavour_to_name(nfss->client->cl_auth->au_flavor)); + + /* +* Missing options: +* port= Probably should be supported. +* addr= This one is already supported; see nfs_show_options(). +* clientaddr= This one isn't, and should be... would be useful for tracking down certain NFSv4 problems. +* mounthost= +* mountaddr= > + * mountport= > + * mountvers= > + * mountproto= And these mount* options are for the kernel's new mount protocol client. They aren't really useful for understanding steady-state NFS client behavior, they only effect mount-time behavior. begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture: Linux Projects Group adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA email;internet:chuck dot lever at nospam oracle dot com title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE version:2.1 end:vcard
Re: [PATCH 24/27] NFS: Use local caching [try #2]
Some comments below. This patch really ought to be broken into more manageable atomic changes to make it easier to review, and to provide more fine-grained explanation and rationalization for each specific change via individual patch descriptions. David Howells wrote: The attached patch makes it possible for the NFS filesystem to make use of the network filesystem local caching service (FS-Cache). To be able to use this, an updated mount program is required. This can be obtained from: http://people.redhat.com/steved/fscache/util-linux/ This should no longer be necessary. The latest mount.nfs subcommand from nfs-utils supports text-based mounts when running on kernels 2.6.23 and later. To mount an NFS filesystem to use caching, add an "fsc" option to the mount: mount warthog:/ /a -o fsc I hope you intend to provide updates to nfs(5) that describe the new mount options you introduce in this and later patches. You don't mention it, but I assume that "nofsc" is the default behavior. Signed-off-by: David Howells <[EMAIL PROTECTED]> --- fs/nfs/Makefile |1 fs/nfs/client.c |5 + fs/nfs/file.c | 37 fs/nfs/fscache-def.c | 289 + fs/nfs/fscache.c | 391 + fs/nfs/fscache.h | 148 + fs/nfs/inode.c| 47 + fs/nfs/read.c | 28 +++ fs/nfs/super.c|3 fs/nfs/sysctl.c |1 include/linux/nfs_fs.h|9 + include/linux/nfs_fs_sb.h | 18 ++ 12 files changed, 968 insertions(+), 9 deletions(-) create mode 100644 fs/nfs/fscache-def.c create mode 100644 fs/nfs/fscache.c create mode 100644 fs/nfs/fscache.h diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile index df0f41e..073d04c 100644 --- a/fs/nfs/Makefile +++ b/fs/nfs/Makefile @@ -16,3 +16,4 @@ nfs-$(CONFIG_NFS_V4) += nfs4proc.o nfs4xdr.o nfs4state.o nfs4renewd.o \ nfs4namespace.o nfs-$(CONFIG_NFS_DIRECTIO) += direct.o nfs-$(CONFIG_SYSCTL) += sysctl.o +nfs-$(CONFIG_NFS_FSCACHE) += fscache.o fscache-def.o diff --git a/fs/nfs/client.c b/fs/nfs/client.c index a6f6254..bcdc5d0 100644 --- a/fs/nfs/client.c +++ b/fs/nfs/client.c @@ -43,6 +43,7 @@ #include "delegation.h" #include "iostat.h" #include "internal.h" +#include "fscache.h" #define NFSDBG_FACILITY NFSDBG_CLIENT @@ -139,6 +140,8 @@ static struct nfs_client *nfs_alloc_client(const char *hostname, clp->cl_state = 1 << NFS4CLNT_LEASE_EXPIRED; #endif + nfs_fscache_get_client_cookie(clp); + return clp; error_3: @@ -170,6 +173,8 @@ static void nfs_free_client(struct nfs_client *clp) nfs4_shutdown_client(clp); + nfs_fscache_release_client_cookie(clp); + /* -EIO all pending I/O */ if (!IS_ERR(clp->cl_rpcclient)) rpc_shutdown_client(clp->cl_rpcclient); diff --git a/fs/nfs/file.c b/fs/nfs/file.c index b3bb89f..d492cd7 100644 --- a/fs/nfs/file.c +++ b/fs/nfs/file.c @@ -35,6 +35,7 @@ #include "delegation.h" #include "internal.h" #include "iostat.h" +#include "fscache.h" #define NFSDBG_FACILITY NFSDBG_FILE @@ -352,22 +353,48 @@ static int nfs_write_end(struct file *file, struct address_space *mapping, return status < 0 ? status : copied; } +/* + * Partially or wholly invalidate a page + * - Release the private state associated with a page if undergoing complete + * page invalidation + * - Called if either PG_private or PG_fscache set on the page + * - Caller holds page lock + */ Add comments like this in a separate clean up patch. static void nfs_invalidate_page(struct page *page, unsigned long offset) { if (offset != 0) return; /* Cancel any unstarted writes on this page */ nfs_wb_page_cancel(page->mapping->host, page); + + nfs_fscache_invalidate_page(page, page->mapping->host); } +/* + * Release the private state associated with a page + * - Called if either PG_private or PG_fscache set on the page + * - Caller holds page lock + * - Return true (may release) or false (may not) + */ static int nfs_release_page(struct page *page, gfp_t gfp) { /* If PagePrivate() is set, then the page is not freeable */ - return 0; + if (PagePrivate(page)) + return 0; + return nfs_fscache_release_page(page, gfp); } +/* + * Attempt to clear the private state associated with a page when an error + * occurs that requires the cached contents of an inode to be written back or + * destroyed + * - Called if either PG_private or PG_fscache set on the page + * - Caller holds page lock + * - Return 0 if successful, -error otherwise + */ static int nfs_launder_page(struct page *page) { + wait_on_page_fscache_write(page); return nfs_wb_page(page->mapping->host, page); } @@ -387,6 +414,11 @@ const struct address_space_operations nfs_file_aops = {
Re: NFS EINVAL on open(... | O_TRUNC) on 2.6.23.9
be fixed on the server. That is sometimes difficult if the server is not maintained, for instance. - I have found other strange behaviors of the new NFS filesystem support that i am investigating. All are bound to the user of old userspace servers onto the new NFS (since 2.6.22). What to do ? Report the problems on the [EMAIL PROTECTED] mailing list, or document them in the official bug databases (either the Linux kernel bugzilla or the NFSv4 bug tracker at http://bugzilla.linux-nfs.org/ I'm not sure what the NFS client's policy is regarding support for userspace servers. But I'd certainly hope that it is "don't break them". The general policy is that if a server behaves in ways that don't conform to the NFS spec, then the Linux NFS client probably won't work with it. If the client works with a broken server today, there is no guarantee that it will continue to work. Which would make this an NFS client regression. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: How to manage shared persistent local caching (FS-Cache) with NFS?
w and is used by the database engine. Another mount point is ro and is used for back-up utilities, like RMAN. Another example is local software distribution. One mount point is ro, and is accessed by normal users. Another mount point accesses the same export rw, and is used by administrators who provide updates for the software. As useful as the feature is, one can also argue that mounting the same export multiple times is infrequent in most normal use cases. Practically speaking, why do we really need to worry about it? The real problem here is that the NFS protocol itself does not support strong cache coherence. I don't see why the Linux kernel must fix that problem. The only real problem with the first scenario is that you may have more than one copy of a file in the persistent cache. How often will that be the case? Since the local persistence cache is probably disk- based and thus large relative to memory, what's the problem with using a little extra space? The problems you ascribe to your second and third caching scenarios (deadlocking and reconnection) are, however, real and substantial. You don't have these issues when caching each mount point separately, right? It seems to me that implementing the first scenario is (a) straightforward, (b) has fewer runtime risks (ie deadlocks), (c) doesn't take away features that some people still use, and (d) solves more than 80% of the issues here (80/20 rule of thumb). Lastly, there's already a mount option that allows admins to control whether the page and attribute caches are shared -- "sharecache". Is this mount option not adequate for persistent caching? -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] VFS: extend /proc/mounts
On Jan 17, 2008, at 3:55 AM, Miklos Szeredi wrote: Hey, I just found /proc/X/mountstats. How does this fit in to the big picture? It seems to show some counters for NFS mounts, no other filesystem uses it. Format looks rather less nice, than /proc/X/mounts (why do we need long english sentences under /proc?). I introduced /proc/self/mountstats because we need a way for non- block-device-based file systems to report I/O statistics. Everything else I tried was rejected, and apparently what we ended up with was reviewed by only a handful of people, so no one else likes it or uses it. It can go away for all I care, as long as we retain some flexible mechanism for non-block-based file systems to report I/O stats. As far as I am aware, there are only two user utilities that understand and parse this data, and I maintain both. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 21/26] mount options: partially fix nfs
On Jan 25, 2008, at 4:39 AM, Miklos Szeredi wrote: Miklos Szeredi wrote: From: Miklos Szeredi <[EMAIL PROTECTED]> Add posix, bsize=, namelen= options to /proc/mounts for nfs filesystems. Document several other options that are still missing. NFS lists only some options in /proc/mounts on purpose: only the essential options are mentioned there to keep clutter down. The three you've added here are for all intents and purposes deprecated, which is why they are not supported. NFS lists a more complete set of mount options for a mount point in /proc/self/mountstats. See nfs_show_stats(). Since your cover letter does not explain why you are changing this code, can you refer me to a description of why you are doing this? Descritption is in the 01/26 patch. More below. Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]> --- Index: linux/fs/nfs/super.c === --- linux.orig/fs/nfs/super.c 2008-01-19 11:56:34.0 +0100 +++ linux/fs/nfs/super.c2008-01-21 20:41:30.0 +0100 @@ -449,6 +449,7 @@ static void nfs_show_mount_options(struc } nfs_info[] = { { NFS_MOUNT_SOFT, ",soft", ",hard" }, { NFS_MOUNT_INTR, ",intr", ",nointr" }, + { NFS_MOUNT_POSIX, ",posix", "" }, { NFS_MOUNT_NOCTO, ",nocto", "" }, { NFS_MOUNT_NOAC, ",noac", "" }, { NFS_MOUNT_NONLM, ",nolock", "" }, @@ -459,10 +460,17 @@ static void nfs_show_mount_options(struc }; const struct proc_nfs_info *nfs_infop; struct nfs_client *clp = nfss->nfs_client; + unsigned int default_namelen = + clp->rpc_ops->version == 4 ? NFS4_MAXNAMLEN : + clp->rpc_ops->version == 3 ? NFS3_MAXNAMLEN : NFS2_MAXNAMLEN; seq_printf(m, ",vers=%d", clp->rpc_ops->version); seq_printf(m, ",rsize=%d", nfss->rsize); seq_printf(m, ",wsize=%d", nfss->wsize); + if (nfss->bsize != 0) + seq_printf(m, ",bsize=%d", nfss->bsize); + if (nfss->namelen != default_namelen) + seq_printf(m, ",namelen=%d", nfss->namelen); if (nfss->acregmin != 3*HZ || showdefaults) seq_printf(m, ",acregmin=%d", nfss->acregmin/HZ); if (nfss->acregmax != 60*HZ || showdefaults) @@ -482,6 +490,18 @@ static void nfs_show_mount_options(struc seq_printf(m, ",timeo=%lu", 10U * nfss->client->cl_timeout- >to_initval / HZ); seq_printf(m, ",retrans=%u", nfss->client->cl_timeout- >to_retries); seq_printf(m, ",sec=%s", nfs_pseudoflavour_to_name(nfss->client- >cl_auth->au_flavor)); + + /* +* Missing options: +* port= Probably should be supported. +* addr= This one is already supported; see nfs_show_options(). Right, thanks. +* clientaddr= This one isn't, and should be... would be useful for tracking down certain NFSv4 problems. +* mounthost= +* mountaddr= +* mountport= +* mountvers= +* mountproto= And these mount* options are for the kernel's new mount protocol client. They aren't really useful for understanding steady-state NFS client behavior, they only effect mount-time behavior. All mount options should be shown, which are needed to reconstruct a previous mount. Ah, OK. I'm happy to implement logic to display the all missing options. I should have updated nfs_show_mount_options() when I wrote the NFS mount option parser. Let me know your preference. For example, if you copy options out from /proc/mount, umount the filesystem, and then create a new mount with the copied options, you should get the same mount. For NFS, umount also needs to read some of the options in order to determine how mountd is to connect to the server for the unmount. (That's why we have addr= in the first place). -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 21/26] mount options: partially fix nfs
On Jan 28, 2008, at 6:34 AM, Miklos Szeredi wrote: All mount options should be shown, which are needed to reconstruct a previous mount. Ah, OK. I'm happy to implement logic to display the all missing options. I should have updated nfs_show_mount_options() when I wrote the NFS mount option parser. Let me know your preference. You are more familiar with NFS, so I think it would be better if you updated nfs_show_mount_options(). Could you also queue my patch (updated) or incorporate it into a combined fix? Yes. I'll have time in a day or two to get this finished. Thanks, Miklos Subject: mount options: partially fix nfs From: Miklos Szeredi <[EMAIL PROTECTED]> Add posix, bsize=, namelen= options to /proc/mounts for nfs filesystems. Document several other options that are still missing. Changes: - display namelen= unconditionally - addr= isn't missing after all Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]> CC: Trond Myklebust <[EMAIL PROTECTED]> --- Index: linux/fs/nfs/super.c === --- linux.orig/fs/nfs/super.c 2008-01-25 15:44:56.0 +0100 +++ linux/fs/nfs/super.c2008-01-25 15:57:32.0 +0100 @@ -449,6 +449,7 @@ static void nfs_show_mount_options(struc } nfs_info[] = { { NFS_MOUNT_SOFT, ",soft", ",hard" }, { NFS_MOUNT_INTR, ",intr", ",nointr" }, + { NFS_MOUNT_POSIX, ",posix", "" }, { NFS_MOUNT_NOCTO, ",nocto", "" }, { NFS_MOUNT_NOAC, ",noac", "" }, { NFS_MOUNT_NONLM, ",nolock", "" }, @@ -463,6 +464,9 @@ static void nfs_show_mount_options(struc seq_printf(m, ",vers=%d", clp->rpc_ops->version); seq_printf(m, ",rsize=%d", nfss->rsize); seq_printf(m, ",wsize=%d", nfss->wsize); + seq_printf(m, ",namelen=%d", nfss->namelen); + if (nfss->bsize != 0) + seq_printf(m, ",bsize=%d", nfss->bsize); if (nfss->acregmin != 3*HZ || showdefaults) seq_printf(m, ",acregmin=%d", nfss->acregmin/HZ); if (nfss->acregmax != 60*HZ || showdefaults) @@ -482,6 +486,17 @@ static void nfs_show_mount_options(struc seq_printf(m, ",timeo=%lu", 10U * nfss->client->cl_timeout- >to_initval / HZ); seq_printf(m, ",retrans=%u", nfss->client->cl_timeout->to_retries); seq_printf(m, ",sec=%s", nfs_pseudoflavour_to_name(nfss->client- >cl_auth->au_flavor)); + + /* +* Missing options: +* port= +* mountport= +* mountvers= +* mountproto= +* clientaddr= +* mounthost= +* mountaddr= +*/ } /* -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] VFS: New /proc file /proc/self/mountstats
Create a new file under /proc/self, called mountstats, where mounted file systems can export information (configuration options, performance counters, and so on). Use a mechanism similar to /proc/mounts and s_ops->show_options. This mechanism does not violate namespace security, and is safe to use while other processes are unmounting file systems. Version: Mon, 14 Mar 2005 17:06:04 -0500 Signed-off-by: Chuck Lever <[EMAIL PROTECTED]> --- fs/namespace.c | 66 + fs/proc/base.c | 40 +++ include/linux/fs.h |1 3 files changed, 107 insertions(+) diff -X /home/cel/src/linux/dont-diff -Naurp 00-stock/fs/namespace.c 01-mountstats/fs/namespace.c --- 00-stock/fs/namespace.c 2005-03-02 02:38:13.0 -0500 +++ 01-mountstats/fs/namespace.c2005-03-14 15:24:51.565085000 -0500 @@ -265,6 +265,72 @@ struct seq_operations mounts_op = { .show = show_vfsmnt }; +/* iterator */ +static void *ms_start(struct seq_file *m, loff_t *pos) +{ + struct namespace *n = m->private; + struct list_head *p; + loff_t l = *pos; + + down_read(&n->sem); + list_for_each(p, &n->list) + if (!l--) + return list_entry(p, struct vfsmount, mnt_list); + return NULL; +} + +static void *ms_next(struct seq_file *m, void *v, loff_t *pos) +{ + struct namespace *n = m->private; + struct list_head *p = ((struct vfsmount *)v)->mnt_list.next; + (*pos)++; + return p==&n->list ? NULL : list_entry(p, struct vfsmount, mnt_list); +} + +static void ms_stop(struct seq_file *m, void *v) +{ + struct namespace *n = m->private; + up_read(&n->sem); +} + +static int show_vfsstat(struct seq_file *m, void *v) +{ + struct vfsmount *mnt = v; + int err = 0; + + /* device */ + if (mnt->mnt_devname) { + seq_puts(m, "device "); + mangle(m, mnt->mnt_devname); + } else + seq_puts(m, "no device"); + + /* mount point */ + seq_puts(m, " mounted on "); + seq_path(m, mnt, mnt->mnt_root, " \t\n\\"); + seq_putc(m, ' '); + + /* file system type */ + seq_puts(m, "with fstype "); + mangle(m, mnt->mnt_sb->s_type->name); + + /* optional statistics */ + if (mnt->mnt_sb->s_op->show_stats) { + seq_putc(m, ' '); + err = mnt->mnt_sb->s_op->show_stats(m, mnt); + } + + seq_putc(m, '\n'); + return err; +} + +struct seq_operations mountstats_op = { + .start = ms_start, + .next = ms_next, + .stop = ms_stop, + .show = show_vfsstat, +}; + /** * may_umount_tree - check if a mount tree is busy * @mnt: root of mount tree diff -X /home/cel/src/linux/dont-diff -Naurp 00-stock/fs/proc/base.c 01-mountstats/fs/proc/base.c --- 00-stock/fs/proc/base.c 2005-03-02 02:38:12.0 -0500 +++ 01-mountstats/fs/proc/base.c2005-03-14 15:24:51.571085000 -0500 @@ -60,6 +60,7 @@ enum pid_directory_inos { PROC_TGID_STATM, PROC_TGID_MAPS, PROC_TGID_MOUNTS, + PROC_TGID_MOUNTSTATS, PROC_TGID_WCHAN, #ifdef CONFIG_SCHEDSTATS PROC_TGID_SCHEDSTAT, @@ -91,6 +92,7 @@ enum pid_directory_inos { PROC_TID_STATM, PROC_TID_MAPS, PROC_TID_MOUNTS, + PROC_TID_MOUNTSTATS, PROC_TID_WCHAN, #ifdef CONFIG_SCHEDSTATS PROC_TID_SCHEDSTAT, @@ -134,6 +136,7 @@ static struct pid_entry tgid_base_stuff[ E(PROC_TGID_ROOT, "root",S_IFLNK|S_IRWXUGO), E(PROC_TGID_EXE, "exe", S_IFLNK|S_IRWXUGO), E(PROC_TGID_MOUNTS,"mounts", S_IFREG|S_IRUGO), + E(PROC_TGID_MOUNTSTATS, "mountstats", S_IFREG|S_IRUGO), #ifdef CONFIG_SECURITY E(PROC_TGID_ATTR, "attr",S_IFDIR|S_IRUGO|S_IXUGO), #endif @@ -164,6 +167,7 @@ static struct pid_entry tid_base_stuff[] E(PROC_TID_ROOT, "root",S_IFLNK|S_IRWXUGO), E(PROC_TID_EXE,"exe", S_IFLNK|S_IRWXUGO), E(PROC_TID_MOUNTS, "mounts", S_IFREG|S_IRUGO), + E(PROC_TID_MOUNTSTATS, "mountstats", S_IFREG|S_IRUGO), #ifdef CONFIG_SECURITY E(PROC_TID_ATTR, "attr",S_IFDIR|S_IRUGO|S_IXUGO), #endif @@ -528,6 +532,38 @@ static struct file_operations proc_mount .release= mounts_release, }; +extern struct seq_operations mountstats_op; +static int mountstats_open(struct inode *inode, struct file *file) +{ + struct task_struct *task = proc_task(inode); + int ret = seq_open(file, &mountstats_op); + + if (!ret) { + struct seq_file *m = file->p
[PATCH 0/2] RFC: exporting per-superblock statistics to user space
We still have a need to provide "iostat" like statistics for NFS clients. Following are a couple of patches, against 2.6.11.3, which prototype an approach for providing this kind of data to user programs. I'd like some comment on the approach. 01-mountstats.patch adds a new file called /proc/self/mountstats and a new file system method called show_stats. this just replicates /proc/mounts and the show_options hook. 02-nfs-iostat.patch teachs the NFS client to use the new show_stats hook as a demonstration. Note that this approach addresses previously voiced concerns about exporting per-superblock stats to user space. 1. Processes can't see stats for file systems mounted outside their namespace. 2. Reading the stats file is serialized with mount and unmount operations. 3. The approach doesn't use /sys or kobjects. 4. There are no lifetime issues tied to file systems loaded as a module. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] NFS: add I/O performance counters
Add an extensible per-superblock performance counter facility to the NFS client. This facility mimics the counters available for block devices and for networking. Expose these new counters via /proc/self/mountstats. Version: Mon, 14 Mar 2005 17:06:12 -0500 Signed-off-by: Chuck Lever <[EMAIL PROTECTED]> --- fs/nfs/dir.c |8 ++ fs/nfs/direct.c|5 + fs/nfs/file.c | 20 +++-- fs/nfs/inode.c | 126 +++-- fs/nfs/pagelist.c | 12 ++- fs/nfs/read.c |7 ++ fs/nfs/write.c | 10 ++ include/linux/nfs_fs_sb.h |5 + include/linux/nfs_iostat.h | 80 +++ 9 files changed, 256 insertions(+), 17 deletions(-) diff -X /home/cel/src/linux/dont-diff -Naurp 01-mountstats/fs/nfs/dir.c 02-nfs-iostat/fs/nfs/dir.c --- 01-mountstats/fs/nfs/dir.c 2005-03-02 02:38:09.0 -0500 +++ 02-nfs-iostat/fs/nfs/dir.c 2005-03-14 15:28:34.011484000 -0500 @@ -27,6 +27,7 @@ #include #include #include +#include #include #include #include @@ -428,6 +429,8 @@ static int nfs_readdir(struct file *filp lock_kernel(); + nfs_inc_stats(inode, NFS_VFS_GETDENTS); + res = nfs_revalidate_inode(NFS_SERVER(inode), inode); if (res < 0) { unlock_kernel(); @@ -584,6 +587,7 @@ static int nfs_lookup_revalidate(struct parent = dget_parent(dentry); lock_kernel(); dir = parent->d_inode; + nfs_inc_stats(dir, NFS_DENTRY_REVALIDATE); inode = dentry->d_inode; if (nd && !(nd->flags & LOOKUP_CONTINUE) && (nd->flags & LOOKUP_OPEN)) @@ -712,6 +716,7 @@ static struct dentry *nfs_lookup(struct dfprintk(VFS, "NFS: lookup(%s/%s)\n", dentry->d_parent->d_name.name, dentry->d_name.name); + nfs_inc_stats(dir, NFS_VFS_LOOKUP); res = ERR_PTR(-ENAMETOOLONG); if (dentry->d_name.len > NFS_SERVER(dir)->namelen) @@ -1116,6 +1121,7 @@ static int nfs_sillyrename(struct inode dfprintk(VFS, "NFS: silly-rename(%s/%s, ct=%d)\n", dentry->d_parent->d_name.name, dentry->d_name.name, atomic_read(&dentry->d_count)); + nfs_inc_stats(dir, NFS_SILLY_RENAME); #ifdef NFS_PARANOIA if (!dentry->d_inode) @@ -1500,6 +1506,8 @@ int nfs_permission(struct inode *inode, struct rpc_cred *cred; int res; + nfs_inc_stats(inode, NFS_VFS_ACCESS); + if (mask == 0) return 0; diff -X /home/cel/src/linux/dont-diff -Naurp 01-mountstats/fs/nfs/direct.c 02-nfs-iostat/fs/nfs/direct.c --- 01-mountstats/fs/nfs/direct.c 2005-03-02 02:38:25.0 -0500 +++ 02-nfs-iostat/fs/nfs/direct.c 2005-03-14 15:26:16.401349000 -0500 @@ -47,6 +47,7 @@ #include #include +#include #include #include @@ -354,6 +355,8 @@ static ssize_t nfs_direct_read_seg(struc result = nfs_direct_read_wait(dreq, clnt->cl_intr); rpc_clnt_sigunmask(clnt, &oldset); + nfs_add_stats(inode, NFS_WIRE_READ_BYTES, result); + nfs_add_stats(inode, NFS_DIRECT_READ_BYTES, result); return result; } @@ -576,6 +579,8 @@ static ssize_t nfs_direct_write(struct i if (result < size) break; } + nfs_add_stats(inode, NFS_WIRE_WRITTEN_BYTES, tot_bytes); + nfs_add_stats(inode, NFS_DIRECT_WRITTEN_BYTES, tot_bytes); return tot_bytes; } diff -X /home/cel/src/linux/dont-diff -Naurp 01-mountstats/fs/nfs/file.c 02-nfs-iostat/fs/nfs/file.c --- 01-mountstats/fs/nfs/file.c 2005-03-02 02:38:38.0 -0500 +++ 02-nfs-iostat/fs/nfs/file.c 2005-03-14 15:42:52.446804000 -0500 @@ -22,6 +22,7 @@ #include #include #include +#include #include #include #include @@ -86,18 +87,15 @@ static int nfs_check_flags(int flags) static int nfs_file_open(struct inode *inode, struct file *filp) { - struct nfs_server *server = NFS_SERVER(inode); - int (*open)(struct inode *, struct file *); int res; res = nfs_check_flags(filp->f_flags); if (res) return res; + nfs_inc_stats(inode, NFS_VFS_OPEN); lock_kernel(); - /* Do NFSv4 open() call */ - if ((open = server->rpc_ops->file_open) != NULL) - res = open(inode, filp); + res = NFS_SERVER(inode)->rpc_ops->file_open(inode, filp); unlock_kernel(); return res; } @@ -105,6 +103,7 @@ nfs_file_open(struct inode *inode, struc static int nfs_file_release(struct inode *inode, struct file *filp) { + nfs_inc_stats(inode, NFS_VFS_CLOSE); return NFS_PROTO(inode)->file_release(inode, filp); } @@ -123,6 +122,7 @@ nfs_file_flush(struct file *file) if ((file->f_mode & FMODE_WRITE) == 0) return
Re: 2.6.11 oops in skb_drop_fraglist
Andrew Morton wrote: Chuck Lever <[EMAIL PROTECTED]> wrote: testing NFS client workloads on a dual Pentium-III system running 2.6.11 with some NFS patches. i hit this oops while doing simple-minded ftps and tars. the system locks up once or twice a day under this workload. this is the first time i had the console and captured the oops output. Chuck, I didn't see any followup to this. Is it still happening in current kernels? i have not been able to reproduce it with the aforementioned NFS patches removed. i'm now convinced it was a bug in one of the NFS patches i had applied, even though none of them come near the fraglist stuff, but i haven't had a chance to nail it down. i had implemented a patch to cause the RPC client to reuse the port number when reconnecting to the server after the server drops the connection... this is a standard practice for other RPC implementations. i suspect it was that patch that was causing the trouble. thanks for the follow-up! begin:vcard fn:Chuck Lever n:Lever;Charles org:Network Appliance, Incorporated;Linux NFS Client Development adr:535 West William Street, Suite 3100;;Center for Information Technology Integration;Ann Arbor;MI;48103-4943;USA email;internet:[EMAIL PROTECTED] title:Member of Technical Staff tel;work:+1 734 763-4415 tel;fax:+1 734 763 4434 tel;home:+1 734 668-1089 x-mozilla-html:FALSE url:http://www.monkey.org/~cel/ version:2.1 end:vcard
2.6.11 oops in skb_drop_fraglist
testing NFS client workloads on a dual Pentium-III system running 2.6.11 with some NFS patches. i hit this oops while doing simple-minded ftps and tars. the system locks up once or twice a day under this workload. this is the first time i had the console and captured the oops output. Unable to handle kernel NULL pointer dereference at virtual address 0001 printing eip: c02fc752 *pde = Oops: [#1] SMP CPU:0 EIP:0060:[]Not tainted VLI EFLAGS: 00010202 (2.6.11-CITI_NFS4_ALL-1) EIP is at skb_drop_fraglist+0x22/0x50 eax: f6fe26e0 ebx: 0001 ecx: f6fe26e0 edx: 0001 esi: f6f29240 edi: 0004 ebp: f697ed24 esp: c04cadd8 ds: 007b es: 007b ss: 0068 Process swapper (pid: 0, threadinfo=c04ca000 task=c03efb60) Stack: 0001 c02fc7fa f6f29240 f697ecc0 c02fc838 c02fc8ba ccbf2ab0 0b50 c0326a16 f6f87740 f6f29240 f6f29240 c0321c4a 0020 f6f87778 f6f87740 f697ecc0 f69d1560 00625a79 c04cae6c 0012 f697ecc0 Call Trace: [] skb_release_data+0x5a/0x90 [] kfree_skbmem+0x8/0x20 [] __kfree_skb+0x6a/0xf0 [] tcp_transmit_skb+0x406/0x720 [] tcp_clean_rtx_queue+0x17a/0x410 [] tcp_ack+0xf6/0x580 [] tcp_rcv_established+0x409/0x7f0 [] apic_timer_interrupt+0x1c/0x24 [] tcp_v4_do_rcv+0x110/0x120 [] tcp_v4_rcv+0x6bf/0x940 [] ip_local_deliver+0xc2/0x1f0 [] ip_rcv+0x336/0x450 [] alloc_skb+0x41/0xf0 [] netif_receive_skb+0x136/0x1a0 [] e1000_clean_rx_irq+0x15e/0x4a0 [] __kfree_skb+0x6a/0xf0 [] e1000_clean+0xba/0xf0 [] net_rx_action+0x6f/0x100 [] __do_softirq+0xb9/0xd0 [] do_softirq+0x4a/0x60 c02fc730: 53 push %ebx c02fc731: 8b 80 94 00 00 00 mov0x94(%eax),%eax c02fc737: 8b 58 0cmov0xc(%eax),%ebx c02fc73a: c7 40 0c 00 00 00 00movl $0x0,0xc(%eax) c02fc741: eb 0d jmpc02fc750 c02fc743: 90 nop c02fc744: 90 nop c02fc745: 90 nop c02fc746: 90 nop c02fc747: 90 nop c02fc748: 90 nop c02fc749: 90 nop c02fc74a: 90 nop c02fc74b: 90 nop c02fc74c: 90 nop c02fc74d: 90 nop c02fc74e: 90 nop c02fc74f: 90 nop c02fc750: 89 da mov%ebx,%edx c02fc752: 8b 1b mov(%ebx),%ebx c02fc754: 8b 82 84 00 00 00 mov0x84(%edx),%eax c02fc75a: 48 dec%eax c02fc75b: 75 13 jnec02fc770 c02fc75d: f0 83 44 24 00 00 lock addl $0x0,0x0(%esp,1) c02fc763: 89 d0 mov%edx,%eax c02fc765: e8 e6 00 00 00 call c02fc850 <__kfree_skb> c02fc76a: 85 db test %ebx,%ebx c02fc76c: 75 e2 jnec02fc750 c02fc76e: 5b pop%ebx c02fc76f: c3 ret c02fc770: f0 ff 8a 84 00 00 00lock decl 0x84(%edx) c02fc777: 0f 94 c0sete %al c02fc77a: 84 c0 test %al,%al c02fc77c: 74 ec je c02fc76a c02fc77e: eb e3 jmpc02fc763 begin:vcard fn:Chuck Lever n:Lever;Charles org:Network Appliance, Incorporated;Linux NFS Client Development adr:535 West William Street, Suite 3100;;Center for Information Technology Integration;Ann Arbor;MI;48103-4943;USA email;internet:[EMAIL PROTECTED] title:Member of Technical Staff tel;work:+1 734 763-4415 tel;fax:+1 734 763 4434 tel;home:+1 734 668-1089 x-mozilla-html:FALSE url:http://www.monkey.org/~cel/ version:2.1 end:vcard
RE: [NFS] [PATCH] New patch to flush out dirty mmap()ed NFS pages in 2.4.4
the default behavior is that close() waits for all write-backs to be committed to the server's disk. you might add support for the "nocto" mount option so that waiting is skipped for shared mmap'd files, but then what happens to data that is pinned on the client because a write-back failed after close() returns to the application? what's the application domain Linus is trying to optimize? > Linus was not too keen on the patches that circulated last week. In > his concept of shared mmap(), he wants it to ignore the usual > requirement we have on normal files whereby we flush out the pages on > file close. The problem is, though, that we need at least to schedule > the writes using the correct credentials, and thus a compromise was to > do this when closing the file... > > The following patch attempts to implement a fix along Linus' > specification. Features are: > >- Remove inode operation force_delete(). The latter is toxic to all > mmap(), as it can cause the inode to get killed before the dirty > pages get scheduled. > >- Schedule dirty pages upon last fput() of the file. > >- Always write out all dirty pages to the server when > locking/unlocking. > >- Add a write_inode() method in order to allow bdflush() and > friends to force out writes when the user calls sys_sync(), when > umounting, or when memory pressure is high. > >- Since we in any case have to add the write_inode() method, scrap > the NFS special O_SYNC code, so we can just use the generic stuff > (which will be faster for large writes). > > Comments? > > Cheers, > Trond > > diff -u --recursive --new-file linux-2.4.4-fixes/fs/nfs/file.c > linux-2.4.4-mmap/fs/nfs/file.c > --- linux-2.4.4-fixes/fs/nfs/file.c Fri Feb 9 20:29:44 2001 > +++ linux-2.4.4-mmap/fs/nfs/file.cSat May 12 21:31:39 2001 > @@ -39,6 +39,7 @@ > static ssize_t nfs_file_write(struct file *, const char *, > size_t, loff_t *); > static int nfs_file_flush(struct file *); > static int nfs_fsync(struct file *, struct dentry *dentry, int > datasync); > +static int nfs_file_release(struct inode *, struct file *); > > struct file_operations nfs_file_operations = { > read: nfs_file_read, > @@ -46,7 +47,7 @@ > mmap: nfs_file_mmap, > open: nfs_open, > flush: nfs_file_flush, > - release:nfs_release, > + release:nfs_file_release, > fsync: nfs_fsync, > lock: nfs_lock, > }; > @@ -87,6 +88,13 @@ > return status; > } > > +int > +nfs_file_release(struct inode *inode, struct file *file) > +{ > + filemap_fdatasync(inode->i_mapping); > + return nfs_release(inode,file); > +} > + > static ssize_t > nfs_file_read(struct file * file, char * buf, size_t count, loff_t *ppos) > { > @@ -283,9 +291,11 @@ >* Flush all pending writes before doing anything >* with locks.. >*/ > - down(&filp->f_dentry->d_inode->i_sem); > + filemap_fdatasync(inode->i_mapping); > + down(&inode->i_sem); > status = nfs_wb_all(inode); > - up(&filp->f_dentry->d_inode->i_sem); > + up(&inode->i_sem); > + filemap_fdatawait(inode->i_mapping); > if (status < 0) > return status; > > @@ -300,10 +310,12 @@ >*/ > out_ok: > if ((cmd == F_SETLK || cmd == F_SETLKW) && fl->fl_type != F_UNLCK) { > - down(&filp->f_dentry->d_inode->i_sem); > + filemap_fdatasync(inode->i_mapping); > + down(&inode->i_sem); > nfs_wb_all(inode); /* we may have slept */ > + up(&inode->i_sem); > + filemap_fdatawait(inode->i_mapping); > nfs_zap_caches(inode); > - up(&filp->f_dentry->d_inode->i_sem); > } > return status; > } > diff -u --recursive --new-file linux-2.4.4-fixes/fs/nfs/inode.c > linux-2.4.4-mmap/fs/nfs/inode.c > --- linux-2.4.4-fixes/fs/nfs/inode.c Wed Apr 25 23:58:17 2001 > +++ linux-2.4.4-mmap/fs/nfs/inode.c Sat May 12 23:54:16 2001 > @@ -45,6 +45,7 @@ > static void nfs_invalidate_inode(struct inode *); > > static void nfs_read_inode(struct inode *); > +static void nfs_write_inode(struct inode *,int); > static void nfs_delete_inode(struct inode *); > static void nfs_put_super(struct super_block *); > static void nfs_umount_begin(struct super_block *); > @@ -52,7 +53,7 @@ > > static struct super_operations nfs_sops = { > read_inode: nfs_read_inode, > - put_inode: force_delete, > + write_inode:nfs_write_inode, > delete_inode: nfs_delete_inode, > put_super: nfs_put_super, > statfs: nfs_statfs, > @@ -113,6 +114,14 @@ > NFS_CACHEINV(inode); > NFS_ATTRTIMEO(inode) = NFS_MINATTRTIMEO(inode); > NFS_ATTRTIMEO_UPDATE(inode) = jiffies; > +} > + > +static void > +nfs_write_inode(struct inode *inode, int sync) > +{ > + int flags = sync ? FLUSH_WAIT : 0; > + > + nfs_
Re: [PATCH] fix broken handling of port=... in NFS option parsing
ACK. Al Viro wrote: Obviously broken on little-endian; fortunately, the option is not frequently used... Signed-off-by: Al Viro <[EMAIL PROTECTED]> --- diff --git a/fs/nfs/super.c b/fs/nfs/super.c index b34b7a7..b2a851c 100644 --- a/fs/nfs/super.c +++ b/fs/nfs/super.c @@ -732,7 +732,7 @@ static int nfs_parse_mount_options(char *raw, return 0; if (option < 0 || option > 65535) return 0; - mnt->nfs_server.address.sin_port = htonl(option); + mnt->nfs_server.address.sin_port = htons(option); break; case Opt_rsize: if (match_int(args, &mnt->rsize)) begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture: Linux Projects Group adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE version:2.1 end:vcard
DMA CRC errors with SiS chipset and notebook drive
For various reasons I want a silent PC, so I'm using a notebook hard drive in my desktop system. I've recycled an ancient Foxconn Socket 754 mainboard in it with this IDE controller: 00:02.5 IDE interface: Silicon Integrated Systems [SiS] 5513 [IDE] (rev 01) (prog-if 80 [Master]) Subsystem: Foxconn International, Inc. Unknown device 0c92 Flags: bus master, medium devsel, latency 128 I/O ports at 01f0 [size=8] I/O ports at 03f4 [size=1] I/O ports at 0170 [size=8] I/O ports at 0374 [size=1] I/O ports at 4000 [size=16] On boot, the kernel reports: SCSI subsystem initialized libata version 2.21 loaded. pata_sis :00:02.5: version 0.5.1 scsi0 : pata_sis scsi1 : pata_sis ata1: PATA max UDMA/133 cmd 0x000101f0 ctl 0x000103f6 bmdma 0x00 014000 irq 14 ata2: PATA max UDMA/133 cmd 0x00010170 ctl 0x00010376 bmdma 0x00 014008 irq 15 ata1.00: ATA-6: ST94813A, 3.04, max UDMA/100 ata1.00: 78140160 sectors, multi 16: LBA48 ata1.01: ATAPI: TSSTcorpCD/DVDW SH-S182D, SB00, max UDMA/33 ata1.00: configured for UDMA/100 ata1.01: configured for UDMA/33 scsi 0:0:0:0: Direct-Access ATA ST94813A 3.04 PQ: 0 ANSI: 5 sd 0:0:0:0: [sda] 78140160 512-byte hardware sectors (40008 MB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sd 0:0:0:0: [sda] 78140160 512-byte hardware sectors (40008 MB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sda: sda1 sda2 sda3 sd 0:0:0:0: [sda] Attached SCSI disk scsi 0:0:1:0: CD-ROMTSSTcorp CD/DVDW SH-S182D SB00 PQ: 0 ANSI: 5 Then later in the log, I see: Aug 9 07:41:29 monet kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 Aug 9 07:41:29 monet kernel: ata1.00: (BMDMA stat 0x64) Aug 9 07:41:29 monet kernel: ata1.00: cmd c8/00:10:73:15:00/00:00:00:00:00/e1 tag 0 cdb 0x12 data 8192 in Aug 9 07:41:29 monet kernel: res 51/84:00:00:00:00/00:00:00:00:00/e0 Emask 0x10 (ATA bus error) Aug 9 07:41:29 monet kernel: ata1: soft resetting port Aug 9 07:41:29 monet kernel: ata1.00: configured for UDMA/100 Aug 9 07:41:29 monet kernel: ata1.01: configured for UDMA/33 Aug 9 07:41:29 monet kernel: ata1: EH complete And eventually: Aug 9 07:41:29 monet kernel: ata1.00: limiting speed to UDMA/66:PIO4 Aug 9 07:41:29 monet kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 Aug 9 07:41:29 monet kernel: ata1.00: (BMDMA stat 0x64) Aug 9 07:41:29 monet kernel: ata1.00: cmd c8/00:08:fb:7c:a1/00:00:00:00:00/e0 tag 0 cdb 0x12 data 4096 in Aug 9 07:41:29 monet kernel: res 51/84:00:00:00:00/00:00:00:00:00/e0 Emask 0x10 (ATA bus error) Aug 9 07:41:29 monet kernel: ata1: soft resetting port Aug 9 07:41:29 monet kernel: ata1.00: configured for UDMA/66 Aug 9 07:41:29 monet kernel: ata1.01: configured for UDMA/33 Aug 9 07:41:29 monet kernel: ata1: EH complete After that, no more DMA error messages (until the next system reboot). I've tried this with a brand-new Seagate Momentus 5400.2 and a fairly new Western Digital WD800VE. According to the disk specifications available on the vendor web sites, both drives are capable of UDMA/100 speeds. This error occurs whether the drive is on the primary IDE controller by itself, or whether it's sharing the controller with an optical drive (ata1.01 in the above listing). I've changed the disk to the secondary controller, I've swapped IDE cables, and changed out the 44-40 pin notebook drive adapter; no change. The system is running 2.6.22.1-41.fc7 (x86-64, of course), but I've seen this on 2.6.21-based kernels as well. I've noticed similar errors with a Dell Latitude D600 (x86, not SiS), also running recent Fedora 7 kernels. Searching the web seems to indicate that either the drive or the controller is bad, but since the error appears in so many different hardware configurations, I'm beginning to think this is a problem with libata, as that is an obvious common element. This e-mail is mostly an experience report, but I'm also wondering if I should be concerned about data on the drive, or have I missed the real problem entirely? Addendum: The internal SMART error log on the Seagate shows these errors repeatedly: Error 103 occurred at disk power-on lifetime: 296 hours (12 days + 8 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 00 00 00 00 e0 Error: ICRC, ABRT at LBA = 0x = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- c8 00
Re: DMA CRC errors with SiS chipset and notebook drive
Alan Cox wrote: Aug 9 07:41:29 monet kernel: ata1.00: limiting speed to UDMA/66:PIO4 Aug 9 07:41:29 monet kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 Aug 9 07:41:29 monet kernel: ata1.00: (BMDMA stat 0x64) Aug 9 07:41:29 monet kernel: ata1.00: cmd c8/00:08:fb:7c:a1/00:00:00:00:00/e0 tag 0 cdb 0x12 data 4096 in Aug 9 07:41:29 monet kernel: res 51/84:00:00:00:00/00:00:00:00:00/e0 Emask 0x10 (ATA bus error) Aug 9 07:41:29 monet kernel: ata1: soft resetting port Aug 9 07:41:29 monet kernel: ata1.00: configured for UDMA/66 Aug 9 07:41:29 monet kernel: ata1.01: configured for UDMA/33 Aug 9 07:41:29 monet kernel: ata1: EH complete Drops to UDMA33 for 40 wire cable. It drops from UDMA/100 to UDMA/66 for the disk, but stays at UDMA/33 for the optical drive. If I have only the disk on that controller, the disk still drops to UDMA/66. Looks to me like your cabling is out of spec or you have electrical noise problems. I've tried different cabling solutions, including round and flat IDE cables. What might cause "electrical noise"? Are 44-40 pin adapters reliable? begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture: Linux Projects Group adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE url:http://oss.oracle.com/~cel version:2.1 end:vcard
Re: request for patches: showing mount options
Miklos: Some mount options are never passed to the kernel, and thus can't appear in /proc/mounts. Examples include user, users, and _netdev for NFS. Miklos Szeredi wrote: [please consider pruning the CC list if discussing some aspect, which doesn't concern all] I've done an audit of all filesystems with regards to showing mount options in /proc//mounts. Unfortunately most of them show none or only a part of all accepted options (for details see list of filesystems at the end of the mail). This is currently not a big problem, because mount(8) stores the given options in /etc/mtab. However we want to get rid of mtab, and this requires, that the option showing be fixed up. It would be easiest if this was done by the VFS instead of having to deal with it in filesystems. However there are differences in how filesytems handle options during mount and remount, and it would be impossible to take this into account in all cases. If you are CC-ed, and responsible for one of these filesystems, please take a moment to fully implement the ->show_options() method. In most cases it should be an easy task. If for some reason you are unable to do this, please let me know and I'll fix it up. Here are some guidelines for showing options. I'll also add these to Documentation/filesystems/vfs.txt + If a filesystem accepts mount options, it must define show_options() + to show all the currently active options. The rules are: + + - options MUST be shown which are not default or their values differ + from the default + + - options MAY be shown which are enabled by default or have their + default value + + Options used only internally between a mount helper and the kernel + (such as file descriptors), or which only have an effect during the + mounting (such as ones controlling the creation of a journal) are exempt + from the above rules. Thanks, Miklos --- legend: all - fs has options, but doesn't define ->show_options() some - fs defines ->show_options(), but some options are not shown noopt - fs does not have options good - fs shows all options patch - I have a patch 9p some adfsall (maintainer?) affsall afs all autofs all autofs4 some befsall bfs noopt cifssome (odd parser) codanoopt configfsnoopt cramfs noopt debugfs noopt devpts patch ecryptfssome efs noopt ext2patch ext3patch ext4patch fat some freevxfsnoopt fusepatch gfs2good hfs good hfsplus good hostfs patch hpfsall hppfs noopt hugetlbfs all isofs all (maintainer?) jffs2 noopt jfs some minix noopt msdos ->fat ncpfs all (FS_BINARY_MOUNTDATA?) nfs some nfsdnoopt ntfsgood (odd parser) ocfs2 all openpromfs noopt procnoopt qnx4noopt ramfs noopt reiserfsall romfs noopt smbfs good (odd parser) (maintainer?) sysfs noopt sysvnoopt udf all ufs all vfat->fat xfs some (odd parser) mm/shmem.cpatch drivers/oprofile/oprofilefs.c noopt drivers/infiniband/hw/ipath/ipath_fs.cnoopt drivers/misc/ibmasm/ibmasmfs.cnoopt drivers/usb/core (usbfs) noopt drivers/usb/gadget (gadgetfs) noopt drivers/isdn/capi/capifs.cnoopt kernel/cpuset.c noopt fs/binfmt_misc.c noopt net/sunrpc/rpc_pipe.c noopt arch/powerpc/platforms/cell/spufs all arch/s390/hypfs all ipc/mqueue.c noopt security (securityfs) noopt security/selinux/selinuxfs.c noopt in -mm: reiser4some (odd parser) kernel/container.c good (odd parser) - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture: Linux Projects Group adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE version:2.1 end:vcard
Re: request for patches: showing mount options
Miklos Szeredi wrote: Some mount options are never passed to the kernel, and thus can't appear in /proc/mounts. Examples include user, users, and _netdev for NFS. These options control *who* may mount and *when* to mount. They are not a property of the mount itself and are not added to /etc/mtab. There's a "user=ID" option that is added to /etc/mtab in case of user mounts. This identifies the owner of the mount, so that it can be unmounted by that user. There are patches in -mm that enable the kernel to store this info. Do you have other examples in mind? [no]quota comes to mind; also auto, [no]owner, [no]group, and quiet/loud, but these may fall into the same category you mention above. Aside: It's a confusing artifact of the mount CLI that these options control who/when but are passed to the mount command in the same way the other options are. begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture: Linux Projects Group adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE version:2.1 end:vcard
Re: request for patches: showing mount options
Miklos Szeredi wrote: Some mount options are never passed to the kernel, and thus can't appear in /proc/mounts. Examples include user, users, and _netdev for NFS. These options control *who* may mount and *when* to mount. They are not a property of the mount itself and are not added to /etc/mtab. There's a "user=ID" option that is added to /etc/mtab in case of user mounts. This identifies the owner of the mount, so that it can be unmounted by that user. There are patches in -mm that enable the kernel to store this info. Do you have other examples in mind? There are a few more cases for NFS mount. After a successful mount, the NFS mount command tucks some options into /etc/mtab that reflect which mountd was used for the mount, and what protocol version and port was used for the mount request. Those options are not passed to the kernel, and do not appear in /proc/mounts today. See nfs(5)'s discussion of the mountport, mounthost, mountprog, and mountvers options. However, the trend for NFS is to push mount option parsing into the kernel. Thus all options will be passed to the kernel, and at that point it should be able to reflect the mount* options in /proc/mounts. But it doesn't do that quite yet. I'm wondering if there are other such cases in other file systems. begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture: Linux Projects Group adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE url:http://oss.oracle.com/~cel version:2.1 end:vcard
Re: request for patches: showing mount options
Miklos Szeredi wrote: After a successful mount, the NFS mount command tucks some options into /etc/mtab that reflect which mountd was used for the mount, and what protocol version and port was used for the mount request. Those options are not passed to the kernel, and do not appear in /proc/mounts today. See nfs(5)'s discussion of the mountport, mounthost, mountprog, and mountvers options. However, the trend for NFS is to push mount option parsing into the kernel. Thus all options will be passed to the kernel, and at that point it should be able to reflect the mount* options in /proc/mounts. But it doesn't do that quite yet. Trond, do you have a roadmap for this? Well I'm actually doing the coding, and Trond is playing more of an architectural role. We have a first implementation of in-kernel mount option parsing in 2.6.23-rc now. I'm currently working on the user-space piece of this. (And actually, now is a great time to review the new kernel part, while it is still quite young.) However, the NFS mount user-space pieces have undergone radical change recently. The mount.nfs helper was split from the mount command just last year, and is only now starting to go into distributions. This is very old code that has been hacked on for over a decade, so it is taking a little while to rediscover its history and modernize it before we move forward. I expect that both the kernel part and the user-space part will evolve together over the next few months as we clarify the full set of requirements. The requirements for this effort now include: + making new mount options simple to implement; + removing ABI dependencies between mount.nfs and the kernel NFS client; + an eventual merge of the nfs and nfs4 file system types; + improved error handling and reporting during the mount process; + support for NFS over IPv6. I think there is also some talk about fully supporting SELinux as well, but I haven't been following that closely. The removal of /etc/mtab in favor of /proc/mounts is a new requirement, and is not as trivial as you might hope. Internally the NFS client represents the mount options as a binary data structure, and it contains only the information that has traditionally been passed into the kernel by the current mount command. The user-space-only options are not passed to the kernel nor stored in the data structure. Adding facilities to store information about every possible mount option, including the user-space-only ones, will take a bit of time, but is possible, if not straightforward. We just have to understand all the dependencies. begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture: Linux Projects Group adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE url:http://oss.oracle.com/~cel version:2.1 end:vcard
Re: request for patches: showing mount options
Miklos Szeredi wrote: After a successful mount, the NFS mount command tucks some options into /etc/mtab that reflect which mountd was used for the mount, and what protocol version and port was used for the mount request. Those options are not passed to the kernel, and do not appear in /proc/mounts today. See nfs(5)'s discussion of the mountport, mounthost, mountprog, and mountvers options. However, the trend for NFS is to push mount option parsing into the kernel. Thus all options will be passed to the kernel, and at that point it should be able to reflect the mount* options in /proc/mounts. But it doesn't do that quite yet. Trond, do you have a roadmap for this? Well I'm actually doing the coding, and Trond is playing more of an architectural role. OK, what your estimage for this then? I don't have an estimate. This is all very slippery because once I get into a part of the code, we discover a lot of issues that we didn't expect. The NFS mount stuff is largely historical; we've all forgotten (or never really knew) how it works. It would be nice to have all this stuff in 2.6.24, which doesn't leave a lot of time. Yes, that would be nice, but there's a lot of stuff that needs to get done before this. NFS IPv6, for example, is a higher priority than refactoring to remove /etc/mtab -- the US government has a new requirement in 2008 for IPv6 support in any software that it purchases, and NFS may be a stumbling block for distributors if it doesn't have it. So I'd say "no way" for 2.6.24, but it's really Trond's call to make. But if it's just those four options you mentioned, it should be managable. I do not think there needs to be some generic code to hande userspace-only options, it would be perfectly fine just to parse, store and show them like all the other options. Like you, I don't expect it will be difficult to implement, but I have too many balls in the air to make any promises at the moment. Plus, we can't really predict when distributors will feel the in-kernel mount parsing stuff will be ready for their users. begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture: Linux Projects Group adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE url:http://oss.oracle.com/~cel version:2.1 end:vcard
[PATCH] Fix the sign of the result of a conditional expression
In include/asm-x86_64/bitops.h, the find_{first,next,first_zero,next_zero}_bit macros return a result type that depends on the width of the "size" argument. The type of both arms of a conditional expression should always be the same. I changed the return type of __scanbit() to match the return type of the x86_64 find_*_bit() functions. -- corporate: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Eliminate result signage problem in asm-x86_64/bitops.h
The return type of __scanbit() doesn't match the return type of find_{first,next}_bit(). Thus when you construct something like this: boolean ? __scanbit() : find_first_bit() you get an unsigned long result if "boolean" is true, and a signed long result if "boolean" is false. In file included from /home/cel/src/linux/include/linux/mmzone.h:15, from /home/cel/src/linux/include/linux/gfp.h:4, from /home/cel/src/linux/include/linux/slab.h:14, from /home/cel/src/linux/include/linux/percpu.h:5, from /home/cel/src/linux/include/linux/rcupdate.h:41, from /home/cel/src/linux/include/linux/dcache.h:10, from /home/cel/src/linux/include/linux/fs.h:275, from /home/cel/src/linux/fs/nfs/sysctl.c:9: /home/cel/src/linux/include/linux/nodemask.h: In function ‘__first_node’: /home/cel/src/linux/include/linux/nodemask.h:229: warning: signed and unsigned type in conditional expression /home/cel/src/linux/include/linux/nodemask.h: In function ‘__next_node’: /home/cel/src/linux/include/linux/nodemask.h:235: warning: signed and unsigned type in conditional expression /home/cel/src/linux/include/linux/nodemask.h: In function ‘__first_unset_node’: /home/cel/src/linux/include/linux/nodemask.h:253: warning: signed and unsigned type in conditional expression Signed-off-by: Chuck Lever <[EMAIL PROTECTED]> --- include/asm-x86_64/bitops.h |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/include/asm-x86_64/bitops.h b/include/asm-x86_64/bitops.h index d4dbbe5..1d7d9b4 100644 --- a/include/asm-x86_64/bitops.h +++ b/include/asm-x86_64/bitops.h @@ -260,7 +260,7 @@ extern long find_first_bit(const unsigned long * addr, unsigned long size); extern long find_next_bit(const unsigned long * addr, long size, long offset); /* return index of first bet set in val or max when no bit is set */ -static inline unsigned long __scanbit(unsigned long val, unsigned long max) +static inline long __scanbit(unsigned long val, unsigned long max) { asm("bsfq %1,%0 ; cmovz %2,%0" : "=&r" (val) : "r" (val), "r" (max)); return val; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Eliminate result signage problem in asm-x86_64/bitops.h
I apologize for sending a separate cover letter for a single patch. Andi Kleen wrote: On Wed, Aug 15, 2007 at 05:02:47PM -0400, Chuck Lever wrote: The return type of __scanbit() doesn't match the return type of find_{first,next}_bit(). Thus when you construct something like this: boolean ? __scanbit() : find_first_bit() Why would you want to write this? What is boolean? Do they have different arguments? So here's the definition of the x86_64 find_first_bit() macro, straight from include/x86_64/bitops.h: #define find_first_bit(addr,size) \ ((__builtin_constant_p(size) && (size) <= BITS_PER_LONG ? \ (__scanbit(*(unsigned long *)addr,(size))) : \ find_first_bit(addr,size))) In this case "boolean" is: __builtin_constant_p(size) && (size) <= BITS_PER_LONG the first arm of the conditional is: __scanbit(*(unsigned long *)addr,(size)) the second arm of the conditional is: find_first_bit(addr,size) (this is the "function" version of find_first_bit, not the macro that's being defined. The naming here is unfortunately confusing). Thus, roughly speaking, when the type of "size" is smaller than a long, the macro's return type evaluates to unsigned long. If "size" is larger than a long, the macro's return type evaluates to signed long. By making the return type of __scanbit() an unsigned long, both arms of the conditional evaluate to the same result type. It's on my todo list for some time to special case f_f_b() and friends for smaller arguments. Would that eliminate this construct? Well, I can only assume what you mean by this, but I think that would address the problem. My real interest here is to eliminate a whole lot of compiler noise when I enable -Wsign-compare for certain parts of the kernel. begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture: Linux Projects Group adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE url:http://oss.oracle.com/~cel version:2.1 end:vcard
[PATCH 4/5] NET: Make ts_recent_stamp unsigned
The get_seconds() function returns an unsigned long. To prevent incorrect comparison results between values saved in ts_recent_stamp and later invocations of get_seconds(), change the type of ts_recent_stamp to match the return type of get_seconds(). Signed-off-by: Chuck Lever <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> --- include/linux/tcp.h |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index bac17c5..129ddb4 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -206,7 +206,7 @@ struct tcp_sack_block { struct tcp_options_received { /* PAWS/RTTM data */ - longts_recent_stamp;/* Time we stored ts_recent (for aging) */ + unsigned long ts_recent_stamp;/* Time we stored ts_recent (for aging) */ u32 ts_recent; /* Time stamp to echo next */ u32 rcv_tsval; /* Time stamp value */ u32 rcv_tsecr; /* Time stamp echo reply*/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/5] NET: Treat the sign of the result of skb_headroom() consistently
In some places, the result of skb_headroom() is compared to an unsigned integer, and in others, the result is compared to a signed integer. Make the comparisons consistent and correct. Signed-off-by: Chuck Lever <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> --- include/linux/skbuff.h |4 ++-- net/ipv4/ip_gre.c |2 +- net/ipv4/ip_output.c |2 +- net/ipv4/ipip.c|2 +- net/ipv4/ipvs/ip_vs_xmit.c |2 +- net/ipv4/tcp_input.c |2 +- net/ipv6/ip6_output.c |2 +- net/ipv6/ip6_tunnel.c |2 +- net/ipv6/sit.c |2 +- 9 files changed, 10 insertions(+), 10 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 91f0bac..57257d2 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -994,7 +994,7 @@ static inline int pskb_may_pull(struct sk_buff *skb, unsigned int len) * * Return the number of bytes of free space at the head of an &sk_buff. */ -static inline int skb_headroom(const struct sk_buff *skb) +static inline unsigned int skb_headroom(const struct sk_buff *skb) { return skb->data - skb->head; } @@ -1347,7 +1347,7 @@ static inline struct sk_buff *netdev_alloc_skb(struct net_device *dev, * Returns true if modifying the header part of the cloned buffer * does not requires the data to be copied. */ -static inline int skb_clone_writable(struct sk_buff *skb, int len) +static inline int skb_clone_writable(struct sk_buff *skb, unsigned int len) { return !skb_header_cloned(skb) && skb_headroom(skb) + len <= skb->hdr_len; diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c index f151900..c3568ab 100644 --- a/net/ipv4/ip_gre.c +++ b/net/ipv4/ip_gre.c @@ -674,7 +674,7 @@ static int ipgre_tunnel_xmit(struct sk_buff *skb, struct net_device *dev) struct rtable *rt; /* Route to the other host */ struct net_device *tdev;/* Device to other host */ struct iphdr *iph; /* Our new IP header */ - intmax_headroom;/* The extra header space needed */ + unsigned int max_headroom; /* The extra header space needed */ intgre_hlen; __be32 dst; intmtu; diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index f508835..e5f7dc2 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -161,7 +161,7 @@ static inline int ip_finish_output2(struct sk_buff *skb) struct dst_entry *dst = skb->dst; struct rtable *rt = (struct rtable *)dst; struct net_device *dev = dst->dev; - int hh_len = LL_RESERVED_SPACE(dev); + unsigned int hh_len = LL_RESERVED_SPACE(dev); if (rt->rt_type == RTN_MULTICAST) IP_INC_STATS(IPSTATS_MIB_OUTMCASTPKTS); diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c index 5cd5bbe..8c2b2b0 100644 --- a/net/ipv4/ipip.c +++ b/net/ipv4/ipip.c @@ -515,7 +515,7 @@ static int ipip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev) struct net_device *tdev;/* Device to other host */ struct iphdr *old_iph = ip_hdr(skb); struct iphdr *iph; /* Our new IP header */ - intmax_headroom;/* The extra header space needed */ + unsigned int max_headroom; /* The extra header space needed */ __be32 dst = tiph->daddr; intmtu; diff --git a/net/ipv4/ipvs/ip_vs_xmit.c b/net/ipv4/ipvs/ip_vs_xmit.c index d0a92de..7c074e3 100644 --- a/net/ipv4/ipvs/ip_vs_xmit.c +++ b/net/ipv4/ipvs/ip_vs_xmit.c @@ -325,7 +325,7 @@ ip_vs_tunnel_xmit(struct sk_buff *skb, struct ip_vs_conn *cp, __be16 df = old_iph->frag_off; sk_buff_data_t old_transport_header = skb->transport_header; struct iphdr *iph; /* Our new IP header */ - intmax_headroom;/* The extra header space needed */ + unsigned int max_headroom; /* The extra header space needed */ intmtu; EnterFunction(10); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 9288220..3dbbb44 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -3909,7 +3909,7 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, while (before(start, end)) { struct sk_buff *nskb; - int header = skb_headroom(skb); + unsigned int header = skb_headroom(skb); int copy = SKB_MAX_ORDER(header, 0); /* Too big header? This can happen with IPv6. */ diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 13565df..653fc0a 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -171,7 +171,7 @@ int ip6_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl, u32 m
[PATCH 5/5] NET: Remove unneeded implicit type case when calling tcp_minshall_update()
The tcp_minshall_update() function is called in exactly one place, and is passed an unsigned integer for the mss_len argument. Make the sign of the argument match the sign of the passed variable in order to eliminate an unneeded implicit type cast and a mixed sign comparison in tcp_minshall_update(). Signed-off-by: Chuck Lever <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> --- include/net/tcp.h |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 92049e6..d695cea 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -803,7 +803,7 @@ static inline int tcp_is_cwnd_limited(const struct sock *sk, u32 in_flight) return left <= tcp_max_burst(tp); } -static inline void tcp_minshall_update(struct tcp_sock *tp, int mss, +static inline void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss, const struct sk_buff *skb) { if (skb->len < mss) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/5] NET: Make ts_recent_stamp unsigned
David Miller wrote: From: Chuck Lever <[EMAIL PROTECTED]> Date: Tue, 23 Oct 2007 11:44:28 -0400 The get_seconds() function returns an unsigned long. To prevent incorrect comparison results between values saved in ts_recent_stamp and later invocations of get_seconds(), change the type of ts_recent_stamp to match the return type of get_seconds(). Signed-off-by: Chuck Lever <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> I see two potential problems with this patch: 1) If you update struct tcp_options_received you should also update struct tcp_timewait_sock similarly. The fact that you missed this suggests that you didn't grep the tree to see how else this variable is used and this makes me extra concerned about this patch's correctness. Perhaps the result of wishful thinking on my part. I was hoping for a small and self-contained change. 2) There are computations in the TCP stack using this member that probably care about the signedness, such as: net/ipv4/tcp_ipv4.c: get_seconds() - tcptw->tw_ts_recent_stamp > 1))) { include/net/tcp.h: if (get_seconds() >= rx_opt->ts_recent_stamp + TCP_PAWS_24DAYS) include/net/tcp.h: if (get_seconds() >= rx_opt->ts_recent_stamp + TCP_PAWS_24DAYS) We should make sure we understand what is expected here, and why it would still be correct after making both ts_recent_stamp members unsigned. Agreed. I wonder how one could construct a series of mixed case time stamp comparisons *on purpose* (and without documentation of this assumption) that produces consistently correct results. From the invocations of get_seconds() that I sampled, the design of these comparisons seems to assume that both sides of the comparison are non-negative. However, they do not seem to account for time crossing zero. begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture: Linux Projects Group adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA email;internet:chuck dot lever at nospam oracle dot com title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE version:2.1 end:vcard
Re: [2.6 patch] make sunrpc/xprtsock.c:xs_setup_{udp,tcp}() static
Adrian Bunk wrote: xs_setup_{udp,tcp}() can now become static. ACK. Sorry this was overlooked. Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]> --- include/linux/sunrpc/xprtsock.h |6 -- net/sunrpc/xprtsock.c |4 ++-- 2 files changed, 2 insertions(+), 8 deletions(-) 833a31c8caef70589f33be8e3a1fc9d8e01ce3c2 diff --git a/include/linux/sunrpc/xprtsock.h b/include/linux/sunrpc/xprtsock.h index 2c6c2c2..c2a46c4 100644 --- a/include/linux/sunrpc/xprtsock.h +++ b/include/linux/sunrpc/xprtsock.h @@ -9,12 +9,6 @@ #ifdef __KERNEL__ -/* - * Socket transport setup operations - */ -struct rpc_xprt *xs_setup_udp(struct xprt_create *args); -struct rpc_xprt *xs_setup_tcp(struct xprt_create *args); - intinit_socket_xprt(void); void cleanup_socket_xprt(void); diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c index 02298f5..2f630a5 100644 --- a/net/sunrpc/xprtsock.c +++ b/net/sunrpc/xprtsock.c @@ -1828,7 +1828,7 @@ static struct rpc_xprt *xs_setup_xprt(struct xprt_create *args, * @args: rpc transport creation arguments * */ -struct rpc_xprt *xs_setup_udp(struct xprt_create *args) +static struct rpc_xprt *xs_setup_udp(struct xprt_create *args) { struct sockaddr *addr = args->dstaddr; struct rpc_xprt *xprt; @@ -1894,7 +1894,7 @@ struct rpc_xprt *xs_setup_udp(struct xprt_create *args) * @args: rpc transport creation arguments * */ -struct rpc_xprt *xs_setup_tcp(struct xprt_create *args) +static struct rpc_xprt *xs_setup_tcp(struct xprt_create *args) { struct sockaddr *addr = args->dstaddr; struct rpc_xprt *xprt; begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture: Linux Projects Group adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA email;internet:chuck dot lever at nospam oracle dot com title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE version:2.1 end:vcard
Re: [NFS] [PATCH 001 of 9] knfsd: nfsd4: fix non-terminated string
Ming Zhang wrote: On Tue, 2007-02-13 at 10:44 +1100, NeilBrown wrote: From: J. Bruce Fields <[EMAIL PROTECTED]> The server name is expected to be a null-terminated string, so we can't pass in the raw client identifier. What's more, the client identifier is just a binary, not necessarily printable, blob. Let's just use the ip address instead. The server name appears to exist just to help debugging by making some printk's more informative. Note that the string is copies into the rpc client structure, so the pointer to the local variable does not outlive the function call. Signed-off-by: "J. Bruce Fields" <[EMAIL PROTECTED]> Signed-off-by: Neil Brown <[EMAIL PROTECTED]> ### Diffstat output ./fs/nfsd/nfs4callback.c |7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff .prev/fs/nfsd/nfs4callback.c ./fs/nfsd/nfs4callback.c --- .prev/fs/nfsd/nfs4callback.c2007-02-13 09:50:26.0 +1100 +++ ./fs/nfsd/nfs4callback.c2007-02-13 10:00:59.0 +1100 @@ -387,7 +387,6 @@ nfsd4_probe_callback(struct nfs4_client .address = (struct sockaddr *)&addr, .addrsize = sizeof(addr), .timeout= &timeparms, - .servername = clp->cl_name.data, .program= program, .version= nfs_cb_version[1]->number, .authflavor = RPC_AUTH_UNIX,/* XXX: need AUTH_GSS... */ @@ -397,6 +396,7 @@ nfsd4_probe_callback(struct nfs4_client .rpc_proc = &nfs4_cb_procedures[NFSPROC4_CLNT_CB_NULL], .rpc_argp = clp, }; + char clientname[16]; int status; if (atomic_read(&cb->cb_set)) @@ -419,6 +419,11 @@ nfsd4_probe_callback(struct nfs4_client memset(program->stats, 0, sizeof(cb->cb_stat)); program->stats->program = program; + /* Just here to make some printk's more useful: */ + snprintf(clientname, sizeof(clientname), + "%u.%u.%u.%u", NIPQUAD(addr.sin_addr)); can use NIPQUAD_FMT here instead of "%u.%u.%u.%u". btw, will the ip address here possibly be an ipv6 address? Some patches are in the works to build in IPv6 support. See the patch series at http://oss.oracle.com/~cel/linux-2.6/2.6.19/patches/ begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture Linux Projects Group email;internet:chuck dot lever at nospam oracle dot com title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE version:2.1 end:vcard
Re: 2.6.20-git8 fails compile -- net/built-in.o __ipv6_addr_type
David Miller wrote: From: Pete Clements <[EMAIL PROTECTED]> Date: Mon, 12 Feb 2007 20:10:13 -0500 (EST) 2.6.20-git8 fails compile: CHK include/linux/compile.h UPD include/linux/compile.h CC init/version.o LD init/built-in.o LD .tmp_vmlinux1 net/built-in.o: In function `svc_udp_recvfrom': svcsock.c:(.text+0x61be4): undefined reference to `__ipv6_addr_type' make: *** [.tmp_vmlinux1] Error 1 Chuck, you will need to somehow make CONFIG_SUNRPC "depend" upon IPV6 so that if IPV6 is modular SUNRPC can only be built modular. Otherwise the symbols won't resolve correctly. Everybody hits this problem when they add ipv6 support to something. :-) Interestingly, doing a build with ALLYESCONFIG, ALLMODCONFIG, and ALLNOCONFIG doesn't catch this type of error. I just did a copy+paste and that brought in ipv6_addr_type. So I'm not convinced it's really needed here. David, can you take a look at the code in svcsock.c right around the ipv6_addr_type() call and let me know if we can avoid that call outright? begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture Linux Projects Group email;internet:chuck dot lever at nospam oracle dot com title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE version:2.1 end:vcard
Re: [NFS] [PATCH 000 of 14] knfsd: Preparation for IPv6 support in NFS server.
Roland Dreier wrote: > They are mostly from Chuck Level and make preparating for IPv6 support > in the NFS server. > They are *not* for 2.6.20, but should be ok for .21. Out of curiousity, does this patch series reduce the delta between the NFS/RDMA tree and mainline Linux? In other words does this bring NFS/RDMA closer to merging? Hi Roland- The client side support for an RPC/RDMA module is almost completely integrated into mainline. There is still a minimal set of patches required to support alternate transports in loadable kernel modules which Trond has indicated he will integrate when the RPC/RDMA transport is ready to be integrated. At this time I'm not aware of a plan to integrate server-side support for NFS/RDMA. Perhaps the NetApp RDMA developers could respond. begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture Linux Projects Group email;internet:chuck dot lever at nospam oracle dot com title:Principle Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE version:2.1 end:vcard
Re: [NFS] [PATCH 002 of 13] knfsd: SUNRPC: allow creating an RPC service without registering with portmapper
Oops. This one looks old. Let me see if I can dig up the latest. On 12/8/06, Trond Myklebust <[EMAIL PROTECTED]> wrote: On Fri, 2006-12-08 at 23:01 +1100, NeilBrown wrote: > From: Chuck Lever <[EMAIL PROTECTED]> > Sometimes we need to create an RPC service but not register it with the > local portmapper. NFSv4 delegation callback, for example. > > Change the svc_makesock() API to allow optionally creating temporary or > permanent sockets, optionally registering with the local portmapper, and > make it return the ephemeral port of the new socket. NAK. This one is still buggy. The NFSv4 callback server should _NOT_ be registering its listening socket on the RPC server 'temporary' list. Trond > Signed-off-by: Chuck Lever <[EMAIL PROTECTED]> > Cc: Aurelien Charbon <[EMAIL PROTECTED]> > Signed-off-by: Neil Brown <[EMAIL PROTECTED]> > > ### Diffstat output > ./fs/lockd/svc.c | 26 -- > ./fs/nfs/callback.c | 20 +--- > ./fs/nfsd/nfssvc.c |6 -- > ./include/linux/sunrpc/svcsock.h |2 +- > ./net/sunrpc/svcsock.c |6 -- > 5 files changed, 34 insertions(+), 26 deletions(-) > > diff .prev/fs/lockd/svc.c ./fs/lockd/svc.c > --- .prev/fs/lockd/svc.c 2006-12-08 13:36:33.0 +1100 > +++ ./fs/lockd/svc.c 2006-12-08 13:36:33.0 +1100 > @@ -223,23 +223,29 @@ static int find_socket(struct svc_serv * > return found; > } > > +/* > + * Make any sockets that are needed but not present. > + * If nlm_udpport or nlm_tcpport were set as module > + * options, make those sockets unconditionally > + */ > static int make_socks(struct svc_serv *serv, int proto) > { > - /* Make any sockets that are needed but not present. > - * If nlm_udpport or nlm_tcpport were set as module > - * options, make those sockets unconditionally > - */ > - static int warned; > + static int warned; > int err = 0; > + > if (proto == IPPROTO_UDP || nlm_udpport) > if (!find_socket(serv, IPPROTO_UDP)) > - err = svc_makesock(serv, IPPROTO_UDP, nlm_udpport); > - if (err == 0 && (proto == IPPROTO_TCP || nlm_tcpport)) > + err = svc_makesock(serv, IPPROTO_UDP, nlm_udpport, > + SVC_SOCK_DEFAULTS); > + if (err >= 0 && (proto == IPPROTO_TCP || nlm_tcpport)) > if (!find_socket(serv, IPPROTO_TCP)) > - err= svc_makesock(serv, IPPROTO_TCP, nlm_tcpport); > - if (!err) > + err = svc_makesock(serv, IPPROTO_TCP, nlm_tcpport, > + SVC_SOCK_DEFAULTS); > + > + if (err >= 0) { > warned = 0; > - else if (warned++ == 0) > + err = 0; > + } else if (warned++ == 0) > printk(KERN_WARNING > "lockd_up: makesock failed, error=%d\n", err); > return err; > > diff .prev/fs/nfs/callback.c ./fs/nfs/callback.c > --- .prev/fs/nfs/callback.c 2006-12-08 13:36:33.0 +1100 > +++ ./fs/nfs/callback.c 2006-12-08 13:36:33.0 +1100 > @@ -106,7 +106,6 @@ static void nfs_callback_svc(struct svc_ > int nfs_callback_up(void) > { > struct svc_serv *serv; > - struct svc_sock *svsk; > int ret = 0; > > lock_kernel(); > @@ -119,17 +118,14 @@ int nfs_callback_up(void) > ret = -ENOMEM; > if (!serv) > goto out_err; > - /* FIXME: We don't want to register this socket with the portmapper */ > - ret = svc_makesock(serv, IPPROTO_TCP, nfs_callback_set_tcpport); > - if (ret < 0) > + > + ret = svc_makesock(serv, IPPROTO_TCP, nfs_callback_set_tcpport, > + (SVC_SOCK_ANONYMOUS | SVC_SOCK_TEMPORARY)); > + if (ret <= 0) > goto out_destroy; > - if (!list_empty(&serv->sv_permsocks)) { > - svsk = list_entry(serv->sv_permsocks.next, > - struct svc_sock, sk_list); > - nfs_callback_tcpport = ntohs(inet_sk(svsk->sk_sk)->sport); > - dprintk ("Callback port = 0x%x\n", nfs_callback_tcpport); > - } else > - BUG(); > + nfs_callback_tcpport = ret; > + dprintk("Callback port = 0x%x\n", nfs_callback_tcpport); > + > ret = svc_create_thread(nfs_callback_svc, serv); > if (ret < 0) > goto out_destroy; > @@ -140,6 +136,8 @@ out: > unlock_kernel(); > return ret; > out_destroy: > +
Re: [NFS] [PATCH 010 of 14] knfsd: SUNRPC: add a "generic" function to see if the peer uses a secure port
On 12/12/06, Andrew Morton <[EMAIL PROTECTED]> wrote: On Wed, 13 Dec 2006 10:59:27 +1100 NeilBrown <[EMAIL PROTECTED]> wrote: > From: Chuck Lever <[EMAIL PROTECTED]> > The only reason svcsock.c looks at a sockaddr's port is to check whether > the remote peer is connecting from a privileged port. Refactor this check > to hide processing that is specific to address format. > > Signed-off-by: Chuck Lever <[EMAIL PROTECTED]> > Cc: Aurelien Charbon <[EMAIL PROTECTED]> > Signed-off-by: Neil Brown <[EMAIL PROTECTED]> > > ### Diffstat output > ./net/sunrpc/svcsock.c | 20 +--- > 1 file changed, 17 insertions(+), 3 deletions(-) > > diff .prev/net/sunrpc/svcsock.c ./net/sunrpc/svcsock.c > --- .prev/net/sunrpc/svcsock.c2006-12-13 10:32:15.0 +1100 > +++ ./net/sunrpc/svcsock.c2006-12-13 10:32:17.0 +1100 > @@ -926,6 +926,20 @@ svc_tcp_data_ready(struct sock *sk, int > wake_up_interruptible(sk->sk_sleep); > } > > +static inline int svc_port_is_privileged(struct sockaddr *sin) > +{ > + switch (sin->sa_family) { > + case AF_INET: > + return ntohs(((struct sockaddr_in *)sin)->sin_port) < 1024; > +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) > + case AF_INET6: > + return ntohs(((struct sockaddr_in6 *)sin)->sin6_port) < 1024; > +#endif > + default: > + return 0; > + } > +} I'm a bit surprised to see this test implemented in sunrpc - it's the sort of thing which core networking should implement? The check is open-coded in each socket type's bind callout, and includes a capability check which I believe the NFS server doesn't require. And should that "1024" be PROT_SOCK? All I can say is "Doh!" I'll send Neil a replacement with this fixed. -- "We who cut mere stones must always be envisioning cathedrals" -- Quarry worker's creed - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [NFS] [PATCH 2.6.19-rc6] sunrpc: fix race condition
On 11/27/06, Chris Caputo <[EMAIL PROTECTED]> wrote: From: Chris Caputo <[EMAIL PROTECTED]> [PATCH 2.6.19-rc6] sunrpc: fix race condition Patch linux-2.6.10-01-rpc_workqueue.dif introduced a race condition into net/sunrpc/sched.c in kernels 2.6.11-rc1 through 2.6.19-rc6. The race scenario is as follows... Given: RPC_TASK_QUEUED, RPC_TASK_RUNNING and RPC_TASK_ASYNC are set. __rpc_execute() (no spinlock)rpc_make_runnable() (queue spinlock held) -- do_ret = rpc_test_and_set_running(task); rpc_clear_running(task); if (RPC_IS_ASYNC(task)) { if (RPC_IS_QUEUED(task)) return 0; rpc_clear_queued(task); if (do_ret) return; Thus both threads return and the task is abandoned forever. In my test NFS client usage (~200 Mb/s at ~3,000 RPC calls/s) this race condition has resulted in processes getting permanently stuck in 'D' state often in less than 15 minutes of uptime. The following patch fixes the problem by returning to use of a spinlock in __rpc_execute(). Signed-off-by: Chris Caputo <[EMAIL PROTECTED]> --- diff -up a/net/sunrpc/sched.c b/net/sunrpc/sched.c --- a/net/sunrpc/sched.c2006-11-27 08:41:07.0 + +++ b/net/sunrpc/sched.c2006-11-27 11:14:21.0 + @@ -587,6 +587,7 @@ EXPORT_SYMBOL(rpc_exit_task); static int __rpc_execute(struct rpc_task *task) { int status = 0; + struct rpc_wait_queue *queue; dprintk("RPC: %4d rpc_execute flgs %x\n", task->tk_pid, task->tk_flags); @@ -631,22 +632,27 @@ static int __rpc_execute(struct rpc_task lock_kernel(); task->tk_action(task); unlock_kernel(); + /* micro-optimization to avoid spinlock */ + if (!RPC_IS_QUEUED(task)) + continue; } /* -* Lockless check for whether task is sleeping or not. +* Check whether task is sleeping. */ - if (!RPC_IS_QUEUED(task)) - continue; - rpc_clear_running(task); + queue = task->u.tk_wait.rpc_waitq; + spin_lock_bh(&queue->lock); if (RPC_IS_ASYNC(task)) { - /* Careful! we may have raced... */ - if (RPC_IS_QUEUED(task)) - return 0; - if (rpc_test_and_set_running(task)) + if (RPC_IS_QUEUED(task)) { + rpc_clear_running(task); + spin_unlock_bh(&queue->lock); return 0; + } + spin_unlock_bh(&queue->lock); continue; } + rpc_clear_running(task); + spin_unlock_bh(&queue->lock); /* sync task: sleep here */ dprintk("RPC: %4d sync task going to sleep\n", task->tk_pid); The reason the spin lock was removed from the scheduler is because once the BKL is removed from the RPC and NFS clients, the locks in the RPC scheduler become contented. Each RPC request passes through this part of the scheduler an average of 12 times (probably more if a bind or credential refresh is required), so the locking overhead becomes critical. As you are working this fix, can you make sure to test with a heavy RPC workload against a fast server and make sure that lock contention remains reasonable? -- "We who cut mere stones must always be envisioning cathedrals" -- Quarry worker's creed - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Adding subroot information to /proc/mounts, or obtaining that through other means
Al Viro wrote: On Wed, Jun 20, 2007 at 01:57:33PM -0700, H. Peter Anvin wrote: ... or, alternatively, add a subfield to the first field (which would entail escaping whatever separator we choose): /dev/md6 /export ext3 rw,data=ordered 0 0 /dev/md6:/users/foo /home/foo ext3 rw,data=ordered 0 0 /dev/md6:/users/bar /home/bar ext3 rw,data=ordered 0 0 Hell, no. The first field is in principle impossible to parse unless you know the fs type. How about making a new file with sane format? From the very beginning. E.g. mountpoint + ID + relative path + type + options, where ID uniquely identifies superblock (e.g. numeric st_dev) and backing device (if any) is sitting among the options... To support NFS client performance statistics, I recently added /proc/self/mountstats. That might be a place to add details about --move and --bind mounts without changing the format of /proc/mounts. begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture: Linux Projects Group adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE url:http://oss.oracle.com/~cel/ version:2.1 end:vcard
Re: Adding subroot information to /proc/mounts, or obtaining that through other means
H. Peter Anvin wrote: Chuck Lever wrote: To support NFS client performance statistics, I recently added /proc/self/mountstats. That might be a place to add details about --move and --bind mounts without changing the format of /proc/mounts. I just looked at /proc/self/mountstats; it seems to have no more information than /proc/self/mounts, but in an even more annoying format. Either I'm missing something, this file doesn't add anything at all. The advantage is that it doesn't have strong user space dependencies on its format like /proc/mounts does. If you have NFS mount points, you will see that it includes a great deal of additional information about each mount. begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture: Linux Projects Group adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE url:http://oss.oracle.com/~cel/ version:2.1 end:vcard
Re: Adding subroot information to /proc/mounts, or obtaining that through other means
H. Peter Anvin wrote: Chuck Lever wrote: The advantage is that it doesn't have strong user space dependencies on its format like /proc/mounts does. If you have NFS mount points, you will see that it includes a great deal of additional information about each mount. OK, I see now: device raidtest:/export mounted on /net/raidtest/export with fstype nfs statvers=1.0 opts: rw,vers=3,rsize=131072,wsize=131072,acregmin=3,acregmax=60,acdirmin=30,acdirmax=60,hard,proto=tcp,timeo=600,retrans=2,sec=sys age:5 caps: caps=0x9,wtmult=4096,dtsize=4096,bsize=0,namelen=255 sec:flavor=1,pseudoflavor=1 events: 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 bytes: 0 0 0 0 0 0 0 0 RPC iostats version: 1.0 p/v: 13/3 (nfs) xprt: tcp 686 0 2 0 5 8 8 0 8 0 per-op statistics NULL: 0 0 0 0 0 0 0 0 GETATTR: 2 2 0 264 224 1 0 1 SETATTR: 0 0 0 0 0 0 0 0 LOOKUP: 0 0 0 0 0 0 0 0 ACCESS: 1 1 0 116 120 0 0 0 READLINK: 0 0 0 0 0 0 0 0 READ: 0 0 0 0 0 0 0 0 WRITE: 0 0 0 0 0 0 0 0 CREATE: 0 0 0 0 0 0 0 0 MKDIR: 0 0 0 0 0 0 0 0 SYMLINK: 0 0 0 0 0 0 0 0 MKNOD: 0 0 0 0 0 0 0 0 REMOVE: 0 0 0 0 0 0 0 0 RMDIR: 0 0 0 0 0 0 0 0 RENAME: 0 0 0 0 0 0 0 0 LINK: 0 0 0 0 0 0 0 0 READDIR: 0 0 0 0 0 0 0 0 READDIRPLUS: 0 0 0 0 0 0 0 0 FSSTAT: 1 1 0 132 84 0 1 1 FSINFO: 1 1 0 132 80 0 0 0 PATHCONF: 0 0 0 0 0 0 0 0 COMMIT: 0 0 0 0 0 0 0 0 This format is just awful for parsing. It's pretty clearly totally ad-hoc. It's not even self-consistent (it uses different separators, etc, in the same file!) It's reasonably compact for human consumption, but it doesn't show what the arrays mean. Heck, XML would have been better than this mess... Sigh. So where where you when I asked for review time and again? I have a couple of simple Python scripts that can parse this without any difficulty. I resent your tone. Quite a bit. begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture: Linux Projects Group adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE url:http://oss.oracle.com/~cel/ version:2.1 end:vcard
Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS
Hi Chris- John Stoffel wrote: As a user of Netapps, having quotas (if only for reporting purposes) and some way to migrate non-used files to slower/cheaper storage would be great. Ie. being able to setup two pools, one being RAID6, the other being RAID1, where all currently accessed files are in the RAID1 setup, but if un-used get migrated to the RAID6 area. And of course some way for efficient backups and more importantly RESTORES of data which is segregated like this. I like the way dump and restore was handled in AFS (and now ZFS and NetApp). There is a simple command to flatten a file system and send it to another system, which can receive it and re-expand it. The dump/restore process uses snapshots and can easily send incremental backups which are significantly smaller than 0-level. This is somewhat better than rsync, because you don't need checksums to discover what data has changed -- you already have the new data segregated into copied-on-write blocks. NetApp happens to use the standard NDMP protocol for sending the flattened file system. NetApp uses it for synchronous replication, volume migration, and back up to nearline storage and tape. AFS used "vol dump" and "vol restore" for migration, replication, and back-up. ZFS has the "zfs send" and "zfs receive" commands that do basically the same (Eric Kustarz recently published a blog entry that described how these work). And of course, all file system objects are able to be sent this way: streams, xattrs, ACLs, and so on are all supported. Note also that NFSv4 supports the idea of migrated or replicated file objects. All that is needed to support it is a mechanism on the servers to actually move the data. begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture: Linux Projects Group adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA email;internet:chuck dot lever at nospam oracle dot com title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE version:2.1 end:vcard
Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS
Chris Mason wrote: On Thu, Jun 14, 2007 at 02:20:26PM -0400, Chuck Lever wrote: NetApp happens to use the standard NDMP protocol for sending the flattened file system. NetApp uses it for synchronous replication, volume migration, and back up to nearline storage and tape. AFS used "vol dump" and "vol restore" for migration, replication, and back-up. ZFS has the "zfs send" and "zfs receive" commands that do basically the same (Eric Kustarz recently published a blog entry that described how these work). And of course, all file system objects are able to be sent this way: streams, xattrs, ACLs, and so on are all supported. Note also that NFSv4 supports the idea of migrated or replicated file objects. All that is needed to support it is a mechanism on the servers to actually move the data. Stringing the replication together with the underlying FS would be neat. Is there a way to deal with a master/slave setup, where the slave may be out of date? Among the implementations I'm aware of, there is a varying degree of integration into the physical file system. In general, it depends on how far out of date the slave is, and how closely the slave is supposed to be synchronized to the master. A hot backup file system, for example, should be data-consistent within a few seconds of the master. A snapshot is used to initialize a slave, followed by a live stream of updates to the master being sent to slaves. Such a mechanism already exists on NetApp filers because they gather changes in NVRAM before committing them to the local file system. Simply put, these changes can also be bundled and sent to a local hot backup filer that is attached via Infiniband, or over the network to a remote hot backup filer. For AFS, replication is done by maintaining a rw and ro copy of a volume on the designated master server. Changes are made to the rw copy over time. When admins want to push out a new version to replicas on another server, the ro copy on the master is replaced with a new snapshot, then this is pushed to the slaves. The replicas are always ro and are used mostly for load balancing; clients contact the closest or fastest server containing a replica of the volume they want to access. They always have a complete copy of the volume (ie no COW on the slaves). I think you have designed into btrfs a lot of opportunity to implement this kind of data virtualization and management... I'm excited to see what can be done. begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture: Linux Projects Group adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE url:http://oss.oracle.com/~cel/ version:2.1 end:vcard
Re: [bisect] NFS regression breaks X
Trond Myklebust wrote: On Wed, 2007-05-09 at 15:52 -0700, Andrew Morton wrote: It's a bit rough that Jeff spent a large amount of time hunting down an already-known bug. That's normally my job :( The bug was reported by Florin Iucha (on lkml!) on Saturday. It has only just been debugged, and I was in fact in the middle of marshalling the fixes. This five-week-old diff only ever appeared in 2.6.21-mm1, which was released four days ago. It was then whizzed into mainline. We thus lost five weeks public testing which would probably have saved Jeff his pain. What went wrong? Probably my fault. I've had a couple of weeks of heavy travel due to various circumstances that were beyond my control, and so I had little time in which to test the stuff and push it out. Another factor that is affecting us is the slow but gradual collapse of the OSDL NFSv4 regression testing effort. I had expected that many of these issues would be caught by the OSDL test harness. I learned only yesterday that it was no longer available, so I am making an effort to broaden my personal test regime. begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture: Linux Projects Group adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE url:http://oss.oracle.com/~cel/ version:2.1 end:vcard
Re: Merge plans for RPC/RDMA? (Was: Re: [NFS] [PATCH 000 of 14] knfsd: Preparation for IPv6 support in NFS server.)
Mike Snitzer wrote: On 2/2/07, Chuck Lever <[EMAIL PROTECTED]> wrote: Roland Dreier wrote: > > They are mostly from Chuck Level and make preparating for IPv6 support > > in the NFS server. > > They are *not* for 2.6.20, but should be ok for .21. > > Out of curiousity, does this patch series reduce the delta between the > NFS/RDMA tree and mainline Linux? In other words does this bring > NFS/RDMA closer to merging? Hi Roland- The client side support for an RPC/RDMA module is almost completely integrated into mainline. There is still a minimal set of patches required to support alternate transports in loadable kernel modules which Trond has indicated he will integrate when the RPC/RDMA transport is ready to be integrated. Hi Chuck, I must be missing something because I don't see _any_ trace of the core RPC over RDMA support (xprtrdma et al), your RPC Transport Switch, or any of the other supporting changes in mainline. Could you, or others, please clarify the plan for merging RPC/RDMA? The RPC transport switch patches are almost fully integrated into mainline. The xprtrdma piece is what is not there yet. At this time I'm not aware of a plan to integrate server-side support for NFS/RDMA. Perhaps the NetApp RDMA developers could respond. Do you mean the NFS/RDMA server code from Tom Tucker of Open Grid Computing? Why wouldn't this code be included? Tom's code hasn't been reviewed by the community. begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture: Linux Projects Group adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA email;internet:chuck dot lever at nospam oracle dot com title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE version:2.1 end:vcard
Re: Merge plans for RPC/RDMA? (Was: Re: [NFS] [PATCH 000 of 14] knfsd: Preparation for IPv6 support in NFS server.)
Mike Snitzer wrote: On 4/13/07, Chuck Lever <[EMAIL PROTECTED]> wrote: Mike Snitzer wrote: > On 2/2/07, Chuck Lever <[EMAIL PROTECTED]> wrote: >> Roland Dreier wrote: >> > > They are mostly from Chuck Level and make preparating for IPv6 >> support >> > > in the NFS server. >> > > They are *not* for 2.6.20, but should be ok for .21. >> > >> > Out of curiousity, does this patch series reduce the delta between the >> > NFS/RDMA tree and mainline Linux? In other words does this bring >> > NFS/RDMA closer to merging? >> >> Hi Roland- >> >> The client side support for an RPC/RDMA module is almost completely >> integrated into mainline. There is still a minimal set of patches >> required to support alternate transports in loadable kernel modules >> which Trond has indicated he will integrate when the RPC/RDMA transport >> is ready to be integrated. > > Hi Chuck, > > I must be missing something because I don't see _any_ trace of the > core RPC over RDMA support (xprtrdma et al), your RPC Transport > Switch, or any of the other supporting changes in mainline. Could > you, or others, please clarify the plan for merging RPC/RDMA? The RPC transport switch patches are almost fully integrated into mainline. The xprtrdma piece is what is not there yet. OK, has the xprtrdma piece been reviewed by the community? What, if anything, is preventing the code from being included in -mm for wider testing? Some of the main NFS client developers (Trond, the CITI-UM developers, myself) have reviewed previous versions of this code, and have recommended a number of changes. I haven't heard anything recently from the NetApp engineers working on this. We are waiting for them to publish again, but I think they are waiting for the last of the transport switch patches to trickle into mainline (which should be ready with the release of 2.6.21). I know they have been working with a handful of test sites and have achieved good performance results. Has xprtrdma inclusion/progress been discussed on some other mailing list? Not yet, but the place that is likely to occur is [EMAIL PROTECTED] begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture: Linux Projects Group adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA email;internet:chuck dot lever at nospam oracle dot com title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE version:2.1 end:vcard
Re: Merge plans for RPC/RDMA? (Was: Re: [NFS] [PATCH 000 of 14] knfsd: Preparation for IPv6 support in NFS server.)
Mike Snitzer wrote: On 4/13/07, Chuck Lever <[EMAIL PROTECTED]> wrote: Mike Snitzer wrote: > On 2/2/07, Chuck Lever <[EMAIL PROTECTED]> wrote: >> Roland Dreier wrote: >> > > They are mostly from Chuck Level and make preparating for IPv6 >> support >> > > in the NFS server. >> > > They are *not* for 2.6.20, but should be ok for .21. >> > >> > Out of curiousity, does this patch series reduce the delta between the >> > NFS/RDMA tree and mainline Linux? In other words does this bring >> > NFS/RDMA closer to merging? >> >> Hi Roland- >> >> The client side support for an RPC/RDMA module is almost completely >> integrated into mainline. There is still a minimal set of patches >> required to support alternate transports in loadable kernel modules >> which Trond has indicated he will integrate when the RPC/RDMA transport >> is ready to be integrated. > > Hi Chuck, > > I must be missing something because I don't see _any_ trace of the > core RPC over RDMA support (xprtrdma et al), your RPC Transport > Switch, or any of the other supporting changes in mainline. Could > you, or others, please clarify the plan for merging RPC/RDMA? The RPC transport switch patches are almost fully integrated into mainline. The xprtrdma piece is what is not there yet. OK, has the xprtrdma piece been reviewed by the community? What, if anything, is preventing the code from being included in -mm for wider testing? OK, there is something holding up the client-side piece of NFS/RDMA. There is some hackery required for the NFS client to recognize that the RDMA transport should be used instead of the standard socket transport. We did discuss this a bit at Connectathon '07 six weeks ago, but my impression is that more discussion is required. I'm working on some changes to NFS mounting that would move NFS mount option parsing into the kernel. This would make it very simple to add RDMA transport support, but it's rather a while in coming. begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture: Linux Projects Group adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA email;internet:chuck dot lever at nospam oracle dot com title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE version:2.1 end:vcard
Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues
Trond Myklebust wrote: On Wed, 2007-04-18 at 20:52 -0500, Florin Iucha wrote: On Wed, Apr 18, 2007 at 10:11:46AM -0400, Trond Myklebust wrote: Do you have a copy of wireshark or ethereal on hand? If so, could you take a look at whether or not any NFS traffic is going between the client and server once the hang happens? I used the following command tcpdump -w nfs-traffic -i eth0 -vv -tt dst port nfs to capture http://iucha.net/nfs/21-rc7-nfs4/nfs-traffic.bz2 I started the capture before starting the copy and left it to run for a few minutes after the traffic slowed to a crawl. The iostat and vmstat are at: http://iucha.net/nfs/21-rc7-nfs4/iostat http://iucha.net/nfs/21-rc7-nfs4/vmstat It seems that my original problem report had a big mistake! There is no hang, but at some point the write slows down to a trickle (from 40,000 blocks/s to 22 blocks/s) as can be seen from the iostat log. Yeah. You only captured the outgoing traffic to the server, but already it looks as if there were 'interesting' things going on. In frames 29346 to 29350, the traffic stops altogether for 5 seconds (I only see keepalives) then it starts up again. Ditto for frames 40477-40482 (another 5 seconds). ... Then at around frame 92072, the client starts to send a bunch of RSTs. Aha I'll bet that reverting the appended patch fixes the problem. The assumption Chuck makes is that if _no_ request bytes have been sent, yet the request is on the 'receive list' then it must be a resend is patently false in the case where the send queue just happens to be full. There are other places in the RPC client where "zero bytes sent" implies that the request has been sent. The real problem here is that zeroing the "bytes sent" field is overloaded. Perhaps instead of looking at the number of bytes sent, the logic in the last hunk of this patch should check which queue the request is sitting on. --- commit 43d78ef2ba5bec26d0315859e8324bfc0be23766 Author: Chuck Lever <[EMAIL PROTECTED]> Date: Tue Feb 6 18:26:11 2007 -0500 NFS: disconnect before retrying NFSv4 requests over TCP RFC3530 section 3.1.1 states an NFSv4 client MUST NOT send a request twice on the same connection unless it is the NULL procedure. Section 3.1.1 suggests that the client should disconnect and reconnect if it wants to retry a request. Implement this by adding an rpc_clnt flag that an ULP can use to specify that the underlying transport should be disconnected on a major timeout. The NFSv4 client asserts this new flag, and requests no retries after a minor retransmit timeout. Note that disconnecting on a retransmit is in general not safe to do if the RPC client does not reuse the TCP port number when reconnecting. See http://bugzilla.linux-nfs.org/show_bug.cgi?id=6 Signed-off-by: Chuck Lever <[EMAIL PROTECTED]> Signed-off-by: Trond Myklebust <[EMAIL PROTECTED]> diff --git a/fs/nfs/client.c b/fs/nfs/client.c index a3191f0..c46e94f 100644 --- a/fs/nfs/client.c +++ b/fs/nfs/client.c @@ -394,7 +394,8 @@ static void nfs_init_timeout_values(struct rpc_timeout *to, int proto, static int nfs_create_rpc_client(struct nfs_client *clp, int proto, unsigned int timeo, unsigned int retrans, - rpc_authflavor_t flavor) + rpc_authflavor_t flavor, + int flags) { struct rpc_timeout timeparms; struct rpc_clnt *clnt = NULL; @@ -407,6 +408,7 @@ static int nfs_create_rpc_client(struct nfs_client *clp, int proto, .program= &nfs_program, .version= clp->rpc_ops->version, .authflavor = flavor, + .flags = flags, }; if (!IS_ERR(clp->cl_rpcclient)) @@ -548,7 +550,7 @@ static int nfs_init_client(struct nfs_client *clp, const struct nfs_mount_data * * - RFC 2623, sec 2.3.2 */ error = nfs_create_rpc_client(clp, proto, data->timeo, data->retrans, - RPC_AUTH_UNIX); + RPC_AUTH_UNIX, 0); if (error < 0) goto error; nfs_mark_client_ready(clp, NFS_CS_READY); @@ -868,7 +870,8 @@ static int nfs4_init_client(struct nfs_client *clp, /* Check NFS protocol revision and initialize RPC op vector */ clp->rpc_ops = &nfs_v4_clientops; - error = nfs_create_rpc_client(clp, proto, timeo, retrans, authflavour); + error = nfs_create_rpc_client(clp, proto, timeo, retrans, authflavour, + RPC_CLNT_CREATE_DISCRTRY); i
Re: [PATCH] fs/nfsd: Delete invalid assignment statements in nfsd4_decode_exchange_id
> On Jul 22, 2018, at 4:50 AM, nixiaoming wrote: > > dummy = be32_to_cpup(p++); > dummy = be32_to_cpup(p++); > Assigning value to "dummy" here, but that stored value > is overwritten before it can be used. > > delete invalid assignment statements in nfsd4_decode_exchange_id > > Signed-off-by: n00202754 > --- > fs/nfsd/nfs4xdr.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c > index a96843c..8e78541 100644 > --- a/fs/nfsd/nfs4xdr.c > +++ b/fs/nfsd/nfs4xdr.c > @@ -1392,8 +1392,8 @@ nfsd4_decode_exchange_id(struct nfsd4_compoundargs > *argp, > > /* ssp_window and ssp_num_gss_handles */ > READ_BUF(8); > - dummy = be32_to_cpup(p++); > - dummy = be32_to_cpup(p++); > + be32_to_cpup(p++); > + be32_to_cpup(p++); If these values are not used, what's the point of byte swapping them? Surely "p += 2;" should be enough. > break; > default: > goto xdr_error; -- Chuck Lever
Re: [PATCH] fs/nfsd: Delete invalid assignment statements in nfsd4_decode_exchange_id
> On Jul 22, 2018, at 2:33 PM, Trond Myklebust wrote: > > On Sun, 2018-07-22 at 14:12 -0400, Chuck Lever wrote: >>> On Jul 22, 2018, at 4:50 AM, nixiaoming >>> wrote: >>> >>> dummy = be32_to_cpup(p++); >>> dummy = be32_to_cpup(p++); >>> Assigning value to "dummy" here, but that stored value >>> is overwritten before it can be used. >>> >>> delete invalid assignment statements in nfsd4_decode_exchange_id >>> >>> Signed-off-by: n00202754 >>> --- >>> fs/nfsd/nfs4xdr.c | 4 ++-- >>> 1 file changed, 2 insertions(+), 2 deletions(-) >>> >>> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c >>> index a96843c..8e78541 100644 >>> --- a/fs/nfsd/nfs4xdr.c >>> +++ b/fs/nfsd/nfs4xdr.c >>> @@ -1392,8 +1392,8 @@ nfsd4_decode_exchange_id(struct >>> nfsd4_compoundargs *argp, >>> >>> /* ssp_window and ssp_num_gss_handles */ >>> READ_BUF(8); >>> - dummy = be32_to_cpup(p++); >>> - dummy = be32_to_cpup(p++); >>> + be32_to_cpup(p++); >>> + be32_to_cpup(p++); >> >> If these values are not used, what's the point of byte swapping them? >> Surely "p += 2;" should be enough. >> >> >>> break; >>> default: >>> goto xdr_error; >> > > Given that the value of 'p' isn't used either, why not just delete > those two lines altogether? Sounds OK, READ_BUF is tracking progress through the buffer, and it already updates "p" as a side-effect. Might there be some nearby instances of open-coded "p" updates that could also be removed, for similar reasons? -- Chuck Lever
Re: [BUG BISECT] NFSv4 client fails on Flush Journal to Persistent Storage
> On Jul 25, 2018, at 9:27 AM, Krzysztof Kozlowski wrote: > > On 18 June 2018 at 18:20, Chuck Lever wrote: >> >> The extra serialization appears to have a reproducible performance >> impact on RDMA, which no longer takes the reserve_lock when allocating >> a slot. >> >> I could put an xprt_alloc_xid call in xprt_alloc_slot, but that would >> only work for socket-based transports. Would it be OK if RDMA had its >> own XID allocation mechanism? > > Hi, > > On recent next the issue appeared again. My boards with NFSv4 root > timeout on 80% of boots. This time my NFS server is faster - Pi3 B+ > :). > > Is this know? Should I start long bisect or maybe you can point me to > possible causes? Hi Krzysztof, I don't know of any recent changes. Bisecting would be a good place to start. -- Chuck Lever
Re: [PATCH] NFSD: hide unused svcxdr_dupstr()
> On Jan 19, 2018, at 9:54 AM, Arnd Bergmann wrote: > > There is now only one caller left for svcxdr_dupstr() and this is inside > of an #ifdef, so we can get a warning when the option is disabled: > > fs/nfsd/nfs4xdr.c:241:1: error: 'svcxdr_dupstr' defined but not used > [-Werror=unused-function] > > This adds another #ifdef around the definition. > > Fixes: e40d99e6183e ("NFSD: Clean up symlink argument XDR decoders") > Signed-off-by: Arnd Bergmann Seems OK to me, and sorry for the noise. Reviewed-by: Chuck Lever > --- > fs/nfsd/nfs4xdr.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c > index f51c9cccaf78..374a62af6034 100644 > --- a/fs/nfsd/nfs4xdr.c > +++ b/fs/nfsd/nfs4xdr.c > @@ -237,6 +237,7 @@ svcxdr_tmpalloc(struct nfsd4_compoundargs *argp, u32 len) > * Note null-terminating in place usually isn't safe since the > * buffer might end on a page boundary. > */ > +#ifdef CONFIG_NFSD_V4_SECURITY_LABEL > static char * > svcxdr_dupstr(struct nfsd4_compoundargs *argp, void *buf, u32 len) > { > @@ -248,6 +249,7 @@ svcxdr_dupstr(struct nfsd4_compoundargs *argp, void *buf, > u32 len) > p[len] = '\0'; > return p; > } > +#endif > > /** > * savemem - duplicate a chunk of memory for later processing > -- > 2.9.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Chuck Lever
Re: [PATCH] sunrpc: fix a missing check of xdr_inline_decode
> On Dec 25, 2018, at 10:18 PM, Kangjie Lu wrote: > > xdr_inline_decode() could fail. NAK: xdr_inline_decode cannot fail if its second argument is zero. > When it fails, the return value is NULL > and should not be dereferenced. > The fix checks if xdr_inline_decode fails, and if so, returns. > > Signed-off-by: Kangjie Lu > --- > net/sunrpc/xprtrdma/backchannel.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/net/sunrpc/xprtrdma/backchannel.c > b/net/sunrpc/xprtrdma/backchannel.c > index e5b367a3e517..bd9be5272ef4 100644 > --- a/net/sunrpc/xprtrdma/backchannel.c > +++ b/net/sunrpc/xprtrdma/backchannel.c > @@ -285,6 +285,8 @@ void rpcrdma_bc_receive_call(struct rpcrdma_xprt *r_xprt, > __be32 *p; > > p = xdr_inline_decode(&rep->rr_stream, 0); > + if (unlikely(!p)) > + goto out_overflow; > size = xdr_stream_remaining(&rep->rr_stream); > > #ifdef RPCRDMA_BACKCHANNEL_DEBUG > -- > 2.17.2 (Apple Git-113) > -- Chuck Lever
Re: [PATCH] sunrpc: remove redundant code
> On Dec 25, 2018, at 10:24 PM, Kangjie Lu wrote: > > If no bytes to decode, just use "xdr->p" instead of calling > xdr_inline_decode to get it. The fix cleans up the code. > > Signed-off-by: Kangjie Lu > --- > net/sunrpc/xprtrdma/rpc_rdma.c | 4 +--- > 1 file changed, 1 insertion(+), 3 deletions(-) > > diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c > index 9f53e0240035..2ef86be49bd8 100644 > --- a/net/sunrpc/xprtrdma/rpc_rdma.c > +++ b/net/sunrpc/xprtrdma/rpc_rdma.c > @@ -1123,7 +1123,6 @@ rpcrdma_decode_msg(struct rpcrdma_xprt *r_xprt, struct > rpcrdma_rep *rep, > { > struct xdr_stream *xdr = &rep->rr_stream; > u32 writelist, replychunk, rpclen; > - char *base; > > /* Decode the chunk lists */ > if (decode_read_list(xdr)) > @@ -1138,10 +1137,9 @@ rpcrdma_decode_msg(struct rpcrdma_xprt *r_xprt, struct > rpcrdma_rep *rep, > return -EIO; > > /* Build the RPC reply's Payload stream in rqst->rq_rcv_buf */ > - base = (char *)xdr_inline_decode(xdr, 0); > rpclen = xdr_stream_remaining(xdr); > r_xprt->rx_stats.fixup_copy_count += > - rpcrdma_inline_fixup(rqst, base, rpclen, writelist & 3); > + rpcrdma_inline_fixup(rqst, xdr->p, rpclen, writelist & 3); I used xdr_inline_decode(xdr, 0) here because I didn't want to embed specific knowledge about struct xdr_stream into rpcrdma_decode_msg. IOW, adhere to the API contract. Given the need for adding a type cast to this function call, as reported by the kbuild robot, I don't see that this patch improves things significantly. > r_xprt->rx_stats.total_rdma_reply += writelist; > return rpclen + xdr_align_size(writelist); > -- > 2.17.2 (Apple Git-113) > -- Chuck Lever
Re: [PATCH net-next] svcrdma: Fix an uninitialized variable false warning
> On Dec 20, 2018, at 4:49 AM, YueHaibing wrote: > > smatch warning this: > net/sunrpc/xprtrdma/svc_rdma_rw.c:351 svc_rdma_post_chunk_ctxt() error: > uninitialized symbol 'bad_wr' > net/sunrpc/xprtrdma/verbs.c:1569 rpcrdma_post_recvs() error: uninitialized > symbol 'bad_wr' > > 'bad_wr' is initialized in ib_post_send. But smatch > doesn't know that and warns this. > > Signed-off-by: YueHaibing > --- > net/sunrpc/xprtrdma/svc_rdma_rw.c | 2 +- > net/sunrpc/xprtrdma/verbs.c | 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c > b/net/sunrpc/xprtrdma/svc_rdma_rw.c > index dc19517..0954b25 100644 > --- a/net/sunrpc/xprtrdma/svc_rdma_rw.c > +++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c > @@ -308,7 +308,7 @@ static int svc_rdma_post_chunk_ctxt(struct > svc_rdma_chunk_ctxt *cc) > struct svcxprt_rdma *rdma = cc->cc_rdma; > struct svc_xprt *xprt = &rdma->sc_xprt; > struct ib_send_wr *first_wr; > - const struct ib_send_wr *bad_wr; > + const struct ib_send_wr *bad_wr = NULL; > struct list_head *tmp; > struct ib_cqe *cqe; > int ret; > diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c > index 3ddba94..37be70f 100644 > --- a/net/sunrpc/xprtrdma/verbs.c > +++ b/net/sunrpc/xprtrdma/verbs.c > @@ -1518,7 +1518,7 @@ void > rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp) > { > struct rpcrdma_buffer *buf = &r_xprt->rx_buf; > - struct ib_recv_wr *wr, *bad_wr; > + struct ib_recv_wr *wr, *bad_wr = NULL; > int needed, count, rc; > > rc = 0; > -- > 2.7.0 Does this need Fixes: d34ac5cd3a73 ("RDMA, core and ULPs: Declare ib_post_send() and ib_post_recv() arguments const") ??? Bart, any comments? -- Chuck Lever
Re: [PATCH 22/23] SUNRPC: simplify auth_unix.
>cred->fsgid; > - for (i = 0; i < groups; i++) > - cred->uc_gids[i] = acred->cred->group_info->gid[i]; > - if (i < UNX_NGROUPS) > - cred->uc_gids[i] = INVALID_GID; > - > - return &cred->uc_base; > -} > - > -static void > -unx_free_cred(struct unx_cred *unx_cred) > -{ > - dprintk("RPC: unx_free_cred %p\n", unx_cred); > - put_cred(unx_cred->uc_base.cr_cred); > - kfree(unx_cred); > + rpcauth_init_cred(ret, acred, auth, &unix_credops); > + ret->cr_flags = 1UL << RPCAUTH_CRED_UPTODATE; > + return ret; > } > > static void > unx_free_cred_callback(struct rcu_head *head) > { > - struct unx_cred *unx_cred = container_of(head, struct unx_cred, > uc_base.cr_rcu); > - unx_free_cred(unx_cred); > + struct rpc_cred *rpc_cred = container_of(head, struct rpc_cred, cr_rcu); > + dprintk("RPC: unx_free_cred %p\n", rpc_cred); > + put_cred(rpc_cred->cr_cred); > + mempool_free(rpc_cred, unix_pool); > } > > static void > @@ -115,30 +73,32 @@ unx_destroy_cred(struct rpc_cred *cred) > } > > /* > - * Match credentials against current process creds. > - * The root_override argument takes care of cases where the caller may > - * request root creds (e.g. for NFS swapping). > + * Match credentials against current the auth_cred. > */ > static int > -unx_match(struct auth_cred *acred, struct rpc_cred *rcred, int flags) > +unx_match(struct auth_cred *acred, struct rpc_cred *cred, int flags) > { > - struct unx_cred *cred = container_of(rcred, struct unx_cred, uc_base); > unsigned int groups = 0; > unsigned int i; > > + if (cred->cr_cred == acred->cred) > + return 1; > > - if (!uid_eq(cred->uc_uid, acred->cred->fsuid) || !gid_eq(cred->uc_gid, > acred->cred->fsgid)) > + if (!uid_eq(cred->cr_cred->fsuid, acred->cred->fsuid) || > !gid_eq(cred->cr_cred->fsgid, acred->cred->fsgid)) > return 0; > > if (acred->cred && acred->cred->group_info != NULL) > groups = acred->cred->group_info->ngroups; > if (groups > UNX_NGROUPS) > groups = UNX_NGROUPS; > + if (cred->cr_cred->group_info == NULL) > + return groups == 0; > + if (groups != cred->cr_cred->group_info->ngroups) > + return 0; > + > for (i = 0; i < groups ; i++) > - if (!gid_eq(cred->uc_gids[i], acred->cred->group_info->gid[i])) > + if (!gid_eq(cred->cr_cred->group_info->gid[i], > acred->cred->group_info->gid[i])) > return 0; > - if (groups < UNX_NGROUPS && gid_valid(cred->uc_gids[groups])) > - return 0; > return 1; > } > > @@ -150,9 +110,10 @@ static __be32 * > unx_marshal(struct rpc_task *task, __be32 *p) > { > struct rpc_clnt *clnt = task->tk_client; > - struct unx_cred *cred = container_of(task->tk_rqstp->rq_cred, struct > unx_cred, uc_base); > + struct rpc_cred *cred = task->tk_rqstp->rq_cred; > __be32 *base, *hold; > int i; > + struct group_info *gi = cred->cr_cred->group_info; > > *p++ = htonl(RPC_AUTH_UNIX); > base = p++; > @@ -163,11 +124,12 @@ unx_marshal(struct rpc_task *task, __be32 *p) >*/ > p = xdr_encode_array(p, clnt->cl_nodename, clnt->cl_nodelen); > > - *p++ = htonl((u32) from_kuid(&init_user_ns, cred->uc_uid)); > - *p++ = htonl((u32) from_kgid(&init_user_ns, cred->uc_gid)); > + *p++ = htonl((u32) from_kuid(&init_user_ns, cred->cr_cred->fsuid)); > + *p++ = htonl((u32) from_kgid(&init_user_ns, cred->cr_cred->fsgid)); > hold = p++; > - for (i = 0; i < UNX_NGROUPS && gid_valid(cred->uc_gids[i]); i++) > - *p++ = htonl((u32) from_kgid(&init_user_ns, cred->uc_gids[i])); > + if (gi) > + for (i = 0; i < UNX_NGROUPS && i < gi->ngroups; i++) > + *p++ = htonl((u32) from_kgid(&init_user_ns, > gi->gid[i])); > *hold = htonl(p - hold - 1);/* gid array length */ > *base = htonl((p - base - 1) << 2); /* cred length */ > > @@ -214,12 +176,13 @@ unx_validate(struct rpc_task *task, __be32 *p) > > int __init rpc_init_authunix(void) > { > - return rpcauth_init_credcache(&unix_auth); > + unix_pool = mempool_create_kmalloc_pool(16, sizeof(struct rpc_cred)); > + return unix_pool ? 0 : -ENOMEM; > } > > void rpc_destroy_authunix(void) > { > - rpcauth_destroy_credcache(&unix_auth); > + mempool_destroy(unix_pool); > } > > const struct rpc_authops authunix_ops = { > @@ -228,9 +191,7 @@ const struct rpc_authops authunix_ops = { > .au_name= "UNIX", > .create = unx_create, > .destroy= unx_destroy, > - .hash_cred = unx_hash_cred, > .lookup_cred= unx_lookup_cred, > - .crcreate = unx_create_cred, > }; > > static > > -- Chuck Lever
Re: [PATCH 22/23] SUNRPC: simplify auth_unix.
> On Nov 7, 2018, at 8:41 PM, NeilBrown wrote: > > On Wed, Nov 07 2018, Chuck Lever wrote: > >> Hi Neil- >> >> >>> On Nov 6, 2018, at 11:12 PM, NeilBrown wrote: >>> >>> 1/ discard 'struct unx_cred'. We don't need any data that >>> is not already in 'struct rpc_cred'. >>> 2/ Don't keep these creds in a hash table. When a credential >>> is needed, simply allocate it. When not needed, discard it. >>> This can easily be faster than performing a lookup on >>> a shared hash table. > > Thanks for the review Chuck! > >> >> What's the basis for this claim? A memory allocation disables and >> enables IRQs. That definitely hits a resource that is globally >> shared. > > My basis is not rock solid, but I was convinced :-) > > kmem_cache_alloc() does disable local irqs when slab.c is used. > slub.c doesn't disable them in the fast path which I *think* should be > reasonably common. > slob always takes a spinlock as well as disabling interrupts. > > I think slob is only recommended for tiny machines, and slub is > generally preferred, so I think that when performance matters, it will > still be delivered. > > It isn't clear to me why you consider a local irq to be "globally > shared" - assuming that is what you mean. Globally-shared in this case can be construed somewhat narrowly. If we assume slub, kmem_cache_alloc() touches resources that are used by all tasks running on that CPU, at least in the slow path. > Disabling local interrupts is not without cost, but I don't think the > cost increases with the number of CPUs, while the cost of accessing > shared memory (even without a spinlock) does. Point taken, but having a single mempool for all RPC transports and users is also going to be a shared resource that can bottleneck. I guess that's an argument in favor of building creds on demand rather than keeping them in a hash table, isn't it. :-) >> In addition, the comment near unx_marshal suggests we should >> cache the marshaled on-the-wire version of the credential instead >> of building it in the RPC Call buffer every time. That would >> require keeping the creds around. > > That comment has been there since 2.1.32 and has not be acted on. There > seems little reason to expect that to change. Caching doesn't seem to > have been found to be necessary in practice. > >> >> Have you measured a significant difference in throughput with >> this patch? Have you considered improving the lookup speed of >> the hash table by making the buckets into rb-trees, for example? > > No, I haven't measured. > I might have briefly considered changing to an rb-tree, but as the > current hashtable doesn't actually contain anything of value, I would > have quickly discarded the idea. > > If I wanted to further improve performance, I would look at ways to > bypass the "lookup_cred" step completely. > unx_marshal only needs the generic "struct cred", so there should be no > need to have a 'struct rpc_cred' at all. > > If this series is accepted, I'll (hopefully) look into seeing how > practical that is. > > Thanks again, > NeilBrown > > >> >> >>> As the lookup can happen during write-out, use a mempool >>> to ensure forward progress. >>> This means that we cannot compare two credentials for >>> equality by comparing the pointers, but we never do that anyway. >>> >>> Signed-off-by: NeilBrown >>> --- >>> net/sunrpc/auth.c |1 >>> net/sunrpc/auth_unix.c | 101 >>> +++- >>> 2 files changed, 32 insertions(+), 70 deletions(-) >>> >>> diff --git a/net/sunrpc/auth.c b/net/sunrpc/auth.c >>> index 867ea9834bde..a07a7c59d3a4 100644 >>> --- a/net/sunrpc/auth.c >>> +++ b/net/sunrpc/auth.c >>> @@ -651,6 +651,7 @@ rpcauth_init_cred(struct rpc_cred *cred, const struct >>> auth_cred *acred, >>> INIT_LIST_HEAD(&cred->cr_lru); >>> refcount_set(&cred->cr_count, 1); >>> cred->cr_auth = auth; >>> + cred->cr_flags = 0; >>> cred->cr_ops = ops; >>> cred->cr_expire = jiffies; >>> cred->cr_cred = get_cred(acred->cred); >>> diff --git a/net/sunrpc/auth_unix.c b/net/sunrpc/auth_unix.c >>> index bff113a411e0..387f6b3ffbea 100644 >>> --- a/net/sunrpc/auth_unix.c >>> +++ b/net/sunrpc/auth_unix.c >>> @@ -11,16 +11,11
Re: [PATCH 03/20] filelock: the results of the coccinelle conversion
On Tue, Jan 16, 2024 at 02:45:59PM -0500, Jeff Layton wrote: > This is the direct result of the changes generated by coccinelle. There > is still about 1/4 of the callsites that need to be touched by hand > here. > > Signed-off-by: Jeff Layton For the changes in include/linux/lockd/lockd.h, fs/lockd/, and fs/nfsd/: Acked-by: Chuck Lever > --- > fs/9p/vfs_file.c| 38 ++--- > fs/afs/flock.c | 55 +++--- > fs/ceph/locks.c | 66 +++ > fs/dlm/plock.c | 44 ++--- > fs/fuse/file.c | 14 +- > fs/gfs2/file.c | 16 +- > fs/lockd/clnt4xdr.c | 6 +- > fs/lockd/clntlock.c | 2 +- > fs/lockd/clntproc.c | 60 --- > fs/lockd/clntxdr.c | 6 +- > fs/lockd/svclock.c | 10 +- > fs/lockd/svcsubs.c | 20 +-- > fs/lockd/xdr.c | 6 +- > fs/lockd/xdr4.c | 6 +- > fs/locks.c | 406 > +++- > fs/nfs/delegation.c | 2 +- > fs/nfs/file.c | 22 +-- > fs/nfs/nfs3proc.c | 2 +- > fs/nfs/nfs4proc.c | 35 ++-- > fs/nfs/nfs4state.c | 4 +- > fs/nfs/nfs4xdr.c| 8 +- > fs/nfs/write.c | 4 +- > fs/nfsd/filecache.c | 4 +- > fs/nfsd/nfs4layouts.c | 15 +- > fs/nfsd/nfs4state.c | 73 > fs/ocfs2/locks.c| 12 +- > fs/ocfs2/stack_user.c | 2 +- > fs/smb/client/cifssmb.c | 8 +- > fs/smb/client/file.c| 72 > fs/smb/client/smb2file.c| 2 +- > fs/smb/server/smb2pdu.c | 44 ++--- > fs/smb/server/vfs.c | 12 +- > include/linux/lockd/lockd.h | 8 +- > 33 files changed, 554 insertions(+), 530 deletions(-) > > diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c > index 11cd8d23f6f2..f35ac7cb782e 100644 > --- a/fs/9p/vfs_file.c > +++ b/fs/9p/vfs_file.c > @@ -107,7 +107,7 @@ static int v9fs_file_lock(struct file *filp, int cmd, > struct file_lock *fl) > > p9_debug(P9_DEBUG_VFS, "filp: %p lock: %p\n", filp, fl); > > - if ((IS_SETLK(cmd) || IS_SETLKW(cmd)) && fl->fl_type != F_UNLCK) { > + if ((IS_SETLK(cmd) || IS_SETLKW(cmd)) && fl->fl_core.fl_type != > F_UNLCK) { > filemap_write_and_wait(inode->i_mapping); > invalidate_mapping_pages(&inode->i_data, 0, -1); > } > @@ -127,7 +127,7 @@ static int v9fs_file_do_lock(struct file *filp, int cmd, > struct file_lock *fl) > fid = filp->private_data; > BUG_ON(fid == NULL); > > - BUG_ON((fl->fl_flags & FL_POSIX) != FL_POSIX); > + BUG_ON((fl->fl_core.fl_flags & FL_POSIX) != FL_POSIX); > > res = locks_lock_file_wait(filp, fl); > if (res < 0) > @@ -136,7 +136,7 @@ static int v9fs_file_do_lock(struct file *filp, int cmd, > struct file_lock *fl) > /* convert posix lock to p9 tlock args */ > memset(&flock, 0, sizeof(flock)); > /* map the lock type */ > - switch (fl->fl_type) { > + switch (fl->fl_core.fl_type) { > case F_RDLCK: > flock.type = P9_LOCK_TYPE_RDLCK; > break; > @@ -152,7 +152,7 @@ static int v9fs_file_do_lock(struct file *filp, int cmd, > struct file_lock *fl) > flock.length = 0; > else > flock.length = fl->fl_end - fl->fl_start + 1; > - flock.proc_id = fl->fl_pid; > + flock.proc_id = fl->fl_core.fl_pid; > flock.client_id = fid->clnt->name; > if (IS_SETLKW(cmd)) > flock.flags = P9_LOCK_FLAGS_BLOCK; > @@ -207,12 +207,12 @@ static int v9fs_file_do_lock(struct file *filp, int > cmd, struct file_lock *fl) >* incase server returned error for lock request, revert >* it locally >*/ > - if (res < 0 && fl->fl_type != F_UNLCK) { > - fl_type = fl->fl_type; > - fl->fl_type = F_UNLCK; > + if (res < 0 && fl->fl_core.fl_type != F_UNLCK) { > + fl_type = fl->fl_core.fl_type; > + fl->fl_core.fl_type = F_UNLCK; > /* Even if this fails we want to return the remote error */ > locks_lock_file_wait(filp, fl); > - fl->fl_type = fl_type; > + fl->fl_core.fl_type = fl_type; > } > if (flock.client_id != fid->clnt->name) > kfree(flock.client_id); > @@ -234,7 +234,7 @@ static int v9fs_file_getlock(struct file *filp, struct > file_lock *fl) >* if we have a conflicting lock l
Re: [PATCH 00/20] filelock: split struct file_lock into file_lock and file_lease structs
> fs/nfs/nfs4file.c | 2 +- > fs/nfs/nfs4proc.c | 39 +- > fs/nfs/nfs4state.c | 6 +- > fs/nfs/nfs4trace.h | 4 +- > fs/nfs/nfs4xdr.c| 8 +- > fs/nfs/write.c | 8 +- > fs/nfsd/filecache.c | 4 +- > fs/nfsd/nfs4callback.c | 2 +- > fs/nfsd/nfs4layouts.c | 34 +- > fs/nfsd/nfs4state.c | 98 ++--- > fs/ocfs2/locks.c| 12 +- > fs/ocfs2/stack_user.c | 2 +- > fs/smb/client/cifsfs.c | 2 +- > fs/smb/client/cifssmb.c | 8 +- > fs/smb/client/file.c| 74 ++-- > fs/smb/client/smb2file.c| 2 +- > fs/smb/server/smb2pdu.c | 44 +-- > fs/smb/server/vfs.c | 14 +- > include/linux/filelock.h| 58 ++- > include/linux/fs.h | 5 +- > include/linux/lockd/lockd.h | 8 +- > include/trace/events/afs.h | 4 +- > include/trace/events/filelock.h | 54 +-- > 48 files changed, 1119 insertions(+), 825 deletions(-) > --- > base-commit: 052d534373b7ed33712a63d5e17b2b6cdbce84fd > change-id: 20240116-flsplit-bdb46824db68 > > Best regards, > -- > Jeff Layton > -- Chuck Lever
Re: [PATCH 04/20] filelock: fixups after the coccinelle changes
On Tue, Jan 16, 2024 at 02:46:00PM -0500, Jeff Layton wrote: > The coccinelle script doesn't catch quite everythng (particularly with > embedded structs). These are some by-hand fixups after the split of > common fields into struct file_lock_core. > > Signed-off-by: Jeff Layton For the changes in fs/lockd/ and fs/nfsd/: Acked-by: Chuck Lever > --- > fs/ceph/locks.c | 8 ++--- > fs/lockd/clnt4xdr.c | 8 ++--- > fs/lockd/clntproc.c | 6 ++-- > fs/lockd/clntxdr.c | 8 ++--- > fs/lockd/svc4proc.c | 10 +++--- > fs/lockd/svclock.c | 54 + > fs/lockd/svcproc.c | 10 +++--- > fs/lockd/svcsubs.c | 4 +-- > fs/lockd/xdr.c | 8 ++--- > fs/lockd/xdr4.c | 8 ++--- > fs/locks.c | 67 > + > fs/nfs/delegation.c | 2 +- > fs/nfs/nfs4state.c | 2 +- > fs/nfs/nfs4trace.h | 4 +-- > fs/nfs/write.c | 4 +-- > fs/nfsd/nfs4callback.c | 2 +- > fs/nfsd/nfs4state.c | 4 +-- > fs/smb/client/file.c| 2 +- > fs/smb/server/vfs.c | 2 +- > include/trace/events/afs.h | 4 +-- > include/trace/events/filelock.h | 32 ++-- > 21 files changed, 126 insertions(+), 123 deletions(-) > > diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c > index ee12f9864980..55be5d231e38 100644 > --- a/fs/ceph/locks.c > +++ b/fs/ceph/locks.c > @@ -386,9 +386,9 @@ void ceph_count_locks(struct inode *inode, int > *fcntl_count, int *flock_count) > ctx = locks_inode_context(inode); > if (ctx) { > spin_lock(&ctx->flc_lock); > - list_for_each_entry(lock, &ctx->flc_posix, fl_list) > + list_for_each_entry(lock, &ctx->flc_posix, fl_core.fl_list) > ++(*fcntl_count); > - list_for_each_entry(lock, &ctx->flc_flock, fl_list) > + list_for_each_entry(lock, &ctx->flc_flock, fl_core.fl_list) > ++(*flock_count); > spin_unlock(&ctx->flc_lock); > } > @@ -455,7 +455,7 @@ int ceph_encode_locks_to_buffer(struct inode *inode, > return 0; > > spin_lock(&ctx->flc_lock); > - list_for_each_entry(lock, &ctx->flc_posix, fl_list) { > + list_for_each_entry(lock, &ctx->flc_posix, fl_core.fl_list) { > ++seen_fcntl; > if (seen_fcntl > num_fcntl_locks) { > err = -ENOSPC; > @@ -466,7 +466,7 @@ int ceph_encode_locks_to_buffer(struct inode *inode, > goto fail; > ++l; > } > - list_for_each_entry(lock, &ctx->flc_flock, fl_list) { > + list_for_each_entry(lock, &ctx->flc_flock, fl_core.fl_list) { > ++seen_flock; > if (seen_flock > num_flock_locks) { > err = -ENOSPC; > diff --git a/fs/lockd/clnt4xdr.c b/fs/lockd/clnt4xdr.c > index ed00bd2869a7..083a3b1bf288 100644 > --- a/fs/lockd/clnt4xdr.c > +++ b/fs/lockd/clnt4xdr.c > @@ -243,7 +243,7 @@ static void encode_nlm4_holder(struct xdr_stream *xdr, > u64 l_offset, l_len; > __be32 *p; > > - encode_bool(xdr, lock->fl.fl_type == F_RDLCK); > + encode_bool(xdr, lock->fl.fl_core.fl_type == F_RDLCK); > encode_int32(xdr, lock->svid); > encode_netobj(xdr, lock->oh.data, lock->oh.len); > > @@ -357,7 +357,7 @@ static void nlm4_xdr_enc_testargs(struct rpc_rqst *req, > const struct nlm_lock *lock = &args->lock; > > encode_cookie(xdr, &args->cookie); > - encode_bool(xdr, lock->fl.fl_type == F_WRLCK); > + encode_bool(xdr, lock->fl.fl_core.fl_type == F_WRLCK); > encode_nlm4_lock(xdr, lock); > } > > @@ -380,7 +380,7 @@ static void nlm4_xdr_enc_lockargs(struct rpc_rqst *req, > > encode_cookie(xdr, &args->cookie); > encode_bool(xdr, args->block); > - encode_bool(xdr, lock->fl.fl_type == F_WRLCK); > + encode_bool(xdr, lock->fl.fl_core.fl_type == F_WRLCK); > encode_nlm4_lock(xdr, lock); > encode_bool(xdr, args->reclaim); > encode_int32(xdr, args->state); > @@ -403,7 +403,7 @@ static void nlm4_xdr_enc_cancargs(struct rpc_rqst *req, > > encode_cookie(xdr, &args->cookie); > encode_bool(xdr, args->block); > - encode_bool(xdr, lock->fl.fl_type == F_WRLCK); > + encode_bool(xdr, lock->fl.fl_core.fl_type == F_W
Re: [PATCH v2 00/41] filelock: split struct file_lock into file_lock and file_lease structs
On Thu, Jan 25, 2024 at 05:42:41AM -0500, Jeff Layton wrote: > Long ago, file locks used to hang off of a singly-linked list in struct > inode. Because of this, when leases were added, they were added to the > same list and so they had to be tracked using the same sort of > structure. > > Several years ago, we added struct file_lock_context, which allowed us > to use separate lists to track different types of file locks. Given > that, leases no longer need to be tracked using struct file_lock. > > That said, a lot of the underlying infrastructure _is_ the same between > file leases and locks, so we can't completely separate everything. > > This patchset first splits a group of fields used by both file locks and > leases into a new struct file_lock_core, that is then embedded in struct > file_lock. Coccinelle was then used to convert a lot of the callers to > deal with the move, with the remaining 25% or so converted by hand. > > It then converts several internal functions in fs/locks.c to work > with struct file_lock_core. Lastly, struct file_lock is split into > struct file_lock and file_lease, and the lease-related APIs converted to > take struct file_lease. > > After the first few patches (which I left split up for easier review), > the set should be bisectable. I'll plan to squash the first few > together to make sure the resulting set is bisectable before merge. > > Finally, I left the coccinelle scripts I used in tree. I had heard it > was preferable to merge those along with the patches that they > generate, but I wasn't sure where they go. I can either move those to a > more appropriate location or we can just drop that commit if it's not > needed. > > Signed-off-by: Jeff Layton v2 looks nicer. I would add a few list handling primitives, as I see enough instances of list_for_each_entry, list_for_each_entry_safe, list_first_entry, and list_first_entry_or_null on fl_core.flc_list to make it worth having those. Also, there doesn't seem to be benefit for API consumers to have to understand the internal structure of struct file_lock/lease to reach into fl_core. Having accessor functions for common fields like fl_type and fl_flags could be cleaner. For the series: Reviewed-by: Chuck Lever For the nfsd and lockd parts: Acked-by: Chuck Lever > --- > Changes in v2: > - renamed file_lock_core fields to have "flc_" prefix > - used macros to more easily do the change piecemeal > - broke up patches into per-subsystem ones > - Link to v1: > https://lore.kernel.org/r/20240116-flsplit-v1-0-c9d0f4370...@kernel.org > > --- > Jeff Layton (41): > filelock: rename some fields in tracepoints > filelock: rename fl_pid variable in lock_get_status > dlm: rename fl_flags variable in dlm_posix_unlock > nfs: rename fl_flags variable in nfs4_proc_unlck > nfsd: rename fl_type and fl_flags variables in nfsd4_lock > lockd: rename fl_flags and fl_type variables in nlmclnt_lock > 9p: rename fl_type variable in v9fs_file_do_lock > afs: rename fl_type variable in afs_next_locker > filelock: drop the IS_* macros > filelock: split common fields into struct file_lock_core > filelock: add coccinelle scripts to move fields to struct file_lock_core > filelock: have fs/locks.c deal with file_lock_core directly > filelock: convert some internal functions to use file_lock_core instead > filelock: convert more internal functions to use file_lock_core > filelock: make posix_same_owner take file_lock_core pointers > filelock: convert posix_owner_key to take file_lock_core arg > filelock: make locks_{insert,delete}_global_locks take file_lock_core > arg > filelock: convert locks_{insert,delete}_global_blocked > filelock: make __locks_delete_block and __locks_wake_up_blocks take > file_lock_core > filelock: convert __locks_insert_block, conflict and deadlock checks to > use file_lock_core > filelock: convert fl_blocker to file_lock_core > filelock: clean up locks_delete_block internals > filelock: reorganize locks_delete_block and __locks_insert_block > filelock: make assign_type helper take a file_lock_core pointer > filelock: convert locks_wake_up_blocks to take a file_lock_core pointer > filelock: convert locks_insert_lock_ctx and locks_delete_lock_ctx > filelock: convert locks_translate_pid to take file_lock_core > filelock: convert seqfile handling to use file_lock_core > 9p: adapt to breakup of struct file_lock > afs: adapt to breakup of struct file_lock > ceph: adapt to breakup of struct file_lock > dlm: adapt to breakup of struct file_lock > gfs2: adapt
Re: [PATCH] NFSD: Fix nfsd_clid_class use of __string_len() macro
On Thu, Feb 22, 2024 at 12:28:28PM -0500, Steven Rostedt wrote: > From: "Steven Rostedt (Google)" > > I'm working on restructuring the __string* macros so that it doesn't need > to recalculate the string twice. That is, it will save it off when > processing __string() and the __assign_str() will not need to do the work > again as it currently does. > > Currently __string_len(item, src, len) doesn't actually use "src", but my > changes will require src to be correct as that is where the __assign_str() > will get its value from. > > The event class nfsd_clid_class has: > > __string_len(name, name, clp->cl_name.len) > > But the second "name" does not exist and causes my changes to fail to > build. That second parameter should be: clp->cl_name.data. > > Fixes: d27b74a8675ca ("NFSD: Use new __string_len C macros for > nfsd_clid_class") > Signed-off-by: Steven Rostedt (Google) > --- > fs/nfsd/trace.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h > index d1e8cf079b0f..2cd57033791f 100644 > --- a/fs/nfsd/trace.h > +++ b/fs/nfsd/trace.h > @@ -843,7 +843,7 @@ DECLARE_EVENT_CLASS(nfsd_clid_class, > __array(unsigned char, addr, sizeof(struct sockaddr_in6)) > __field(unsigned long, flavor) > __array(unsigned char, verifier, NFS4_VERIFIER_SIZE) > - __string_len(name, name, clp->cl_name.len) > + __string_len(name, clp->cl_name.data, clp->cl_name.len) > ), > TP_fast_assign( > __entry->cl_boot = clp->cl_clientid.cl_boot; > -- > 2.43.0 > Do you want me to take this through the nfsd tree, or would you like an Ack from me so you can handle it as part of your clean up? Just in case: Acked-by: Chuck Lever -- Chuck Lever
Re: [PATCH] SUNRPC: Fix rpcgss_context trace event acceptor field
On Wed, Apr 10, 2024 at 12:38:13PM -0400, Steven Rostedt wrote: > From: "Steven Rostedt (Google)" > > The rpcgss_context trace event acceptor field is a dynamically sized > string that records the "data" parameter. But this parameter is also > dependent on the "len" field to determine the size of the data. > > It needs to use __string_len() helper macro where the length can be passed > in. It also incorrectly uses strncpy() to save it instead of > __assign_str(). As these macros can change, it is not wise to open code > them in trace events. > > As of commit c759e609030c ("tracing: Remove __assign_str_len()"), > __assign_str() can be used for both __string() and __string_len() fields. > Before that commit, __assign_str_len() is required to be used. This needs > to be noted for backporting. (In actuality, commit c1fa617caeb0 ("tracing: > Rework __assign_str() and __string() to not duplicate getting the string") > is the commit that makes __string_str_len() obsolete). > > Cc: sta...@vger.kernel.org > Fixes: 0c77668ddb4e7 ("SUNRPC: Introduce trace points in rpc_auth_gss.ko") > Signed-off-by: Steven Rostedt (Google) Acked-by: Chuck Lever > --- > include/trace/events/rpcgss.h | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/include/trace/events/rpcgss.h b/include/trace/events/rpcgss.h > index ba2d96a1bc2f..f50fcafc69de 100644 > --- a/include/trace/events/rpcgss.h > +++ b/include/trace/events/rpcgss.h > @@ -609,7 +609,7 @@ TRACE_EVENT(rpcgss_context, > __field(unsigned int, timeout) > __field(u32, window_size) > __field(int, len) > - __string(acceptor, data) > + __string_len(acceptor, data, len) > ), > > TP_fast_assign( > @@ -618,7 +618,7 @@ TRACE_EVENT(rpcgss_context, > __entry->timeout = timeout; > __entry->window_size = window_size; > __entry->len = len; > - strncpy(__get_str(acceptor), data, len); > + __assign_str(acceptor, data); > ), > > TP_printk("win_size=%u expiry=%lu now=%lu timeout=%u acceptor=%.*s", > -- > 2.43.0 > -- Chuck Lever
Re: [PATCH] SUNRPC: Fix rpcgss_context trace event acceptor field
On Wed, Apr 10, 2024 at 01:07:41PM -0400, Steven Rostedt wrote: > On Wed, 10 Apr 2024 12:38:53 -0400 > Chuck Lever wrote: > > > On Wed, Apr 10, 2024 at 12:38:13PM -0400, Steven Rostedt wrote: > > > From: "Steven Rostedt (Google)" > > > > > > The rpcgss_context trace event acceptor field is a dynamically sized > > > string that records the "data" parameter. But this parameter is also > > > dependent on the "len" field to determine the size of the data. > > > > > > It needs to use __string_len() helper macro where the length can be passed > > > in. It also incorrectly uses strncpy() to save it instead of > > > __assign_str(). As these macros can change, it is not wise to open code > > > them in trace events. > > > > > > As of commit c759e609030c ("tracing: Remove __assign_str_len()"), > > > __assign_str() can be used for both __string() and __string_len() fields. > > > Before that commit, __assign_str_len() is required to be used. This needs > > > to be noted for backporting. (In actuality, commit c1fa617caeb0 ("tracing: > > > Rework __assign_str() and __string() to not duplicate getting the string") > > > is the commit that makes __string_str_len() obsolete). > > > > > > Cc: sta...@vger.kernel.org > > > Fixes: 0c77668ddb4e7 ("SUNRPC: Introduce trace points in rpc_auth_gss.ko") > > > Signed-off-by: Steven Rostedt (Google) > > > > Acked-by: Chuck Lever > > > > Thanks, but feel free to take it if you want. Unless you rather have me > take it through my tree? Well I guess I could test it for you. I'll take it for nfsd v6.9-rc. -- Chuck Lever