from:"Chuck Lever"

Re: [PATCH 0/3] lockd: use per-net refrence-counted NSM clients

2012-09-14 Thread Chuck Lever

What happens if statd is restarted?

Sent from my iPhone

On Sep 14, 2012, at 10:25 AM, Stanislav Kinsbursky  
wrote:

> This is a bug fix for https://bugzilla.redhat.com/show_bug.cgi?id=830862.
> 
> The problem is that with NFSv4 mount in container (with separated mount
> namesapce) and active lock on it, dying child reaped of this container will
> try to umount NFS and doing this will try to create RPC client to send
> unmonitor request to statd.
> But creation of RCP client requires valid current->nsproxy (for operation with
> utsname()) and during umount on child reaper exit it's equal to zero.
> 
> Proposed solution is to introduce refrence-counter per-net NSM client, which
> is created on fist monitor call and destroyed after the lst monitor call.
> 
> The following series implements...
> 
> ---
> 
> Stanislav Kinsbursky (3):
>  lockd: use rpc client's cl_nodename for id encoding
>  lockd: per-net NSM client creation and destruction helpers introduced
>  lockd: create and use per-net NSM RPC clients on MON/UNMON requests
> 
> 
> fs/lockd/mon.c   |   91 +++---
> fs/lockd/netns.h |4 ++
> fs/lockd/svc.c   |1 +
> 3 files changed, 77 insertions(+), 19 deletions(-)
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] lockd: use rpc client's cl_nodename for id encoding

2012-09-14 Thread Chuck Lever

Hi-

On Sep 14, 2012, at 10:25 AM, Stanislav Kinsbursky wrote:

> Taking hostname from uts namespace if not safe, because this cuold be
> performind during umount operation on child reaper death. And in this case
> current->nsproxy is NULL already.
> 
> Signed-off-by: Stanislav Kinsbursky 
> Cc: 
> ---
> fs/lockd/mon.c |   14 --
> 1 files changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/lockd/mon.c b/fs/lockd/mon.c
> index 7ef14b3..c6186fb 100644
> --- a/fs/lockd/mon.c
> +++ b/fs/lockd/mon.c
> @@ -426,11 +426,12 @@ static void encode_mon_name(struct xdr_stream *xdr, 
> const struct nsm_args *argp)
>  * (via the NLMPROC_SM_NOTIFY call) that the state of host "mon_name"
>  * has changed.
>  */
> -static void encode_my_id(struct xdr_stream *xdr, const struct nsm_args *argp)
> +static void encode_my_id(struct xdr_stream *xdr, const struct nsm_args *argp,
> +  char *nodename)
> {
>   __be32 *p;
> 
> - encode_nsm_string(xdr, utsname()->nodename);
> + encode_nsm_string(xdr, nodename);
>   p = xdr_reserve_space(xdr, 4 + 4 + 4);
>   *p++ = cpu_to_be32(argp->prog);
>   *p++ = cpu_to_be32(argp->vers);
> @@ -441,10 +442,11 @@ static void encode_my_id(struct xdr_stream *xdr, const 
> struct nsm_args *argp)
>  * The "mon_id" argument specifies the non-private arguments
>  * of an NSMPROC_MON or NSMPROC_UNMON call.
>  */
> -static void encode_mon_id(struct xdr_stream *xdr, const struct nsm_args 
> *argp)
> +static void encode_mon_id(struct xdr_stream *xdr, const struct nsm_args 
> *argp,
> +   char *nodename)
> {
>   encode_mon_name(xdr, argp);
> - encode_my_id(xdr, argp);
> + encode_my_id(xdr, argp, nodename);
> }
> 
> /*
> @@ -463,14 +465,14 @@ static void encode_priv(struct xdr_stream *xdr, const 
> struct nsm_args *argp)
> static void nsm_xdr_enc_mon(struct rpc_rqst *req, struct xdr_stream *xdr,
>   const struct nsm_args *argp)
> {
> - encode_mon_id(xdr, argp);
> + encode_mon_id(xdr, argp, req->rq_task->tk_client->cl_nodename);

IMO you should get the cl_nodename in nsm_mon_unmon() from clnt->cl_nodename, 
and pass it in as part of *argp .  The choice of which nodename to use is 
clearly a decision for an "upper layer" not a choice for the XDR functions.

Long ago I had patches which fixed this layering violation for a very similar 
purpose as yours, but they were never applied.

>   encode_priv(xdr, argp);
> }
> 
> static void nsm_xdr_enc_unmon(struct rpc_rqst *req, struct xdr_stream *xdr,
> const struct nsm_args *argp)
> {
> - encode_mon_id(xdr, argp);
> + encode_mon_id(xdr, argp, req->rq_task->tk_client->cl_nodename);

Ditto.

> }
> 
> static int nsm_xdr_dec_stat_res(struct rpc_rqst *rqstp,
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/3] lockd: use per-net refrence-counted NSM clients

2012-09-14 Thread Chuck Lever


On Sep 14, 2012, at 1:38 PM, Myklebust, Trond wrote:

> On Fri, 2012-09-14 at 13:01 -0400, Chuck Lever wrote:
>> What happens if statd is restarted?
> 
> Nothing unusual. Why?

The NSM upcall transport is a potential application for TCP + softconn, now 
that a persistent rpc_clnt is used.  It just depends on what failure mode we'd 
like to optimize for.

> 
>> Sent from my iPhone
>> 
>> On Sep 14, 2012, at 10:25 AM, Stanislav Kinsbursky 
>>  wrote:
>> 
>>> This is a bug fix for https://bugzilla.redhat.com/show_bug.cgi?id=830862.
>>> 
>>> The problem is that with NFSv4 mount in container (with separated mount
>>> namesapce) and active lock on it, dying child reaped of this container will
>>> try to umount NFS and doing this will try to create RPC client to send
>>> unmonitor request to statd.
>>> But creation of RCP client requires valid current->nsproxy (for operation 
>>> with
>>> utsname()) and during umount on child reaper exit it's equal to zero.
>>> 
>>> Proposed solution is to introduce refrence-counter per-net NSM client, which
>>> is created on fist monitor call and destroyed after the lst monitor call.
>>> 
>>> The following series implements...
>>> 
>>> ---
>>> 
>>> Stanislav Kinsbursky (3):
>>> lockd: use rpc client's cl_nodename for id encoding
>>> lockd: per-net NSM client creation and destruction helpers introduced
>>> lockd: create and use per-net NSM RPC clients on MON/UNMON requests
>>> 
>>> 
>>> fs/lockd/mon.c   |   91 
>>> +++---
>>> fs/lockd/netns.h |4 ++
>>> fs/lockd/svc.c   |1 +
>>> 3 files changed, 77 insertions(+), 19 deletions(-)
>>> 
>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer
> 
> NetApp
> trond.mykleb...@netapp.com
> www.netapp.com

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 24/27] NFS: Use local caching [try #2]

2008-01-30 Thread Chuck Lever


Hi David-

On Jan 29, 2008, at 10:25 PM, David Howells wrote:

Chuck Lever <[EMAIL PROTECTED]> wrote:

This patch really ought to be broken into more manageable atomic
changes to make it easier to review, and to provide more fine-grained
explanation and rationalization for each specific change via
individual patch descriptions.


Hmmm  I broke the patch up as Trond stipulated - at least, I  
thought I

had.

In many ways this request doesn't make sense.  You can't do NFS  
caching
without all the appropriate bits, so logically they should be one  
patch.
Breaking it up won't help git-bisect since the option to enable all  
this is

the last (or nearly last) patch.


In addition to adding a new feature, you are changing existing code.   
If any one of the changes you made breaks existing behavior, having  
them all in small atomic patches makes it practical to bisect and  
find the problem.


In addition it makes it worlds easier to review by people who are not  
so familiar with your fscache implementation.  And smaller patches  
means the ratio of patch descriptions to code changes can be much  
higher.


It does make sense to introduce the files under fs/fsc in a single  
patch.  But when you are changing code that is already being used,  
more care needs to be taken.



This should no longer be necessary.  The latest mount.nfs subcommand
from nfs-utils supports text-based mounts when running on kernels
2.6.23 and later.


Okay.  I'll update my patches to reflect this.  Note, however, I've  
got

someone reporting a bug that seems to show otherwise.  I'll have to
investigate this more next week.


The very latest version (post 1.1.1) is required today for text-based  
NFS mounts.  (That is, the bleeding edge version you get by cloning  
the nfs-utils git repo).


And it only works on kernels later than 2.6.22 -- if that particular  
user is testing fscache on 2.6.22 or older, then only the legacy  
binary NFS mount system call API is supported.



Add comments like this in a separate clean up patch.





+/*
+ * Notification that a PTE pointing to an NFS page is about to be made
+ * writable, implying that someone is about to modify the page  
through a

+ * shared-writable mapping
+ */

What does that have to do with local disk caching?


+struct nfs_fh_auxdata {
+   struct timespec i_mtime;
+   struct timespec i_ctime;
+   loff_t  i_size;
+};


It might be useful to explain here why you need to supplement the
mtime, ctime, and size fields that already exist in an NFS inode.


Supplement?  I don't understand.


Why is it necessary to add additional mtime, ctime and size fields  
for NFS inodes?  Similar metadata is already stored in nfsi.


All I'm asking for is some documentation of what these fields do that  
the existing time stamps and size fields in nfsi don't.  Explain why  
the NFS fsc implementation needs this data structure.



+   key->port = clp->cl_addr.sin_port;


Not sure why you are using the server's port here.  In almost every
case the server side port number will be 2049, so it really doesn't
add any uniquification.


The reason lies is "in almost every case".  It's possible to  
configure it
such that a server is running two separate NFS servers on different  
ports.


We should explore whether it is typical or even possible that such a  
configuration exports the same file handles on different ports, and  
whether that really matters to the client.



I strongly recommend you use the existing IPv6 address conversion
macros for this instead of open-coding yet another way of mapping an
IPv4 address to an IPv6 address.

However, since AF_INET6 support is being introduced in the NFS client
in 2.6.24, I recommend you take a look at these source files after
Trond has pushed his NFS_ALL for 2.6.24.


I'll look at them.


I always do this:  I meant 2.6.25, not 2.6.24.

By the time you return, basic IPv6 support for NFSv4 should be in  
2.6.25-rc1's NFS client (not server).  Not that it is bug-free, but  
an implementation is now there.


--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Fix {clear,copy}_user_page() declarations in page.h

2008-02-14 Thread Chuck Lever

Clean up: eliminate some compiler noise on x86 when building with strict
warnings enabled, introduced by commit 345b904c.

In file included from include2/asm/thread_info_64.h:12,
 from include2/asm/thread_info.h:4,
 from
/home/cel/src/linux/nfs-2.6/include/linux/thread_info.h:35,
 from
/home/cel/src/linux/nfs-2.6/include/linux/preempt.h:9,
 from
/home/cel/src/linux/nfs-2.6/include/linux/spinlock.h:49,
 from /home/cel/src/linux/nfs-2.6/include/linux/mmzone.h:7,
 from /home/cel/src/linux/nfs-2.6/include/linux/gfp.h:4,
 from /home/cel/src/linux/nfs-2.6/include/linux/slab.h:14,
 from /home/cel/src/linux/nfs-2.6/fs/nfsd/nfs4acl.c:40:
include2/asm/page.h:55: warning: `inline' is not at beginning of
declaration
include2/asm/page.h:61: warning: `inline' is not at beginning of
declaration

Signed-off-by: Chuck Lever <[EMAIL PROTECTED]>
Cc: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
---

 include/asm-x86/page.h |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/asm-x86/page.h b/include/asm-x86/page.h
index 1cb7c51..a05b289 100644
--- a/include/asm-x86/page.h
+++ b/include/asm-x86/page.h
@@ -52,13 +52,13 @@ extern int page_is_ram(unsigned long pagenr);
 
 struct page;
 
-static void inline clear_user_page(void *page, unsigned long vaddr,
+static inline void clear_user_page(void *page, unsigned long vaddr,
struct page *pg)
 {
clear_page(page);
 }
 
-static void inline copy_user_page(void *to, void *from, unsigned long vaddr,
+static inline void copy_user_page(void *to, void *from, unsigned long vaddr,
struct page *topage)
 {
copy_page(to, from);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 3.10-rc3 NFSv3 mount issues

2013-05-30 Thread Chuck Lever


On May 30, 2013, at 4:19 PM, Jim Schutt  wrote:

> Hi,
> 
> I've been trying to test 3.10-rc3 on some diskless clients, and found
> that I can no longer mount my root file system via NFSv3.
> 
> I poked around looking at NFS changes for 3.10, and found these two
> commits:
> 
> d497ab9751 "NFSv3: match sec= flavor against server list"
> 4580a92d44 "NFS: Use server-recommended security flavor by default (NFSv3)"
> 
> If I revert both of these commits from 3.10-rc3, then my diskless
> client can mount its root file system.
> 
> The busybox mount command fails like this, when using 3.10-rc3:
> 
> / # mount  -t nfs -o ro,nolock,vers=3,proto=tcp 
> 172.17.0.122:/gmi/images/jaschut/ceph.toss-2x /mnt
> mount: mounting 172.17.0.122:/gmi/images/jaschut/ceph.toss-2x on /mnt failed: 
> Invalid argument
> 
> The commit messages for both these commits seem to say that mounting
> with the "sys=sec" option should work, but unfortunately, my busybox doesn't
> seem to understand the "sec=" mount option:
> 
> / # mount  -t nfs -o ro,nolock,vers=3,proto=tcp,sec=sys 
> 172.17.0.122:/gmi/images/jaschut/ceph.toss-2x /mnt
> mount: invalid number 'sys'
> 
> My NFS server is based on RHEL6, and is not using any "sec=" option
> in its export for this file system.  I did try exporting with "sec=sys",
> but it didn't seem to make any difference either.
> 
> So far, this seems like a regression to me 
> Any ideas what I might be doing wrong?  How can I
> help make this work again?

3.10-rc3 appears to be missing the fix for this.  See:

  http://marc.info/?l=linux-nfs&m=136855668104598&w=2

Trond, can we get this applied?

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] NFSv4: Use exponential backoff delay for NFS4_ERRDELAY

2013-04-25 Thread Chuck Lever

On Apr 25, 2013, at 9:49 AM, bfie...@fieldses.org wrote:

> On Thu, Apr 25, 2013 at 01:30:58PM +, Myklebust, Trond wrote:
>> On Thu, 2013-04-25 at 09:29 -0400, bfie...@fieldses.org wrote:
>> 
>>> My position is that we simply have no idea what order of magnitude even
>>> delay should be.  And that in such a situation exponential backoff such
>>> as implemented in the synchronous case seems the reasonable default as
>>> it guarantees at worst doubling the delay while still bounding the
>>> long-term average frequency of retries.
>> 
>> So we start with a 15 second delay, and then go to 60 seconds?
> 
> I agree that a server should normally be doing the wait on its own if
> the wait would be on the order of an rpc round trip.
> 
> So I'd be inclined to start with a delay that was an order of magnitude
> or two more than a round trip.
> 
> And I'd expect NFS isn't common on networks with 1-second latencies.
> 
> So the 1/10 second we're using in the synchronous case sounds closer to
> the right ballpark to me.

The RPC layer already keeps RPC round trip statistics, so the client doesn't 
have to guess with a "one size fits all" number.

I'm all for keeping client recovery time short.  But after following this 
argument, I think 10xRTT is crazy short.  Aggressive retransmits can lead to 
data corruption, and RTT on a fast server is going to be on the order of a 
millisecond.  And what about RDMA, where RTT is about 20usecs? 

A better answer might be to start at one second then exponentially back off to 
the minimum of 0.25x the lease time and 0.25x the RPC retransmit time out.

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] NFSv4: Use exponential backoff delay for NFS4_ERRDELAY

2013-04-25 Thread Chuck Lever


On Apr 25, 2013, at 2:19 PM, "bfie...@fieldses.org"  
wrote:

> On Thu, Apr 25, 2013 at 02:10:36PM +, Myklebust, Trond wrote:
>> On Thu, 2013-04-25 at 09:49 -0400, bfie...@fieldses.org wrote:
>>> On Thu, Apr 25, 2013 at 01:30:58PM +, Myklebust, Trond wrote:
>>>> On Thu, 2013-04-25 at 09:29 -0400, bfie...@fieldses.org wrote:
>>>> 
>>>>> My position is that we simply have no idea what order of magnitude even
>>>>> delay should be.  And that in such a situation exponential backoff such
>>>>> as implemented in the synchronous case seems the reasonable default as
>>>>> it guarantees at worst doubling the delay while still bounding the
>>>>> long-term average frequency of retries.
>>>> 
>>>> So we start with a 15 second delay, and then go to 60 seconds?
>>> 
>>> I agree that a server should normally be doing the wait on its own if
>>> the wait would be on the order of an rpc round trip.
>>> 
>>> So I'd be inclined to start with a delay that was an order of magnitude
>>> or two more than a round trip.
>>> 
>>> And I'd expect NFS isn't common on networks with 1-second latencies.
>>> 
>>> So the 1/10 second we're using in the synchronous case sounds closer to
>>> the right ballpark to me.
>> 
>> OK, then. Now all I need is actual motivation for changing the existing
>> code other than handwaving arguments about "polling is better than flat
>> waits".
>> What actual use cases are impacting us now, other than the AIX design
>> decision to force CLOSE to retry at least once before succeeding?
> 
> Nah, I've got nothing, and I agree that the AIX problem is there bug.
> 
> Just for fun I looked at re-checked the Linux server cases.  As far as I
> can tell they are:
> 
>   - delegations: returned immediately on detection of any
> conflict.  The current behavior in the sync case looks
> reasonable to me.
>   - allocation failures: not really sure it's the best error, but
> it seems to be all the protocol offers.  We probably don't
> care much what the client does in this case.
>   - some rare cases that would probably indicate bugs (e.g.,
> attempting to destroy a client while other rpc's from that
> client are running.)  Again we don't care what the client does
> here.
>   - the 4.1 slot-inuse case.
> 
> We also by default map four errors (ETIMEDOUT, EAGAIN, EWOULDBLOCK,
> ENOMEM) to delay.  I thought I remembered one of those being used by
> some HFS system, but can't actually find an example now.  A quick grep
> doesn't show anything interesting.

It's worth mentioning that servers that have frozen state (say, in preparation 
for Transparent State Migration) may use NFS4ERR_DELAY to prevent clients from 
modifying open or lock state until that state has transitioned to a destination 
server.

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] NFSv4: Use exponential backoff delay for NFS4_ERRDELAY

2013-04-25 Thread Chuck Lever


On Apr 25, 2013, at 2:46 PM, "bfie...@fieldses.org"  
wrote:

> On Thu, Apr 25, 2013 at 02:40:11PM -0400, Chuck Lever wrote:
>> 
>> On Apr 25, 2013, at 2:19 PM, "bfie...@fieldses.org"  
>> wrote:
>> 
>>> On Thu, Apr 25, 2013 at 02:10:36PM +, Myklebust, Trond wrote:
>>>> On Thu, 2013-04-25 at 09:49 -0400, bfie...@fieldses.org wrote:
>>>>> On Thu, Apr 25, 2013 at 01:30:58PM +, Myklebust, Trond wrote:
>>>>>> On Thu, 2013-04-25 at 09:29 -0400, bfie...@fieldses.org wrote:
>>>>>> 
>>>>>>> My position is that we simply have no idea what order of magnitude even
>>>>>>> delay should be.  And that in such a situation exponential backoff such
>>>>>>> as implemented in the synchronous case seems the reasonable default as
>>>>>>> it guarantees at worst doubling the delay while still bounding the
>>>>>>> long-term average frequency of retries.
>>>>>> 
>>>>>> So we start with a 15 second delay, and then go to 60 seconds?
>>>>> 
>>>>> I agree that a server should normally be doing the wait on its own if
>>>>> the wait would be on the order of an rpc round trip.
>>>>> 
>>>>> So I'd be inclined to start with a delay that was an order of magnitude
>>>>> or two more than a round trip.
>>>>> 
>>>>> And I'd expect NFS isn't common on networks with 1-second latencies.
>>>>> 
>>>>> So the 1/10 second we're using in the synchronous case sounds closer to
>>>>> the right ballpark to me.
>>>> 
>>>> OK, then. Now all I need is actual motivation for changing the existing
>>>> code other than handwaving arguments about "polling is better than flat
>>>> waits".
>>>> What actual use cases are impacting us now, other than the AIX design
>>>> decision to force CLOSE to retry at least once before succeeding?
>>> 
>>> Nah, I've got nothing, and I agree that the AIX problem is there bug.
>>> 
>>> Just for fun I looked at re-checked the Linux server cases.  As far as I
>>> can tell they are:
>>> 
>>> - delegations: returned immediately on detection of any
>>>   conflict.  The current behavior in the sync case looks
>>>   reasonable to me.
>>> - allocation failures: not really sure it's the best error, but
>>>   it seems to be all the protocol offers.  We probably don't
>>>   care much what the client does in this case.
>>> - some rare cases that would probably indicate bugs (e.g.,
>>>   attempting to destroy a client while other rpc's from that
>>>   client are running.)  Again we don't care what the client does
>>>   here.
>>> - the 4.1 slot-inuse case.
>>> 
>>> We also by default map four errors (ETIMEDOUT, EAGAIN, EWOULDBLOCK,
>>> ENOMEM) to delay.  I thought I remembered one of those being used by
>>> some HFS system, but can't actually find an example now.  A quick grep
>>> doesn't show anything interesting.
>> 
>> It's worth mentioning that servers that have frozen state (say, in 
>> preparation for Transparent State Migration) may use NFS4ERR_DELAY to 
>> prevent clients from modifying open or lock state until that state has 
>> transitioned to a destination server.
> 
> I thought they'd decided they'll be forced to find a different way to do
> that?
> 
> (The issue being that it only works if you're using 4.1, and if the
> session state itself isn't part of the state to be transferred.
> Otherwise you're forced to modify the state anyway since NFS4ERR_DELAY
> is seqid-modifying.)

The answer is not to return NFS4ERR_DELAY on seqid-modifying operations.

The source server can return NFS4ERR_DELAY to the client's migration recovery 
operations (the GETATTR(fs_locations) request) for example.

Or, the server could return it on the initial PUTFH operation in a compound 
containing seqid-modifying operations.

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: build failure after merge of the nfsd tree

2013-04-29 Thread Chuck Lever


On Apr 28, 2013, at 9:24 PM, Stephen Rothwell  wrote:

> Hi J.,
> 
> After merging the nfsd tree, today's linux-next build (powerpc
> ppc64_defconfig) failed like this:
> 
> net/sunrpc/auth_gss/svcauth_gss.c: In function 'gss_proxy_save_rsc':
> net/sunrpc/auth_gss/svcauth_gss.c:1182:3: error: implicit declaration of 
> function 'gss_mech_get_by_OID' [-Werror=implicit-function-declaration]
> 
> Caused byc ommit 030d794bf498 ("SUNRPC: Use gssproxy upcall for server
> RPCGSS authentication").  gss_mech_get_by_OID() made static to
> net/sunrpc/auth_gss/gss_mech_switch.c by commit 9568c5e9a61d ("SUNRPC:
> Introduce rpcauth_get_pseudoflavor()") in the nfs tree (part of the nfs
> tree that you did not merge).
> 
> I don't know how to fix this, so I have used the nfsd tree from
> next-20130426 for today.

Bruce, it might make sense for me to submit the three server-side RPC GSS 
patches, and then you can rebase the gssproxy work on top of those.  Let me 
know how you would like to proceed.


-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: build failure after merge of the nfsd tree

2013-04-29 Thread Chuck Lever


On Apr 29, 2013, at 11:45 AM, "J. Bruce Fields"  wrote:

> On Mon, Apr 29, 2013 at 10:53:37AM -0400, Chuck Lever wrote:
>> 
>> On Apr 28, 2013, at 9:24 PM, Stephen Rothwell  wrote:
>> 
>>> Hi J.,
>>> 
>>> After merging the nfsd tree, today's linux-next build (powerpc
>>> ppc64_defconfig) failed like this:
>>> 
>>> net/sunrpc/auth_gss/svcauth_gss.c: In function 'gss_proxy_save_rsc':
>>> net/sunrpc/auth_gss/svcauth_gss.c:1182:3: error: implicit declaration of 
>>> function 'gss_mech_get_by_OID' [-Werror=implicit-function-declaration]
>>> 
>>> Caused byc ommit 030d794bf498 ("SUNRPC: Use gssproxy upcall for server
>>> RPCGSS authentication").  gss_mech_get_by_OID() made static to
>>> net/sunrpc/auth_gss/gss_mech_switch.c by commit 9568c5e9a61d ("SUNRPC:
>>> Introduce rpcauth_get_pseudoflavor()") in the nfs tree (part of the nfs
>>> tree that you did not merge).
>>> 
>>> I don't know how to fix this, so I have used the nfsd tree from
>>> next-20130426 for today.
>> 
>> Bruce, it might make sense for me to submit the three server-side RPC GSS 
>> patches, and then you can rebase the gssproxy work on top of those.  Let me 
>> know how you would like to proceed.
> 
> I'm happy to take those patches whenever you consider them ready.  Would
> that fix the problem?

Someone would need to modify the gssproxy patches to use the new interfaces.

> Also: it looks like 030d794bf498 "SUNRPC: Introduce
> rpcauth_get_pseudoflavor()" is in Trond's linux-next, but not his
> nfs-for-next.  I'm not sure what that means--is it safe to rebase on top
> of *that*?

That doesn't seem right to me.

> I was hoping I could consider the gss-proxy work committed at this point
> and pile any fixes on top, but... whatever works for you guys, I guess.
> 
> --b.

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: build failure after merge of the nfsd tree

2013-04-29 Thread Chuck Lever


On Apr 29, 2013, at 12:29 PM, Simo Sorce  wrote:

> On Mon, 2013-04-29 at 12:05 -0400, Chuck Lever wrote:
>> On Apr 29, 2013, at 11:45 AM, "J. Bruce Fields"  wrote:
>> 
>>> On Mon, Apr 29, 2013 at 10:53:37AM -0400, Chuck Lever wrote:
>>>> 
>>>> On Apr 28, 2013, at 9:24 PM, Stephen Rothwell  
>>>> wrote:
>>>> 
>>>>> Hi J.,
>>>>> 
>>>>> After merging the nfsd tree, today's linux-next build (powerpc
>>>>> ppc64_defconfig) failed like this:
>>>>> 
>>>>> net/sunrpc/auth_gss/svcauth_gss.c: In function 'gss_proxy_save_rsc':
>>>>> net/sunrpc/auth_gss/svcauth_gss.c:1182:3: error: implicit declaration of 
>>>>> function 'gss_mech_get_by_OID' [-Werror=implicit-function-declaration]
>>>>> 
>>>>> Caused byc ommit 030d794bf498 ("SUNRPC: Use gssproxy upcall for server
>>>>> RPCGSS authentication").  gss_mech_get_by_OID() made static to
>>>>> net/sunrpc/auth_gss/gss_mech_switch.c by commit 9568c5e9a61d ("SUNRPC:
>>>>> Introduce rpcauth_get_pseudoflavor()") in the nfs tree (part of the nfs
>>>>> tree that you did not merge).
>>>>> 
>>>>> I don't know how to fix this, so I have used the nfsd tree from
>>>>> next-20130426 for today.
>>>> 
>>>> Bruce, it might make sense for me to submit the three server-side RPC GSS 
>>>> patches, and then you can rebase the gssproxy work on top of those.  Let 
>>>> me know how you would like to proceed.
>>> 
>>> I'm happy to take those patches whenever you consider them ready.  Would
>>> that fix the problem?
>> 
>> Someone would need to modify the gssproxy patches to use the new interfaces.
>> 
>>> Also: it looks like 030d794bf498 "SUNRPC: Introduce
>>> rpcauth_get_pseudoflavor()" is in Trond's linux-next, but not his
>>> nfs-for-next.  I'm not sure what that means--is it safe to rebase on top
>>> of *that*?
>> 
>> That doesn't seem right to me.
> 
> GSS-Proxy patches are 1 year old and we've been delayed once already to
> accomodate the containers work, maybe it's time for your patches to be
> rebased on gssproxy ones ? :-)

Don't sweat it.  IMO this is a simple merge problem, unlike the containers work.

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: build failure after merge of the nfsd tree

2013-04-29 Thread Chuck Lever


On Apr 29, 2013, at 12:21 PM, Trond Myklebust  
wrote:

> On Mon, 2013-04-29 at 12:05 -0400, Chuck Lever wrote:
>> On Apr 29, 2013, at 11:45 AM, "J. Bruce Fields"  wrote:
>> 
>>> On Mon, Apr 29, 2013 at 10:53:37AM -0400, Chuck Lever wrote:
>>>> 
>>>> On Apr 28, 2013, at 9:24 PM, Stephen Rothwell  
>>>> wrote:
>>>> 
>>>>> Hi J.,
>>>>> 
>>>>> After merging the nfsd tree, today's linux-next build (powerpc
>>>>> ppc64_defconfig) failed like this:
>>>>> 
>>>>> net/sunrpc/auth_gss/svcauth_gss.c: In function 'gss_proxy_save_rsc':
>>>>> net/sunrpc/auth_gss/svcauth_gss.c:1182:3: error: implicit declaration of 
>>>>> function 'gss_mech_get_by_OID' [-Werror=implicit-function-declaration]
>>>>> 
>>>>> Caused byc ommit 030d794bf498 ("SUNRPC: Use gssproxy upcall for server
>>>>> RPCGSS authentication").  gss_mech_get_by_OID() made static to
>>>>> net/sunrpc/auth_gss/gss_mech_switch.c by commit 9568c5e9a61d ("SUNRPC:
>>>>> Introduce rpcauth_get_pseudoflavor()") in the nfs tree (part of the nfs
>>>>> tree that you did not merge).
>>>>> 
>>>>> I don't know how to fix this, so I have used the nfsd tree from
>>>>> next-20130426 for today.
>>>> 
>>>> Bruce, it might make sense for me to submit the three server-side RPC GSS 
>>>> patches, and then you can rebase the gssproxy work on top of those.  Let 
>>>> me know how you would like to proceed.
>>> 
>>> I'm happy to take those patches whenever you consider them ready.  Would
>>> that fix the problem?
>> 
>> Someone would need to modify the gssproxy patches to use the new interfaces.
>> 
>>> Also: it looks like 030d794bf498 "SUNRPC: Introduce
>>> rpcauth_get_pseudoflavor()" is in Trond's linux-next, but not his
>>> nfs-for-next.  I'm not sure what that means--is it safe to rebase on top
>>> of *that*?
>> 
>> That doesn't seem right to me.
> 
> I've now pulled the rpcsec_gss changes into the nfs-for-next. The main
> reason why they were not pulled in earlier was due to uncertainty what
> to do about the increase in "AUTH_GSS upcall timed out." syslog
> warnings.

Trond's nfs-for-next now has the new rpcauth_get_gssinfo() and 
rpcauth_get_pseudoflavor() APIs, which are replacements for direct calls into 
the GSS mech switch.  These APIs are a little more generic, and more robust in 
the face of unloaded GSS kernel modules.

Instead of gss_mech_get_by_OID(), I suspect you want 
rpcauth_get_pseudoflavor(), but I haven't looked at the gssproxy code.

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: build failure after merge of the nfsd tree

2013-04-29 Thread Chuck Lever


On Apr 29, 2013, at 1:38 PM, "J. Bruce Fields"  wrote:

> On Mon, Apr 29, 2013 at 01:04:01PM -0400, Chuck Lever wrote:
>> 
>> On Apr 29, 2013, at 12:21 PM, Trond Myklebust
>>  wrote:
>> 
>>> On Mon, 2013-04-29 at 12:05 -0400, Chuck Lever wrote:
>>>> On Apr 29, 2013, at 11:45 AM, "J. Bruce Fields"
>>>>  wrote:
>>>> 
>>>>> On Mon, Apr 29, 2013 at 10:53:37AM -0400, Chuck Lever wrote:
>>>>>> 
>>>>>> On Apr 28, 2013, at 9:24 PM, Stephen Rothwell
>>>>>>  wrote:
>>>>>> 
>>>>>>> Hi J.,
>>>>>>> 
>>>>>>> After merging the nfsd tree, today's linux-next build (powerpc
>>>>>>> ppc64_defconfig) failed like this:
>>>>>>> 
>>>>>>> net/sunrpc/auth_gss/svcauth_gss.c: In function
>>>>>>> 'gss_proxy_save_rsc': net/sunrpc/auth_gss/svcauth_gss.c:1182:3:
>>>>>>> error: implicit declaration of function 'gss_mech_get_by_OID'
>>>>>>> [-Werror=implicit-function-declaration]
>>>>>>> 
>>>>>>> Caused byc ommit 030d794bf498 ("SUNRPC: Use gssproxy upcall for
>>>>>>> server RPCGSS authentication").  gss_mech_get_by_OID() made
>>>>>>> static to net/sunrpc/auth_gss/gss_mech_switch.c by commit
>>>>>>> 9568c5e9a61d ("SUNRPC: Introduce rpcauth_get_pseudoflavor()") in
>>>>>>> the nfs tree (part of the nfs tree that you did not merge).
>>>>>>> 
>>>>>>> I don't know how to fix this, so I have used the nfsd tree from
>>>>>>> next-20130426 for today.
>>>>>> 
>>>>>> Bruce, it might make sense for me to submit the three server-side
>>>>>> RPC GSS patches, and then you can rebase the gssproxy work on top
>>>>>> of those.  Let me know how you would like to proceed.
>>>>> 
>>>>> I'm happy to take those patches whenever you consider them ready.
>>>>> Would that fix the problem?
>>>> 
>>>> Someone would need to modify the gssproxy patches to use the new
>>>> interfaces.
>>>> 
>>>>> Also: it looks like 030d794bf498 "SUNRPC: Introduce
>>>>> rpcauth_get_pseudoflavor()" is in Trond's linux-next, but not his
>>>>> nfs-for-next.  I'm not sure what that means--is it safe to rebase
>>>>> on top of *that*?
>>>> 
>>>> That doesn't seem right to me.
>>> 
>>> I've now pulled the rpcsec_gss changes into the nfs-for-next. The
>>> main reason why they were not pulled in earlier was due to
>>> uncertainty what to do about the increase in "AUTH_GSS upcall timed
>>> out." syslog warnings.
>> 
>> Trond's nfs-for-next now has the new rpcauth_get_gssinfo() and
>> rpcauth_get_pseudoflavor() APIs, which are replacements for direct
>> calls into the GSS mech switch.  These APIs are a little more generic,
>> and more robust in the face of unloaded GSS kernel modules.
>> 
>> Instead of gss_mech_get_by_OID(), I suspect you want
>> rpcauth_get_pseudoflavor(), but I haven't looked at the gssproxy code.
> 
> It's doing
> 
>   status = -EOPNOTSUPP;
>   gm = gss_mech_get_by_OID(&ud->mech_oid);
>   if (!gm)
>   goto out;
>   status = -EINVAL;
>   status = gss_import_sec_context(ud->out_handle.data,
>   ud->out_handle.len,
>   gm, &rsci.mechctx,
>   &expiry, GFP_KERNEL);
>   if (status)
>   goto out;
> 
> So we need a way to get from an OID and some mechanism-specific data to
> a context.
> 
> Looks to me like we just want to re-export gss_mech_get_by_OID().

The reason for the new wrappers is to load the kernel modules properly before 
trying to discover the OID -> mechanism mapping.

Where are you calling it from?  If it's from outside of the GSS module, how do 
you guarantee the rpc_gss_auth module is loaded?  What if the GSS mechanism for 
that OID isn't loaded?

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: build failure after merge of the nfsd tree

2013-04-29 Thread Chuck Lever


On Apr 29, 2013, at 1:59 PM, "J. Bruce Fields"  wrote:

> On Mon, Apr 29, 2013 at 01:47:16PM -0400, Chuck Lever wrote:
>> 
>> On Apr 29, 2013, at 1:38 PM, "J. Bruce Fields"  wrote:
>> 
>>> On Mon, Apr 29, 2013 at 01:04:01PM -0400, Chuck Lever wrote:
>>>> 
>>>> On Apr 29, 2013, at 12:21 PM, Trond Myklebust
>>>>  wrote:
>>>> 
>>>>> On Mon, 2013-04-29 at 12:05 -0400, Chuck Lever wrote:
>>>>>> On Apr 29, 2013, at 11:45 AM, "J. Bruce Fields"
>>>>>>  wrote:
>>>>>> 
>>>>>>> On Mon, Apr 29, 2013 at 10:53:37AM -0400, Chuck Lever wrote:
>>>>>>>> 
>>>>>>>> On Apr 28, 2013, at 9:24 PM, Stephen Rothwell
>>>>>>>>  wrote:
>>>>>>>> 
>>>>>>>>> Hi J.,
>>>>>>>>> 
>>>>>>>>> After merging the nfsd tree, today's linux-next build (powerpc
>>>>>>>>> ppc64_defconfig) failed like this:
>>>>>>>>> 
>>>>>>>>> net/sunrpc/auth_gss/svcauth_gss.c: In function
>>>>>>>>> 'gss_proxy_save_rsc': net/sunrpc/auth_gss/svcauth_gss.c:1182:3:
>>>>>>>>> error: implicit declaration of function 'gss_mech_get_by_OID'
>>>>>>>>> [-Werror=implicit-function-declaration]
>>>>>>>>> 
>>>>>>>>> Caused byc ommit 030d794bf498 ("SUNRPC: Use gssproxy upcall for
>>>>>>>>> server RPCGSS authentication").  gss_mech_get_by_OID() made
>>>>>>>>> static to net/sunrpc/auth_gss/gss_mech_switch.c by commit
>>>>>>>>> 9568c5e9a61d ("SUNRPC: Introduce rpcauth_get_pseudoflavor()") in
>>>>>>>>> the nfs tree (part of the nfs tree that you did not merge).
>>>>>>>>> 
>>>>>>>>> I don't know how to fix this, so I have used the nfsd tree from
>>>>>>>>> next-20130426 for today.
>>>>>>>> 
>>>>>>>> Bruce, it might make sense for me to submit the three server-side
>>>>>>>> RPC GSS patches, and then you can rebase the gssproxy work on top
>>>>>>>> of those.  Let me know how you would like to proceed.
>>>>>>> 
>>>>>>> I'm happy to take those patches whenever you consider them ready.
>>>>>>> Would that fix the problem?
>>>>>> 
>>>>>> Someone would need to modify the gssproxy patches to use the new
>>>>>> interfaces.
>>>>>> 
>>>>>>> Also: it looks like 030d794bf498 "SUNRPC: Introduce
>>>>>>> rpcauth_get_pseudoflavor()" is in Trond's linux-next, but not his
>>>>>>> nfs-for-next.  I'm not sure what that means--is it safe to rebase
>>>>>>> on top of *that*?
>>>>>> 
>>>>>> That doesn't seem right to me.
>>>>> 
>>>>> I've now pulled the rpcsec_gss changes into the nfs-for-next. The
>>>>> main reason why they were not pulled in earlier was due to
>>>>> uncertainty what to do about the increase in "AUTH_GSS upcall timed
>>>>> out." syslog warnings.
>>>> 
>>>> Trond's nfs-for-next now has the new rpcauth_get_gssinfo() and
>>>> rpcauth_get_pseudoflavor() APIs, which are replacements for direct
>>>> calls into the GSS mech switch.  These APIs are a little more generic,
>>>> and more robust in the face of unloaded GSS kernel modules.
>>>> 
>>>> Instead of gss_mech_get_by_OID(), I suspect you want
>>>> rpcauth_get_pseudoflavor(), but I haven't looked at the gssproxy code.
>>> 
>>> It's doing
>>> 
>>> status = -EOPNOTSUPP;
>>> gm = gss_mech_get_by_OID(&ud->mech_oid);
>>> if (!gm)
>>> goto out;
>>> status = -EINVAL;
>>> status = gss_import_sec_context(ud->out_handle.data,
>>> ud->out_handle.len,
>>> gm, &rsci.mechctx,
>>> &expiry, GFP_KERNEL);
>>> if (status)
>>> goto out;
>>> 
>>> So we need a way to get from an OID and some mechanism-specific data to
>>> a context.
>>> 
>>> Looks to me like we just want to re-export gss_mech_get_by_OID().
>> 
>> The reason for the new wrappers is to load the kernel modules properly 
>> before trying to discover the OID -> mechanism mapping.
>> 
>> Where are you calling it from?  If it's from outside of the GSS module, how 
>> do you guarantee the rpc_gss_auth module is loaded?  What if the GSS 
>> mechanism for that OID isn't loaded?
> 
> Sorry, I should have said just "remove static from", not
> "re-export"--it's in the same module.  So there should be no problem
> here, right?

OK, gotcha.  Architecturally outside of the mech switch I'd like to see OIDs 
passed around embedded in GSS tuples rather than by themselves.

An alternative would be to use gss_mech_get_by_name(), which is already 
visible, loads GSS mechanism modules automatically, and returns struct 
gss_api_mech *.  For NFS, we should already have a clean mapping of mechanism 
name to pseudoflavor to GSS tuple.  Looks like rsc_parse() already uses this 
API.

Do you have gssproxy patches published in a git tree somewhere I could review?

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: build failure after merge of the nfsd tree

2013-04-29 Thread Chuck Lever


On Apr 29, 2013, at 2:57 PM, J. Bruce Fields  wrote:

> On Mon, Apr 29, 2013 at 02:30:33PM -0400, Chuck Lever wrote:
>> 
>> On Apr 29, 2013, at 1:59 PM, "J. Bruce Fields" 
>> wrote:
>> 
>>> On Mon, Apr 29, 2013 at 01:47:16PM -0400, Chuck Lever wrote:
>>>> 
>>>> On Apr 29, 2013, at 1:38 PM, "J. Bruce Fields"
>>>>  wrote:
>>>> 
>>>>> On Mon, Apr 29, 2013 at 01:04:01PM -0400, Chuck Lever wrote:
>>>>>> Trond's nfs-for-next now has the new rpcauth_get_gssinfo() and
>>>>>> rpcauth_get_pseudoflavor() APIs, which are replacements for
>>>>>> direct calls into the GSS mech switch.  These APIs are a little
>>>>>> more generic, and more robust in the face of unloaded GSS kernel
>>>>>> modules.
>>>>>> 
>>>>>> Instead of gss_mech_get_by_OID(), I suspect you want
>>>>>> rpcauth_get_pseudoflavor(), but I haven't looked at the gssproxy
>>>>>> code.
>>>>> 
>>>>> It's doing
>>>>> 
>>>>>   status = -EOPNOTSUPP; gm =
>>>>>   gss_mech_get_by_OID(&ud->mech_oid); if (!gm) goto out;
>>>>>   status = -EINVAL; status =
>>>>>   gss_import_sec_context(ud->out_handle.data,
>>>>>   ud->out_handle.len, gm, &rsci.mechctx, &expiry,
>>>>>   GFP_KERNEL); if (status) goto out;
>>>>> 
>>>>> So we need a way to get from an OID and some mechanism-specific
>>>>> data to a context.
>>>>> 
>>>>> Looks to me like we just want to re-export gss_mech_get_by_OID().
>>>> 
>>>> The reason for the new wrappers is to load the kernel modules
>>>> properly before trying to discover the OID -> mechanism mapping.
>>>> 
>>>> Where are you calling it from?  If it's from outside of the GSS
>>>> module, how do you guarantee the rpc_gss_auth module is loaded?
>>>> What if the GSS mechanism for that OID isn't loaded?
>>> 
>>> Sorry, I should have said just "remove static from", not
>>> "re-export"--it's in the same module.  So there should be no problem
>>> here, right?
>> 
>> OK, gotcha.  Architecturally outside of the mech switch I'd like to
>> see OIDs passed around embedded in GSS tuples rather than by
>> themselves.
> 
> I'm not sure what you mean.  When I accept a gss context I don't yet
> know what service or qop it's going to be used with, I only know the
> mechanism OID.

RPC server GSS support didn't need the gss_mech_get_by_OID() interface before 
gssproxy, so I'm trying to understand why it is needed now.

But it sounds like you do need it now, so go ahead and make 
gss_mech_get_by_OID() global within the AUTH_GSS module. 

> 
>> An alternative would be to use gss_mech_get_by_name(), which is
>> already visible, loads GSS mechanism modules automatically, and
>> returns struct gss_api_mech *.  For NFS, we should already have a
>> clean mapping of mechanism name to pseudoflavor to GSS tuple.  Looks
>> like rsc_parse() already uses this API.
> 
> We don't have a name here, only an OID.
> 
>> Do you have gssproxy patches published in a git tree somewhere I could
>> review?
> 
> It's in my for-3.10 branch.
> 
> Which is more or less what I plan to submit for 3.10, so I'd prefer not
> to rebase it at this point.
> 
> It looks like just removing "static" would resolve the immediate
> conflict, is that right?  And then I'd happily help deal with cleaning
> up the API as followup work.



-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/3] lockd: use per-net refrence-counted NSM clients

2012-09-17 Thread Chuck Lever

On Sep 17, 2012, at 6:49 AM, Stanislav Kinsbursky wrote:

> 14.09.2012 23:10, Chuck Lever пишет:
>> 
>> On Sep 14, 2012, at 1:38 PM, Myklebust, Trond wrote:
>> 
>>> On Fri, 2012-09-14 at 13:01 -0400, Chuck Lever wrote:
>>>> What happens if statd is restarted?
>>> 
>>> Nothing unusual. Why?
>> 
>> The NSM upcall transport is a potential application for TCP + softconn, now 
>> that a persistent rpc_clnt is used.  It just depends on what failure mode 
>> we'd like to optimize for.
>> 
> 
> I don't understand, where the problem is.
> Could you be more specific, please?

I'm suggesting an enhancement.

The change is to use TCP for the NSM upcall transport, and set 
RPC_TASK_SOFTCONN on the individual RPCs.  The advantage of this is that the 
kernel could discover when statd is not running and fail the upcall 
immediately, rather than waiting possibly many seconds for each upcall RPC to 
time out.

The client already has a check in the mount.nfs command to see that statd is 
running, likely to avoid this lengthly timeout.  Since the client already has 
long-standing logic to avoid it, I think the benefit would be mostly on the 
server side.

But this change can be done at some later point.

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: NFS EINVAL on open(... | O_TRUNC) on 2.6.23.9

2008-02-06 Thread Chuck Lever


Hi Gianluca-

On Feb 6, 2008, at 1:25 PM, Gianluca Alberici wrote:

Hello all,

Thanks to Chuck's help i finally decided to proceed to a git bisect  
and found the bad patch. Is there anybody that has an idea why it  
breaks userspace nfs servers as we have seen ? Sorry for emailing  
directly Chuck Lever and Andrew Morton but i really wanted to thank  
Chuck for his precious help and thought that /akpm/ having signed  
this commit maybe he's going to figure out whats wrong easily


The commit you found is a plausible source of the trouble (based on  
our current theory about the problem).


What isn't quite clear to me is whether this commit causes your user- 
space server to start failing suddenly, or it causes the client to  
start sending the special non-standard time stamps in the SETATTR  
request.  My guess is the latter, but I want to confirm this guess  
against reality  :-)


Are you running the client and server concurrently on the same  
system?  If so, it would be helpful if you could run this test with a  
constant kernel version on one side while varying it on the other.   
If client and server are already on different systems, can you tell  
us which system and which kernel combinations caused the failure?


A matrix of combinations might be:

1. server kernel is before 1c710c89, client kernel is before 1c710c89
2. server kernel is before 1c710c89, client kernel is after 1c710c89
3. server kernel is after 1c710c89, client kernel is before 1c710c89
4. server kernel is after 1c710c89, client kernel is after 1c710c89

Thanks.


This is what i finally get from git:

1c710c896eb461895d3c399e15bb5f20b39c9073 is first bad commit
commit 1c710c896eb461895d3c399e15bb5f20b39c9073
Author: Ulrich Drepper <[EMAIL PROTECTED]>
Date:   Tue May 8 00:33:25 2007 -0700

   utimensat implementation

   Implement utimensat(2) which is an extension to futimesat(2) in  
that it


   a) supports nano-second resolution for the timestamps
   b) allows to selectively ignore the atime/mtime value
   c) allows to selectively use the current time for either atime  
or mtime
   d) supports changing the atime/mtime of a symlink itself along  
the lines

  of the BSD lutimes(3) functions

[...]

   [EMAIL PROTECTED]: add missing i386 syscall table entry]
   Signed-off-by: Ulrich Drepper <[EMAIL PROTECTED]>
   Cc: Alexey Dobriyan <[EMAIL PROTECTED]>
   Cc: Michael Kerrisk <[EMAIL PROTECTED]>
   Cc: <[EMAIL PROTECTED]>
   Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
   Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]>

:04 04 3bedbc7fd919ba167b8e5f208a630261570853bb  
927002a9423dcb51ba4f7bee53e60cdca6c1df43 M  arch
:04 04 fd688c5b534efd3111cbf1e1095d6ff631738325  
3d0fbf20fb3da1cb380c92f5b2b39815897376d3 M  fs
:04 04 bfb1a907a9a842db4fa3543e12a8381d4e11b1eb  
9c1d99324db12e066c0d17870fe48457809ad43b M  include


Thanks in advance, regards,

Gianluca


Hi Gianluca-

On Jan 30, 2008, at 7:40 AM, Gianluca Alberici wrote:


Hello again everybody

Here follows the testbench:

- I got two mirrors, same machine, same disk etc...chaged  
hostname, IP, and on the second i have recompiled kernel.

- First: 2.6.21.7 on debian sarge
- Second: 2.6.22 same system.
- Onto both i got nfs-user-server and cfsd last versions
- The export file is the same (localhost /opt/nfs (rw, async),  
stripping off the async option does not changes anything)

- Mount options are exactly the same.

The problem arises in the very same manner with both nfs and cfsd:

NFS:setattr {
...
...
RPC:call_decode {
return 22;
}
...
return 22;
}



Again, there is nothing wrong with the RPC client or call_decode.  
The *server* is returning NFSERR_INVAL (22) to a SETATTR request;  
the RPC client is simply passing that along to the NFS client, as  
it is designed to do.



I have tried these kernels:

2.6.16.11 works
2.6.20 works
2.6.21 works
2.6.21.7 works
2.6.22 doesnt work (contiguous to previous version)
2.6.23 doesnt work (same behavior as previous)
2.6.23.9 doesnt work (as above)
2.6.24rc7 doesnt work (as above)

I would really like to do more, client or server side, if you ave  
any suggestions.
Can we find out what is the change (doesnt matter if it is a buf  
or bug fix) that caused this problem ?



The goal here is to identify the kernel change between 2.6.21 and  
2.6.22 that makes the client generate SETATTR requests the user- 
space server chokes on. It may be a change in the NFS client, or  
it could be somewhere else in the file system stack, like the VFS.


The usual procedure is to use "git bisect". It does a binary  
search on the kernel patches between the working kernel version  
and the kernel version that is known not to work. It works like this:


1. You clone a linux kernel git repository (if you don't have a git
repository already)

2. You tell git bisect which kernel version is working, and which  
isn't.
git bisect t

[PATCH 00/11] RFC: Kconfig changes for NFS_FS and NFSD

2008-02-08 Thread Chuck Lever

Scratching my head to find an appropriate mailing list for review of some
proposed NFS-related Kconfig changes.  These have already seen some light
on linux-nfs, but Herbert Xu suggested lkml for review by Kconfig experts.

In addition to updating the help text, I've tried to untangle the entry
dependencies somewhat.  I'm not really a Kconfig expert, so I'd appreciate
some review of these patches by those more knowlegeable than me.

-- 
corporate:
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 09/11] NFSD: Move "select NFSD_V2_ACL if NFSD_V3_ACL"

2008-02-08 Thread Chuck Lever

Clean up: since NFSD_V2_ACL is a boolean, it can be selected safely
under the NFSD_V3_ACL entry (also a boolean).

Signed-off-by: Chuck Lever <[EMAIL PROTECTED]>
---

 fs/Kconfig |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 32c84d9..21362e9 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -1657,7 +1657,6 @@ config NFSD
select LOCKD
select SUNRPC
select EXPORTFS
-   select NFSD_V2_ACL if NFSD_V3_ACL
select NFS_ACL_SUPPORT if NFSD_V2_ACL
select CRYPTO_MD5 if NFSD_V4
select CRYPTO if NFSD_V4
@@ -1699,6 +1698,7 @@ config NFSD_V3
 config NFSD_V3_ACL
bool "NFS server support for the NFSv3 ACL protocol extension"
depends on NFSD_V3
+   select NFSD_V2_ACL
help
  Solaris NFS servers support an auxiliary NFSv3 ACL protocol that
  never became an official part of the NFS version 3 protocol.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 11/11] NLM: LOCKD fails to load if CONFIG_SYSCTL is not set

2008-02-08 Thread Chuck Lever

By the way, we've got another config-related nit here:

http://bugzilla.linux-nfs.org/show_bug.cgi?id=156

You can build lockd without CONFIG_SYSCTL set, but then the module will
fail to load.

Signed-off-by: Chuck Lever <[EMAIL PROTECTED]>
---

 fs/Kconfig |1 +
 fs/lockd/svc.c |4 
 2 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index f799964..b7aeed4 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -1734,6 +1734,7 @@ config NFSD_V4
 
 config LOCKD
tristate
+   select SYSCTL
 
 config LOCKD_V4
bool
diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
index 0822646..adb8e2c 100644
--- a/fs/lockd/svc.c
+++ b/fs/lockd/svc.c
@@ -516,8 +516,12 @@ module_param(nsm_use_hostnames, bool, 0644);
 
 static int __init init_nlm(void)
 {
+#if CONFIG_SYSCTL
nlm_sysctl_table = register_sysctl_table(nlm_sysctl_root);
return nlm_sysctl_table ? 0 : -ENOMEM;
+#else
+   return 0;
+#endif
 }
 
 static void __exit exit_nlm(void)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 06/11] NFSD: Remove NFSv4 dependency on NFSv3

2008-02-08 Thread Chuck Lever


On Feb 8, 2008, at 3:20 PM, Sam Ravnborg wrote:

On Fri, Feb 08, 2008 at 12:52:08PM -0500, Chuck Lever wrote:
Clean up: Because NFSD_V4 "depends on" NFSD_V3, it appears as a  
child of

the NFSD_V3 menu entry, and is not visible if NFSD_V3 is unselected.

Replace the dependency on NFSD_V3 with a "select NFSD_V3".  This  
makes
NFSD_V4 look and work just like NFS_V3, while ensuring that  
NFSD_V3 is

enabled if NFSD_V4 is.

Signed-off-by: Chuck Lever <[EMAIL PROTECTED]>
---

 fs/Kconfig |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 9ad62a9..4c16789 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -1723,7 +1723,8 @@ config NFSD_V3_ACL

 config NFSD_V4
bool "NFS server support for NFS version 4 (EXPERIMENTAL)"
-   depends on NFSD && NFSD_V3 && EXPERIMENTAL
+   depends on NFSD && EXPERIMENTAL
+   select NFSD_V3
select RPCSEC_GSS_KRB5
help
  This option enables support in your system's NFS server for


This use of select is questionable. In general it is bad to select
a symbol with dependencies.
In this case the dependencies of NFSD_V3 are duplicated for NFSD_V4
so we will not se erratic configurations but do you remember to
update NFSD_V4 when you add a depends on NFSD_V3?

But I see no other clean way to do it rithg now.


This patch is entirely optional, and I can just drop it for now.

Thanks for the review.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 06/11] NFSD: Remove NFSv4 dependency on NFSv3

2008-02-08 Thread Chuck Lever

Clean up: Because NFSD_V4 "depends on" NFSD_V3, it appears as a child of
the NFSD_V3 menu entry, and is not visible if NFSD_V3 is unselected.

Replace the dependency on NFSD_V3 with a "select NFSD_V3".  This makes
NFSD_V4 look and work just like NFS_V3, while ensuring that NFSD_V3 is
enabled if NFSD_V4 is.

Signed-off-by: Chuck Lever <[EMAIL PROTECTED]>
---

 fs/Kconfig |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 9ad62a9..4c16789 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -1723,7 +1723,8 @@ config NFSD_V3_ACL
 
 config NFSD_V4
bool "NFS server support for NFS version 4 (EXPERIMENTAL)"
-   depends on NFSD && NFSD_V3 && EXPERIMENTAL
+   depends on NFSD && EXPERIMENTAL
+   select NFSD_V3
select RPCSEC_GSS_KRB5
help
  This option enables support in your system's NFS server for

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 03/11] NFS: Update help text for CONFIG_NFS_FS

2008-02-08 Thread Chuck Lever

Clean up: refresh the help text for Kconfig items related to the NFS
client.  Remove obsolete URLs, and make the language consistent among
the options.

Also move the ROOT_NFS config option next to the options related to the
NFS client.

Signed-off-by: Chuck Lever <[EMAIL PROTECTED]>
---

 fs/Kconfig |  115 ++--
 1 files changed, 57 insertions(+), 58 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 9c2c24b..9427c73 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -1517,10 +1517,6 @@ config UFS_FS
   The recently released UFS2 variant (used in FreeBSD 5.x) is
   READ-ONLY supported.
 
- If you only intend to mount files from some other Unix over the
- network using NFS, you don't need the UFS file system support (but
- you need NFS file system support obviously).
-
  Note that this option is generally not needed for floppies, since a
  good portable way to transport files and directories between unixes
  (and even other operating systems) is given by the tar program ("man
@@ -1560,6 +1556,7 @@ menuconfig NETWORK_FILESYSTEMS
  Say Y here to get to see options for network filesystems and
  filesystem-related networking code, such as NFS daemon and
  RPCSEC security modules.
+
  This option alone does not add any kernel code.
 
  If you say N, all options in this submenu will be skipped and
@@ -1568,76 +1565,92 @@ menuconfig NETWORK_FILESYSTEMS
 if NETWORK_FILESYSTEMS
 
 config NFS_FS
-   tristate "NFS file system support"
+   tristate "NFS client support"
depends on INET
select LOCKD
select SUNRPC
select NFS_ACL_SUPPORT if NFS_V3_ACL
help
- If you are connected to some other (usually local) Unix computer
- (using SLIP, PLIP, PPP or Ethernet) and want to mount files residing
- on that computer (the NFS server) using the Network File Sharing
- protocol, say Y. "Mounting files" means that the client can access
- the files with usual UNIX commands as if they were sitting on the
- client's hard disk. For this to work, the server must run the
- programs nfsd and mountd (but does not need to have NFS file system
- support enabled in its kernel). NFS is explained in the Network
- Administrator's Guide, available from
- <http://www.tldp.org/docs.html#guide>, on its man page: "man
- nfs", and in the NFS-HOWTO.
+ Choose Y here if you want to access files residing on other
+ computers using Sun's Network File System protocol.  To compile
+ this file system support as a module, choose M here: the module
+ will be called nfs.
 
- A superior but less widely used alternative to NFS is provided by
- the Coda file system; see "Coda file system support" below.
+ To mount file systems exported by NFS servers, you also need to
+ install the user space mount.nfs command which can be found in
+ the Linux nfs-utils package, available from http://linux-nfs.org/.
+ Information about using the mount command is available in the
+ mount(8) man page.  More detail about the Linux NFS client
+ implementation is available via the nfs(5) man page.
 
- If you say Y here, you should have said Y to TCP/IP networking also.
- This option would enlarge your kernel by about 27 KB.
+ Below you can choose which versions of the NFS protocol are
+ available in the kernel to mount NFS servers.  Support for NFS
+ version 2 (RFC 1094) is always available when NFS_FS is selected.
 
- To compile this file system support as a module, choose M here: the
- module will be called nfs.
-
- If you are configuring a diskless machine which will mount its root
- file system over NFS at boot time, say Y here and to "Kernel
- level IP autoconfiguration" above and to "Root file system on NFS"
- below. You cannot compile this driver as a module in this case.
- There are two packages designed for booting diskless machines over
- the net: netboot, available from
- <http://ftp1.sourceforge.net/netboot/>, and Etherboot,
- available from <http://ftp1.sourceforge.net/etherboot/>.
+ To configure a system which mounts its root file system via NFS
+ at boot time, say Y here, select "Kernel level IP
+ autoconfiguration" in the NETWORK menu, and select "Root file
+ system on NFS" below.  You cannot compile this file system as a
+ module in this case.
 
- If you don't know what all this is about, say N.
+ If unsure, say N.
 
 config NFS_V3
-   bool "Provide NFSv3 client su

[PATCH 07/11] NFSD: Use "depends on" for PROC_FS dependency

2008-02-08 Thread Chuck Lever

Recently a reverse dependency was added to fs/Kconfig to ensure that
PROC_FS was enabled if NFSD_V4 was enabled.

There is a guideline in Documentation/kbuild/kconfig-language.txt that
states "In general use select only for non-visible symbols (no prompts
anywhere) and for symbols with no dependencies."

A quick grep around other Kconfig files reveals that no entry currently
uses "select PROC_FS" -- every one uses "depends on".  Thus CONFIG_NFSD_V4
should use "depends on PROC_FS" as well.

For SUNRPC_GSS, it's a little more complex.  Other entries can "select"
SUNRPC_GSS, as it is non-visible.  However, the guideline suggests an
entry can't "select" it if it has a dependency (such as PROC_FS).
So, we add forward dependencies on PROC_FS to RPCSEC_GSS_FOO instead.

XXX: Both CONFIG_NFSV4 and CONFIG_NFSD_V4 select RPCSEC_GSS_KRB5, which is
visible, which kconfig-language.txt also frowns upon.  The intent was to
enable at least one GSS mechanism if V4 was enabled.  Perhaps we should
make SUNRPC_GSS visible, and make the NFSv4 options visible only if
SUNRPC_GSS is enabled.

Signed-off-by: Chuck Lever <[EMAIL PROTECTED]>
---

 fs/Kconfig |8 +++-
 1 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 4c16789..5f00ee7 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -1662,8 +1662,6 @@ config NFSD
select CRYPTO_MD5 if NFSD_V4
select CRYPTO if NFSD_V4
select FS_POSIX_ACL if NFSD_V4
-   select PROC_FS if NFSD_V4
-   select PROC_FS if SUNRPC_GSS
help
  Choose Y here if you want to allow other computers to access
  files residing on this system using Sun's Network File System
@@ -1723,7 +1721,7 @@ config NFSD_V3_ACL
 
 config NFSD_V4
bool "NFS server support for NFS version 4 (EXPERIMENTAL)"
-   depends on NFSD && EXPERIMENTAL
+   depends on NFSD && PROC_FS && EXPERIMENTAL
select NFSD_V3
select RPCSEC_GSS_KRB5
help
@@ -1796,7 +1794,7 @@ config SUNRPC_BIND34
 
 config RPCSEC_GSS_KRB5
tristate "Secure RPC: Kerberos V mechanism (EXPERIMENTAL)"
-   depends on SUNRPC && EXPERIMENTAL
+   depends on SUNRPC && PROC_FS && EXPERIMENTAL
select SUNRPC_GSS
select CRYPTO
select CRYPTO_MD5
@@ -1815,7 +1813,7 @@ config RPCSEC_GSS_KRB5
 
 config RPCSEC_GSS_SPKM3
tristate "Secure RPC: SPKM3 mechanism (EXPERIMENTAL)"
-   depends on SUNRPC && EXPERIMENTAL
+   depends on SUNRPC && PROC_FS && EXPERIMENTAL
select SUNRPC_GSS
select CRYPTO
select CRYPTO_MD5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 08/11] NFSD: Move "select FS_POSIX_ACL if NFSD_V4"

2008-02-08 Thread Chuck Lever

Clean up: since FS_POSIX_ACL is a non-visible boolean entry, it can be
selected safely under the NFSD_V4 entry (also a boolean).

Signed-off-by: Chuck Lever <[EMAIL PROTECTED]>
---

 fs/Kconfig |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 5f00ee7..32c84d9 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -411,7 +411,7 @@ config JFS_STATISTICS
  to be made available to the user in the /proc/fs/jfs/ directory.
 
 config FS_POSIX_ACL
-# Posix ACL utility routines (for now, only ext2/ext3/jfs/reiserfs)
+# Posix ACL utility routines (for now, only ext2/ext3/jfs/reiserfs/nfs)
 #
 # NOTE: you can implement Posix ACLs without these helpers (XFS does).
 #  Never use this symbol for ifdefs.
@@ -1661,7 +1661,6 @@ config NFSD
select NFS_ACL_SUPPORT if NFSD_V2_ACL
select CRYPTO_MD5 if NFSD_V4
select CRYPTO if NFSD_V4
-   select FS_POSIX_ACL if NFSD_V4
help
  Choose Y here if you want to allow other computers to access
  files residing on this system using Sun's Network File System
@@ -1723,6 +1722,7 @@ config NFSD_V4
bool "NFS server support for NFS version 4 (EXPERIMENTAL)"
depends on NFSD && PROC_FS && EXPERIMENTAL
select NFSD_V3
+   select FS_POSIX_ACL
select RPCSEC_GSS_KRB5
help
  This option enables support in your system's NFS server for

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 01/11] NFS: Always enable NFS direct I/O

2008-02-08 Thread Chuck Lever

Since O_DIRECT is a standard feature that is enabled in most distros,
eliminate the CONFIG_NFS_DIRECTIO build option, and change the
fs/nfs/Makefile to always build in the NFS direct I/O engine.

Signed-off-by: Chuck Lever <[EMAIL PROTECTED]>
---

 fs/Kconfig|   24 
 fs/nfs/Makefile   |3 +--
 fs/nfs/file.c |6 --
 fs/nfs/internal.h |5 -
 4 files changed, 1 insertions(+), 37 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 987b5d7..4965fd8 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -1638,30 +1638,6 @@ config NFS_V4
 
  If unsure, say N.
 
-config NFS_DIRECTIO
-   bool "Allow direct I/O on NFS files"
-   depends on NFS_FS
-   help
- This option enables applications to perform uncached I/O on files
- in NFS file systems using the O_DIRECT open() flag.  When O_DIRECT
- is set for a file, its data is not cached in the system's page
- cache.  Data is moved to and from user-level application buffers
- directly.  Unlike local disk-based file systems, NFS O_DIRECT has
- no alignment restrictions.
-
- Unless your program is designed to use O_DIRECT properly, you are
- much better off allowing the NFS client to manage data caching for
- you.  Misusing O_DIRECT can cause poor server performance or network
- storms.  This kernel build option defaults OFF to avoid exposing
- system administrators unwittingly to a potentially hazardous
- feature.
-
- For more details on NFS O_DIRECT, see fs/nfs/direct.c.
-
- If unsure, say N.  This reduces the size of the NFS client, and
- causes open() to return EINVAL if a file residing in NFS is
- opened with the O_DIRECT flag.
-
 config NFSD
tristate "NFS server support"
depends on INET
diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile
index f54c4d7..344bccb 100644
--- a/fs/nfs/Makefile
+++ b/fs/nfs/Makefile
@@ -8,7 +8,7 @@ EXTRA_CFLAGS = -W -Wall -Wextra -Wsign-compare \
 obj-$(CONFIG_NFS_FS) += nfs.o
 
 nfs-y  := client.o dir.o file.o getroot.o inode.o super.o 
nfs2xdr.o \
-  pagelist.o proc.o read.o symlink.o unlink.o \
+  direct.o pagelist.o proc.o read.o symlink.o unlink.o 
\
   write.o namespace.o mount_clnt.o
 nfs-$(CONFIG_ROOT_NFS) += nfsroot.o
 nfs-$(CONFIG_NFS_V3)   += nfs3proc.o nfs3xdr.o
@@ -17,5 +17,4 @@ nfs-$(CONFIG_NFS_V4)  += nfs4proc.o nfs4xdr.o nfs4state.o 
nfs4renewd.o \
   delegation.o idmap.o \
   callback.o callback_xdr.o callback_proc.o \
   nfs4namespace.o
-nfs-$(CONFIG_NFS_DIRECTIO) += direct.o
 nfs-$(CONFIG_SYSCTL) += sysctl.o
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index ef57a5a..10e8b80 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -234,10 +234,8 @@ nfs_file_read(struct kiocb *iocb, const struct iovec *iov,
ssize_t result;
size_t count = iov_length(iov, nr_segs);
 
-#ifdef CONFIG_NFS_DIRECTIO
if (iocb->ki_filp->f_flags & O_DIRECT)
return nfs_file_direct_read(iocb, iov, nr_segs, pos);
-#endif
 
dfprintk(VFS, "nfs: read(%s/%s, [EMAIL PROTECTED])\n",
dentry->d_parent->d_name.name, dentry->d_name.name,
@@ -383,9 +381,7 @@ const struct address_space_operations nfs_file_aops = {
.write_end = nfs_write_end,
.invalidatepage = nfs_invalidate_page,
.releasepage = nfs_release_page,
-#ifdef CONFIG_NFS_DIRECTIO
.direct_IO = nfs_direct_IO,
-#endif
.launder_page = nfs_launder_page,
 };
 
@@ -443,10 +439,8 @@ static ssize_t nfs_file_write(struct kiocb *iocb, const 
struct iovec *iov,
ssize_t result;
size_t count = iov_length(iov, nr_segs);
 
-#ifdef CONFIG_NFS_DIRECTIO
if (iocb->ki_filp->f_flags & O_DIRECT)
return nfs_file_direct_write(iocb, iov, nr_segs, pos);
-#endif
 
dfprintk(VFS, "nfs: write(%s/%s(%ld), [EMAIL PROTECTED])\n",
dentry->d_parent->d_name.name, dentry->d_name.name,
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index fbe5ba4..13f43d6 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -113,13 +113,8 @@ extern void nfs_destroy_readpagecache(void);
 extern int __init nfs_init_writepagecache(void);
 extern void nfs_destroy_writepagecache(void);
 
-#ifdef CONFIG_NFS_DIRECTIO
 extern int __init nfs_init_directcache(void);
 extern void nfs_destroy_directcache(void);
-#else
-#define nfs_init_directcache() (0)
-#define nfs_destroy_directcache() do {} while(0)
-#endif
 
 /* nfs2xdr.c */
 extern int nfs_stat_to_errno(int);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 04/11] NFSD: Update help text for CONFIG_NFSD

2008-02-08 Thread Chuck Lever

Clean up: refresh the help text for Kconfig items related to the NFS
server.  Remove obsolete URLs, and make the language consistent among
the options.

Signed-off-by: Chuck Lever <[EMAIL PROTECTED]>
---

 fs/Kconfig |   76 +---
 1 files changed, 47 insertions(+), 29 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 9427c73..53e21eb 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -1665,56 +1665,74 @@ config NFSD
select PROC_FS if NFSD_V4
select PROC_FS if SUNRPC_GSS
help
- If you want your Linux box to act as an NFS *server*, so that other
- computers on your local network which support NFS can access certain
- directories on your box transparently, you have two options: you can
- use the self-contained user space program nfsd, in which case you
- should say N here, or you can say Y and use the kernel based NFS
- server. The advantage of the kernel based solution is that it is
- faster.
+ Choose Y here if you want to allow other computers to access
+ files residing on this system using Sun's Network File System
+ protocol.  To compile the NFS server support as a module,
+ choose M here: the module will be called nfsd.
 
- In either case, you will need support software; the respective
- locations are given in the file  in the
- NFS section.
+ You may choose to use a user-space NFS server instead, in which
+ case you can choose N.
 
- If you say Y here, you will get support for version 2 of the NFS
- protocol (NFSv2). If you also want NFSv3, say Y to the next question
- as well.
+ To export local file systems using NFS, you also need to install
+ user space programs which can be found in the Linux nfs-utils
+ package, available from http://linux-nfs.org/.  More detail about
+ the Linux NFS server implementation is available via the
+ exports(5) man page.
 
- Please read the NFS-HOWTO, available from
- <http://www.tldp.org/docs.html#howto>.
+ Below you can choose which versions of the NFS protocol are
+ available to clients mounting the NFS server on this system.
+ Support for NFS version 2 (RFC 1094) is always available when
+ CONFIG_NFSD is selected.
 
- To compile the NFS server support as a module, choose M here: the
- module will be called nfsd.  If unsure, say N.
+ If unsure, say N.
 
 config NFSD_V2_ACL
bool
depends on NFSD
 
 config NFSD_V3
-   bool "Provide NFSv3 server support"
+   bool "NFS server support for NFS version 3"
depends on NFSD
help
- If you would like to include the NFSv3 server as well as the NFSv2
- server, say Y here.  If unsure, say Y.
+ This option enables support in your system's NFS server for
+ version 3 of the NFS protocol (RFC 1813).
+
+ If unsure, say Y.
 
 config NFSD_V3_ACL
-   bool "Provide server support for the NFSv3 ACL protocol extension"
+   bool "NFS server support for the NFSv3 ACL protocol extension"
depends on NFSD_V3
help
- Implement the NFSv3 ACL protocol extension for manipulating POSIX
- Access Control Lists on exported file systems. NFS clients should
- be compiled with the NFSv3 ACL protocol extension; see the
- CONFIG_NFS_V3_ACL option.  If unsure, say N.
+ Solaris NFS servers support an auxiliary NFSv3 ACL protocol that
+ never became an official part of the NFS version 3 protocol.
+ This protocol extension allows applications on NFS clients to
+ manipulate POSIX Access Control Lists on files residing on NFS
+ servers.  NFS servers enforce POSIX ACLs on local files whether
+ this protocol is available or not.
+
+ This option enables support in your system's NFS server for the
+ NFSv3 ACL protocol extension allowing NFS clients to manipulate
+ POSIX ACLs on files exported by your system's NFS server.  NFS
+ clients which support the Solaris NFSv3 ACL protocol can then
+ access and modify ACLs on your NFS server.
+
+ To store ACLs on your NFS server, you also need to enable ACL-
+ related CONFIG options for your local file systems of choice.
+
+ If unsure, say N.
 
 config NFSD_V4
-   bool "Provide NFSv4 server support (EXPERIMENTAL)"
+   bool "NFS server support for NFS version 4 (EXPERIMENTAL)"
depends on NFSD && NFSD_V3 && EXPERIMENTAL
select RPCSEC_GSS_KRB5
help
- If you would like to include the NFSv4 server as well as the NFSv2
- and NFSv3 servers, say Y here.  This feature is experimental, and
- shoul

[PATCH 10/11] NFSD: Remove redundant "select" clauses in fs/Kconfig

2008-02-08 Thread Chuck Lever

As far as I can tell, selecting the CRYPTO and CRYPTO_MD5 entries under
CONFIG_NFSD is redundant, since CONFIG_NFSD_V4 already selects
RPCSEC_GSS_KRB5, which selects these entries.

Testing with "make menuconfig" shows that the entries under CRYPTO still
properly reflect "Y" or "M" based on the setting of CONFIG_NFSD after this
change is applied.

Signed-off-by: Chuck Lever <[EMAIL PROTECTED]>
---

 fs/Kconfig |2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 21362e9..f799964 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -1658,8 +1658,6 @@ config NFSD
select SUNRPC
select EXPORTFS
select NFS_ACL_SUPPORT if NFSD_V2_ACL
-   select CRYPTO_MD5 if NFSD_V4
-   select CRYPTO if NFSD_V4
help
  Choose Y here if you want to allow other computers to access
  files residing on this system using Sun's Network File System

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 05/11] SUNRPC: Update help Kconfig text

2008-02-08 Thread Chuck Lever

Clean up: refresh the help text for Kconfig items related to the sunrpc
module.  Remove obsolete URLs, and make the language consistent among
the options.

Signed-off-by: Chuck Lever <[EMAIL PROTECTED]>
---

 fs/Kconfig |   44 +---
 1 files changed, 29 insertions(+), 15 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 53e21eb..9ad62a9 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -1766,17 +1766,29 @@ config SUNRPC_XPRT_RDMA
depends on SUNRPC && INFINIBAND && EXPERIMENTAL
default m
help
- Adds a client RPC transport for supporting kernel NFS over RDMA
- mounts, including Infiniband and iWARP. Experimental.
+ This option enables the RPC client transport capability that
+ supports NFS over RDMA in the kernel's NFS client.  To compile
+ RPC client RDMA transport support as a module, choose M here:
+ the module will be called xprtrdma.
+
+ If unsure, say N.
 
 config SUNRPC_BIND34
bool "Support for rpcbind versions 3 & 4 (EXPERIMENTAL)"
depends on SUNRPC && EXPERIMENTAL
help
- Provides kernel support for querying rpcbind servers via versions 3
- and 4 of the rpcbind protocol.  The kernel automatically falls back
- to version 2 if a remote rpcbind service does not support versions
- 3 or 4.
+ RPC requests over IPv6 networks require support for larger
+ addresses when performing an RPC bind.  Sun added support for
+ IPv6 addressing by creating two new versions of the rpcbind
+ protocol (RFC 1833).
+
+ This option enables support in the kernel RPC client for
+ querying rpcbind servers via versions 3 and 4 of the rpcbind
+ protocol.  The kernel automatically falls back to version 2
+ if a remote rpcbind service does not support versions 3 or 4.
+ By themselves, these new versions do not provide support for
+ RPC over IPv6, but the new protocol versions are necessary to
+ support it.
 
  If unsure, say N to get traditional behavior (version 2 rpcbind
  requests only).
@@ -1790,12 +1802,13 @@ config RPCSEC_GSS_KRB5
select CRYPTO_DES
select CRYPTO_CBC
help
- Provides for secure RPC calls by means of a gss-api
- mechanism based on Kerberos V5. This is required for
- NFSv4.
+ Choose Y here to enable Secure RPC using the Kerberos version 5
+ GSS-API mechanism (RFC 1964).
 
- Note: Requires an auxiliary userspace daemon which may be found on
-   http://www.citi.umich.edu/projects/nfsv4/
+ Secure RPC calls with Kerberos require an auxiliary user-space
+ daemon which may be found in the Linux nfs-utils package
+ available from http://linux-nfs.org/.  In addition, user-space
+ Kerberos support should be installed.
 
  If unsure, say N.
 
@@ -1809,11 +1822,12 @@ config RPCSEC_GSS_SPKM3
select CRYPTO_CAST5
select CRYPTO_CBC
help
- Provides for secure RPC calls by means of a gss-api
- mechanism based on the SPKM3 public-key mechanism.
+ Choose Y here to enable Secure RPC using the SPKM3 public key
+ GSS-API mechansim (RFC 2025).
 
- Note: Requires an auxiliary userspace daemon which may be found on
-   http://www.citi.umich.edu/projects/nfsv4/
+ Secure RPC calls with SPKM3 require an auxiliary userspace
+ daemon which may be found in the Linux nfs-utils package
+ available from http://linux-nfs.org/.
 
  If unsure, say N.
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 02/11] NFSD: Remove NFSD_TCP kernel build option

2008-02-08 Thread Chuck Lever

Likewise, distros usually leave CONFIG_NFSD_TCP enabled.

TCP support in the Linux NFS server is stable enough that we can leave it
on always.  CONFIG_NFSD_TCP adds about 10 lines of code, and defaults to
"Y" anyway.

A run-time switch might be more appropriate if people feel they would like
to disable NFSD's TCP support.

Signed-off-by: Chuck Lever <[EMAIL PROTECTED]>
---

 fs/Kconfig   |   10 --
 fs/nfsd/nfssvc.c |2 --
 2 files changed, 0 insertions(+), 12 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 4965fd8..9c2c24b 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -1646,7 +1646,6 @@ config NFSD
select EXPORTFS
select NFSD_V2_ACL if NFSD_V3_ACL
select NFS_ACL_SUPPORT if NFSD_V2_ACL
-   select NFSD_TCP if NFSD_V4
select CRYPTO_MD5 if NFSD_V4
select CRYPTO if NFSD_V4
select FS_POSIX_ACL if NFSD_V4
@@ -1705,15 +1704,6 @@ config NFSD_V4
  should only be used if you are interested in helping to test NFSv4.
  If unsure, say N.
 
-config NFSD_TCP
-   bool "Provide NFS server over TCP support"
-   depends on NFSD
-   default y
-   help
- If you want your NFS server to support TCP connections, say Y here.
- TCP connections usually perform better than the default UDP when
- the network is lossy or congested.  If unsure, say Y.
-
 config ROOT_NFS
bool "Root file system on NFS"
depends on NFS_FS=y && IP_PNP
diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 9647b0f..941041f 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -244,7 +244,6 @@ static int nfsd_init_socks(int port)
if (error < 0)
return error;
 
-#ifdef CONFIG_NFSD_TCP
error = lockd_up(IPPROTO_TCP);
if (error >= 0) {
error = svc_create_xprt(nfsd_serv, "tcp", port,
@@ -254,7 +253,6 @@ static int nfsd_init_socks(int port)
}
if (error < 0)
return error;
-#endif
return 0;
 }
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Fix queued SIGIO

2000-09-19 Thread Chuck Lever

On Tue, 19 Sep 2000, Julian Anastasov wrote:
> On Mon, 18 Sep 2000, Andi Kleen wrote:
> 
> > >   SI_SIGIO is not generated from kernel. The same is for the
> > > other SI_ consts < 0 not defined with __SI_CODE.
> >
> > Ok, then you have already broken binary compatibility between 2.2 and 2.4
> 
>   Looking in the old kernels, it seems the binary compatibility
> was broken in 2.3.21 when si_code returns POLL_xxx events just like
> mentioned in "The Single UNIX ® Specification, Version 2",
> xsh/signal.h.html and not SI_SIGIO.
> 
>   SI_SIGIO in si_code for 2.2 does not return any information
> about the events. I even see that Redhat maintains patch against 2.2
> to backport the POLL_xxx events from 2.3. Not sure after the changes
> in 2.4.0-test1. Anyway, 2.2 looks unusable for me and I don't see other
> way this problem to be fixed. The binary compatibility is impossible
> to exist. The applications can support the both ways: the old SI_SIGIO
> and the new POLL_xxx events (recompiled after test1) in si_code.

because the 2.2 kernels are already "broken" in this regard, i can't see
how binary compatibility between 2.2 and 2.4 could even be an
issue.  applications can't use this stuff in 2.2 without at least the
RedHat patch.

unless there's a problem implementing a glibc that runs on both 2.2 and
2.4, perhaps this should be revisited.

also, the test at issue here (from line 363 of kernel/signal.c):

/* If this was sent by a rt mechanism, try again.  */
if (info->si_code != SI_USER) {
ret = -EAGAIN;
goto out;
}

has always been unclear as to its intent.  it seems like there is
overloading going on here -- if the real intent is to prevent users
without credentials from sending "kernel-only" signals, then that should
be the logic here.

>   The next step is somebody to implement event merging and to
> allow receiving of many events with one call. For the next kernels.

we just published a paper in the ALS proceedings describing our
implementation of a new system call similar to sigtimedwait() that
collects many events at once.

- Chuck Lever
--
corporate:  <[EMAIL PROTECTED]>
personal:   <[EMAIL PROTECTED]>

The Linux Scalability project:
http://www.citi.umich.edu/projects/linux-scalability/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: How to manage shared persistent local caching (FS-Cache) with NFS?

2007-12-06 Thread Chuck Lever


Hi David-

On Dec 5, 2007, at 8:22 PM, David Howells wrote:

Chuck Lever <[EMAIL PROTECTED]> wrote:

I don't see how persistent local caching means we can no longer  
ignore (a)

and (b) above.  Can you amplify this a bit?


How about I put it like this.  There are two principal problems to  
be dealt

with:

 (1) Reconnection.

 Imagine that the administrator requests a mount that uses part  
of a cache.
 The client machine is at some time later rebooted and the  
administrator

 requests the same mount again.

 Since the cache is meant to be persistent, the administrator  
is at liberty
 to expect that the second mount immediately begins to use the  
data that

 the first mount left in the cache.

 For this to occur, the second mount has to be able to  
determine which part
 of the cache the first mount was using and request to use the  
same piece

 of cache.

 To aid with this, FS-Cache has the concept of a 'key'.  Each  
object in the
 cache is addressed by a unique key.  NFS currently builds a  
key to the
 cache object for a file from: "NFS", the server IP address,  
port and NFS

 version and the file handle for that file.


Why not use the fsid as well?  The NFS client already uses the fsid  
to detect when it is crossing a server-side mount point.  Fsids are  
supposed to be stable over server reboots (although sometimes they  
aren't, it could be made a condition of supporting FS-cache on clients).


I also note the inclusion of server IP address in the key.  For multi- 
homed servers, you have the same unavoidable cache aliasing issues if  
the client mounts the same server and export via different server  
network interfaces.



 (2) Cache coherency.

 Imagine that the administrator requests a mount that uses part  
of a
 cache.  The administrator then makes a second mount that  
overlaps the
 first, maybe because it's a different part of the same server  
export or

 maybe it uses the same part, but with different parameters.

 Imagine further that a particular server file is accessible  
through both
 mountpoints.  This means that the kernel, and therefore the  
user, has two

 views of the one file.

 If the kernel maintains these two views of the files as  
totally separate
 copies, then coherency is mostly not a kernel problem, it's an  
application

 problem - as it is now.

 However, if these two views are shared at any level - such as  
if they

 share an FS-Cache cache object - then coherency can be a problem.


Is it a problem because, if there are multiple copies of the same  
remote file in its cache, then FS-cache doesn't know, upon  
reconnection, which item to match against a particular remote file?


I think that's actually going to be a fairly typical situation --  
you'll have conditions where some cache items will become orphaned,  
for example, so you're going to have to deal with that ambiguity as a  
part of normal operation.


For example, if the FS-caching client is disconnected or powered off  
when a remote rename occurs that replaces a file it has cached, the  
client will have an orphaned item left over.  Maybe this use case is  
only a garbage collection problem.


 The two simplest solutions to the coherency problem are (a) to  
enforce
 sharing at all levels (superblocks, inodes, cache objects),  
(b) to enforce
 non-sharing.  In-between states are possible, but are much  
trickier and

 more complex.

 Note that cache coherency management can't be entirely  
avoided: upon
 reconnection a cache object has to be checked against the  
server to see

 whether it's still valid.


How do you propose to do that?

First, clearly, FS-cache has to know that it's the same object, so  
fsid and filehandle have to be the same (you refer to that as the  
"reconnection problem", but it may generally be a "cache aliasing  
problem").


I assume FS-cache has a record of the state of the remote file when  
it was last connected -- mtime, ctime, size, change attribute (I'll  
refer to this as the "reconciliation problem")?  Does it, for  
instance, checksum both the cache item and the remote file to detect  
data differences?


You have the same problem here as we have with file system search  
tools such as Beagle.  Reconciling file contents after a reconnection  
event may be too expensive to consider for NFS, especially if a file  
is terabytes in size.



Note that both these problems only really exist because the cache is
persistent between mounts.  If it were volatile between mounts,  
then (1) would not exist, and (2) can be ignored as it is now.


Do you allow administrators to select whether the FS-cache is  
persistent?  Or is it always unconditionally persistent?


An adequate first pass at FS-cache can be done without guarantee

Re: How to manage shared persistent local caching (FS-Cache) with NFS?

2007-12-07 Thread Chuck Lever


Hi David-

[ Some history snipped... ]

On Dec 6, 2007, at 3:00 PM, David Howells wrote:

Chuck Lever <[EMAIL PROTECTED]> wrote:
Is it a problem because, if there are multiple copies of the same  
remote file
in its cache, then FS-cache doesn't know, upon  reconnection,  
which item to

match against a particular remote file?


There are multiple copies of the same remote file that are  
described by the
same remote parameters.  Same IP address, same port, same NFS  
version, same

FSID, same FH.  The difference may be a local connection parameter.


Why not encode the local mounted-on directory in the key?  A  
cryptographic hash of the directory's absolute pathname would be  
bounded in size.  And the mounted-on directory is usually persistent  
across client reboots.


That way you can use the directory name hash to distinguish the  
different views of the same remote object.



An adequate first pass at FS-cache can be done without guaranteeing
persistence.


True.  But it's not particularly interesting to me in such a case.


There are a host of other issues that need exposure -- steady-state
performance;


Meaning what?


Meaning your cache is at quota all the time, and to continue  
operation it must eject items constantly.


This is a scenario where it pays to cache the read-mostly items on  
disk, and leave the frequently changing items in memory.


The economics of disk caches is different than memory caches.  Disk  
caches are much larger and cheaper, but their performance tanks when  
they have to track frequently changing files.  Memory caches are  
smaller, but tracking frequently changing data is only a little more  
expensive than tracking data that doesn't change often.


I have been measuring the performance improvement and degradation  
numbers, and
I can say that if you've one client and one server, the server has  
all the
files in memory, and there's gigabit ethernet between them, an on- 
disk cache

really doesn't help.

Basically, the consideration of whether to use a cache is a  
compromise between

a host of factors.


cache garbage collection


Done.


and reclamation;


Done.


cache item aliasing;


Partly done.

whether all files on a mount point should be cached on disk, or  
some in

memory and some on disk;


I've thought about that, but no-one seems particularly interested in
discussing it.


I think it's key to preventing FS-cache from making performance worse  
in many common scenarios.


And what would it harm if FS-cache decides that certain items in  
its cache
have become ambiguous or otherwise unusable after a  reconnection  
event, thus

it reclaims them instead of re-using them?


It depends.

At some point I'd like to make disconnected operation possible, and  
that means
storing data to be written back in the cache.  You can't  
necessarily just

chuck that away.


Disconnected operation for NFS is fraught with challenges.  Access to  
data on servers is traditionally gated by the client's IP address,  
for example.  The client may disconnect from the network, then  
reconnect using a different address where suddenly all of its  
accesses are rebuffed.


NFS servers, not clients, traditionally determine the file's mtime  
and ctime, and its file handle.  So file updates and file creation  
become problematic.  The client has to reconcile the server's file  
handle, for files created offline, with its own when reconnecting.


And, for disconnected operation, the cache is required to contain  
every item from the remote.  You can't just drop items from the cache  
because they are inconvenient.


I can't just say: "Well, it'll oops if you configure your NFS  
shares like

that,
so don't.  It's not worth me implementing round it.".


What causes that instability?  Why can't you insulate against the  
instability

but allow cache incoherence and aliased cache items?


Insulate how?  The only way to do that is to add something to the  
cache key
that says that these two otherwise identical items are actually  
diffent

things.


That something might be the pathname of the mounted-on directory or  
of the file itself.


I'm arguing that cache coherence isn't supported by the NFS  
protocol, so how
can FS-cache *require* a facility to support persistent local   
caching that

the protocol doesn't have in the first place?


NFS has just enough to just about support a persistent local cache for
unmodified files.  It has unique file keys per server, and it has a  
(limited)

amount of coherency data per file.  That's not really the problem.

The problem is that the client can create loads of different views  
of a remote
export and the kernel treats them as if they're views of different  
remote
exports.  These views do not necessarily have *anything* to  
distinguish them

at all (nosharecache option).


Yes, they do.  The combin

Re: [2.6 patch] fs/nfs/direct.c: remove dead code

2007-12-12 Thread Chuck Lever

commit b9148c6b should be reverted.  It was recently forward-ported  
from some years-old patches, and is clearly not needed now.


On Dec 11, 2007, at 5:21 PM, Adrian Bunk wrote:

This code became dead after commit  
b9148c6b80d802dbc2a7530b29915a80432e50c7
(which BTW doesn't seem to have changed any behaviour) and can  
therefore

be removed.

Spotted by the Coverity checker.

Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]>

---
--- linux-2.6/fs/nfs/direct.c.old   2007-12-02 21:54:53.0 +0100
+++ linux-2.6/fs/nfs/direct.c   2007-12-02 21:55:10.0 +0100
@@ -897,15 +897,12 @@ ssize_t nfs_file_direct_write(struct kio
if (!count)
goto out;   /* return 0 */

retval = -EINVAL;
if ((ssize_t) count < 0)
goto out;
-   retval = 0;
-   if (!count)
-   goto out;

retval = nfs_sync_mapping(mapping);
if (retval)
goto out;

retval = nfs_direct_write(iocb, iov, nr_segs, pos, count);



--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/3] enhanced ESTALE error handling

2008-01-18 Thread Chuck Lever


Hi Peter-

On Jan 18, 2008, at 10:35 AM, Peter Staubach wrote:

Hi.

Here is a patch set which modifies the system to enhance the
ESTALE error handling for system calls which take pathnames
as arguments.


The VFS already handles ESTALE.

If a pathname resolution encounters an ESTALE at any point, the  
resolution is restarted exactly once, and an additional flag is  
passed to the file system during each lookup that forces each  
component in the path to be revalidated on the server.  This has no  
possibility of causing an infinite loop.


Is there some part of this logic that is no longer working?




--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/3] enhanced ESTALE error handling

2008-01-18 Thread Chuck Lever


On Jan 18, 2008, at 12:30 PM, Peter Staubach wrote:

Chuck Lever wrote:

On Jan 18, 2008, at 11:55 AM, Peter Staubach wrote:

Chuck Lever wrote:

Hi Peter-

On Jan 18, 2008, at 10:35 AM, Peter Staubach wrote:

Hi.

Here is a patch set which modifies the system to enhance the
ESTALE error handling for system calls which take pathnames
as arguments.


The VFS already handles ESTALE.

If a pathname resolution encounters an ESTALE at any point, the  
resolution is restarted exactly once, and an additional flag is  
passed to the file system during each lookup that forces each  
component in the path to be revalidated on the server.  This has  
no possibility of causing an infinite loop.


Is there some part of this logic that is no longer working?


The VFS does not fully handle ESTALE.  An ESTALE error can occur
during the second pathname resolution attempt.


If an ESTALE occurs during the second resolution attempt, we  
should give up.  When I addressed this issue two years ago, the  
two-try logic was the only acceptable solution because there's no  
way to guarantee the pathname resolution will ever finish unless  
we put a hard limit on it.




I can probably imagine a situation where the pathname resolution
would never finish, but I am not sure that it could ever happen
in nature.


Unless someone is doing something malicious.  Or if the server is  
repeatedly returning ESTALE for some reason.



There are lots of
reasons, some of which are the 1 second resolution from some file
systems on the server


Which is a server bug, AFAICS.  It's simply impossible to close  
all the windows that result from sloppy file time stamps without  
completely disabling client-side caching.  The NFS protocol relies  
on file time stamps to manage cache coherence.  If the server is  
lying about time stamps, there's no way the client can cache  
coherently.




Server bug or not, it is something that the client has to live
with.  We can't get the server file system fixed, so it is
something that we should find a way to live with.  This support
can help.


We haven't identified a server-side solution yet, but that doesn't  
mean it doesn't exist.


If we address the time stamp problem in the client, should we also go  
to lengths to address it in every other corner of the NFS client?   
Should we also address every other server bug we discover with a  
client side fix?



Also, there was no support for ESTALE errors which occur during
subsequent operations to the pathname resolution process.  For
example, during a mkdir(2) operation, the ESTALE can occur from
the over the wire MKDIR operation after the LOOKUP operations
have all succeeded.


If the final operation fails after a pathname resolution, then  
it's a real error.  Is there a fixed and valid recovery script for  
the client in this case that will allow the mkdir to proceed?




Why do you think that it is an error?


Because this is a problem that sometimes requires application-level  
recovery.  Can we guarantee that retrying the mkdir is the right  
thing to do every time?



It can easily occur if the directory in which the new directory
is to be created disppears after it is looked up and before the
MKDIR is issued.

The recovery is to perform the lookup again.


Have you tried this client against a file server when you unexport  
the filesystem under test?  The server returns ESTALE no matter what  
the client does.  Should the client continue to retry the request if  
the file system has been permanently taken offline?


Admittedly, the NFS client could recover more cleanly from some of  
these problems, but given the architecture of the Linux VFS, it  
will be difficult to address some of the corner cases.


Could you outline some of these corner cases that this proposal
would not address, please?


I think we have one right here: should the client retry a mkdir if  
gets an ESTALE?


--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/3] enhanced ESTALE error handling

2008-01-18 Thread Chuck Lever


On Jan 18, 2008, at 11:55 AM, Peter Staubach wrote:

Chuck Lever wrote:

Hi Peter-

On Jan 18, 2008, at 10:35 AM, Peter Staubach wrote:

Hi.

Here is a patch set which modifies the system to enhance the
ESTALE error handling for system calls which take pathnames
as arguments.


The VFS already handles ESTALE.

If a pathname resolution encounters an ESTALE at any point, the  
resolution is restarted exactly once, and an additional flag is  
passed to the file system during each lookup that forces each  
component in the path to be revalidated on the server.  This has  
no possibility of causing an infinite loop.


Is there some part of this logic that is no longer working?


The VFS does not fully handle ESTALE.  An ESTALE error can occur
during the second pathname resolution attempt.


If an ESTALE occurs during the second resolution attempt, we should  
give up.  When I addressed this issue two years ago, the two-try  
logic was the only acceptable solution because there's no way to  
guarantee the pathname resolution will ever finish unless we put a  
hard limit on it.



There are lots of
reasons, some of which are the 1 second resolution from some file
systems on the server


Which is a server bug, AFAICS.  It's simply impossible to close all  
the windows that result from sloppy file time stamps without  
completely disabling client-side caching.  The NFS protocol relies on  
file time stamps to manage cache coherence.  If the server is lying  
about time stamps, there's no way the client can cache coherently.



and the window in between the revalidation
and the actual use of the file handle associated with each
dentry/inode pair.


A use case or two would be useful to explore (on linux-nfs or linux- 
fsdevel, rather than lkml).



Also, there was no support for ESTALE errors which occur during
subsequent operations to the pathname resolution process.  For
example, during a mkdir(2) operation, the ESTALE can occur from
the over the wire MKDIR operation after the LOOKUP operations
have all succeeded.


If the final operation fails after a pathname resolution, then it's a  
real error.  Is there a fixed and valid recovery script for the  
client in this case that will allow the mkdir to proceed?


Admittedly, the NFS client could recover more cleanly from some of  
these problems, but given the architecture of the Linux VFS, it will  
be difficult to address some of the corner cases.


--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/3] enhanced ESTALE error handling

2008-01-18 Thread Chuck Lever


Hi Peter-

On Jan 18, 2008, at 12:30 PM, Peter Staubach wrote:

Chuck Lever wrote:

On Jan 18, 2008, at 11:55 AM, Peter Staubach wrote:

Chuck Lever wrote:

Hi Peter-

On Jan 18, 2008, at 10:35 AM, Peter Staubach wrote:

and the window in between the revalidation
and the actual use of the file handle associated with each
dentry/inode pair.


A use case or two would be useful to explore (on linux-nfs or  
linux-fsdevel, rather than lkml).


I created a bunch of use cases in the gensyscall.c program that
I attached to the original description of the problem and my
proposed solution.  It was very useful in generating many, many
ESTALE errors over the wire from a variety of different over the
wire operations, which were originally getting returned to the
user level.


The gensyscall.c program is what I would call a set of unit test, btw.

This is not the same as a use case, which would include information  
about the application environment, its users, a detailed description  
of current system behavior, and some discussion of alternatives for  
improving it (including doing nothing).


A test case is written in a programming language, a use case is  
written in a natural language.


--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 21/26] mount options: partially fix nfs

2008-01-24 Thread Chuck Lever

Hi Miklos-

Miklos Szeredi wrote:

From: Miklos Szeredi <[EMAIL PROTECTED]>

Add posix, bsize=, namelen= options to /proc/mounts for nfs
filesystems.

Document several other options that are still missing.

NFS lists only some options in /proc/mounts on purpose: only the 
essential options are mentioned there to keep clutter down.  The three 
you've added here are for all intents and purposes deprecated, which is 
why they are not supported.

NFS lists a more complete set of mount options for a mount point in 
/proc/self/mountstats.  See nfs_show_stats().

Since your cover letter does not explain why you are changing this code, 
can you refer me to a description of why you are doing this?

More below.

Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
---

Index: linux/fs/nfs/super.c
===
--- linux.orig/fs/nfs/super.c   2008-01-19 11:56:34.0 +0100
+++ linux/fs/nfs/super.c2008-01-21 20:41:30.0 +0100
@@ -449,6 +449,7 @@ static void nfs_show_mount_options(struc
} nfs_info[] = {
{ NFS_MOUNT_SOFT, ",soft", ",hard" },
{ NFS_MOUNT_INTR, ",intr", ",nointr" },
+   { NFS_MOUNT_POSIX, ",posix", "" },
{ NFS_MOUNT_NOCTO, ",nocto", "" },
{ NFS_MOUNT_NOAC, ",noac", "" },
{ NFS_MOUNT_NONLM, ",nolock", "" },
@@ -459,10 +460,17 @@ static void nfs_show_mount_options(struc
};
const struct proc_nfs_info *nfs_infop;
struct nfs_client *clp = nfss->nfs_client;
+   unsigned int default_namelen =
+   clp->rpc_ops->version == 4 ? NFS4_MAXNAMLEN :
+   clp->rpc_ops->version == 3 ? NFS3_MAXNAMLEN : NFS2_MAXNAMLEN;

 	seq_printf(m, ",vers=%d", clp->rpc_ops->version);

seq_printf(m, ",rsize=%d", nfss->rsize);
seq_printf(m, ",wsize=%d", nfss->wsize);
+   if (nfss->bsize != 0)
+   seq_printf(m, ",bsize=%d", nfss->bsize);
+   if (nfss->namelen != default_namelen)
+   seq_printf(m, ",namelen=%d", nfss->namelen);
if (nfss->acregmin != 3*HZ || showdefaults)
seq_printf(m, ",acregmin=%d", nfss->acregmin/HZ);
if (nfss->acregmax != 60*HZ || showdefaults)
@@ -482,6 +490,18 @@ static void nfs_show_mount_options(struc
seq_printf(m, ",timeo=%lu", 10U * nfss->client->cl_timeout->to_initval 
/ HZ);
seq_printf(m, ",retrans=%u", nfss->client->cl_timeout->to_retries);
seq_printf(m, ",sec=%s", 
nfs_pseudoflavour_to_name(nfss->client->cl_auth->au_flavor));
+
+   /*
+* Missing options:
+* port=

Probably should be supported.

+* addr=

This one is already supported; see nfs_show_options().

+* clientaddr=

This one isn't, and should be... would be useful for tracking down 
certain NFSv4 problems.

+* mounthost=
+* mountaddr=

> +   * mountport=
> +   * mountvers=
> +   * mountproto=

And these mount* options are for the kernel's new mount protocol client. 
 They aren't really useful for understanding steady-state NFS client 
behavior, they only effect mount-time behavior.
begin:vcard
fn:Chuck Lever
n:Lever;Chuck
org:Oracle Corporation;Corporate Architecture: Linux Projects Group
adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA
email;internet:chuck dot lever at nospam oracle dot com
title:Principal Member of Staff
tel;work:+1 248 614 5091
x-mozilla-html:FALSE
version:2.1
end:vcard

Re: [PATCH 24/27] NFS: Use local caching [try #2]

2008-01-24 Thread Chuck Lever


Some comments below.

This patch really ought to be broken into more manageable atomic changes 
to make it easier to review, and to provide more fine-grained 
explanation and rationalization for each specific change via individual 
patch descriptions.


David Howells wrote:

The attached patch makes it possible for the NFS filesystem to make use of the
network filesystem local caching service (FS-Cache).

To be able to use this, an updated mount program is required.  This can be
obtained from:

http://people.redhat.com/steved/fscache/util-linux/


This should no longer be necessary.  The latest mount.nfs subcommand 
from nfs-utils supports text-based mounts when running on kernels 2.6.23 
and later.



To mount an NFS filesystem to use caching, add an "fsc" option to the mount:

mount warthog:/ /a -o fsc


I hope you intend to provide updates to nfs(5) that describe the new 
mount options you introduce in this and later patches.  You don't 
mention it, but I assume that "nofsc" is the default behavior.



Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/nfs/Makefile   |1 
 fs/nfs/client.c   |5 +

 fs/nfs/file.c |   37 
 fs/nfs/fscache-def.c  |  289 +
 fs/nfs/fscache.c  |  391 +
 fs/nfs/fscache.h  |  148 +
 fs/nfs/inode.c|   47 +
 fs/nfs/read.c |   28 +++
 fs/nfs/super.c|3 
 fs/nfs/sysctl.c   |1 
 include/linux/nfs_fs.h|9 +

 include/linux/nfs_fs_sb.h |   18 ++
 12 files changed, 968 insertions(+), 9 deletions(-)
 create mode 100644 fs/nfs/fscache-def.c
 create mode 100644 fs/nfs/fscache.c
 create mode 100644 fs/nfs/fscache.h


diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile
index df0f41e..073d04c 100644
--- a/fs/nfs/Makefile
+++ b/fs/nfs/Makefile
@@ -16,3 +16,4 @@ nfs-$(CONFIG_NFS_V4)  += nfs4proc.o nfs4xdr.o nfs4state.o 
nfs4renewd.o \
   nfs4namespace.o
 nfs-$(CONFIG_NFS_DIRECTIO) += direct.o
 nfs-$(CONFIG_SYSCTL) += sysctl.o
+nfs-$(CONFIG_NFS_FSCACHE) += fscache.o fscache-def.o
diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index a6f6254..bcdc5d0 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -43,6 +43,7 @@
 #include "delegation.h"
 #include "iostat.h"
 #include "internal.h"
+#include "fscache.h"
 
 #define NFSDBG_FACILITY		NFSDBG_CLIENT
 
@@ -139,6 +140,8 @@ static struct nfs_client *nfs_alloc_client(const char *hostname,

clp->cl_state = 1 << NFS4CLNT_LEASE_EXPIRED;
 #endif
 
+	nfs_fscache_get_client_cookie(clp);

+
return clp;
 
 error_3:

@@ -170,6 +173,8 @@ static void nfs_free_client(struct nfs_client *clp)
 
 	nfs4_shutdown_client(clp);
 
+	nfs_fscache_release_client_cookie(clp);

+
/* -EIO all pending I/O */
if (!IS_ERR(clp->cl_rpcclient))
rpc_shutdown_client(clp->cl_rpcclient);
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index b3bb89f..d492cd7 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -35,6 +35,7 @@
 #include "delegation.h"
 #include "internal.h"
 #include "iostat.h"
+#include "fscache.h"
 
 #define NFSDBG_FACILITY		NFSDBG_FILE
 
@@ -352,22 +353,48 @@ static int nfs_write_end(struct file *file, struct address_space *mapping,

return status < 0 ? status : copied;
 }
 
+/*

+ * Partially or wholly invalidate a page
+ * - Release the private state associated with a page if undergoing complete
+ *   page invalidation
+ * - Called if either PG_private or PG_fscache set on the page
+ * - Caller holds page lock
+ */


Add comments like this in a separate clean up patch.


 static void nfs_invalidate_page(struct page *page, unsigned long offset)
 {
if (offset != 0)
return;
/* Cancel any unstarted writes on this page */
nfs_wb_page_cancel(page->mapping->host, page);
+
+   nfs_fscache_invalidate_page(page, page->mapping->host);
 }
 
+/*

+ * Release the private state associated with a page
+ * - Called if either PG_private or PG_fscache set on the page
+ * - Caller holds page lock
+ * - Return true (may release) or false (may not)
+ */
 static int nfs_release_page(struct page *page, gfp_t gfp)
 {
/* If PagePrivate() is set, then the page is not freeable */
-   return 0;
+   if (PagePrivate(page))
+   return 0;
+   return nfs_fscache_release_page(page, gfp);
 }
 
+/*

+ * Attempt to clear the private state associated with a page when an error
+ * occurs that requires the cached contents of an inode to be written back or
+ * destroyed
+ * - Called if either PG_private or PG_fscache set on the page
+ * - Caller holds page lock
+ * - Return 0 if successful, -error otherwise
+ */
 static int nfs_launder_page(struct page *page)
 {
+   wait_on_page_fscache_write(page);
return nfs_wb_page(page->mapping->host, page);
 }
 
@@ -387,6 +414,11 @@ const struct address_space_operations nfs_file_aops = {

Re: NFS EINVAL on open(... | O_TRUNC) on 2.6.23.9

2007-12-26 Thread Chuck Lever

 be fixed on the server.  That is  
sometimes difficult if the server is not maintained, for instance.


- I have found other strange behaviors of the new NFS filesystem  
support

that i am investigating. All are bound to the user of
old userspace servers onto the new NFS (since 2.6.22). What to do ?


Report the problems on the [EMAIL PROTECTED] mailing list, or
document them in the official bug databases (either the Linux kernel  
bugzilla or the NFSv4 bug tracker at http://bugzilla.linux-nfs.org/



I'm not sure what the NFS client's policy is regarding support for
userspace servers.  But I'd certainly hope that it is "don't break  
them".


The general policy is that if a server behaves in ways that don't  
conform to the NFS spec, then the Linux NFS client probably won't  
work with it.  If the client works with a broken server today, there  
is no guarantee that it will continue to work.



Which would make this an NFS client regression.


--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: How to manage shared persistent local caching (FS-Cache) with NFS?

2007-12-05 Thread Chuck Lever

w and is used by the database engine.   
Another mount point is ro and is used for back-up utilities, like RMAN.


Another example is local software distribution.  One mount point is  
ro, and is accessed by normal users.  Another mount point accesses  
the same export rw, and is used by administrators who provide updates  
for the software.


As useful as the feature is, one can also argue that mounting the  
same export multiple times is infrequent in most normal use cases.   
Practically speaking, why do we really need to worry about it?


The real problem here is that the NFS protocol itself does not  
support strong cache coherence.  I don't see why the Linux kernel  
must fix that problem.


The only real problem with the first scenario is that you may have  
more than one copy of a file in the persistent cache.  How often will  
that be the case?  Since the local persistence cache is probably disk- 
based and thus large relative to memory, what's the problem with  
using a little extra space?


The problems you ascribe to your second and third caching scenarios  
(deadlocking and reconnection) are, however, real and substantial.   
You don't have these issues when caching each mount point separately,  
right?


It seems to me that implementing the first scenario is (a)  
straightforward, (b) has fewer runtime risks (ie deadlocks), (c)  
doesn't take away features that some people still use, and (d) solves  
more than 80% of the issues here (80/20 rule of thumb).


Lastly, there's already a mount option that allows admins to control  
whether the page and attribute caches are shared -- "sharecache".  Is  
this mount option not adequate for persistent caching?


--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] VFS: extend /proc/mounts

2008-01-17 Thread Chuck Lever


On Jan 17, 2008, at 3:55 AM, Miklos Szeredi wrote:
Hey, I just found /proc/X/mountstats.  How does this fit in to the  
big

picture?


It seems to show some counters for NFS mounts, no other filesystem
uses it.  Format looks rather less nice, than /proc/X/mounts (why do
we need long english sentences under /proc?).



I introduced /proc/self/mountstats because we need a way for non- 
block-device-based file systems to report I/O statistics.  Everything  
else I tried was rejected, and apparently what we ended up with was  
reviewed by only a handful of people, so no one else likes it or uses  
it.


It can go away for all I care, as long as we retain some flexible  
mechanism for non-block-based file systems to report I/O stats.  As  
far as I am aware, there are only two user utilities that understand  
and parse this data, and I maintain both.


--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 21/26] mount options: partially fix nfs

2008-01-25 Thread Chuck Lever

On Jan 25, 2008, at 4:39 AM, Miklos Szeredi wrote:

Miklos Szeredi wrote:

From: Miklos Szeredi <[EMAIL PROTECTED]>

Add posix, bsize=, namelen= options to /proc/mounts for nfs
filesystems.

Document several other options that are still missing.

NFS lists only some options in /proc/mounts on purpose: only the
essential options are mentioned there to keep clutter down.  The  
three
you've added here are for all intents and purposes deprecated,  
which is

why they are not supported.

NFS lists a more complete set of mount options for a mount point in
/proc/self/mountstats.  See nfs_show_stats().

Since your cover letter does not explain why you are changing this  
code,

can you refer me to a description of why you are doing this?

Descritption is in the 01/26 patch.

More below.

Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
---

Index: linux/fs/nfs/super.c
===
--- linux.orig/fs/nfs/super.c   2008-01-19 11:56:34.0 +0100
+++ linux/fs/nfs/super.c2008-01-21 20:41:30.0 +0100
@@ -449,6 +449,7 @@ static void nfs_show_mount_options(struc
} nfs_info[] = {
{ NFS_MOUNT_SOFT, ",soft", ",hard" },
{ NFS_MOUNT_INTR, ",intr", ",nointr" },
+   { NFS_MOUNT_POSIX, ",posix", "" },
{ NFS_MOUNT_NOCTO, ",nocto", "" },
{ NFS_MOUNT_NOAC, ",noac", "" },
{ NFS_MOUNT_NONLM, ",nolock", "" },
@@ -459,10 +460,17 @@ static void nfs_show_mount_options(struc
};
const struct proc_nfs_info *nfs_infop;
struct nfs_client *clp = nfss->nfs_client;
+   unsigned int default_namelen =
+   clp->rpc_ops->version == 4 ? NFS4_MAXNAMLEN :
+   clp->rpc_ops->version == 3 ? NFS3_MAXNAMLEN : NFS2_MAXNAMLEN;

seq_printf(m, ",vers=%d", clp->rpc_ops->version);
seq_printf(m, ",rsize=%d", nfss->rsize);
seq_printf(m, ",wsize=%d", nfss->wsize);
+   if (nfss->bsize != 0)
+   seq_printf(m, ",bsize=%d", nfss->bsize);
+   if (nfss->namelen != default_namelen)
+   seq_printf(m, ",namelen=%d", nfss->namelen);
if (nfss->acregmin != 3*HZ || showdefaults)
seq_printf(m, ",acregmin=%d", nfss->acregmin/HZ);
if (nfss->acregmax != 60*HZ || showdefaults)
@@ -482,6 +490,18 @@ static void nfs_show_mount_options(struc
 	seq_printf(m, ",timeo=%lu", 10U * nfss->client->cl_timeout- 
>to_initval / HZ);
 	seq_printf(m, ",retrans=%u", nfss->client->cl_timeout- 
>to_retries);
 	seq_printf(m, ",sec=%s", nfs_pseudoflavour_to_name(nfss->client- 
>cl_auth->au_flavor));

+
+   /*
+* Missing options:
+* port=

Probably should be supported.

+* addr=

This one is already supported; see nfs_show_options().

Right, thanks.

+* clientaddr=

This one isn't, and should be... would be useful for tracking down
certain NFSv4 problems.

+* mounthost=
+* mountaddr=
+* mountport=
+* mountvers=
+* mountproto=

And these mount* options are for the kernel's new mount protocol  
client.

  They aren't really useful for understanding steady-state NFS client
behavior, they only effect mount-time behavior.

All mount options should be shown, which are needed to reconstruct a
previous mount.

Ah, OK.

I'm happy to implement logic to display the all missing options.  I  
should have updated nfs_show_mount_options() when I wrote the NFS  
mount option parser.

Let me know your preference.

For example, if you copy options out from /proc/mount, umount the
filesystem, and then create a new mount with the copied options, you
should get the same mount.

For NFS, umount also needs to read some of the options in order to  
determine how mountd is to connect to the server for the unmount.   
(That's why we have addr= in the first place).

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 21/26] mount options: partially fix nfs

2008-01-28 Thread Chuck Lever


On Jan 28, 2008, at 6:34 AM, Miklos Szeredi wrote:

All mount options should be shown, which are needed to reconstruct a
previous mount.


Ah, OK.

I'm happy to implement logic to display the all missing options.  I
should have updated nfs_show_mount_options() when I wrote the NFS
mount option parser.

Let me know your preference.


You are more familiar with NFS, so I think it would be better if you
updated nfs_show_mount_options().

Could you also queue my patch (updated) or incorporate it into a
combined fix?


Yes.  I'll have time in a day or two to get this finished.


Thanks,
Miklos


Subject: mount options: partially fix nfs

From: Miklos Szeredi <[EMAIL PROTECTED]>

Add posix, bsize=, namelen= options to /proc/mounts for nfs
filesystems.

Document several other options that are still missing.

Changes:

 - display namelen= unconditionally
 - addr= isn't missing after all

Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
CC: Trond Myklebust <[EMAIL PROTECTED]>
---

Index: linux/fs/nfs/super.c
===
--- linux.orig/fs/nfs/super.c   2008-01-25 15:44:56.0 +0100
+++ linux/fs/nfs/super.c2008-01-25 15:57:32.0 +0100
@@ -449,6 +449,7 @@ static void nfs_show_mount_options(struc
} nfs_info[] = {
{ NFS_MOUNT_SOFT, ",soft", ",hard" },
{ NFS_MOUNT_INTR, ",intr", ",nointr" },
+   { NFS_MOUNT_POSIX, ",posix", "" },
{ NFS_MOUNT_NOCTO, ",nocto", "" },
{ NFS_MOUNT_NOAC, ",noac", "" },
{ NFS_MOUNT_NONLM, ",nolock", "" },
@@ -463,6 +464,9 @@ static void nfs_show_mount_options(struc
seq_printf(m, ",vers=%d", clp->rpc_ops->version);
seq_printf(m, ",rsize=%d", nfss->rsize);
seq_printf(m, ",wsize=%d", nfss->wsize);
+   seq_printf(m, ",namelen=%d", nfss->namelen);
+   if (nfss->bsize != 0)
+   seq_printf(m, ",bsize=%d", nfss->bsize);
if (nfss->acregmin != 3*HZ || showdefaults)
seq_printf(m, ",acregmin=%d", nfss->acregmin/HZ);
if (nfss->acregmax != 60*HZ || showdefaults)
@@ -482,6 +486,17 @@ static void nfs_show_mount_options(struc
 	seq_printf(m, ",timeo=%lu", 10U * nfss->client->cl_timeout- 
>to_initval / HZ);

seq_printf(m, ",retrans=%u", nfss->client->cl_timeout->to_retries);
 	seq_printf(m, ",sec=%s", nfs_pseudoflavour_to_name(nfss->client- 
>cl_auth->au_flavor));

+
+   /*
+* Missing options:
+* port=
+* mountport=
+* mountvers=
+* mountproto=
+* clientaddr=
+* mounthost=
+* mountaddr=
+*/
 }

 /*


--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] VFS: New /proc file /proc/self/mountstats

2005-03-17 Thread Chuck Lever

 Create a new file under /proc/self, called mountstats, where mounted file
 systems can export information (configuration options, performance counters,
 and so on).  Use a mechanism similar to /proc/mounts and s_ops->show_options.

 This mechanism does not violate namespace security, and is safe to use while
 other processes are unmounting file systems.

 Version: Mon, 14 Mar 2005 17:06:04 -0500
 
 Signed-off-by: Chuck Lever <[EMAIL PROTECTED]>
---
 
 fs/namespace.c |   66 +
 fs/proc/base.c |   40 +++
 include/linux/fs.h |1 
 3 files changed, 107 insertions(+)
 
 
diff -X /home/cel/src/linux/dont-diff -Naurp 00-stock/fs/namespace.c 
01-mountstats/fs/namespace.c
--- 00-stock/fs/namespace.c 2005-03-02 02:38:13.0 -0500
+++ 01-mountstats/fs/namespace.c2005-03-14 15:24:51.565085000 -0500
@@ -265,6 +265,72 @@ struct seq_operations mounts_op = {
.show   = show_vfsmnt
 };
 
+/* iterator */
+static void *ms_start(struct seq_file *m, loff_t *pos)
+{
+   struct namespace *n = m->private;
+   struct list_head *p;
+   loff_t l = *pos;
+
+   down_read(&n->sem);
+   list_for_each(p, &n->list)
+   if (!l--)
+   return list_entry(p, struct vfsmount, mnt_list);
+   return NULL;
+}
+
+static void *ms_next(struct seq_file *m, void *v, loff_t *pos)
+{
+   struct namespace *n = m->private;
+   struct list_head *p = ((struct vfsmount *)v)->mnt_list.next;
+   (*pos)++;
+   return p==&n->list ? NULL : list_entry(p, struct vfsmount, mnt_list);
+}
+
+static void ms_stop(struct seq_file *m, void *v)
+{
+   struct namespace *n = m->private;
+   up_read(&n->sem);
+}
+
+static int show_vfsstat(struct seq_file *m, void *v)
+{
+   struct vfsmount *mnt = v;
+   int err = 0;
+
+   /* device */
+   if (mnt->mnt_devname) {
+   seq_puts(m, "device ");
+   mangle(m, mnt->mnt_devname);
+   } else
+   seq_puts(m, "no device");
+
+   /* mount point */
+   seq_puts(m, " mounted on ");
+   seq_path(m, mnt, mnt->mnt_root, " \t\n\\");
+   seq_putc(m, ' ');
+
+   /* file system type */
+   seq_puts(m, "with fstype ");
+   mangle(m, mnt->mnt_sb->s_type->name);
+
+   /* optional statistics */
+   if (mnt->mnt_sb->s_op->show_stats) {
+   seq_putc(m, ' ');
+   err = mnt->mnt_sb->s_op->show_stats(m, mnt);
+   }
+
+   seq_putc(m, '\n');
+   return err;
+}
+
+struct seq_operations mountstats_op = {
+   .start  = ms_start,
+   .next   = ms_next,
+   .stop   = ms_stop,
+   .show   = show_vfsstat,
+};
+
 /**
  * may_umount_tree - check if a mount tree is busy
  * @mnt: root of mount tree
diff -X /home/cel/src/linux/dont-diff -Naurp 00-stock/fs/proc/base.c 
01-mountstats/fs/proc/base.c
--- 00-stock/fs/proc/base.c 2005-03-02 02:38:12.0 -0500
+++ 01-mountstats/fs/proc/base.c2005-03-14 15:24:51.571085000 -0500
@@ -60,6 +60,7 @@ enum pid_directory_inos {
PROC_TGID_STATM,
PROC_TGID_MAPS,
PROC_TGID_MOUNTS,
+   PROC_TGID_MOUNTSTATS,
PROC_TGID_WCHAN,
 #ifdef CONFIG_SCHEDSTATS
PROC_TGID_SCHEDSTAT,
@@ -91,6 +92,7 @@ enum pid_directory_inos {
PROC_TID_STATM,
PROC_TID_MAPS,
PROC_TID_MOUNTS,
+   PROC_TID_MOUNTSTATS,
PROC_TID_WCHAN,
 #ifdef CONFIG_SCHEDSTATS
PROC_TID_SCHEDSTAT,
@@ -134,6 +136,7 @@ static struct pid_entry tgid_base_stuff[
E(PROC_TGID_ROOT,  "root",S_IFLNK|S_IRWXUGO),
E(PROC_TGID_EXE,   "exe", S_IFLNK|S_IRWXUGO),
E(PROC_TGID_MOUNTS,"mounts",  S_IFREG|S_IRUGO),
+   E(PROC_TGID_MOUNTSTATS, "mountstats", S_IFREG|S_IRUGO),
 #ifdef CONFIG_SECURITY
E(PROC_TGID_ATTR,  "attr",S_IFDIR|S_IRUGO|S_IXUGO),
 #endif
@@ -164,6 +167,7 @@ static struct pid_entry tid_base_stuff[]
E(PROC_TID_ROOT,   "root",S_IFLNK|S_IRWXUGO),
E(PROC_TID_EXE,"exe", S_IFLNK|S_IRWXUGO),
E(PROC_TID_MOUNTS, "mounts",  S_IFREG|S_IRUGO),
+   E(PROC_TID_MOUNTSTATS, "mountstats", S_IFREG|S_IRUGO),
 #ifdef CONFIG_SECURITY
E(PROC_TID_ATTR,   "attr",S_IFDIR|S_IRUGO|S_IXUGO),
 #endif
@@ -528,6 +532,38 @@ static struct file_operations proc_mount
.release= mounts_release,
 };
 
+extern struct seq_operations mountstats_op;
+static int mountstats_open(struct inode *inode, struct file *file)
+{
+   struct task_struct *task = proc_task(inode);
+   int ret = seq_open(file, &mountstats_op);
+
+   if (!ret) {
+   struct seq_file *m = file->p

[PATCH 0/2] RFC: exporting per-superblock statistics to user space

2005-03-17 Thread Chuck Lever

 We still have a need to provide "iostat" like statistics for NFS
 clients.  Following are a couple of patches, against 2.6.11.3, which
 prototype an approach for providing this kind of data to user programs.

 I'd like some comment on the approach.

 01-mountstats.patch adds a new file called /proc/self/mountstats and a
 new file system method called show_stats.  this just replicates
 /proc/mounts and the show_options hook.

 02-nfs-iostat.patch teachs the NFS client to use the new show_stats
 hook as a demonstration.

 Note that this approach addresses previously voiced concerns about
 exporting per-superblock stats to user space.

 1. Processes can't see stats for file systems mounted outside their
namespace.

 2. Reading the stats file is serialized with mount and unmount
operations.

 3. The approach doesn't use /sys or kobjects.

 4. There are no lifetime issues tied to file systems loaded as a
module.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] NFS: add I/O performance counters

2005-03-17 Thread Chuck Lever

 Add an extensible per-superblock performance counter facility to the NFS
 client.  This facility mimics the counters available for block devices and
 for networking.

 Expose these new counters via /proc/self/mountstats.

 Version: Mon, 14 Mar 2005 17:06:12 -0500
 
 Signed-off-by: Chuck Lever <[EMAIL PROTECTED]>
---
 
 fs/nfs/dir.c   |8 ++
 fs/nfs/direct.c|5 +
 fs/nfs/file.c  |   20 +++--
 fs/nfs/inode.c |  126 +++--
 fs/nfs/pagelist.c  |   12 ++-
 fs/nfs/read.c  |7 ++
 fs/nfs/write.c |   10 ++
 include/linux/nfs_fs_sb.h  |5 +
 include/linux/nfs_iostat.h |   80 +++
 9 files changed, 256 insertions(+), 17 deletions(-)
 
 
diff -X /home/cel/src/linux/dont-diff -Naurp 01-mountstats/fs/nfs/dir.c 
02-nfs-iostat/fs/nfs/dir.c
--- 01-mountstats/fs/nfs/dir.c  2005-03-02 02:38:09.0 -0500
+++ 02-nfs-iostat/fs/nfs/dir.c  2005-03-14 15:28:34.011484000 -0500
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -428,6 +429,8 @@ static int nfs_readdir(struct file *filp
 
lock_kernel();
 
+   nfs_inc_stats(inode, NFS_VFS_GETDENTS);
+
res = nfs_revalidate_inode(NFS_SERVER(inode), inode);
if (res < 0) {
unlock_kernel();
@@ -584,6 +587,7 @@ static int nfs_lookup_revalidate(struct 
parent = dget_parent(dentry);
lock_kernel();
dir = parent->d_inode;
+   nfs_inc_stats(dir, NFS_DENTRY_REVALIDATE);
inode = dentry->d_inode;
 
if (nd && !(nd->flags & LOOKUP_CONTINUE) && (nd->flags & LOOKUP_OPEN))
@@ -712,6 +716,7 @@ static struct dentry *nfs_lookup(struct 
 
dfprintk(VFS, "NFS: lookup(%s/%s)\n",
dentry->d_parent->d_name.name, dentry->d_name.name);
+   nfs_inc_stats(dir, NFS_VFS_LOOKUP);
 
res = ERR_PTR(-ENAMETOOLONG);
if (dentry->d_name.len > NFS_SERVER(dir)->namelen)
@@ -1116,6 +1121,7 @@ static int nfs_sillyrename(struct inode 
dfprintk(VFS, "NFS: silly-rename(%s/%s, ct=%d)\n",
dentry->d_parent->d_name.name, dentry->d_name.name, 
atomic_read(&dentry->d_count));
+   nfs_inc_stats(dir, NFS_SILLY_RENAME);
 
 #ifdef NFS_PARANOIA
 if (!dentry->d_inode)
@@ -1500,6 +1506,8 @@ int nfs_permission(struct inode *inode, 
struct rpc_cred *cred;
int res;
 
+   nfs_inc_stats(inode, NFS_VFS_ACCESS);
+
if (mask == 0)
return 0;
 
diff -X /home/cel/src/linux/dont-diff -Naurp 01-mountstats/fs/nfs/direct.c 
02-nfs-iostat/fs/nfs/direct.c
--- 01-mountstats/fs/nfs/direct.c   2005-03-02 02:38:25.0 -0500
+++ 02-nfs-iostat/fs/nfs/direct.c   2005-03-14 15:26:16.401349000 -0500
@@ -47,6 +47,7 @@
 #include 
 
 #include 
+#include 
 #include 
 #include 
 
@@ -354,6 +355,8 @@ static ssize_t nfs_direct_read_seg(struc
result = nfs_direct_read_wait(dreq, clnt->cl_intr);
rpc_clnt_sigunmask(clnt, &oldset);
 
+   nfs_add_stats(inode, NFS_WIRE_READ_BYTES, result);
+   nfs_add_stats(inode, NFS_DIRECT_READ_BYTES, result);
return result;
 }
 
@@ -576,6 +579,8 @@ static ssize_t nfs_direct_write(struct i
if (result < size)
break;
}
+   nfs_add_stats(inode, NFS_WIRE_WRITTEN_BYTES, tot_bytes);
+   nfs_add_stats(inode, NFS_DIRECT_WRITTEN_BYTES, tot_bytes);
return tot_bytes;
 }
 
diff -X /home/cel/src/linux/dont-diff -Naurp 01-mountstats/fs/nfs/file.c 
02-nfs-iostat/fs/nfs/file.c
--- 01-mountstats/fs/nfs/file.c 2005-03-02 02:38:38.0 -0500
+++ 02-nfs-iostat/fs/nfs/file.c 2005-03-14 15:42:52.446804000 -0500
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -86,18 +87,15 @@ static int nfs_check_flags(int flags)
 static int
 nfs_file_open(struct inode *inode, struct file *filp)
 {
-   struct nfs_server *server = NFS_SERVER(inode);
-   int (*open)(struct inode *, struct file *);
int res;
 
res = nfs_check_flags(filp->f_flags);
if (res)
return res;
 
+   nfs_inc_stats(inode, NFS_VFS_OPEN);
lock_kernel();
-   /* Do NFSv4 open() call */
-   if ((open = server->rpc_ops->file_open) != NULL)
-   res = open(inode, filp);
+   res = NFS_SERVER(inode)->rpc_ops->file_open(inode, filp);
unlock_kernel();
return res;
 }
@@ -105,6 +103,7 @@ nfs_file_open(struct inode *inode, struc
 static int
 nfs_file_release(struct inode *inode, struct file *filp)
 {
+   nfs_inc_stats(inode, NFS_VFS_CLOSE);
return NFS_PROTO(inode)->file_release(inode, filp);
 }
 
@@ -123,6 +122,7 @@ nfs_file_flush(struct file *file)
 
if ((file->f_mode & FMODE_WRITE) == 0)
return

Re: 2.6.11 oops in skb_drop_fraglist

2005-03-21 Thread Chuck Lever

Andrew Morton wrote:
Chuck Lever <[EMAIL PROTECTED]> wrote:
testing NFS client workloads on a dual Pentium-III system running 2.6.11 
with some NFS patches.  i hit this oops while doing simple-minded ftps 
and tars.

the system locks up once or twice a day under this workload.  this is 
the first time i had the console and captured the oops output.


Chuck, I didn't see any followup to this.  Is it still happening in current
kernels?
i have not been able to reproduce it with the aforementioned NFS patches 
removed.  i'm now convinced it was a bug in one of the NFS patches i had 
applied, even though none of them come near the fraglist stuff, but i 
haven't had a chance to nail it down.

i had implemented a patch to cause the RPC client to reuse the port 
number when reconnecting to the server after the server drops the 
connection... this is a standard practice for other RPC implementations. 
 i suspect it was that patch that was causing the trouble.

thanks for the follow-up!
begin:vcard
fn:Chuck Lever
n:Lever;Charles
org:Network Appliance, Incorporated;Linux NFS Client Development
adr:535 West William Street, Suite 3100;;Center for Information Technology Integration;Ann Arbor;MI;48103-4943;USA
email;internet:[EMAIL PROTECTED]
title:Member of Technical Staff
tel;work:+1 734 763-4415
tel;fax:+1 734 763 4434
tel;home:+1 734 668-1089
x-mozilla-html:FALSE
url:http://www.monkey.org/~cel/
version:2.1
end:vcard

2.6.11 oops in skb_drop_fraglist

2005-03-10 Thread Chuck Lever

testing NFS client workloads on a dual Pentium-III system running 2.6.11 
with some NFS patches.  i hit this oops while doing simple-minded ftps 
and tars.

the system locks up once or twice a day under this workload.  this is 
the first time i had the console and captured the oops output.

Unable to handle kernel NULL pointer dereference at virtual address 0001
 printing eip:
c02fc752
*pde = 
Oops:  [#1]
SMP
CPU:0
EIP:0060:[]Not tainted VLI
EFLAGS: 00010202   (2.6.11-CITI_NFS4_ALL-1)
EIP is at skb_drop_fraglist+0x22/0x50
eax: f6fe26e0   ebx: 0001   ecx: f6fe26e0   edx: 0001
esi: f6f29240   edi: 0004   ebp: f697ed24   esp: c04cadd8
ds: 007b   es: 007b   ss: 0068
Process swapper (pid: 0, threadinfo=c04ca000 task=c03efb60)
Stack: 0001 c02fc7fa f6f29240 f697ecc0 c02fc838  c02fc8ba 
ccbf2ab0
   0b50 c0326a16 f6f87740 f6f29240 f6f29240 c0321c4a 0020 
f6f87778
   f6f87740 f697ecc0 f69d1560 00625a79 c04cae6c  0012 
f697ecc0
Call Trace:
 [] skb_release_data+0x5a/0x90
 [] kfree_skbmem+0x8/0x20
 [] __kfree_skb+0x6a/0xf0
 [] tcp_transmit_skb+0x406/0x720
 [] tcp_clean_rtx_queue+0x17a/0x410
 [] tcp_ack+0xf6/0x580
 [] tcp_rcv_established+0x409/0x7f0
 [] apic_timer_interrupt+0x1c/0x24
 [] tcp_v4_do_rcv+0x110/0x120
 [] tcp_v4_rcv+0x6bf/0x940
 [] ip_local_deliver+0xc2/0x1f0
 [] ip_rcv+0x336/0x450
 [] alloc_skb+0x41/0xf0
 [] netif_receive_skb+0x136/0x1a0
 [] e1000_clean_rx_irq+0x15e/0x4a0
 [] __kfree_skb+0x6a/0xf0
 [] e1000_clean+0xba/0xf0
 [] net_rx_action+0x6f/0x100
 [] __do_softirq+0xb9/0xd0
 [] do_softirq+0x4a/0x60

c02fc730:   53  push   %ebx
c02fc731:   8b 80 94 00 00 00   mov0x94(%eax),%eax
c02fc737:   8b 58 0cmov0xc(%eax),%ebx
c02fc73a:   c7 40 0c 00 00 00 00movl   $0x0,0xc(%eax)
c02fc741:   eb 0d   jmpc02fc750 

c02fc743:   90  nop
c02fc744:   90  nop
c02fc745:   90  nop
c02fc746:   90  nop
c02fc747:   90  nop
c02fc748:   90  nop
c02fc749:   90  nop
c02fc74a:   90  nop
c02fc74b:   90  nop
c02fc74c:   90  nop
c02fc74d:   90  nop
c02fc74e:   90  nop
c02fc74f:   90  nop
c02fc750:   89 da   mov%ebx,%edx
c02fc752:   8b 1b   mov(%ebx),%ebx
c02fc754:   8b 82 84 00 00 00   mov0x84(%edx),%eax
c02fc75a:   48  dec%eax
c02fc75b:   75 13   jnec02fc770 

c02fc75d:   f0 83 44 24 00 00   lock addl $0x0,0x0(%esp,1)
c02fc763:   89 d0   mov%edx,%eax
c02fc765:   e8 e6 00 00 00  call   c02fc850 <__kfree_skb>
c02fc76a:   85 db   test   %ebx,%ebx
c02fc76c:   75 e2   jnec02fc750 

c02fc76e:   5b  pop%ebx
c02fc76f:   c3  ret
c02fc770:   f0 ff 8a 84 00 00 00lock decl 0x84(%edx)
c02fc777:   0f 94 c0sete   %al
c02fc77a:   84 c0   test   %al,%al
c02fc77c:   74 ec   je c02fc76a 

c02fc77e:   eb e3   jmpc02fc763 



begin:vcard
fn:Chuck Lever
n:Lever;Charles
org:Network Appliance, Incorporated;Linux NFS Client Development
adr:535 West William Street, Suite 3100;;Center for Information Technology Integration;Ann Arbor;MI;48103-4943;USA
email;internet:[EMAIL PROTECTED]
title:Member of Technical Staff
tel;work:+1 734 763-4415
tel;fax:+1 734 763 4434
tel;home:+1 734 668-1089
x-mozilla-html:FALSE
url:http://www.monkey.org/~cel/
version:2.1
end:vcard

RE: [NFS] [PATCH] New patch to flush out dirty mmap()ed NFS pages in 2.4.4

2001-05-14 Thread Chuck Lever


the default behavior is that close() waits for all write-backs to
be committed to the server's disk.  you might add support for the
"nocto" mount option so that waiting is skipped for shared mmap'd
files, but then what happens to data that is pinned on the client
because a write-back failed after close() returns to the application?

what's the application domain Linus is trying to optimize?

> Linus was not too keen on the patches that circulated last week. In
> his concept of shared mmap(), he wants it to ignore the usual
> requirement we have on normal files whereby we flush out the pages on
> file close.  The problem is, though, that we need at least to schedule
> the writes using the correct credentials, and thus a compromise was to
> do this when closing the file...
>
> The following patch attempts to implement a fix along Linus'
> specification. Features are:
>
>- Remove inode operation force_delete(). The latter is toxic to all
>  mmap(), as it can cause the inode to get killed before the dirty
>  pages get scheduled.
>
>- Schedule dirty pages upon last fput() of the file.
>
>- Always write out all dirty pages to the server when
>  locking/unlocking.
>
>- Add a write_inode() method in order to allow bdflush() and
>  friends to force out writes when the user calls sys_sync(), when
>  umounting, or when memory pressure is high.
>
>- Since we in any case have to add the write_inode() method, scrap
>  the NFS special O_SYNC code, so we can just use the generic stuff
>  (which will be faster for large writes).
>
> Comments?
>
> Cheers,
>   Trond
>
> diff -u --recursive --new-file linux-2.4.4-fixes/fs/nfs/file.c
> linux-2.4.4-mmap/fs/nfs/file.c
> --- linux-2.4.4-fixes/fs/nfs/file.c   Fri Feb  9 20:29:44 2001
> +++ linux-2.4.4-mmap/fs/nfs/file.cSat May 12 21:31:39 2001
> @@ -39,6 +39,7 @@
>  static ssize_t nfs_file_write(struct file *, const char *,
> size_t, loff_t *);
>  static int  nfs_file_flush(struct file *);
>  static int  nfs_fsync(struct file *, struct dentry *dentry, int
> datasync);
> +static int  nfs_file_release(struct inode *, struct file *);
>
>  struct file_operations nfs_file_operations = {
>   read:   nfs_file_read,
> @@ -46,7 +47,7 @@
>   mmap:   nfs_file_mmap,
>   open:   nfs_open,
>   flush:  nfs_file_flush,
> - release:nfs_release,
> + release:nfs_file_release,
>   fsync:  nfs_fsync,
>   lock:   nfs_lock,
>  };
> @@ -87,6 +88,13 @@
>   return status;
>  }
>
> +int
> +nfs_file_release(struct inode *inode, struct file *file)
> +{
> + filemap_fdatasync(inode->i_mapping);
> + return nfs_release(inode,file);
> +}
> +
>  static ssize_t
>  nfs_file_read(struct file * file, char * buf, size_t count, loff_t *ppos)
>  {
> @@ -283,9 +291,11 @@
>* Flush all pending writes before doing anything
>* with locks..
>*/
> - down(&filp->f_dentry->d_inode->i_sem);
> + filemap_fdatasync(inode->i_mapping);
> + down(&inode->i_sem);
>   status = nfs_wb_all(inode);
> - up(&filp->f_dentry->d_inode->i_sem);
> + up(&inode->i_sem);
> + filemap_fdatawait(inode->i_mapping);
>   if (status < 0)
>   return status;
>
> @@ -300,10 +310,12 @@
>*/
>   out_ok:
>   if ((cmd == F_SETLK || cmd == F_SETLKW) && fl->fl_type != F_UNLCK) {
> - down(&filp->f_dentry->d_inode->i_sem);
> + filemap_fdatasync(inode->i_mapping);
> + down(&inode->i_sem);
>   nfs_wb_all(inode);  /* we may have slept */
> + up(&inode->i_sem);
> + filemap_fdatawait(inode->i_mapping);
>   nfs_zap_caches(inode);
> - up(&filp->f_dentry->d_inode->i_sem);
>   }
>   return status;
>  }
> diff -u --recursive --new-file linux-2.4.4-fixes/fs/nfs/inode.c
> linux-2.4.4-mmap/fs/nfs/inode.c
> --- linux-2.4.4-fixes/fs/nfs/inode.c  Wed Apr 25 23:58:17 2001
> +++ linux-2.4.4-mmap/fs/nfs/inode.c   Sat May 12 23:54:16 2001
> @@ -45,6 +45,7 @@
>  static void nfs_invalidate_inode(struct inode *);
>
>  static void nfs_read_inode(struct inode *);
> +static void nfs_write_inode(struct inode *,int);
>  static void nfs_delete_inode(struct inode *);
>  static void nfs_put_super(struct super_block *);
>  static void nfs_umount_begin(struct super_block *);
> @@ -52,7 +53,7 @@
>
>  static struct super_operations nfs_sops = {
>   read_inode: nfs_read_inode,
> - put_inode:  force_delete,
> + write_inode:nfs_write_inode,
>   delete_inode:   nfs_delete_inode,
>   put_super:  nfs_put_super,
>   statfs: nfs_statfs,
> @@ -113,6 +114,14 @@
>   NFS_CACHEINV(inode);
>   NFS_ATTRTIMEO(inode) = NFS_MINATTRTIMEO(inode);
>   NFS_ATTRTIMEO_UPDATE(inode) = jiffies;
> +}
> +
> +static void
> +nfs_write_inode(struct inode *inode, int sync)
> +{
> + int flags = sync ? FLUSH_WAIT : 0;
> +
> + nfs_

Re: [PATCH] fix broken handling of port=... in NFS option parsing

2007-07-23 Thread Chuck Lever


ACK.

Al Viro wrote:

Obviously broken on little-endian; fortunately, the option is
not frequently used...

Signed-off-by: Al Viro <[EMAIL PROTECTED]>
---
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index b34b7a7..b2a851c 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -732,7 +732,7 @@ static int nfs_parse_mount_options(char *raw,
return 0;
if (option < 0 || option > 65535)
return 0;
-   mnt->nfs_server.address.sin_port = htonl(option);
+   mnt->nfs_server.address.sin_port = htons(option);
break;
case Opt_rsize:
if (match_int(args, &mnt->rsize))
begin:vcard
fn:Chuck Lever
n:Lever;Chuck
org:Oracle Corporation;Corporate Architecture: Linux Projects Group
adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA
title:Principal Member of Staff
tel;work:+1 248 614 5091
x-mozilla-html:FALSE
version:2.1
end:vcard

DMA CRC errors with SiS chipset and notebook drive

2007-08-09 Thread Chuck Lever


For various reasons I want a silent PC, so I'm using a notebook hard
drive in my desktop system.  I've recycled an ancient Foxconn Socket 754
mainboard in it with this IDE controller:

00:02.5 IDE interface: Silicon Integrated Systems [SiS] 5513 [IDE] (rev
01) (prog-if 80 [Master])
Subsystem: Foxconn International, Inc. Unknown device 0c92
Flags: bus master, medium devsel, latency 128
I/O ports at 01f0 [size=8]
I/O ports at 03f4 [size=1]
I/O ports at 0170 [size=8]
I/O ports at 0374 [size=1]
I/O ports at 4000 [size=16]

On boot, the kernel reports:

SCSI subsystem initialized
libata version 2.21 loaded.
pata_sis :00:02.5: version 0.5.1
scsi0 : pata_sis
scsi1 : pata_sis
ata1: PATA max UDMA/133 cmd 0x000101f0 ctl 0x000103f6 
bmdma 0x00

014000 irq 14
ata2: PATA max UDMA/133 cmd 0x00010170 ctl 0x00010376 
bmdma 0x00

014008 irq 15
ata1.00: ATA-6: ST94813A, 3.04, max UDMA/100
ata1.00: 78140160 sectors, multi 16: LBA48
ata1.01: ATAPI: TSSTcorpCD/DVDW SH-S182D, SB00, max UDMA/33
ata1.00: configured for UDMA/100
ata1.01: configured for UDMA/33
scsi 0:0:0:0: Direct-Access ATA  ST94813A 3.04 PQ: 0 ANSI: 5
sd 0:0:0:0: [sda] 78140160 512-byte hardware sectors (40008 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't 
support DPO

 or FUA
sd 0:0:0:0: [sda] 78140160 512-byte hardware sectors (40008 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't 
support DPO

 or FUA
 sda: sda1 sda2 sda3
sd 0:0:0:0: [sda] Attached SCSI disk
scsi 0:0:1:0: CD-ROMTSSTcorp CD/DVDW SH-S182D SB00 PQ: 0 ANSI: 5

Then later in the log, I see:

Aug  9 07:41:29 monet kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr
0x0 action 0x2
Aug  9 07:41:29 monet kernel: ata1.00: (BMDMA stat 0x64)
Aug  9 07:41:29 monet kernel: ata1.00: cmd
c8/00:10:73:15:00/00:00:00:00:00/e1 tag 0 cdb 0x12 data 8192 in
Aug  9 07:41:29 monet kernel:  res
51/84:00:00:00:00/00:00:00:00:00/e0 Emask 0x10 (ATA bus error)
Aug  9 07:41:29 monet kernel: ata1: soft resetting port
Aug  9 07:41:29 monet kernel: ata1.00: configured for UDMA/100
Aug  9 07:41:29 monet kernel: ata1.01: configured for UDMA/33
Aug  9 07:41:29 monet kernel: ata1: EH complete

And eventually:

Aug  9 07:41:29 monet kernel: ata1.00: limiting speed to UDMA/66:PIO4
Aug  9 07:41:29 monet kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr
0x0 action 0x2
Aug  9 07:41:29 monet kernel: ata1.00: (BMDMA stat 0x64)
Aug  9 07:41:29 monet kernel: ata1.00: cmd
c8/00:08:fb:7c:a1/00:00:00:00:00/e0 tag 0 cdb 0x12 data 4096 in
Aug  9 07:41:29 monet kernel:  res
51/84:00:00:00:00/00:00:00:00:00/e0 Emask 0x10 (ATA bus error)
Aug  9 07:41:29 monet kernel: ata1: soft resetting port
Aug  9 07:41:29 monet kernel: ata1.00: configured for UDMA/66
Aug  9 07:41:29 monet kernel: ata1.01: configured for UDMA/33
Aug  9 07:41:29 monet kernel: ata1: EH complete

After that, no more DMA error messages (until the next system reboot).

I've tried this with a brand-new Seagate Momentus 5400.2 and a fairly 
new Western Digital WD800VE.  According to the disk specifications 
available on the vendor web sites, both drives are capable of UDMA/100 
speeds.


This error occurs whether the drive is on the primary IDE controller by
itself, or whether it's sharing the controller with an optical drive
(ata1.01 in the above listing).  I've changed the disk to the secondary
controller, I've swapped IDE cables, and changed out the 44-40
pin notebook drive adapter; no change.

The system is running 2.6.22.1-41.fc7 (x86-64, of course), but I've seen 
this on 2.6.21-based kernels as well.  I've noticed similar errors with 
a Dell Latitude D600 (x86, not SiS), also running recent Fedora 7 kernels.


Searching the web seems to indicate that either the drive or the
controller is bad, but since the error appears in so many different
hardware configurations, I'm beginning to think this is a problem with
libata, as that is an obvious common element.

This e-mail is mostly an experience report, but I'm also wondering if I
should be concerned about data on the drive, or have I missed the real
problem entirely?

Addendum: The internal SMART error log on the Seagate shows these errors 
repeatedly:


Error 103 occurred at disk power-on lifetime: 296 hours (12 days + 8 hours)
  When the command that caused the error occurred, the device was
active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 00 00 00 e0  Error: ICRC, ABRT at LBA = 0x = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --    
  c8 00

Re: DMA CRC errors with SiS chipset and notebook drive

2007-08-09 Thread Chuck Lever




Alan Cox wrote:

Aug  9 07:41:29 monet kernel: ata1.00: limiting speed to UDMA/66:PIO4
Aug  9 07:41:29 monet kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr
0x0 action 0x2
Aug  9 07:41:29 monet kernel: ata1.00: (BMDMA stat 0x64)
Aug  9 07:41:29 monet kernel: ata1.00: cmd
c8/00:08:fb:7c:a1/00:00:00:00:00/e0 tag 0 cdb 0x12 data 4096 in
Aug  9 07:41:29 monet kernel:  res
51/84:00:00:00:00/00:00:00:00:00/e0 Emask 0x10 (ATA bus error)
Aug  9 07:41:29 monet kernel: ata1: soft resetting port
Aug  9 07:41:29 monet kernel: ata1.00: configured for UDMA/66
Aug  9 07:41:29 monet kernel: ata1.01: configured for UDMA/33
Aug  9 07:41:29 monet kernel: ata1: EH complete


Drops to UDMA33 for 40 wire cable.


It drops from UDMA/100 to UDMA/66 for the disk, but stays at UDMA/33 for 
the optical drive.  If I have only the disk on that controller, the disk 
still drops to UDMA/66.



Looks to me like your cabling is out of spec or you have electrical noise
problems.


I've tried different cabling solutions, including round and flat IDE 
cables.  What might cause "electrical noise"?  Are 44-40 pin adapters 
reliable?
begin:vcard
fn:Chuck Lever
n:Lever;Chuck
org:Oracle Corporation;Corporate Architecture: Linux Projects Group
adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA
title:Principal Member of Staff
tel;work:+1 248 614 5091
x-mozilla-html:FALSE
url:http://oss.oracle.com/~cel
version:2.1
end:vcard

Re: request for patches: showing mount options

2007-07-27 Thread Chuck Lever


Miklos:

Some mount options are never passed to the kernel, and thus can't appear 
in /proc/mounts.  Examples include user, users, and _netdev for NFS.


Miklos Szeredi wrote:

[please consider pruning the CC list if discussing some aspect, which
doesn't concern all]

I've done an audit of all filesystems with regards to showing mount
options in /proc//mounts.  Unfortunately most of them show none
or only a part of all accepted options (for details see list of
filesystems at the end of the mail).

This is currently not a big problem, because mount(8) stores the given
options in /etc/mtab.  However we want to get rid of mtab, and this
requires, that the option showing be fixed up.

It would be easiest if this was done by the VFS instead of having to
deal with it in filesystems.  However there are differences in how
filesytems handle options during mount and remount, and it would be
impossible to take this into account in all cases.

If you are CC-ed, and responsible for one of these filesystems, please
take a moment to fully implement the ->show_options() method.  In most
cases it should be an easy task.

If for some reason you are unable to do this, please let me know and
I'll fix it up.

Here are some guidelines for showing options.  I'll also add these to
Documentation/filesystems/vfs.txt

+   If a filesystem accepts mount options, it must define show_options()
+   to show all the currently active options.  The rules are:
+
+ - options MUST be shown which are not default or their values differ
+   from the default
+
+ - options MAY be shown which are enabled by default or have their
+   default value
+
+   Options used only internally between a mount helper and the kernel
+   (such as file descriptors), or which only have an effect during the
+   mounting (such as ones controlling the creation of a journal) are exempt
+   from the above rules.

Thanks,
Miklos

---
legend:

  all - fs has options, but doesn't define ->show_options()
  some - fs defines ->show_options(), but some options are not shown
  noopt - fs does not have options
  good - fs shows all options
  patch - I have a patch


9p  some
adfsall (maintainer?)
affsall
afs all
autofs  all
autofs4 some
befsall
bfs noopt
cifssome (odd parser)
codanoopt
configfsnoopt
cramfs  noopt
debugfs noopt
devpts  patch
ecryptfssome
efs noopt
ext2patch
ext3patch
ext4patch
fat some
freevxfsnoopt
fusepatch
gfs2good
hfs good
hfsplus good
hostfs  patch
hpfsall
hppfs   noopt
hugetlbfs   all
isofs   all (maintainer?)
jffs2   noopt
jfs some
minix   noopt
msdos   ->fat
ncpfs   all (FS_BINARY_MOUNTDATA?)
nfs some
nfsdnoopt
ntfsgood (odd parser)
ocfs2   all
openpromfs  noopt
procnoopt
qnx4noopt
ramfs   noopt
reiserfsall
romfs   noopt
smbfs   good (odd parser) (maintainer?)
sysfs   noopt
sysvnoopt
udf all
ufs all
vfat->fat
xfs some (odd parser)

mm/shmem.cpatch
drivers/oprofile/oprofilefs.c noopt
drivers/infiniband/hw/ipath/ipath_fs.cnoopt
drivers/misc/ibmasm/ibmasmfs.cnoopt
drivers/usb/core (usbfs)  noopt
drivers/usb/gadget (gadgetfs) noopt
drivers/isdn/capi/capifs.cnoopt
kernel/cpuset.c   noopt
fs/binfmt_misc.c  noopt
net/sunrpc/rpc_pipe.c noopt
arch/powerpc/platforms/cell/spufs all
arch/s390/hypfs   all
ipc/mqueue.c  noopt
security (securityfs) noopt
security/selinux/selinuxfs.c  noopt

in -mm:

reiser4some (odd parser)
kernel/container.c good (odd parser)
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
begin:vcard
fn:Chuck Lever
n:Lever;Chuck
org:Oracle Corporation;Corporate Architecture: Linux Projects Group
adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA
title:Principal Member of Staff
tel;work:+1 248 614 5091
x-mozilla-html:FALSE
version:2.1
end:vcard

Re: request for patches: showing mount options

2007-07-27 Thread Chuck Lever


Miklos Szeredi wrote:
Some mount options are never passed to the kernel, and thus can't appear 
in /proc/mounts.  Examples include user, users, and _netdev for NFS.


These options control *who* may mount and *when* to mount.  They are
not a property of the mount itself and are not added to /etc/mtab.

There's a "user=ID" option that is added to /etc/mtab in case of user
mounts.  This identifies the owner of the mount, so that it can be
unmounted by that user.  There are patches in -mm that enable the
kernel to store this info.

Do you have other examples in mind?


[no]quota comes to mind; also auto, [no]owner, [no]group, and 
quiet/loud, but these may fall into the same category you mention above.


Aside: It's a confusing artifact of the mount CLI that these options 
control who/when but are passed to the mount command in the same way the 
other options are.
begin:vcard
fn:Chuck Lever
n:Lever;Chuck
org:Oracle Corporation;Corporate Architecture: Linux Projects Group
adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA
title:Principal Member of Staff
tel;work:+1 248 614 5091
x-mozilla-html:FALSE
version:2.1
end:vcard

Re: request for patches: showing mount options

2007-07-30 Thread Chuck Lever




Miklos Szeredi wrote:
Some mount options are never passed to the kernel, and thus can't appear 
in /proc/mounts.  Examples include user, users, and _netdev for NFS.

These options control *who* may mount and *when* to mount.  They are
not a property of the mount itself and are not added to /etc/mtab.

There's a "user=ID" option that is added to /etc/mtab in case of user
mounts.  This identifies the owner of the mount, so that it can be
unmounted by that user.  There are patches in -mm that enable the
kernel to store this info.

Do you have other examples in mind?


There are a few more cases for NFS mount.

After a successful mount, the NFS mount command tucks some options into 
/etc/mtab that reflect which mountd was used for the mount, and what 
protocol version and port was used for the mount request.  Those options 
are not passed to the kernel, and do not appear in /proc/mounts today. 
See nfs(5)'s discussion of the mountport, mounthost, mountprog, and 
mountvers options.


However, the trend for NFS is to push mount option parsing into the 
kernel.  Thus all options will be passed to the kernel, and at that 
point it should be able to reflect the mount* options in /proc/mounts. 
But it doesn't do that quite yet.


I'm wondering if there are other such cases in other file systems.
begin:vcard
fn:Chuck Lever
n:Lever;Chuck
org:Oracle Corporation;Corporate Architecture: Linux Projects Group
adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA
title:Principal Member of Staff
tel;work:+1 248 614 5091
x-mozilla-html:FALSE
url:http://oss.oracle.com/~cel
version:2.1
end:vcard

Re: request for patches: showing mount options

2007-07-31 Thread Chuck Lever


Miklos Szeredi wrote:
After a successful mount, the NFS mount command tucks some options into 
/etc/mtab that reflect which mountd was used for the mount, and what 
protocol version and port was used for the mount request.  Those options 
are not passed to the kernel, and do not appear in /proc/mounts today. 
See nfs(5)'s discussion of the mountport, mounthost, mountprog, and 
mountvers options.


However, the trend for NFS is to push mount option parsing into the 
kernel.  Thus all options will be passed to the kernel, and at that 
point it should be able to reflect the mount* options in /proc/mounts. 
But it doesn't do that quite yet.


Trond, do you have a roadmap for this?


Well I'm actually doing the coding, and Trond is playing more of an 
architectural role.


We have a first implementation of in-kernel mount option parsing in 
2.6.23-rc now.  I'm currently working on the user-space piece of this. 
(And actually, now is a great time to review the new kernel part, while 
it is still quite young.)


However, the NFS mount user-space pieces have undergone radical change 
recently.  The mount.nfs helper was split from the mount command just 
last year, and is only now starting to go into distributions.  This is 
very old code that has been hacked on for over a decade, so it is taking 
a little while to rediscover its history and modernize it before we move 
forward.


I expect that both the kernel part and the user-space part will evolve 
together over the next few months as we clarify the full set of 
requirements.  The requirements for this effort now include:


+ making new mount options simple to implement;

+ removing ABI dependencies between mount.nfs and the kernel NFS client;

+ an eventual merge of the nfs and nfs4 file system types;

+ improved error handling and reporting during the mount process;

+ support for NFS over IPv6.

I think there is also some talk about fully supporting SELinux as well, 
but I haven't been following that closely.


The removal of /etc/mtab in favor of /proc/mounts is a new requirement, 
and is not as trivial as you might hope.  Internally the NFS client 
represents the mount options as a binary data structure, and it contains 
only the information that has traditionally been passed into the kernel 
by the current mount command.  The user-space-only options are not 
passed to the kernel nor stored in the data structure.


Adding facilities to store information about every possible mount 
option, including the user-space-only ones, will take a bit of time, but 
is possible, if not straightforward.  We just have to understand all the 
dependencies.
begin:vcard
fn:Chuck Lever
n:Lever;Chuck
org:Oracle Corporation;Corporate Architecture: Linux Projects Group
adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA
title:Principal Member of Staff
tel;work:+1 248 614 5091
x-mozilla-html:FALSE
url:http://oss.oracle.com/~cel
version:2.1
end:vcard

Re: request for patches: showing mount options

2007-07-31 Thread Chuck Lever


Miklos Szeredi wrote:
After a successful mount, the NFS mount command tucks some options into 
/etc/mtab that reflect which mountd was used for the mount, and what 
protocol version and port was used for the mount request.  Those options 
are not passed to the kernel, and do not appear in /proc/mounts today. 
See nfs(5)'s discussion of the mountport, mounthost, mountprog, and 
mountvers options.


However, the trend for NFS is to push mount option parsing into the 
kernel.  Thus all options will be passed to the kernel, and at that 
point it should be able to reflect the mount* options in /proc/mounts. 
But it doesn't do that quite yet.

Trond, do you have a roadmap for this?
Well I'm actually doing the coding, and Trond is playing more of an 
architectural role.


OK, what your estimage for this then?


I don't have an estimate.  This is all very slippery because once I get 
into a part of the code, we discover a lot of issues that we didn't 
expect.  The NFS mount stuff is largely historical; we've all forgotten 
(or never really knew) how it works.



It would be nice to have all this stuff in 2.6.24, which doesn't leave
a lot of time.


Yes, that would be nice, but there's a lot of stuff that needs to get 
done before this.  NFS IPv6, for example, is a higher priority than 
refactoring to remove /etc/mtab -- the US government has a new 
requirement in 2008 for IPv6 support in any software that it purchases, 
and NFS may be a stumbling block for distributors if it doesn't have it.


So I'd say "no way" for 2.6.24, but it's really Trond's call to make.


But if it's just those four options you mentioned, it should be
managable.  I do not think there needs to be some generic code to
hande userspace-only options, it would be perfectly fine just to
parse, store and show them like all the other options.


Like you, I don't expect it will be difficult to implement, but I have 
too many balls in the air to make any promises at the moment.  Plus, we 
can't really predict when distributors will feel the in-kernel mount 
parsing stuff will be ready for their users.
begin:vcard
fn:Chuck Lever
n:Lever;Chuck
org:Oracle Corporation;Corporate Architecture: Linux Projects Group
adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA
title:Principal Member of Staff
tel;work:+1 248 614 5091
x-mozilla-html:FALSE
url:http://oss.oracle.com/~cel
version:2.1
end:vcard

[PATCH] Fix the sign of the result of a conditional expression

2007-08-15 Thread Chuck Lever

In include/asm-x86_64/bitops.h, the find_{first,next,first_zero,next_zero}_bit
macros return a result type that depends on the width of the "size" argument.
The type of both arms of a conditional expression should always be the same.

I changed the return type of __scanbit() to match the return type of the
x86_64 find_*_bit() functions.

-- 
corporate:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Eliminate result signage problem in asm-x86_64/bitops.h

2007-08-15 Thread Chuck Lever

The return type of __scanbit() doesn't match the return type of
find_{first,next}_bit().  Thus when you construct something like
this:

   boolean ? __scanbit() : find_first_bit()

you get an unsigned long result if "boolean" is true, and a signed
long result if "boolean" is false.

In file included from /home/cel/src/linux/include/linux/mmzone.h:15,
 from /home/cel/src/linux/include/linux/gfp.h:4,
 from /home/cel/src/linux/include/linux/slab.h:14,
 from /home/cel/src/linux/include/linux/percpu.h:5,
 from
/home/cel/src/linux/include/linux/rcupdate.h:41,
 from /home/cel/src/linux/include/linux/dcache.h:10,
 from /home/cel/src/linux/include/linux/fs.h:275,
 from /home/cel/src/linux/fs/nfs/sysctl.c:9:
/home/cel/src/linux/include/linux/nodemask.h: In function
‘__first_node’:
/home/cel/src/linux/include/linux/nodemask.h:229: warning: signed and
unsigned type in conditional expression
/home/cel/src/linux/include/linux/nodemask.h: In function
‘__next_node’:
/home/cel/src/linux/include/linux/nodemask.h:235: warning: signed and
unsigned type in conditional expression
/home/cel/src/linux/include/linux/nodemask.h: In function
‘__first_unset_node’:
/home/cel/src/linux/include/linux/nodemask.h:253: warning: signed and
unsigned type in conditional expression

Signed-off-by: Chuck Lever <[EMAIL PROTECTED]>
---

 include/asm-x86_64/bitops.h |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/asm-x86_64/bitops.h b/include/asm-x86_64/bitops.h
index d4dbbe5..1d7d9b4 100644
--- a/include/asm-x86_64/bitops.h
+++ b/include/asm-x86_64/bitops.h
@@ -260,7 +260,7 @@ extern long find_first_bit(const unsigned long * addr, 
unsigned long size);
 extern long find_next_bit(const unsigned long * addr, long size, long offset);
 
 /* return index of first bet set in val or max when no bit is set */
-static inline unsigned long __scanbit(unsigned long val, unsigned long max)
+static inline long __scanbit(unsigned long val, unsigned long max)
 {
asm("bsfq %1,%0 ; cmovz %2,%0" : "=&r" (val) : "r" (val), "r" (max));
return val;

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Eliminate result signage problem in asm-x86_64/bitops.h

2007-08-15 Thread Chuck Lever


I apologize for sending a separate cover letter for a single patch.

Andi Kleen wrote:

On Wed, Aug 15, 2007 at 05:02:47PM -0400, Chuck Lever wrote:

The return type of __scanbit() doesn't match the return type of
find_{first,next}_bit().  Thus when you construct something like
this:

   boolean ? __scanbit() : find_first_bit()


Why would you want to write this?  What is boolean?
Do they have different arguments?


So here's the definition of the x86_64 find_first_bit() macro, straight 
from include/x86_64/bitops.h:


#define find_first_bit(addr,size) \
((__builtin_constant_p(size) && (size) <= BITS_PER_LONG ? \
  (__scanbit(*(unsigned long *)addr,(size))) : \
  find_first_bit(addr,size)))

In this case "boolean" is:

   __builtin_constant_p(size) && (size) <= BITS_PER_LONG

the first arm of the conditional is:

   __scanbit(*(unsigned long *)addr,(size))

the second arm of the conditional is:

   find_first_bit(addr,size)

(this is the "function" version of find_first_bit, not the macro that's 
being defined.  The naming here is unfortunately confusing).


Thus, roughly speaking, when the type of "size" is smaller than a long, 
the macro's return type evaluates to unsigned long.  If "size" is larger 
than a long, the macro's return type evaluates to signed long.


By making the return type of __scanbit() an unsigned long, both arms of 
the conditional evaluate to the same result type.



It's on my todo list for some time to special case
f_f_b() and friends for smaller arguments. Would
that eliminate this construct?


Well, I can only assume what you mean by this, but I think that would 
address the problem.


My real interest here is to eliminate a whole lot of compiler noise when 
I enable -Wsign-compare for certain parts of the kernel.
begin:vcard
fn:Chuck Lever
n:Lever;Chuck
org:Oracle Corporation;Corporate Architecture: Linux Projects Group
adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA
title:Principal Member of Staff
tel;work:+1 248 614 5091
x-mozilla-html:FALSE
url:http://oss.oracle.com/~cel
version:2.1
end:vcard

[PATCH 4/5] NET: Make ts_recent_stamp unsigned

2007-10-23 Thread Chuck Lever

The get_seconds() function returns an unsigned long.  To prevent incorrect
comparison results between values saved in ts_recent_stamp and later
invocations of get_seconds(), change the type of ts_recent_stamp to match
the return type of get_seconds().

Signed-off-by: Chuck Lever <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
---

 include/linux/tcp.h |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index bac17c5..129ddb4 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -206,7 +206,7 @@ struct tcp_sack_block {
 
 struct tcp_options_received {
 /* PAWS/RTTM data  */
-   longts_recent_stamp;/* Time we stored ts_recent (for aging) */
+   unsigned long ts_recent_stamp;/* Time we stored ts_recent (for aging) */
u32 ts_recent;  /* Time stamp to echo next  */
u32 rcv_tsval;  /* Time stamp value */
u32 rcv_tsecr;  /* Time stamp echo reply*/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/5] NET: Treat the sign of the result of skb_headroom() consistently

2007-10-23 Thread Chuck Lever

In some places, the result of skb_headroom() is compared to an unsigned
integer, and in others, the result is compared to a signed integer.  Make
the comparisons consistent and correct.

Signed-off-by: Chuck Lever <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
---

 include/linux/skbuff.h |4 ++--
 net/ipv4/ip_gre.c  |2 +-
 net/ipv4/ip_output.c   |2 +-
 net/ipv4/ipip.c|2 +-
 net/ipv4/ipvs/ip_vs_xmit.c |2 +-
 net/ipv4/tcp_input.c   |2 +-
 net/ipv6/ip6_output.c  |2 +-
 net/ipv6/ip6_tunnel.c  |2 +-
 net/ipv6/sit.c |2 +-
 9 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 91f0bac..57257d2 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -994,7 +994,7 @@ static inline int pskb_may_pull(struct sk_buff *skb, 
unsigned int len)
  *
  * Return the number of bytes of free space at the head of an &sk_buff.
  */
-static inline int skb_headroom(const struct sk_buff *skb)
+static inline unsigned int skb_headroom(const struct sk_buff *skb)
 {
return skb->data - skb->head;
 }
@@ -1347,7 +1347,7 @@ static inline struct sk_buff *netdev_alloc_skb(struct 
net_device *dev,
  * Returns true if modifying the header part of the cloned buffer
  * does not requires the data to be copied.
  */
-static inline int skb_clone_writable(struct sk_buff *skb, int len)
+static inline int skb_clone_writable(struct sk_buff *skb, unsigned int len)
 {
return !skb_header_cloned(skb) &&
   skb_headroom(skb) + len <= skb->hdr_len;
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index f151900..c3568ab 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -674,7 +674,7 @@ static int ipgre_tunnel_xmit(struct sk_buff *skb, struct 
net_device *dev)
struct rtable *rt;  /* Route to the other host */
struct net_device *tdev;/* Device to other host 
*/
struct iphdr  *iph; /* Our new IP header */
-   intmax_headroom;/* The extra header space 
needed */
+   unsigned int max_headroom;  /* The extra header space 
needed */
intgre_hlen;
__be32 dst;
intmtu;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index f508835..e5f7dc2 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -161,7 +161,7 @@ static inline int ip_finish_output2(struct sk_buff *skb)
struct dst_entry *dst = skb->dst;
struct rtable *rt = (struct rtable *)dst;
struct net_device *dev = dst->dev;
-   int hh_len = LL_RESERVED_SPACE(dev);
+   unsigned int hh_len = LL_RESERVED_SPACE(dev);
 
if (rt->rt_type == RTN_MULTICAST)
IP_INC_STATS(IPSTATS_MIB_OUTMCASTPKTS);
diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c
index 5cd5bbe..8c2b2b0 100644
--- a/net/ipv4/ipip.c
+++ b/net/ipv4/ipip.c
@@ -515,7 +515,7 @@ static int ipip_tunnel_xmit(struct sk_buff *skb, struct 
net_device *dev)
struct net_device *tdev;/* Device to other host 
*/
struct iphdr  *old_iph = ip_hdr(skb);
struct iphdr  *iph; /* Our new IP header */
-   intmax_headroom;/* The extra header space 
needed */
+   unsigned int max_headroom;  /* The extra header space 
needed */
__be32 dst = tiph->daddr;
intmtu;
 
diff --git a/net/ipv4/ipvs/ip_vs_xmit.c b/net/ipv4/ipvs/ip_vs_xmit.c
index d0a92de..7c074e3 100644
--- a/net/ipv4/ipvs/ip_vs_xmit.c
+++ b/net/ipv4/ipvs/ip_vs_xmit.c
@@ -325,7 +325,7 @@ ip_vs_tunnel_xmit(struct sk_buff *skb, struct ip_vs_conn 
*cp,
__be16 df = old_iph->frag_off;
sk_buff_data_t old_transport_header = skb->transport_header;
struct iphdr  *iph; /* Our new IP header */
-   intmax_headroom;/* The extra header space 
needed */
+   unsigned int max_headroom;  /* The extra header space 
needed */
intmtu;
 
EnterFunction(10);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9288220..3dbbb44 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3909,7 +3909,7 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list,
 
while (before(start, end)) {
struct sk_buff *nskb;
-   int header = skb_headroom(skb);
+   unsigned int header = skb_headroom(skb);
int copy = SKB_MAX_ORDER(header, 0);
 
/* Too big header? This can happen with IPv6. */
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 13565df..653fc0a 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -171,7 +171,7 @@ int ip6_xmit(struct sock *sk, struct sk_buff *skb, struct 
flowi *fl,
u32 m

[PATCH 5/5] NET: Remove unneeded implicit type case when calling tcp_minshall_update()

2007-10-23 Thread Chuck Lever

The tcp_minshall_update() function is called in exactly one place, and is
passed an unsigned integer for the mss_len argument.  Make the sign of the
argument match the sign of the passed variable in order to eliminate an
unneeded implicit type cast and a mixed sign comparison in
tcp_minshall_update().

Signed-off-by: Chuck Lever <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
---

 include/net/tcp.h |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 92049e6..d695cea 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -803,7 +803,7 @@ static inline int tcp_is_cwnd_limited(const struct sock 
*sk, u32 in_flight)
return left <= tcp_max_burst(tp);
 }
 
-static inline void tcp_minshall_update(struct tcp_sock *tp, int mss,
+static inline void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss,
   const struct sk_buff *skb)
 {
if (skb->len < mss)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/5] NET: Make ts_recent_stamp unsigned

2007-10-24 Thread Chuck Lever

David Miller wrote:

From: Chuck Lever <[EMAIL PROTECTED]>
Date: Tue, 23 Oct 2007 11:44:28 -0400

The get_seconds() function returns an unsigned long.  To prevent incorrect
comparison results between values saved in ts_recent_stamp and later
invocations of get_seconds(), change the type of ts_recent_stamp to match
the return type of get_seconds().

Signed-off-by: Chuck Lever <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>

I see two potential problems with this patch:

1) If you update struct tcp_options_received you should also
   update struct tcp_timewait_sock similarly.

   The fact that you missed this suggests that you didn't
   grep the tree to see how else this variable is used and
   this makes me extra concerned about this patch's correctness.

Perhaps the result of wishful thinking on my part.  I was hoping for a 
small and self-contained change.

2) There are computations in the TCP stack using this member that
   probably care about the signedness, such as:

net/ipv4/tcp_ipv4.c: get_seconds() - 
tcptw->tw_ts_recent_stamp > 1))) {
include/net/tcp.h:  if (get_seconds() >= rx_opt->ts_recent_stamp + 
TCP_PAWS_24DAYS)
include/net/tcp.h:  if (get_seconds() >= rx_opt->ts_recent_stamp + 
TCP_PAWS_24DAYS)

   We should make sure we understand what is expected here, and
   why it would still be correct after making both ts_recent_stamp
   members unsigned.

Agreed.

I wonder how one could construct a series of mixed case time stamp 
comparisons *on purpose* (and without documentation of this assumption) 
that produces consistently correct results.

From the invocations of get_seconds() that I sampled, the design of 
these comparisons seems to assume that both sides of the comparison are 
non-negative.  However, they do not seem to account for time crossing zero.
begin:vcard
fn:Chuck Lever
n:Lever;Chuck
org:Oracle Corporation;Corporate Architecture: Linux Projects Group
adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA
email;internet:chuck dot lever at nospam oracle dot com
title:Principal Member of Staff
tel;work:+1 248 614 5091
x-mozilla-html:FALSE
version:2.1
end:vcard

Re: [2.6 patch] make sunrpc/xprtsock.c:xs_setup_{udp,tcp}() static

2007-10-24 Thread Chuck Lever


Adrian Bunk wrote:

xs_setup_{udp,tcp}() can now become static.


ACK.  Sorry this was overlooked.


Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]>

---

 include/linux/sunrpc/xprtsock.h |6 --
 net/sunrpc/xprtsock.c   |4 ++--
 2 files changed, 2 insertions(+), 8 deletions(-)

833a31c8caef70589f33be8e3a1fc9d8e01ce3c2 
diff --git a/include/linux/sunrpc/xprtsock.h b/include/linux/sunrpc/xprtsock.h

index 2c6c2c2..c2a46c4 100644
--- a/include/linux/sunrpc/xprtsock.h
+++ b/include/linux/sunrpc/xprtsock.h
@@ -9,12 +9,6 @@
 
 #ifdef __KERNEL__
 
-/*

- * Socket transport setup operations
- */
-struct rpc_xprt *xs_setup_udp(struct xprt_create *args);
-struct rpc_xprt *xs_setup_tcp(struct xprt_create *args);
-
 intinit_socket_xprt(void);
 void   cleanup_socket_xprt(void);
 
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c

index 02298f5..2f630a5 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1828,7 +1828,7 @@ static struct rpc_xprt *xs_setup_xprt(struct xprt_create 
*args,
  * @args: rpc transport creation arguments
  *
  */
-struct rpc_xprt *xs_setup_udp(struct xprt_create *args)
+static struct rpc_xprt *xs_setup_udp(struct xprt_create *args)
 {
struct sockaddr *addr = args->dstaddr;
struct rpc_xprt *xprt;
@@ -1894,7 +1894,7 @@ struct rpc_xprt *xs_setup_udp(struct xprt_create *args)
  * @args: rpc transport creation arguments
  *
  */
-struct rpc_xprt *xs_setup_tcp(struct xprt_create *args)
+static struct rpc_xprt *xs_setup_tcp(struct xprt_create *args)
 {
struct sockaddr *addr = args->dstaddr;
struct rpc_xprt *xprt;



begin:vcard
fn:Chuck Lever
n:Lever;Chuck
org:Oracle Corporation;Corporate Architecture: Linux Projects Group
adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA
email;internet:chuck dot lever at nospam oracle dot com
title:Principal Member of Staff
tel;work:+1 248 614 5091
x-mozilla-html:FALSE
version:2.1
end:vcard

Re: [NFS] [PATCH 001 of 9] knfsd: nfsd4: fix non-terminated string

2007-02-14 Thread Chuck Lever


Ming Zhang wrote:

On Tue, 2007-02-13 at 10:44 +1100, NeilBrown wrote:

From: J. Bruce Fields <[EMAIL PROTECTED]>
The server name is expected to be a null-terminated string, so we can't
pass in the raw client identifier.

What's more, the client identifier is just a binary, not necessarily
printable, blob.  Let's just use the ip address instead.  The server
name appears to exist just to help debugging by making some printk's
more informative.

Note that the string is copies into the rpc client structure, so
the pointer to the local variable does not outlive the function call.

Signed-off-by: "J. Bruce Fields" <[EMAIL PROTECTED]>
Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./fs/nfsd/nfs4callback.c |7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff .prev/fs/nfsd/nfs4callback.c ./fs/nfsd/nfs4callback.c
--- .prev/fs/nfsd/nfs4callback.c2007-02-13 09:50:26.0 +1100
+++ ./fs/nfsd/nfs4callback.c2007-02-13 10:00:59.0 +1100
@@ -387,7 +387,6 @@ nfsd4_probe_callback(struct nfs4_client 
 		.address	= (struct sockaddr *)&addr,

.addrsize   = sizeof(addr),
.timeout= &timeparms,
-   .servername = clp->cl_name.data,
.program= program,
.version= nfs_cb_version[1]->number,
.authflavor = RPC_AUTH_UNIX,/* XXX: need 
AUTH_GSS... */
@@ -397,6 +396,7 @@ nfsd4_probe_callback(struct nfs4_client 
 		.rpc_proc   = &nfs4_cb_procedures[NFSPROC4_CLNT_CB_NULL],

.rpc_argp   = clp,
};
+   char clientname[16];
int status;
 
 	if (atomic_read(&cb->cb_set))
@@ -419,6 +419,11 @@ nfsd4_probe_callback(struct nfs4_client 
 	memset(program->stats, 0, sizeof(cb->cb_stat));

program->stats->program = program;
 
+	/* Just here to make some printk's more useful: */

+   snprintf(clientname, sizeof(clientname),
+   "%u.%u.%u.%u", NIPQUAD(addr.sin_addr));


can use NIPQUAD_FMT here instead of "%u.%u.%u.%u".

btw, will the ip address here possibly be an ipv6 address?


Some patches are in the works to build in IPv6 support.  See the patch 
series at http://oss.oracle.com/~cel/linux-2.6/2.6.19/patches/
begin:vcard
fn:Chuck Lever
n:Lever;Chuck
org:Oracle Corporation;Corporate Architecture Linux Projects Group
email;internet:chuck dot lever at nospam oracle dot com
title:Principal Member of Staff
tel;work:+1 248 614 5091
x-mozilla-html:FALSE
version:2.1
end:vcard

Re: 2.6.20-git8 fails compile -- net/built-in.o __ipv6_addr_type

2007-02-14 Thread Chuck Lever

David Miller wrote:

From: Pete Clements <[EMAIL PROTECTED]>
Date: Mon, 12 Feb 2007 20:10:13 -0500 (EST)

2.6.20-git8 fails compile:

  CHK include/linux/compile.h
  UPD include/linux/compile.h
  CC  init/version.o
  LD  init/built-in.o
  LD  .tmp_vmlinux1
net/built-in.o: In function `svc_udp_recvfrom':
svcsock.c:(.text+0x61be4): undefined reference to `__ipv6_addr_type'
make: *** [.tmp_vmlinux1] Error 1

Chuck, you will need to somehow make CONFIG_SUNRPC "depend" upon IPV6
so that if IPV6 is modular SUNRPC can only be built modular.

Otherwise the symbols won't resolve correctly.

Everybody hits this problem when they add ipv6 support to
something. :-)

Interestingly, doing a build with ALLYESCONFIG, ALLMODCONFIG, and 
ALLNOCONFIG doesn't catch this type of error.

I just did a copy+paste and that brought in ipv6_addr_type.  So I'm not 
convinced it's really needed here.  David, can you take a look at the 
code in svcsock.c right around the ipv6_addr_type() call and let me know 
if we can avoid that call outright?
begin:vcard
fn:Chuck Lever
n:Lever;Chuck
org:Oracle Corporation;Corporate Architecture Linux Projects Group
email;internet:chuck dot lever at nospam oracle dot com
title:Principal Member of Staff
tel;work:+1 248 614 5091
x-mozilla-html:FALSE
version:2.1
end:vcard

Re: [NFS] [PATCH 000 of 14] knfsd: Preparation for IPv6 support in NFS server.

2007-02-02 Thread Chuck Lever

Roland Dreier wrote:

 > They are mostly from Chuck Level and make preparating for IPv6 support
 > in the NFS server.
 > They are *not* for 2.6.20, but should be ok for .21.

Out of curiousity, does this patch series reduce the delta between the
NFS/RDMA tree and mainline Linux?  In other words does this bring
NFS/RDMA closer to merging?

Hi Roland-

The client side support for an RPC/RDMA module is almost completely 
integrated into mainline.  There is still a minimal set of patches 
required to support alternate transports in loadable kernel modules 
which Trond has indicated he will integrate when the RPC/RDMA transport 
is ready to be integrated.

At this time I'm not aware of a plan to integrate server-side support 
for NFS/RDMA.  Perhaps the NetApp RDMA developers could respond.
begin:vcard
fn:Chuck Lever
n:Lever;Chuck
org:Oracle Corporation;Corporate Architecture Linux Projects Group
email;internet:chuck dot lever at nospam oracle dot com
title:Principle Member of Staff
tel;work:+1 248 614 5091
x-mozilla-html:FALSE
version:2.1
end:vcard

Re: [NFS] [PATCH 002 of 13] knfsd: SUNRPC: allow creating an RPC service without registering with portmapper

2006-12-08 Thread Chuck Lever


Oops.  This one looks old.  Let me see if I can dig up the latest.

On 12/8/06, Trond Myklebust <[EMAIL PROTECTED]> wrote:

On Fri, 2006-12-08 at 23:01 +1100, NeilBrown wrote:
> From: Chuck Lever <[EMAIL PROTECTED]>
> Sometimes we need to create an RPC service but not register it with the
> local portmapper.  NFSv4 delegation callback, for example.
>
> Change the svc_makesock() API to allow optionally creating temporary or
> permanent sockets, optionally registering with the local portmapper, and
> make it return the ephemeral port of the new socket.

NAK. This one is still buggy.

The NFSv4 callback server should _NOT_ be registering its listening
socket on the RPC server 'temporary' list.

Trond


> Signed-off-by: Chuck Lever <[EMAIL PROTECTED]>
> Cc: Aurelien Charbon <[EMAIL PROTECTED]>
> Signed-off-by: Neil Brown <[EMAIL PROTECTED]>
>
> ### Diffstat output
>  ./fs/lockd/svc.c |   26 --
>  ./fs/nfs/callback.c  |   20 +---
>  ./fs/nfsd/nfssvc.c   |6 --
>  ./include/linux/sunrpc/svcsock.h |2 +-
>  ./net/sunrpc/svcsock.c   |6 --
>  5 files changed, 34 insertions(+), 26 deletions(-)
>
> diff .prev/fs/lockd/svc.c ./fs/lockd/svc.c
> --- .prev/fs/lockd/svc.c  2006-12-08 13:36:33.0 +1100
> +++ ./fs/lockd/svc.c  2006-12-08 13:36:33.0 +1100
> @@ -223,23 +223,29 @@ static int find_socket(struct svc_serv *
>   return found;
>  }
>
> +/*
> + * Make any sockets that are needed but not present.
> + * If nlm_udpport or nlm_tcpport were set as module
> + * options, make those sockets unconditionally
> + */
>  static int make_socks(struct svc_serv *serv, int proto)
>  {
> - /* Make any sockets that are needed but not present.
> -  * If nlm_udpport or nlm_tcpport were set as module
> -  * options, make those sockets unconditionally
> -  */
> - static int  warned;
> + static int warned;
>   int err = 0;
> +
>   if (proto == IPPROTO_UDP || nlm_udpport)
>   if (!find_socket(serv, IPPROTO_UDP))
> - err = svc_makesock(serv, IPPROTO_UDP, nlm_udpport);
> - if (err == 0 && (proto == IPPROTO_TCP || nlm_tcpport))
> + err = svc_makesock(serv, IPPROTO_UDP, nlm_udpport,
> + SVC_SOCK_DEFAULTS);
> + if (err >= 0 && (proto == IPPROTO_TCP || nlm_tcpport))
>   if (!find_socket(serv, IPPROTO_TCP))
> - err= svc_makesock(serv, IPPROTO_TCP, nlm_tcpport);
> - if (!err)
> + err = svc_makesock(serv, IPPROTO_TCP, nlm_tcpport,
> + SVC_SOCK_DEFAULTS);
> +
> + if (err >= 0) {
>   warned = 0;
> - else if (warned++ == 0)
> + err = 0;
> + } else if (warned++ == 0)
>   printk(KERN_WARNING
>  "lockd_up: makesock failed, error=%d\n", err);
>   return err;
>
> diff .prev/fs/nfs/callback.c ./fs/nfs/callback.c
> --- .prev/fs/nfs/callback.c   2006-12-08 13:36:33.0 +1100
> +++ ./fs/nfs/callback.c   2006-12-08 13:36:33.0 +1100
> @@ -106,7 +106,6 @@ static void nfs_callback_svc(struct svc_
>  int nfs_callback_up(void)
>  {
>   struct svc_serv *serv;
> - struct svc_sock *svsk;
>   int ret = 0;
>
>   lock_kernel();
> @@ -119,17 +118,14 @@ int nfs_callback_up(void)
>   ret = -ENOMEM;
>   if (!serv)
>   goto out_err;
> - /* FIXME: We don't want to register this socket with the portmapper */
> - ret = svc_makesock(serv, IPPROTO_TCP, nfs_callback_set_tcpport);
> - if (ret < 0)
> +
> + ret = svc_makesock(serv, IPPROTO_TCP, nfs_callback_set_tcpport,
> + (SVC_SOCK_ANONYMOUS | SVC_SOCK_TEMPORARY));
> + if (ret <= 0)
>   goto out_destroy;
> - if (!list_empty(&serv->sv_permsocks)) {
> - svsk = list_entry(serv->sv_permsocks.next,
> - struct svc_sock, sk_list);
> - nfs_callback_tcpport = ntohs(inet_sk(svsk->sk_sk)->sport);
> - dprintk ("Callback port = 0x%x\n", nfs_callback_tcpport);
> - } else
> - BUG();
> + nfs_callback_tcpport = ret;
> + dprintk("Callback port = 0x%x\n", nfs_callback_tcpport);
> +
>   ret = svc_create_thread(nfs_callback_svc, serv);
>   if (ret < 0)
>   goto out_destroy;
> @@ -140,6 +136,8 @@ out:
>   unlock_kernel();
>   return ret;
>  out_destroy:
> +

Re: [NFS] [PATCH 010 of 14] knfsd: SUNRPC: add a "generic" function to see if the peer uses a secure port

2006-12-13 Thread Chuck Lever


On 12/12/06, Andrew Morton <[EMAIL PROTECTED]> wrote:

On Wed, 13 Dec 2006 10:59:27 +1100
NeilBrown <[EMAIL PROTECTED]> wrote:

> From: Chuck Lever <[EMAIL PROTECTED]>
> The only reason svcsock.c looks at a sockaddr's port is to check whether
> the remote peer is connecting from a privileged port.  Refactor this check
> to hide processing that is specific to address format.
>
> Signed-off-by: Chuck Lever <[EMAIL PROTECTED]>
> Cc: Aurelien Charbon <[EMAIL PROTECTED]>
> Signed-off-by: Neil Brown <[EMAIL PROTECTED]>
>
> ### Diffstat output
>  ./net/sunrpc/svcsock.c |   20 +---
>  1 file changed, 17 insertions(+), 3 deletions(-)
>
> diff .prev/net/sunrpc/svcsock.c ./net/sunrpc/svcsock.c
> --- .prev/net/sunrpc/svcsock.c2006-12-13 10:32:15.0 +1100
> +++ ./net/sunrpc/svcsock.c2006-12-13 10:32:17.0 +1100
> @@ -926,6 +926,20 @@ svc_tcp_data_ready(struct sock *sk, int
>   wake_up_interruptible(sk->sk_sleep);
>  }
>
> +static inline int svc_port_is_privileged(struct sockaddr *sin)
> +{
> + switch (sin->sa_family) {
> + case AF_INET:
> + return ntohs(((struct sockaddr_in *)sin)->sin_port) < 1024;
> +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
> + case AF_INET6:
> + return ntohs(((struct sockaddr_in6 *)sin)->sin6_port) < 1024;
> +#endif
> + default:
> + return 0;
> + }
> +}

I'm a bit surprised to see this test implemented in sunrpc - it's the sort
of thing which core networking should implement?


The check is open-coded in each socket type's bind callout, and
includes a capability check which I believe the NFS server doesn't
require.


And should that "1024" be PROT_SOCK?


All I can say is "Doh!"  I'll send Neil a replacement with this fixed.

--
"We who cut mere stones must always be envisioning cathedrals"
  -- Quarry worker's creed
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [NFS] [PATCH 2.6.19-rc6] sunrpc: fix race condition

2006-11-27 Thread Chuck Lever


On 11/27/06, Chris Caputo <[EMAIL PROTECTED]> wrote:

From: Chris Caputo <[EMAIL PROTECTED]>
[PATCH 2.6.19-rc6] sunrpc: fix race condition

Patch linux-2.6.10-01-rpc_workqueue.dif introduced a race condition into
net/sunrpc/sched.c in kernels 2.6.11-rc1 through 2.6.19-rc6.  The race
scenario is as follows...

Given: RPC_TASK_QUEUED, RPC_TASK_RUNNING and RPC_TASK_ASYNC are set.

__rpc_execute() (no spinlock)rpc_make_runnable() (queue spinlock held)
--
 do_ret = rpc_test_and_set_running(task);
rpc_clear_running(task);
if (RPC_IS_ASYNC(task)) {
if (RPC_IS_QUEUED(task))
return 0;
 rpc_clear_queued(task);
 if (do_ret)
 return;

Thus both threads return and the task is abandoned forever.

In my test NFS client usage (~200 Mb/s at ~3,000 RPC calls/s) this race
condition has resulted in processes getting permanently stuck in 'D' state
often in less than 15 minutes of uptime.

The following patch fixes the problem by returning to use of a spinlock in
__rpc_execute().

Signed-off-by: Chris Caputo <[EMAIL PROTECTED]>
---

diff -up a/net/sunrpc/sched.c b/net/sunrpc/sched.c
--- a/net/sunrpc/sched.c2006-11-27 08:41:07.0 +
+++ b/net/sunrpc/sched.c2006-11-27 11:14:21.0 +
@@ -587,6 +587,7 @@ EXPORT_SYMBOL(rpc_exit_task);
 static int __rpc_execute(struct rpc_task *task)
 {
int status = 0;
+   struct rpc_wait_queue *queue;

dprintk("RPC: %4d rpc_execute flgs %x\n",
task->tk_pid, task->tk_flags);
@@ -631,22 +632,27 @@ static int __rpc_execute(struct rpc_task
lock_kernel();
task->tk_action(task);
unlock_kernel();
+   /* micro-optimization to avoid spinlock */
+   if (!RPC_IS_QUEUED(task))
+   continue;
}

/*
-* Lockless check for whether task is sleeping or not.
+* Check whether task is sleeping.
 */
-   if (!RPC_IS_QUEUED(task))
-   continue;
-   rpc_clear_running(task);
+   queue = task->u.tk_wait.rpc_waitq;
+   spin_lock_bh(&queue->lock);
if (RPC_IS_ASYNC(task)) {
-   /* Careful! we may have raced... */
-   if (RPC_IS_QUEUED(task))
-   return 0;
-   if (rpc_test_and_set_running(task))
+   if (RPC_IS_QUEUED(task)) {
+   rpc_clear_running(task);
+   spin_unlock_bh(&queue->lock);
return 0;
+   }
+   spin_unlock_bh(&queue->lock);
continue;
}
+   rpc_clear_running(task);
+   spin_unlock_bh(&queue->lock);

/* sync task: sleep here */
dprintk("RPC: %4d sync task going to sleep\n", task->tk_pid);


The reason the spin lock was removed from the scheduler is because
once the BKL is removed from the RPC and NFS clients, the locks in the
RPC scheduler become contented.  Each RPC request passes through this
part of the scheduler an average of 12 times (probably more if a bind
or credential refresh is required), so the locking overhead becomes
critical.

As you are working this fix, can you make sure to test with a heavy
RPC workload against a fast server and make sure that lock contention
remains reasonable?

--
"We who cut mere stones must always be envisioning cathedrals"
  -- Quarry worker's creed
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Adding subroot information to /proc/mounts, or obtaining that through other means

2007-06-20 Thread Chuck Lever


Al Viro wrote:

On Wed, Jun 20, 2007 at 01:57:33PM -0700, H. Peter Anvin wrote:

... or, alternatively, add a subfield to the first field (which would
entail escaping whatever separator we choose):

/dev/md6 /export ext3 rw,data=ordered 0 0
/dev/md6:/users/foo /home/foo ext3 rw,data=ordered 0 0
/dev/md6:/users/bar /home/bar ext3 rw,data=ordered 0 0


Hell, no.  The first field is in principle impossible to parse unless
you know the fs type.

How about making a new file with sane format?  From the very
beginning.  E.g. mountpoint + ID + relative path + type + options,
where ID uniquely identifies superblock (e.g. numeric st_dev)
and backing device (if any) is sitting among the options...


To support NFS client performance statistics, I recently added 
/proc/self/mountstats.  That might be a place to add details about 
--move and --bind mounts without changing the format of /proc/mounts.
begin:vcard
fn:Chuck Lever
n:Lever;Chuck
org:Oracle Corporation;Corporate Architecture: Linux Projects Group
adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA
title:Principal Member of Staff
tel;work:+1 248 614 5091
x-mozilla-html:FALSE
url:http://oss.oracle.com/~cel/
version:2.1
end:vcard

Re: Adding subroot information to /proc/mounts, or obtaining that through other means

2007-06-20 Thread Chuck Lever


H. Peter Anvin wrote:

Chuck Lever wrote:

To support NFS client performance statistics, I recently added
/proc/self/mountstats.  That might be a place to add details about
--move and --bind mounts without changing the format of /proc/mounts.


I just looked at /proc/self/mountstats; it seems to have no more
information than /proc/self/mounts, but in an even more annoying format.
 Either I'm missing something, this file doesn't add anything at all.


The advantage is that it doesn't have strong user space dependencies on 
its format like /proc/mounts does.


If you have NFS mount points, you will see that it includes a great deal 
of additional information about each mount.
begin:vcard
fn:Chuck Lever
n:Lever;Chuck
org:Oracle Corporation;Corporate Architecture: Linux Projects Group
adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA
title:Principal Member of Staff
tel;work:+1 248 614 5091
x-mozilla-html:FALSE
url:http://oss.oracle.com/~cel/
version:2.1
end:vcard

Re: Adding subroot information to /proc/mounts, or obtaining that through other means

2007-06-20 Thread Chuck Lever


H. Peter Anvin wrote:

Chuck Lever wrote:

The advantage is that it doesn't have strong user space dependencies on
its format like /proc/mounts does.

If you have NFS mount points, you will see that it includes a great deal
of additional information about each mount.


OK, I see now:
device raidtest:/export mounted on /net/raidtest/export with fstype nfs
statvers=1.0
opts:
rw,vers=3,rsize=131072,wsize=131072,acregmin=3,acregmax=60,acdirmin=30,acdirmax=60,hard,proto=tcp,timeo=600,retrans=2,sec=sys
age:5
caps:   caps=0x9,wtmult=4096,dtsize=4096,bsize=0,namelen=255
sec:flavor=1,pseudoflavor=1
events: 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
bytes:  0 0 0 0 0 0 0 0
RPC iostats version: 1.0  p/v: 13/3 (nfs)
xprt:   tcp 686 0 2 0 5 8 8 0 8 0
per-op statistics
NULL: 0 0 0 0 0 0 0 0
 GETATTR: 2 2 0 264 224 1 0 1
 SETATTR: 0 0 0 0 0 0 0 0
  LOOKUP: 0 0 0 0 0 0 0 0
  ACCESS: 1 1 0 116 120 0 0 0
READLINK: 0 0 0 0 0 0 0 0
READ: 0 0 0 0 0 0 0 0
   WRITE: 0 0 0 0 0 0 0 0
  CREATE: 0 0 0 0 0 0 0 0
   MKDIR: 0 0 0 0 0 0 0 0
 SYMLINK: 0 0 0 0 0 0 0 0
   MKNOD: 0 0 0 0 0 0 0 0
  REMOVE: 0 0 0 0 0 0 0 0
   RMDIR: 0 0 0 0 0 0 0 0
  RENAME: 0 0 0 0 0 0 0 0
LINK: 0 0 0 0 0 0 0 0
 READDIR: 0 0 0 0 0 0 0 0
 READDIRPLUS: 0 0 0 0 0 0 0 0
  FSSTAT: 1 1 0 132 84 0 1 1
  FSINFO: 1 1 0 132 80 0 0 0
PATHCONF: 0 0 0 0 0 0 0 0
  COMMIT: 0 0 0 0 0 0 0 0

This format is just awful for parsing.  It's pretty clearly totally
ad-hoc.  It's not even self-consistent (it uses different separators,
etc, in the same file!)  It's reasonably compact for human consumption,
but it doesn't show what the arrays mean.

Heck, XML would have been better than this mess...


Sigh.  So where where you when I asked for review time and again?

I have a couple of simple Python scripts that can parse this without any 
difficulty.


I resent your tone.  Quite a bit.
begin:vcard
fn:Chuck Lever
n:Lever;Chuck
org:Oracle Corporation;Corporate Architecture: Linux Projects Group
adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA
title:Principal Member of Staff
tel;work:+1 248 614 5091
x-mozilla-html:FALSE
url:http://oss.oracle.com/~cel/
version:2.1
end:vcard

Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS

2007-06-14 Thread Chuck Lever


Hi Chris-

John Stoffel wrote:

As a user of Netapps, having quotas (if only for reporting purposes)
and some way to migrate non-used files to slower/cheaper storage would
be great.

Ie. being able to setup two pools, one being RAID6, the other being
RAID1, where all currently accessed files are in the RAID1 setup, but
if un-used get migrated to the RAID6 area.  


And of course some way for efficient backups and more importantly
RESTORES of data which is segregated like this.  


I like the way dump and restore was handled in AFS (and now ZFS and 
NetApp).  There is a simple command to flatten a file system and send it 
to another system, which can receive it and re-expand it.  The 
dump/restore process uses snapshots and can easily send incremental 
backups which are significantly smaller than 0-level.  This is somewhat 
better than rsync, because you don't need checksums to discover what 
data has changed -- you already have the new data segregated into 
copied-on-write blocks.


NetApp happens to use the standard NDMP protocol for sending the 
flattened file system.  NetApp uses it for synchronous replication, 
volume migration, and back up to nearline storage and tape.  AFS used 
"vol dump" and "vol restore" for migration, replication, and back-up. 
ZFS has the "zfs send" and "zfs receive" commands that do basically the 
same (Eric Kustarz recently published a blog entry that described how 
these work).  And of course, all file system objects are able to be sent 
this way:  streams, xattrs, ACLs, and so on are all supported.


Note also that NFSv4 supports the idea of migrated or replicated file 
objects.  All that is needed to support it is a mechanism on the servers 
to actually move the data.
begin:vcard
fn:Chuck Lever
n:Lever;Chuck
org:Oracle Corporation;Corporate Architecture: Linux Projects Group
adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA
email;internet:chuck dot lever at nospam oracle dot com
title:Principal Member of Staff
tel;work:+1 248 614 5091
x-mozilla-html:FALSE
version:2.1
end:vcard

Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS

2007-06-15 Thread Chuck Lever


Chris Mason wrote:

On Thu, Jun 14, 2007 at 02:20:26PM -0400, Chuck Lever wrote:
NetApp happens to use the standard NDMP protocol for sending the 
flattened file system.  NetApp uses it for synchronous replication, 
volume migration, and back up to nearline storage and tape.  AFS used 
"vol dump" and "vol restore" for migration, replication, and back-up. 
ZFS has the "zfs send" and "zfs receive" commands that do basically the 
same (Eric Kustarz recently published a blog entry that described how 
these work).  And of course, all file system objects are able to be sent 
this way:  streams, xattrs, ACLs, and so on are all supported.


Note also that NFSv4 supports the idea of migrated or replicated file 
objects.  All that is needed to support it is a mechanism on the servers 
to actually move the data.


Stringing the replication together with the underlying FS would be neat.
Is there a way to deal with a master/slave setup, where the slave may be
out of date?


Among the implementations I'm aware of, there is a varying degree of 
integration into the physical file system.  In general, it depends on 
how far out of date the slave is, and how closely the slave is supposed 
to be synchronized to the master.


A hot backup file system, for example, should be data-consistent within 
a few seconds of the master.  A snapshot is used to initialize a slave, 
followed by a live stream of updates to the master being sent to slaves. 
 Such a mechanism already exists on NetApp filers because they gather 
changes in NVRAM before committing them to the local file system. 
Simply put, these changes can also be bundled and sent to a local hot 
backup filer that is attached via Infiniband, or over the network to a 
remote hot backup filer.


For AFS, replication is done by maintaining a rw and ro copy of a volume 
on the designated master server.  Changes are made to the rw copy over 
time.  When admins want to push out a new version to replicas on another 
server, the ro copy on the master is replaced with a new snapshot, then 
this is pushed to the slaves.  The replicas are always ro and are used 
mostly for load balancing; clients contact the closest or fastest server 
containing a replica of the volume they want to access.  They always 
have a complete copy of the volume (ie no COW on the slaves).


I think you have designed into btrfs a lot of opportunity to implement 
this kind of data virtualization and management... I'm excited to see 
what can be done.
begin:vcard
fn:Chuck Lever
n:Lever;Chuck
org:Oracle Corporation;Corporate Architecture: Linux Projects Group
adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA
title:Principal Member of Staff
tel;work:+1 248 614 5091
x-mozilla-html:FALSE
url:http://oss.oracle.com/~cel/
version:2.1
end:vcard

Re: [bisect] NFS regression breaks X

2007-05-10 Thread Chuck Lever


Trond Myklebust wrote:

On Wed, 2007-05-09 at 15:52 -0700, Andrew Morton wrote:

It's a bit rough that Jeff spent a large amount of time hunting down an
already-known bug.  That's normally my job :(


The bug was reported by Florin Iucha (on lkml!) on Saturday. It has only
just been debugged, and I was in fact in the middle of marshalling the
fixes.


This five-week-old diff only ever appeared in 2.6.21-mm1, which was
released four days ago.  It was then whizzed into mainline.  We thus lost
five weeks public testing which would probably have saved Jeff his pain.

What went wrong?


Probably my fault. I've had a couple of weeks of heavy travel due to
various circumstances that were beyond my control, and so I had little
time in which to test the stuff and push it out.

Another factor that is affecting us is the slow but gradual collapse of
the OSDL NFSv4 regression testing effort.


I had expected that many of these issues would be caught by the OSDL 
test harness.  I learned only yesterday that it was no longer available, 
so I am making an effort to broaden my personal test regime.
begin:vcard
fn:Chuck Lever
n:Lever;Chuck
org:Oracle Corporation;Corporate Architecture: Linux Projects Group
adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA
title:Principal Member of Staff
tel;work:+1 248 614 5091
x-mozilla-html:FALSE
url:http://oss.oracle.com/~cel/
version:2.1
end:vcard

Re: Merge plans for RPC/RDMA? (Was: Re: [NFS] [PATCH 000 of 14] knfsd: Preparation for IPv6 support in NFS server.)

2007-04-13 Thread Chuck Lever

Mike Snitzer wrote:

On 2/2/07, Chuck Lever <[EMAIL PROTECTED]> wrote:

Roland Dreier wrote:
>  > They are mostly from Chuck Level and make preparating for IPv6 
support

>  > in the NFS server.
>  > They are *not* for 2.6.20, but should be ok for .21.
>
> Out of curiousity, does this patch series reduce the delta between the
> NFS/RDMA tree and mainline Linux?  In other words does this bring
> NFS/RDMA closer to merging?

Hi Roland-

The client side support for an RPC/RDMA module is almost completely
integrated into mainline.  There is still a minimal set of patches
required to support alternate transports in loadable kernel modules
which Trond has indicated he will integrate when the RPC/RDMA transport
is ready to be integrated.

Hi Chuck,

I must be missing something because I don't see _any_ trace of the
core RPC over RDMA support (xprtrdma et al), your RPC Transport
Switch, or any of the other supporting changes in mainline.  Could
you, or others, please clarify the plan for merging RPC/RDMA?

The RPC transport switch patches are almost fully integrated into 
mainline.  The xprtrdma piece is what is not there yet.

At this time I'm not aware of a plan to integrate server-side support
for NFS/RDMA.  Perhaps the NetApp RDMA developers could respond.

Do you mean the NFS/RDMA server code from Tom Tucker of Open Grid
Computing?  Why wouldn't this code be included?

Tom's code hasn't been reviewed by the community.
begin:vcard
fn:Chuck Lever
n:Lever;Chuck
org:Oracle Corporation;Corporate Architecture: Linux Projects Group
adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA
email;internet:chuck dot lever at nospam oracle dot com
title:Principal Member of Staff
tel;work:+1 248 614 5091
x-mozilla-html:FALSE
version:2.1
end:vcard

Re: Merge plans for RPC/RDMA? (Was: Re: [NFS] [PATCH 000 of 14] knfsd: Preparation for IPv6 support in NFS server.)

2007-04-13 Thread Chuck Lever

Mike Snitzer wrote:

On 4/13/07, Chuck Lever <[EMAIL PROTECTED]> wrote:

Mike Snitzer wrote:
> On 2/2/07, Chuck Lever <[EMAIL PROTECTED]> wrote:
>> Roland Dreier wrote:
>> >  > They are mostly from Chuck Level and make preparating for IPv6
>> support
>> >  > in the NFS server.
>> >  > They are *not* for 2.6.20, but should be ok for .21.
>> >
>> > Out of curiousity, does this patch series reduce the delta 
between the

>> > NFS/RDMA tree and mainline Linux?  In other words does this bring
>> > NFS/RDMA closer to merging?
>>
>> Hi Roland-
>>
>> The client side support for an RPC/RDMA module is almost completely
>> integrated into mainline.  There is still a minimal set of patches
>> required to support alternate transports in loadable kernel modules
>> which Trond has indicated he will integrate when the RPC/RDMA 
transport

>> is ready to be integrated.
>
> Hi Chuck,
>
> I must be missing something because I don't see _any_ trace of the
> core RPC over RDMA support (xprtrdma et al), your RPC Transport
> Switch, or any of the other supporting changes in mainline.  Could
> you, or others, please clarify the plan for merging RPC/RDMA?

The RPC transport switch patches are almost fully integrated into
mainline.  The xprtrdma piece is what is not there yet.

OK, has the xprtrdma piece been reviewed by the community?  What, if
anything, is preventing the code from being included in -mm for wider
testing?

Some of the main NFS client developers (Trond, the CITI-UM developers, 
myself) have reviewed previous versions of this code, and have 
recommended a number of changes.  I haven't heard anything recently from 
the NetApp engineers working on this.  We are waiting for them to 
publish again, but I think they are waiting for the last of the 
transport switch patches to trickle into mainline (which should be ready 
with the release of 2.6.21).  I know they have been working with a 
handful of test sites and have achieved good performance results.

Has xprtrdma inclusion/progress been discussed on some other mailing
list?

Not yet, but the place that is likely to occur is [EMAIL PROTECTED]
begin:vcard
fn:Chuck Lever
n:Lever;Chuck
org:Oracle Corporation;Corporate Architecture: Linux Projects Group
adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA
email;internet:chuck dot lever at nospam oracle dot com
title:Principal Member of Staff
tel;work:+1 248 614 5091
x-mozilla-html:FALSE
version:2.1
end:vcard

Re: Merge plans for RPC/RDMA? (Was: Re: [NFS] [PATCH 000 of 14] knfsd: Preparation for IPv6 support in NFS server.)

2007-04-13 Thread Chuck Lever

Mike Snitzer wrote:

On 4/13/07, Chuck Lever <[EMAIL PROTECTED]> wrote:

Mike Snitzer wrote:
> On 2/2/07, Chuck Lever <[EMAIL PROTECTED]> wrote:
>> Roland Dreier wrote:
>> >  > They are mostly from Chuck Level and make preparating for IPv6
>> support
>> >  > in the NFS server.
>> >  > They are *not* for 2.6.20, but should be ok for .21.
>> >
>> > Out of curiousity, does this patch series reduce the delta 
between the

>> > NFS/RDMA tree and mainline Linux?  In other words does this bring
>> > NFS/RDMA closer to merging?
>>
>> Hi Roland-
>>
>> The client side support for an RPC/RDMA module is almost completely
>> integrated into mainline.  There is still a minimal set of patches
>> required to support alternate transports in loadable kernel modules
>> which Trond has indicated he will integrate when the RPC/RDMA 
transport

>> is ready to be integrated.
>
> Hi Chuck,
>
> I must be missing something because I don't see _any_ trace of the
> core RPC over RDMA support (xprtrdma et al), your RPC Transport
> Switch, or any of the other supporting changes in mainline.  Could
> you, or others, please clarify the plan for merging RPC/RDMA?

The RPC transport switch patches are almost fully integrated into
mainline.  The xprtrdma piece is what is not there yet.

OK, has the xprtrdma piece been reviewed by the community?  What, if
anything, is preventing the code from being included in -mm for wider
testing?

OK, there is something holding up the client-side piece of NFS/RDMA. 
There is some hackery required for the NFS client to recognize that the 
RDMA transport should be used instead of the standard socket transport.

We did discuss this a bit at Connectathon '07 six weeks ago, but my 
impression is that more discussion is required.

I'm working on some changes to NFS mounting that would move NFS mount 
option parsing into the kernel.  This would make it very simple to add 
RDMA transport support, but it's rather a while in coming.
begin:vcard
fn:Chuck Lever
n:Lever;Chuck
org:Oracle Corporation;Corporate Architecture: Linux Projects Group
adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA
email;internet:chuck dot lever at nospam oracle dot com
title:Principal Member of Staff
tel;work:+1 248 614 5091
x-mozilla-html:FALSE
version:2.1
end:vcard

Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues

2007-04-19 Thread Chuck Lever


Trond Myklebust wrote:

On Wed, 2007-04-18 at 20:52 -0500, Florin Iucha wrote:

On Wed, Apr 18, 2007 at 10:11:46AM -0400, Trond Myklebust wrote:

Do you have a copy of wireshark or ethereal on hand? If so, could you
take a look at whether or not any NFS traffic is going between the
client and server once the hang happens?
I used the following command 


   tcpdump -w nfs-traffic -i eth0 -vv -tt dst port nfs

to capture

   http://iucha.net/nfs/21-rc7-nfs4/nfs-traffic.bz2

I started the capture before starting the copy and left it to run for
a few minutes after the traffic slowed to a crawl.

The iostat and vmstat are at:

   http://iucha.net/nfs/21-rc7-nfs4/iostat
   http://iucha.net/nfs/21-rc7-nfs4/vmstat
   
It seems that my original problem report had a big mistake!  There is

no hang, but at some point the write slows down to a trickle (from
40,000 blocks/s to 22 blocks/s) as can be seen from the iostat log.


Yeah. You only captured the outgoing traffic to the server, but already
it looks as if there were 'interesting' things going on. In frames 29346
to 29350, the traffic stops altogether for 5 seconds (I only see
keepalives) then it starts up again. Ditto for frames 40477-40482
(another 5 seconds). ...
Then at around frame 92072, the client starts to send a bunch of RSTs.
Aha I'll bet that reverting the appended patch fixes the problem.

The assumption Chuck makes is that if _no_ request bytes have been sent,
yet the request is on the 'receive list' then it must be a resend is
patently false in the case where the send queue just happens to be full.


There are other places in the RPC client where "zero bytes sent" implies 
that the request has been sent.  The real problem here is that zeroing 
the "bytes sent" field is overloaded.


Perhaps instead of looking at the number of bytes sent, the logic in the 
last hunk of this patch should check which queue the request is sitting on.




---
commit 43d78ef2ba5bec26d0315859e8324bfc0be23766
Author: Chuck Lever <[EMAIL PROTECTED]>
Date:   Tue Feb 6 18:26:11 2007 -0500

NFS: disconnect before retrying NFSv4 requests over TCP

RFC3530 section 3.1.1 states an NFSv4 client MUST NOT send a request

twice on the same connection unless it is the NULL procedure.  Section
3.1.1 suggests that the client should disconnect and reconnect if it
wants to retry a request.

Implement this by adding an rpc_clnt flag that an ULP can use to

specify that the underlying transport should be disconnected on a
major timeout.  The NFSv4 client asserts this new flag, and requests
no retries after a minor retransmit timeout.

Note that disconnecting on a retransmit is in general not safe to do

if the RPC client does not reuse the TCP port number when reconnecting.

See http://bugzilla.linux-nfs.org/show_bug.cgi?id=6

Signed-off-by: Chuck Lever <[EMAIL PROTECTED]>

Signed-off-by: Trond Myklebust <[EMAIL PROTECTED]>

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index a3191f0..c46e94f 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -394,7 +394,8 @@ static void nfs_init_timeout_values(struct rpc_timeout *to, 
int proto,
 static int nfs_create_rpc_client(struct nfs_client *clp, int proto,
unsigned int timeo,
unsigned int retrans,
-   rpc_authflavor_t flavor)
+   rpc_authflavor_t flavor,
+   int flags)
 {
struct rpc_timeout  timeparms;
struct rpc_clnt *clnt = NULL;
@@ -407,6 +408,7 @@ static int nfs_create_rpc_client(struct nfs_client *clp, 
int proto,
.program= &nfs_program,
.version= clp->rpc_ops->version,
.authflavor = flavor,
+   .flags  = flags,
};
 
 	if (!IS_ERR(clp->cl_rpcclient))

@@ -548,7 +550,7 @@ static int nfs_init_client(struct nfs_client *clp, const 
struct nfs_mount_data *
 * - RFC 2623, sec 2.3.2
 */
error = nfs_create_rpc_client(clp, proto, data->timeo, data->retrans,
-   RPC_AUTH_UNIX);
+   RPC_AUTH_UNIX, 0);
if (error < 0)
goto error;
nfs_mark_client_ready(clp, NFS_CS_READY);
@@ -868,7 +870,8 @@ static int nfs4_init_client(struct nfs_client *clp,
/* Check NFS protocol revision and initialize RPC op vector */
clp->rpc_ops = &nfs_v4_clientops;
 
-	error = nfs_create_rpc_client(clp, proto, timeo, retrans, authflavour);

+   error = nfs_create_rpc_client(clp, proto, timeo, retrans, authflavour,
+   RPC_CLNT_CREATE_DISCRTRY);
i

Re: [PATCH] fs/nfsd: Delete invalid assignment statements in nfsd4_decode_exchange_id

2018-07-22 Thread Chuck Lever




> On Jul 22, 2018, at 4:50 AM, nixiaoming  wrote:
> 
> dummy = be32_to_cpup(p++);
> dummy = be32_to_cpup(p++);
> Assigning value to "dummy" here, but that stored value
> is overwritten before it can be used.
> 
> delete invalid assignment statements in nfsd4_decode_exchange_id
> 
> Signed-off-by: n00202754 
> ---
> fs/nfsd/nfs4xdr.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index a96843c..8e78541 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -1392,8 +1392,8 @@ nfsd4_decode_exchange_id(struct nfsd4_compoundargs 
> *argp,
> 
>   /* ssp_window and ssp_num_gss_handles */
>   READ_BUF(8);
> - dummy = be32_to_cpup(p++);
> - dummy = be32_to_cpup(p++);
> + be32_to_cpup(p++);
> + be32_to_cpup(p++);

If these values are not used, what's the point of byte swapping them?
Surely "p += 2;" should be enough.


>   break;
>   default:
>   goto xdr_error;

--
Chuck Lever

Re: [PATCH] fs/nfsd: Delete invalid assignment statements in nfsd4_decode_exchange_id

2018-07-22 Thread Chuck Lever




> On Jul 22, 2018, at 2:33 PM, Trond Myklebust  wrote:
> 
> On Sun, 2018-07-22 at 14:12 -0400, Chuck Lever wrote:
>>> On Jul 22, 2018, at 4:50 AM, nixiaoming 
>>> wrote:
>>> 
>>> dummy = be32_to_cpup(p++);
>>> dummy = be32_to_cpup(p++);
>>> Assigning value to "dummy" here, but that stored value
>>> is overwritten before it can be used.
>>> 
>>> delete invalid assignment statements in nfsd4_decode_exchange_id
>>> 
>>> Signed-off-by: n00202754 
>>> ---
>>> fs/nfsd/nfs4xdr.c | 4 ++--
>>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>> 
>>> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
>>> index a96843c..8e78541 100644
>>> --- a/fs/nfsd/nfs4xdr.c
>>> +++ b/fs/nfsd/nfs4xdr.c
>>> @@ -1392,8 +1392,8 @@ nfsd4_decode_exchange_id(struct
>>> nfsd4_compoundargs *argp,
>>> 
>>> /* ssp_window and ssp_num_gss_handles */
>>> READ_BUF(8);
>>> -   dummy = be32_to_cpup(p++);
>>> -   dummy = be32_to_cpup(p++);
>>> +   be32_to_cpup(p++);
>>> +   be32_to_cpup(p++);
>> 
>> If these values are not used, what's the point of byte swapping them?
>> Surely "p += 2;" should be enough.
>> 
>> 
>>> break;
>>> default:
>>> goto xdr_error;
>> 
> 
> Given that the value of 'p' isn't used either, why not just delete
> those two lines altogether?

Sounds OK, READ_BUF is tracking progress through the buffer,
and it already updates "p" as a side-effect.

Might there be some nearby instances of open-coded "p" updates
that could also be removed, for similar reasons?


--
Chuck Lever

Re: [BUG BISECT] NFSv4 client fails on Flush Journal to Persistent Storage

2018-07-25 Thread Chuck Lever




> On Jul 25, 2018, at 9:27 AM, Krzysztof Kozlowski  wrote:
> 
> On 18 June 2018 at 18:20, Chuck Lever  wrote:
>> 
>> The extra serialization appears to have a reproducible performance
>> impact on RDMA, which no longer takes the reserve_lock when allocating
>> a slot.
>> 
>> I could put an xprt_alloc_xid call in xprt_alloc_slot, but that would
>> only work for socket-based transports. Would it be OK if RDMA had its
>> own XID allocation mechanism?
> 
> Hi,
> 
> On recent next the issue appeared again. My boards with NFSv4 root
> timeout on 80% of boots. This time my NFS server is faster - Pi3 B+
> :).
> 
> Is this know? Should I start long bisect or maybe you can point me to
> possible causes?

Hi Krzysztof, I don't know of any recent changes. Bisecting would be
a good place to start.


--
Chuck Lever

Re: [PATCH] NFSD: hide unused svcxdr_dupstr()

2018-01-19 Thread Chuck Lever



> On Jan 19, 2018, at 9:54 AM, Arnd Bergmann  wrote:
> 
> There is now only one caller left for svcxdr_dupstr() and this is inside
> of an #ifdef, so we can get a warning when the option is disabled:
> 
> fs/nfsd/nfs4xdr.c:241:1: error: 'svcxdr_dupstr' defined but not used 
> [-Werror=unused-function]
> 
> This adds another #ifdef around the definition.
> 
> Fixes: e40d99e6183e ("NFSD: Clean up symlink argument XDR decoders")
> Signed-off-by: Arnd Bergmann 

Seems OK to me, and sorry for the noise.

Reviewed-by: Chuck Lever 


> ---
> fs/nfsd/nfs4xdr.c | 2 ++
> 1 file changed, 2 insertions(+)
> 
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index f51c9cccaf78..374a62af6034 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -237,6 +237,7 @@ svcxdr_tmpalloc(struct nfsd4_compoundargs *argp, u32 len)
>  * Note null-terminating in place usually isn't safe since the
>  * buffer might end on a page boundary.
>  */
> +#ifdef CONFIG_NFSD_V4_SECURITY_LABEL
> static char *
> svcxdr_dupstr(struct nfsd4_compoundargs *argp, void *buf, u32 len)
> {
> @@ -248,6 +249,7 @@ svcxdr_dupstr(struct nfsd4_compoundargs *argp, void *buf, 
> u32 len)
>   p[len] = '\0';
>   return p;
> }
> +#endif
> 
> /**
>  * savemem - duplicate a chunk of memory for later processing
> -- 
> 2.9.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever

Re: [PATCH] sunrpc: fix a missing check of xdr_inline_decode

2018-12-26 Thread Chuck Lever




> On Dec 25, 2018, at 10:18 PM, Kangjie Lu  wrote:
> 
> xdr_inline_decode() could fail.

NAK: xdr_inline_decode cannot fail if its second argument is zero.


> When it fails, the return value is NULL
> and should not be dereferenced.
> The fix checks if xdr_inline_decode fails, and if so, returns.
> 
> Signed-off-by: Kangjie Lu 
> ---
> net/sunrpc/xprtrdma/backchannel.c | 2 ++
> 1 file changed, 2 insertions(+)
> 
> diff --git a/net/sunrpc/xprtrdma/backchannel.c 
> b/net/sunrpc/xprtrdma/backchannel.c
> index e5b367a3e517..bd9be5272ef4 100644
> --- a/net/sunrpc/xprtrdma/backchannel.c
> +++ b/net/sunrpc/xprtrdma/backchannel.c
> @@ -285,6 +285,8 @@ void rpcrdma_bc_receive_call(struct rpcrdma_xprt *r_xprt,
>   __be32 *p;
> 
>   p = xdr_inline_decode(&rep->rr_stream, 0);
> + if (unlikely(!p))
> + goto out_overflow;
>   size = xdr_stream_remaining(&rep->rr_stream);
> 
> #ifdef RPCRDMA_BACKCHANNEL_DEBUG
> -- 
> 2.17.2 (Apple Git-113)
> 

--
Chuck Lever

Re: [PATCH] sunrpc: remove redundant code

2018-12-26 Thread Chuck Lever




> On Dec 25, 2018, at 10:24 PM, Kangjie Lu  wrote:
> 
> If no bytes to decode, just use "xdr->p" instead of calling
> xdr_inline_decode to get it. The fix cleans up the code.
> 
> Signed-off-by: Kangjie Lu 
> ---
> net/sunrpc/xprtrdma/rpc_rdma.c | 4 +---
> 1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
> index 9f53e0240035..2ef86be49bd8 100644
> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
> @@ -1123,7 +1123,6 @@ rpcrdma_decode_msg(struct rpcrdma_xprt *r_xprt, struct 
> rpcrdma_rep *rep,
> {
>   struct xdr_stream *xdr = &rep->rr_stream;
>   u32 writelist, replychunk, rpclen;
> - char *base;
> 
>   /* Decode the chunk lists */
>   if (decode_read_list(xdr))
> @@ -1138,10 +1137,9 @@ rpcrdma_decode_msg(struct rpcrdma_xprt *r_xprt, struct 
> rpcrdma_rep *rep,
>   return -EIO;
> 
>   /* Build the RPC reply's Payload stream in rqst->rq_rcv_buf */
> - base = (char *)xdr_inline_decode(xdr, 0);
>   rpclen = xdr_stream_remaining(xdr);
>   r_xprt->rx_stats.fixup_copy_count +=
> - rpcrdma_inline_fixup(rqst, base, rpclen, writelist & 3);
> + rpcrdma_inline_fixup(rqst, xdr->p, rpclen, writelist & 3);

I used xdr_inline_decode(xdr, 0) here because I didn't want to
embed specific knowledge about struct xdr_stream into
rpcrdma_decode_msg. IOW, adhere to the API contract.

Given the need for adding a type cast to this function call, as
reported by the kbuild robot, I don't see that this patch improves
things significantly.


>   r_xprt->rx_stats.total_rdma_reply += writelist;
>   return rpclen + xdr_align_size(writelist);
> -- 
> 2.17.2 (Apple Git-113)
> 

--
Chuck Lever

Re: [PATCH net-next] svcrdma: Fix an uninitialized variable false warning

2018-12-28 Thread Chuck Lever



> On Dec 20, 2018, at 4:49 AM, YueHaibing  wrote:
> 
> smatch warning this:
> net/sunrpc/xprtrdma/svc_rdma_rw.c:351 svc_rdma_post_chunk_ctxt() error: 
> uninitialized symbol 'bad_wr'
> net/sunrpc/xprtrdma/verbs.c:1569 rpcrdma_post_recvs() error: uninitialized 
> symbol 'bad_wr'
> 
> 'bad_wr' is initialized in ib_post_send. But smatch
> doesn't know that and warns this.
> 
> Signed-off-by: YueHaibing 
> ---
> net/sunrpc/xprtrdma/svc_rdma_rw.c | 2 +-
> net/sunrpc/xprtrdma/verbs.c   | 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c 
> b/net/sunrpc/xprtrdma/svc_rdma_rw.c
> index dc19517..0954b25 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
> @@ -308,7 +308,7 @@ static int svc_rdma_post_chunk_ctxt(struct 
> svc_rdma_chunk_ctxt *cc)
>   struct svcxprt_rdma *rdma = cc->cc_rdma;
>   struct svc_xprt *xprt = &rdma->sc_xprt;
>   struct ib_send_wr *first_wr;
> - const struct ib_send_wr *bad_wr;
> + const struct ib_send_wr *bad_wr = NULL;
>   struct list_head *tmp;
>   struct ib_cqe *cqe;
>   int ret;
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index 3ddba94..37be70f 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -1518,7 +1518,7 @@ void
> rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp)
> {
>   struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
> - struct ib_recv_wr *wr, *bad_wr;
> + struct ib_recv_wr *wr, *bad_wr = NULL;
>   int needed, count, rc;
> 
>   rc = 0;
> -- 
> 2.7.0

Does this need

Fixes: d34ac5cd3a73 ("RDMA, core and ULPs: Declare ib_post_send() and 
ib_post_recv() arguments const")  ???

Bart, any comments?


--
Chuck Lever

Re: [PATCH 22/23] SUNRPC: simplify auth_unix.

2018-11-07 Thread Chuck Lever

>cred->fsgid;
> - for (i = 0; i < groups; i++)
> - cred->uc_gids[i] = acred->cred->group_info->gid[i];
> - if (i < UNX_NGROUPS)
> - cred->uc_gids[i] = INVALID_GID;
> -
> - return &cred->uc_base;
> -}
> -
> -static void
> -unx_free_cred(struct unx_cred *unx_cred)
> -{
> - dprintk("RPC:   unx_free_cred %p\n", unx_cred);
> - put_cred(unx_cred->uc_base.cr_cred);
> - kfree(unx_cred);
> + rpcauth_init_cred(ret, acred, auth, &unix_credops);
> + ret->cr_flags = 1UL << RPCAUTH_CRED_UPTODATE;
> + return ret;
> }
> 
> static void
> unx_free_cred_callback(struct rcu_head *head)
> {
> - struct unx_cred *unx_cred = container_of(head, struct unx_cred, 
> uc_base.cr_rcu);
> - unx_free_cred(unx_cred);
> + struct rpc_cred *rpc_cred = container_of(head, struct rpc_cred, cr_rcu);
> + dprintk("RPC:   unx_free_cred %p\n", rpc_cred);
> + put_cred(rpc_cred->cr_cred);
> + mempool_free(rpc_cred, unix_pool);
> }
> 
> static void
> @@ -115,30 +73,32 @@ unx_destroy_cred(struct rpc_cred *cred)
> }
> 
> /*
> - * Match credentials against current process creds.
> - * The root_override argument takes care of cases where the caller may
> - * request root creds (e.g. for NFS swapping).
> + * Match credentials against current the auth_cred.
>  */
> static int
> -unx_match(struct auth_cred *acred, struct rpc_cred *rcred, int flags)
> +unx_match(struct auth_cred *acred, struct rpc_cred *cred, int flags)
> {
> - struct unx_cred *cred = container_of(rcred, struct unx_cred, uc_base);
>   unsigned int groups = 0;
>   unsigned int i;
> 
> + if (cred->cr_cred == acred->cred)
> + return 1;
> 
> - if (!uid_eq(cred->uc_uid, acred->cred->fsuid) || !gid_eq(cred->uc_gid, 
> acred->cred->fsgid))
> + if (!uid_eq(cred->cr_cred->fsuid, acred->cred->fsuid) || 
> !gid_eq(cred->cr_cred->fsgid, acred->cred->fsgid))
>   return 0;
> 
>   if (acred->cred && acred->cred->group_info != NULL)
>   groups = acred->cred->group_info->ngroups;
>   if (groups > UNX_NGROUPS)
>   groups = UNX_NGROUPS;
> + if (cred->cr_cred->group_info == NULL)
> + return groups == 0;
> + if (groups != cred->cr_cred->group_info->ngroups)
> + return 0;
> +
>   for (i = 0; i < groups ; i++)
> - if (!gid_eq(cred->uc_gids[i], acred->cred->group_info->gid[i]))
> + if (!gid_eq(cred->cr_cred->group_info->gid[i], 
> acred->cred->group_info->gid[i]))
>   return 0;
> - if (groups < UNX_NGROUPS && gid_valid(cred->uc_gids[groups]))
> - return 0;
>   return 1;
> }
> 
> @@ -150,9 +110,10 @@ static __be32 *
> unx_marshal(struct rpc_task *task, __be32 *p)
> {
>   struct rpc_clnt *clnt = task->tk_client;
> - struct unx_cred *cred = container_of(task->tk_rqstp->rq_cred, struct 
> unx_cred, uc_base);
> + struct rpc_cred *cred = task->tk_rqstp->rq_cred;
>   __be32  *base, *hold;
>   int i;
> + struct group_info *gi = cred->cr_cred->group_info;
> 
>   *p++ = htonl(RPC_AUTH_UNIX);
>   base = p++;
> @@ -163,11 +124,12 @@ unx_marshal(struct rpc_task *task, __be32 *p)
>*/
>   p = xdr_encode_array(p, clnt->cl_nodename, clnt->cl_nodelen);
> 
> - *p++ = htonl((u32) from_kuid(&init_user_ns, cred->uc_uid));
> - *p++ = htonl((u32) from_kgid(&init_user_ns, cred->uc_gid));
> + *p++ = htonl((u32) from_kuid(&init_user_ns, cred->cr_cred->fsuid));
> + *p++ = htonl((u32) from_kgid(&init_user_ns, cred->cr_cred->fsgid));
>   hold = p++;
> - for (i = 0; i < UNX_NGROUPS && gid_valid(cred->uc_gids[i]); i++)
> - *p++ = htonl((u32) from_kgid(&init_user_ns, cred->uc_gids[i]));
> + if (gi)
> + for (i = 0; i < UNX_NGROUPS && i < gi->ngroups; i++)
> + *p++ = htonl((u32) from_kgid(&init_user_ns, 
> gi->gid[i]));
>   *hold = htonl(p - hold - 1);/* gid array length */
>   *base = htonl((p - base - 1) << 2); /* cred length */
> 
> @@ -214,12 +176,13 @@ unx_validate(struct rpc_task *task, __be32 *p)
> 
> int __init rpc_init_authunix(void)
> {
> - return rpcauth_init_credcache(&unix_auth);
> + unix_pool = mempool_create_kmalloc_pool(16, sizeof(struct rpc_cred));
> + return unix_pool ? 0 : -ENOMEM;
> }
> 
> void rpc_destroy_authunix(void)
> {
> - rpcauth_destroy_credcache(&unix_auth);
> + mempool_destroy(unix_pool);
> }
> 
> const struct rpc_authops authunix_ops = {
> @@ -228,9 +191,7 @@ const struct rpc_authops authunix_ops = {
>   .au_name= "UNIX",
>   .create = unx_create,
>   .destroy= unx_destroy,
> - .hash_cred  = unx_hash_cred,
>   .lookup_cred= unx_lookup_cred,
> - .crcreate   = unx_create_cred,
> };
> 
> static
> 
> 

--
Chuck Lever

Re: [PATCH 22/23] SUNRPC: simplify auth_unix.

2018-11-08 Thread Chuck Lever




> On Nov 7, 2018, at 8:41 PM, NeilBrown  wrote:
> 
> On Wed, Nov 07 2018, Chuck Lever wrote:
> 
>> Hi Neil-
>> 
>> 
>>> On Nov 6, 2018, at 11:12 PM, NeilBrown  wrote:
>>> 
>>> 1/ discard 'struct unx_cred'.  We don't need any data that
>>>  is not already in 'struct rpc_cred'.
>>> 2/ Don't keep these creds in a hash table.  When a credential
>>>  is needed, simply allocate it.  When not needed, discard it.
>>>  This can easily be faster than performing a lookup on
>>>  a shared hash table.
> 
> Thanks for the review Chuck!
> 
>> 
>> What's the basis for this claim? A memory allocation disables and
>> enables IRQs. That definitely hits a resource that is globally
>> shared.
> 
> My basis is not rock solid, but I was convinced :-)
> 
> kmem_cache_alloc() does disable local irqs when slab.c is used.
> slub.c doesn't disable them in the fast path which I *think* should be
> reasonably common.
> slob always takes a spinlock as well as disabling interrupts.
> 
> I think slob is only recommended for tiny machines, and slub is
> generally preferred, so I think that when performance matters, it will
> still be delivered.
> 
> It isn't clear to me why you consider a local irq to be "globally
> shared" - assuming that is what you mean.

Globally-shared in this case can be construed somewhat narrowly.
If we assume slub, kmem_cache_alloc() touches resources that are
used by all tasks running on that CPU, at least in the slow path.


> Disabling local interrupts is not without cost, but I don't think the
> cost increases with the number of CPUs, while the cost of accessing
> shared memory (even without a spinlock) does.

Point taken, but having a single mempool for all RPC transports
and users is also going to be a shared resource that can
bottleneck.

I guess that's an argument in favor of building creds on demand
rather than keeping them in a hash table, isn't it. :-)


>> In addition, the comment near unx_marshal suggests we should
>> cache the marshaled on-the-wire version of the credential instead
>> of building it in the RPC Call buffer every time. That would
>> require keeping the creds around.
> 
> That comment has been there since 2.1.32 and has not be acted on.  There
> seems little reason to expect that to change.  Caching doesn't seem to
> have been found to be necessary in practice.
> 
>> 
>> Have you measured a significant difference in throughput with
>> this patch? Have you considered improving the lookup speed of
>> the hash table by making the buckets into rb-trees, for example?
> 
> No, I haven't measured.
> I might have briefly considered changing to an rb-tree, but as the
> current hashtable doesn't actually contain anything of value, I would
> have quickly discarded the idea.
> 
> If I wanted to further improve performance, I would look at ways to
> bypass the "lookup_cred" step completely.
> unx_marshal only needs the generic "struct cred", so there should be no
> need to have a 'struct rpc_cred' at all.
> 
> If this series is accepted, I'll (hopefully) look into seeing how
> practical that is.
> 
> Thanks again,
> NeilBrown
> 
> 
>> 
>> 
>>>  As the lookup can happen during write-out, use a mempool
>>>  to ensure forward progress.
>>>  This means that we cannot compare two credentials for
>>>  equality by comparing the pointers, but we never do that anyway.
>>> 
>>> Signed-off-by: NeilBrown 
>>> ---
>>> net/sunrpc/auth.c  |1 
>>> net/sunrpc/auth_unix.c |  101 
>>> +++-
>>> 2 files changed, 32 insertions(+), 70 deletions(-)
>>> 
>>> diff --git a/net/sunrpc/auth.c b/net/sunrpc/auth.c
>>> index 867ea9834bde..a07a7c59d3a4 100644
>>> --- a/net/sunrpc/auth.c
>>> +++ b/net/sunrpc/auth.c
>>> @@ -651,6 +651,7 @@ rpcauth_init_cred(struct rpc_cred *cred, const struct 
>>> auth_cred *acred,
>>> INIT_LIST_HEAD(&cred->cr_lru);
>>> refcount_set(&cred->cr_count, 1);
>>> cred->cr_auth = auth;
>>> +   cred->cr_flags = 0;
>>> cred->cr_ops = ops;
>>> cred->cr_expire = jiffies;
>>> cred->cr_cred = get_cred(acred->cred);
>>> diff --git a/net/sunrpc/auth_unix.c b/net/sunrpc/auth_unix.c
>>> index bff113a411e0..387f6b3ffbea 100644
>>> --- a/net/sunrpc/auth_unix.c
>>> +++ b/net/sunrpc/auth_unix.c
>>> @@ -11,16 +11,11

Re: [PATCH 03/20] filelock: the results of the coccinelle conversion

2024-01-17 Thread Chuck Lever

On Tue, Jan 16, 2024 at 02:45:59PM -0500, Jeff Layton wrote:
> This is the direct result of the changes generated by coccinelle. There
> is still about 1/4 of the callsites that need to be touched by hand
> here.
> 
> Signed-off-by: Jeff Layton 

For the changes in include/linux/lockd/lockd.h, fs/lockd/, and
fs/nfsd/:

Acked-by: Chuck Lever 


> ---
>  fs/9p/vfs_file.c|  38 ++---
>  fs/afs/flock.c  |  55 +++---
>  fs/ceph/locks.c |  66 +++
>  fs/dlm/plock.c  |  44 ++---
>  fs/fuse/file.c  |  14 +-
>  fs/gfs2/file.c  |  16 +-
>  fs/lockd/clnt4xdr.c |   6 +-
>  fs/lockd/clntlock.c |   2 +-
>  fs/lockd/clntproc.c |  60 ---
>  fs/lockd/clntxdr.c  |   6 +-
>  fs/lockd/svclock.c  |  10 +-
>  fs/lockd/svcsubs.c  |  20 +--
>  fs/lockd/xdr.c  |   6 +-
>  fs/lockd/xdr4.c |   6 +-
>  fs/locks.c  | 406 
> +++-
>  fs/nfs/delegation.c |   2 +-
>  fs/nfs/file.c   |  22 +--
>  fs/nfs/nfs3proc.c   |   2 +-
>  fs/nfs/nfs4proc.c   |  35 ++--
>  fs/nfs/nfs4state.c  |   4 +-
>  fs/nfs/nfs4xdr.c|   8 +-
>  fs/nfs/write.c  |   4 +-
>  fs/nfsd/filecache.c |   4 +-
>  fs/nfsd/nfs4layouts.c   |  15 +-
>  fs/nfsd/nfs4state.c |  73 
>  fs/ocfs2/locks.c|  12 +-
>  fs/ocfs2/stack_user.c   |   2 +-
>  fs/smb/client/cifssmb.c |   8 +-
>  fs/smb/client/file.c|  72 
>  fs/smb/client/smb2file.c|   2 +-
>  fs/smb/server/smb2pdu.c |  44 ++---
>  fs/smb/server/vfs.c |  12 +-
>  include/linux/lockd/lockd.h |   8 +-
>  33 files changed, 554 insertions(+), 530 deletions(-)
> 
> diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
> index 11cd8d23f6f2..f35ac7cb782e 100644
> --- a/fs/9p/vfs_file.c
> +++ b/fs/9p/vfs_file.c
> @@ -107,7 +107,7 @@ static int v9fs_file_lock(struct file *filp, int cmd, 
> struct file_lock *fl)
>  
>   p9_debug(P9_DEBUG_VFS, "filp: %p lock: %p\n", filp, fl);
>  
> - if ((IS_SETLK(cmd) || IS_SETLKW(cmd)) && fl->fl_type != F_UNLCK) {
> + if ((IS_SETLK(cmd) || IS_SETLKW(cmd)) && fl->fl_core.fl_type != 
> F_UNLCK) {
>   filemap_write_and_wait(inode->i_mapping);
>   invalidate_mapping_pages(&inode->i_data, 0, -1);
>   }
> @@ -127,7 +127,7 @@ static int v9fs_file_do_lock(struct file *filp, int cmd, 
> struct file_lock *fl)
>   fid = filp->private_data;
>   BUG_ON(fid == NULL);
>  
> - BUG_ON((fl->fl_flags & FL_POSIX) != FL_POSIX);
> + BUG_ON((fl->fl_core.fl_flags & FL_POSIX) != FL_POSIX);
>  
>   res = locks_lock_file_wait(filp, fl);
>   if (res < 0)
> @@ -136,7 +136,7 @@ static int v9fs_file_do_lock(struct file *filp, int cmd, 
> struct file_lock *fl)
>   /* convert posix lock to p9 tlock args */
>   memset(&flock, 0, sizeof(flock));
>   /* map the lock type */
> - switch (fl->fl_type) {
> + switch (fl->fl_core.fl_type) {
>   case F_RDLCK:
>   flock.type = P9_LOCK_TYPE_RDLCK;
>   break;
> @@ -152,7 +152,7 @@ static int v9fs_file_do_lock(struct file *filp, int cmd, 
> struct file_lock *fl)
>   flock.length = 0;
>   else
>   flock.length = fl->fl_end - fl->fl_start + 1;
> - flock.proc_id = fl->fl_pid;
> + flock.proc_id = fl->fl_core.fl_pid;
>   flock.client_id = fid->clnt->name;
>   if (IS_SETLKW(cmd))
>   flock.flags = P9_LOCK_FLAGS_BLOCK;
> @@ -207,12 +207,12 @@ static int v9fs_file_do_lock(struct file *filp, int 
> cmd, struct file_lock *fl)
>* incase server returned error for lock request, revert
>* it locally
>*/
> - if (res < 0 && fl->fl_type != F_UNLCK) {
> - fl_type = fl->fl_type;
> - fl->fl_type = F_UNLCK;
> + if (res < 0 && fl->fl_core.fl_type != F_UNLCK) {
> + fl_type = fl->fl_core.fl_type;
> + fl->fl_core.fl_type = F_UNLCK;
>   /* Even if this fails we want to return the remote error */
>   locks_lock_file_wait(filp, fl);
> - fl->fl_type = fl_type;
> + fl->fl_core.fl_type = fl_type;
>   }
>   if (flock.client_id != fid->clnt->name)
>   kfree(flock.client_id);
> @@ -234,7 +234,7 @@ static int v9fs_file_getlock(struct file *filp, struct 
> file_lock *fl)
>* if we have a conflicting lock l

Re: [PATCH 00/20] filelock: split struct file_lock into file_lock and file_lease structs

2024-01-17 Thread Chuck Lever

>  fs/nfs/nfs4file.c   |   2 +-
>  fs/nfs/nfs4proc.c   |  39 +-
>  fs/nfs/nfs4state.c  |   6 +-
>  fs/nfs/nfs4trace.h  |   4 +-
>  fs/nfs/nfs4xdr.c|   8 +-
>  fs/nfs/write.c  |   8 +-
>  fs/nfsd/filecache.c |   4 +-
>  fs/nfsd/nfs4callback.c  |   2 +-
>  fs/nfsd/nfs4layouts.c   |  34 +-
>  fs/nfsd/nfs4state.c |  98 ++---
>  fs/ocfs2/locks.c|  12 +-
>  fs/ocfs2/stack_user.c   |   2 +-
>  fs/smb/client/cifsfs.c  |   2 +-
>  fs/smb/client/cifssmb.c |   8 +-
>  fs/smb/client/file.c|  74 ++--
>  fs/smb/client/smb2file.c|   2 +-
>  fs/smb/server/smb2pdu.c |  44 +--
>  fs/smb/server/vfs.c |  14 +-
>  include/linux/filelock.h|  58 ++-
>  include/linux/fs.h      |   5 +-
>  include/linux/lockd/lockd.h |   8 +-
>  include/trace/events/afs.h  |   4 +-
>  include/trace/events/filelock.h |  54 +--
>  48 files changed, 1119 insertions(+), 825 deletions(-)
> ---
> base-commit: 052d534373b7ed33712a63d5e17b2b6cdbce84fd
> change-id: 20240116-flsplit-bdb46824db68
> 
> Best regards,
> -- 
> Jeff Layton 
> 

-- 
Chuck Lever

Re: [PATCH 04/20] filelock: fixups after the coccinelle changes

2024-01-17 Thread Chuck Lever

On Tue, Jan 16, 2024 at 02:46:00PM -0500, Jeff Layton wrote:
> The coccinelle script doesn't catch quite everythng (particularly with
> embedded structs). These are some by-hand fixups after the split of
> common fields into struct file_lock_core.
> 
> Signed-off-by: Jeff Layton 

For the changes in fs/lockd/ and fs/nfsd/:

Acked-by: Chuck Lever 


> ---
>  fs/ceph/locks.c |  8 ++---
>  fs/lockd/clnt4xdr.c |  8 ++---
>  fs/lockd/clntproc.c |  6 ++--
>  fs/lockd/clntxdr.c  |  8 ++---
>  fs/lockd/svc4proc.c | 10 +++---
>  fs/lockd/svclock.c  | 54 +
>  fs/lockd/svcproc.c  | 10 +++---
>  fs/lockd/svcsubs.c  |  4 +--
>  fs/lockd/xdr.c  |  8 ++---
>  fs/lockd/xdr4.c |  8 ++---
>  fs/locks.c  | 67 
> +
>  fs/nfs/delegation.c |  2 +-
>  fs/nfs/nfs4state.c  |  2 +-
>  fs/nfs/nfs4trace.h  |  4 +--
>  fs/nfs/write.c  |  4 +--
>  fs/nfsd/nfs4callback.c  |  2 +-
>  fs/nfsd/nfs4state.c |  4 +--
>  fs/smb/client/file.c|  2 +-
>  fs/smb/server/vfs.c |  2 +-
>  include/trace/events/afs.h  |  4 +--
>  include/trace/events/filelock.h | 32 ++--
>  21 files changed, 126 insertions(+), 123 deletions(-)
> 
> diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
> index ee12f9864980..55be5d231e38 100644
> --- a/fs/ceph/locks.c
> +++ b/fs/ceph/locks.c
> @@ -386,9 +386,9 @@ void ceph_count_locks(struct inode *inode, int 
> *fcntl_count, int *flock_count)
>   ctx = locks_inode_context(inode);
>   if (ctx) {
>   spin_lock(&ctx->flc_lock);
> - list_for_each_entry(lock, &ctx->flc_posix, fl_list)
> + list_for_each_entry(lock, &ctx->flc_posix, fl_core.fl_list)
>   ++(*fcntl_count);
> - list_for_each_entry(lock, &ctx->flc_flock, fl_list)
> + list_for_each_entry(lock, &ctx->flc_flock, fl_core.fl_list)
>   ++(*flock_count);
>   spin_unlock(&ctx->flc_lock);
>   }
> @@ -455,7 +455,7 @@ int ceph_encode_locks_to_buffer(struct inode *inode,
>   return 0;
>  
>   spin_lock(&ctx->flc_lock);
> - list_for_each_entry(lock, &ctx->flc_posix, fl_list) {
> + list_for_each_entry(lock, &ctx->flc_posix, fl_core.fl_list) {
>   ++seen_fcntl;
>   if (seen_fcntl > num_fcntl_locks) {
>   err = -ENOSPC;
> @@ -466,7 +466,7 @@ int ceph_encode_locks_to_buffer(struct inode *inode,
>   goto fail;
>   ++l;
>   }
> - list_for_each_entry(lock, &ctx->flc_flock, fl_list) {
> + list_for_each_entry(lock, &ctx->flc_flock, fl_core.fl_list) {
>   ++seen_flock;
>   if (seen_flock > num_flock_locks) {
>   err = -ENOSPC;
> diff --git a/fs/lockd/clnt4xdr.c b/fs/lockd/clnt4xdr.c
> index ed00bd2869a7..083a3b1bf288 100644
> --- a/fs/lockd/clnt4xdr.c
> +++ b/fs/lockd/clnt4xdr.c
> @@ -243,7 +243,7 @@ static void encode_nlm4_holder(struct xdr_stream *xdr,
>   u64 l_offset, l_len;
>   __be32 *p;
>  
> - encode_bool(xdr, lock->fl.fl_type == F_RDLCK);
> + encode_bool(xdr, lock->fl.fl_core.fl_type == F_RDLCK);
>   encode_int32(xdr, lock->svid);
>   encode_netobj(xdr, lock->oh.data, lock->oh.len);
>  
> @@ -357,7 +357,7 @@ static void nlm4_xdr_enc_testargs(struct rpc_rqst *req,
>   const struct nlm_lock *lock = &args->lock;
>  
>   encode_cookie(xdr, &args->cookie);
> - encode_bool(xdr, lock->fl.fl_type == F_WRLCK);
> + encode_bool(xdr, lock->fl.fl_core.fl_type == F_WRLCK);
>   encode_nlm4_lock(xdr, lock);
>  }
>  
> @@ -380,7 +380,7 @@ static void nlm4_xdr_enc_lockargs(struct rpc_rqst *req,
>  
>   encode_cookie(xdr, &args->cookie);
>   encode_bool(xdr, args->block);
> - encode_bool(xdr, lock->fl.fl_type == F_WRLCK);
> + encode_bool(xdr, lock->fl.fl_core.fl_type == F_WRLCK);
>   encode_nlm4_lock(xdr, lock);
>   encode_bool(xdr, args->reclaim);
>   encode_int32(xdr, args->state);
> @@ -403,7 +403,7 @@ static void nlm4_xdr_enc_cancargs(struct rpc_rqst *req,
>  
>   encode_cookie(xdr, &args->cookie);
>   encode_bool(xdr, args->block);
> - encode_bool(xdr, lock->fl.fl_type == F_WRLCK);
> + encode_bool(xdr, lock->fl.fl_core.fl_type == F_W

Re: [PATCH v2 00/41] filelock: split struct file_lock into file_lock and file_lease structs

2024-01-25 Thread Chuck Lever

On Thu, Jan 25, 2024 at 05:42:41AM -0500, Jeff Layton wrote:
> Long ago, file locks used to hang off of a singly-linked list in struct
> inode. Because of this, when leases were added, they were added to the
> same list and so they had to be tracked using the same sort of
> structure.
> 
> Several years ago, we added struct file_lock_context, which allowed us
> to use separate lists to track different types of file locks. Given
> that, leases no longer need to be tracked using struct file_lock.
> 
> That said, a lot of the underlying infrastructure _is_ the same between
> file leases and locks, so we can't completely separate everything.
> 
> This patchset first splits a group of fields used by both file locks and
> leases into a new struct file_lock_core, that is then embedded in struct
> file_lock. Coccinelle was then used to convert a lot of the callers to
> deal with the move, with the remaining 25% or so converted by hand.
> 
> It then converts several internal functions in fs/locks.c to work
> with struct file_lock_core. Lastly, struct file_lock is split into
> struct file_lock and file_lease, and the lease-related APIs converted to
> take struct file_lease.
> 
> After the first few patches (which I left split up for easier review),
> the set should be bisectable. I'll plan to squash the first few
> together to make sure the resulting set is bisectable before merge.
> 
> Finally, I left the coccinelle scripts I used in tree. I had heard it
> was preferable to merge those along with the patches that they
> generate, but I wasn't sure where they go. I can either move those to a
> more appropriate location or we can just drop that commit if it's not
> needed.
> 
> Signed-off-by: Jeff Layton 

v2 looks nicer.

I would add a few list handling primitives, as I see enough
instances of list_for_each_entry, list_for_each_entry_safe,
list_first_entry, and list_first_entry_or_null on fl_core.flc_list
to make it worth having those.

Also, there doesn't seem to be benefit for API consumers to have to
understand the internal structure of struct file_lock/lease to reach
into fl_core. Having accessor functions for common fields like
fl_type and fl_flags could be cleaner.

For the series:

Reviewed-by: Chuck Lever 

For the nfsd and lockd parts:

Acked-by: Chuck Lever 


> ---
> Changes in v2:
> - renamed file_lock_core fields to have "flc_" prefix
> - used macros to more easily do the change piecemeal
> - broke up patches into per-subsystem ones
> - Link to v1: 
> https://lore.kernel.org/r/20240116-flsplit-v1-0-c9d0f4370...@kernel.org
> 
> ---
> Jeff Layton (41):
>   filelock: rename some fields in tracepoints
>   filelock: rename fl_pid variable in lock_get_status
>   dlm: rename fl_flags variable in dlm_posix_unlock
>   nfs: rename fl_flags variable in nfs4_proc_unlck
>   nfsd: rename fl_type and fl_flags variables in nfsd4_lock
>   lockd: rename fl_flags and fl_type variables in nlmclnt_lock
>   9p: rename fl_type variable in v9fs_file_do_lock
>   afs: rename fl_type variable in afs_next_locker
>   filelock: drop the IS_* macros
>   filelock: split common fields into struct file_lock_core
>   filelock: add coccinelle scripts to move fields to struct file_lock_core
>   filelock: have fs/locks.c deal with file_lock_core directly
>   filelock: convert some internal functions to use file_lock_core instead
>   filelock: convert more internal functions to use file_lock_core
>   filelock: make posix_same_owner take file_lock_core pointers
>   filelock: convert posix_owner_key to take file_lock_core arg
>   filelock: make locks_{insert,delete}_global_locks take file_lock_core 
> arg
>   filelock: convert locks_{insert,delete}_global_blocked
>   filelock: make __locks_delete_block and __locks_wake_up_blocks take 
> file_lock_core
>   filelock: convert __locks_insert_block, conflict and deadlock checks to 
> use file_lock_core
>   filelock: convert fl_blocker to file_lock_core
>   filelock: clean up locks_delete_block internals
>   filelock: reorganize locks_delete_block and __locks_insert_block
>   filelock: make assign_type helper take a file_lock_core pointer
>   filelock: convert locks_wake_up_blocks to take a file_lock_core pointer
>   filelock: convert locks_insert_lock_ctx and locks_delete_lock_ctx
>   filelock: convert locks_translate_pid to take file_lock_core
>   filelock: convert seqfile handling to use file_lock_core
>   9p: adapt to breakup of struct file_lock
>   afs: adapt to breakup of struct file_lock
>   ceph: adapt to breakup of struct file_lock
>   dlm: adapt to breakup of struct file_lock
>   gfs2: adapt

Re: [PATCH] NFSD: Fix nfsd_clid_class use of __string_len() macro

2024-02-22 Thread Chuck Lever

On Thu, Feb 22, 2024 at 12:28:28PM -0500, Steven Rostedt wrote:
> From: "Steven Rostedt (Google)" 
> 
> I'm working on restructuring the __string* macros so that it doesn't need
> to recalculate the string twice. That is, it will save it off when
> processing __string() and the __assign_str() will not need to do the work
> again as it currently does.
> 
> Currently __string_len(item, src, len) doesn't actually use "src", but my
> changes will require src to be correct as that is where the __assign_str()
> will get its value from.
> 
> The event class nfsd_clid_class has:
> 
>   __string_len(name, name, clp->cl_name.len)
> 
> But the second "name" does not exist and causes my changes to fail to
> build. That second parameter should be: clp->cl_name.data.
> 
> Fixes: d27b74a8675ca ("NFSD: Use new __string_len C macros for 
> nfsd_clid_class")
> Signed-off-by: Steven Rostedt (Google) 
> ---
>  fs/nfsd/trace.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> index d1e8cf079b0f..2cd57033791f 100644
> --- a/fs/nfsd/trace.h
> +++ b/fs/nfsd/trace.h
> @@ -843,7 +843,7 @@ DECLARE_EVENT_CLASS(nfsd_clid_class,
>   __array(unsigned char, addr, sizeof(struct sockaddr_in6))
>   __field(unsigned long, flavor)
>   __array(unsigned char, verifier, NFS4_VERIFIER_SIZE)
> - __string_len(name, name, clp->cl_name.len)
> + __string_len(name, clp->cl_name.data, clp->cl_name.len)
>   ),
>   TP_fast_assign(
>   __entry->cl_boot = clp->cl_clientid.cl_boot;
> -- 
> 2.43.0
> 

Do you want me to take this through the nfsd tree, or would you like
an Ack from me so you can handle it as part of your clean up? Just
in case:

Acked-by: Chuck Lever 

-- 
Chuck Lever

Re: [PATCH] SUNRPC: Fix rpcgss_context trace event acceptor field

2024-04-10 Thread Chuck Lever

On Wed, Apr 10, 2024 at 12:38:13PM -0400, Steven Rostedt wrote:
> From: "Steven Rostedt (Google)" 
> 
> The rpcgss_context trace event acceptor field is a dynamically sized
> string that records the "data" parameter. But this parameter is also
> dependent on the "len" field to determine the size of the data.
> 
> It needs to use __string_len() helper macro where the length can be passed
> in. It also incorrectly uses strncpy() to save it instead of
> __assign_str(). As these macros can change, it is not wise to open code
> them in trace events.
> 
> As of commit c759e609030c ("tracing: Remove __assign_str_len()"),
> __assign_str() can be used for both __string() and __string_len() fields.
> Before that commit, __assign_str_len() is required to be used. This needs
> to be noted for backporting. (In actuality, commit c1fa617caeb0 ("tracing:
> Rework __assign_str() and __string() to not duplicate getting the string")
> is the commit that makes __string_str_len() obsolete).
> 
> Cc: sta...@vger.kernel.org
> Fixes: 0c77668ddb4e7 ("SUNRPC: Introduce trace points in rpc_auth_gss.ko")
> Signed-off-by: Steven Rostedt (Google) 

Acked-by: Chuck Lever 


> ---
>  include/trace/events/rpcgss.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/include/trace/events/rpcgss.h b/include/trace/events/rpcgss.h
> index ba2d96a1bc2f..f50fcafc69de 100644
> --- a/include/trace/events/rpcgss.h
> +++ b/include/trace/events/rpcgss.h
> @@ -609,7 +609,7 @@ TRACE_EVENT(rpcgss_context,
>   __field(unsigned int, timeout)
>   __field(u32, window_size)
>   __field(int, len)
> - __string(acceptor, data)
> + __string_len(acceptor, data, len)
>   ),
>  
>   TP_fast_assign(
> @@ -618,7 +618,7 @@ TRACE_EVENT(rpcgss_context,
>   __entry->timeout = timeout;
>   __entry->window_size = window_size;
>   __entry->len = len;
> - strncpy(__get_str(acceptor), data, len);
> + __assign_str(acceptor, data);
>   ),
>  
>   TP_printk("win_size=%u expiry=%lu now=%lu timeout=%u acceptor=%.*s",
> -- 
> 2.43.0
> 

-- 
Chuck Lever

Re: [PATCH] SUNRPC: Fix rpcgss_context trace event acceptor field

2024-04-10 Thread Chuck Lever

On Wed, Apr 10, 2024 at 01:07:41PM -0400, Steven Rostedt wrote:
> On Wed, 10 Apr 2024 12:38:53 -0400
> Chuck Lever  wrote:
> 
> > On Wed, Apr 10, 2024 at 12:38:13PM -0400, Steven Rostedt wrote:
> > > From: "Steven Rostedt (Google)" 
> > > 
> > > The rpcgss_context trace event acceptor field is a dynamically sized
> > > string that records the "data" parameter. But this parameter is also
> > > dependent on the "len" field to determine the size of the data.
> > > 
> > > It needs to use __string_len() helper macro where the length can be passed
> > > in. It also incorrectly uses strncpy() to save it instead of
> > > __assign_str(). As these macros can change, it is not wise to open code
> > > them in trace events.
> > > 
> > > As of commit c759e609030c ("tracing: Remove __assign_str_len()"),
> > > __assign_str() can be used for both __string() and __string_len() fields.
> > > Before that commit, __assign_str_len() is required to be used. This needs
> > > to be noted for backporting. (In actuality, commit c1fa617caeb0 ("tracing:
> > > Rework __assign_str() and __string() to not duplicate getting the string")
> > > is the commit that makes __string_str_len() obsolete).
> > > 
> > > Cc: sta...@vger.kernel.org
> > > Fixes: 0c77668ddb4e7 ("SUNRPC: Introduce trace points in rpc_auth_gss.ko")
> > > Signed-off-by: Steven Rostedt (Google)   
> > 
> > Acked-by: Chuck Lever 
> > 
> 
> Thanks, but feel free to take it if you want. Unless you rather have me
> take it through my tree?

Well I guess I could test it for you. I'll take it for nfsd v6.9-rc.


-- 
Chuck Lever

1 2 3 >

1 - 100 of 251 matches

Mail list logo