Re: [V9fs-developer] [PATCH] net: trans_rdma: remove unused function

2013-07-24 Thread Dominique Martinet
Hi,

David Miller wrote on Wed, Jul 24, 2013 :
> From: Paul Bolle 
> Date: Thu, 25 Jul 2013 01:09:47 +0200
> > After this patch one might as well revert the rest of commit
> > 80b45261a0b2 ("9P: Add cancelled() to the transport functions.") too. It
> > seems the entire cancelled callback stuff is now pointless.
> > 
> > As I already asked in https://lkml.org/lkml/2013/7/15/87 : did that
> > commit "forget to actually hook up rdma_cancelled() into
> > p9_rdma_trans()?". It does look so to me.
> 
> If nobody responds to this in the next day or so, feel free to send me
> a patch which rips it all out.
> 
> No response means they don't care, and neither, therefore, should we.

Well, I do care - but I couldn't find where the trans->cancelled member
function was supposed to be called anyway...
So adding it to the struct and fixing the warning is well and fine, but
if it's still never called in the end I don't see much point and there's
nothing to test.

It's on my lng todo list of things to look at, but as no other reply
came and the patch was never picked up it kind of dropped in priority.
I'll have another look next week

I guess the worst that could happen would be submitting the
rdma_cancelled function again when this is sorted out :)

Regards,
-- 
Dominique Martinet
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [V9fs-developer] [PATCH] net: trans_rdma: remove unused function

2013-07-24 Thread Dominique Martinet
Hard morning, sorry for the double mail.

Dominique Martinet wrote on Thu, Jul 25, 2013 :
> Well, I do care - but I couldn't find where the trans->cancelled member
> function was supposed to be called anyway...
> So adding it to the struct and fixing the warning is well and fine, but
> if it's still never called in the end I don't see much point and there's
> nothing to test.

To be more precise, there's a single call to c->trans_mode->cancelled in
net/9p/client.c, in p9_client_flush, which is called on subfunction returning
-ERESTARTSYS... which never happens as far as I could see.

This will be useful once/if we start working on client recovery, though
- so the function in itself definitely does interest me, and I guess
that thinking about I would have preferred to have the hook added rather
than the function removed.
But there definitely is no hurry to add this cancelled function till
then.


Regards,
-- 
Dominique Martinet
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [V9fs-developer] [PATCH] net: trans_rdma: remove unused function

2013-07-25 Thread Dominique Martinet
Eric Van Hensbergen wrote on Thu, Jul 25, 2013 :
> So, the cancel function should be used to flush any pending requests that
> haven't actually been sent yet.  Looking at the 9p RDMA code, it looks like
> the thought was that this wasn't going to be possible.  Regardless of
> removing unsent requests, the flush will still be sent and if the server
> processes it before the original request and sends a flush response back
> then we need to clear the posted buffer.  This is what rdma_cancelled is
> supposed to be doing.  So, the fix is to hook it into the structure -- but
> looking at the code it seems like we probably need to do something more to
> reclaim the buffer rather than just incrementing a counter.
> 
> To be clear this has less to do with recovery and more to do with the
> proper implementation of 9p flush semantics.  By and large, those semantics
> won't impact static file system users -- but if anyone is using the
> transport to access synthetic filesystems or files then they'll definitely
> want to have a properly implemented flush setup.  The way to test this is
> to get a blocking read on a remote named pipe or fifo and then ^C it.

Ok, I knew about the concept of flush but didn't think a ^C would cause
a -ESYSRESTART, so didn't think of that.
That said, reading from, say, a fifo is an entierly local operation: the
client does a walk, getattr, doesn't do anything 9p-wise, and clunks
when it's done with it.



As for the function needing a bit more work, there's a race, but on
"normal" requests I think it is about right - the answer lays in a
comment in rdma_request:

  /* When an error occurs between posting the recv and the send,
   * there will be a receive context posted without a pending request.
   * Since there is no way to "un-post" it, we remember it and skip
   * post_recv() for the next request.
   * So here,
   * see if we are this `next request' and need to absorb an excess rc.
   * If yes, then drop and free our own, and do not recv_post().
   **/

Basically, receive buffers are sent in a queue, and we can't "retrieve"
it back, so we just don't sent next one.

There is one problem though - if the server handles the original request
before getting the flush, the receive buffer will be consumed and we
won't send a new one, so we'll starve the reception queue.
I'm afraid I don't have any bright idea there...


While we are on reception buffer issues, there is another problem with
the queue of receive buffers, even without flush, in the following
scenario:
 - post a buffer for tag 0, on a hanging request
 - post a buffer for tag 1
 - reply for tag 1 will come on buffer from tag 0
 - post another request with tag 1.. the buffer already is in the queue,
and we don't know we can post the buffer associated with tag 0 back.

I haven't found how to reproduce this perfectly yet, but a dd with
blocksize 1MB and one with blocksize 10B in parallel brought the
mountpoint down (and the whole server was completely unavailable for the
duration of the dd - TCP sessions timed out, I even got IO errors on the
local disk :D)


Regards,
-- 
Dominique Martinet
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [V9fs-developer] [PATCH] net: trans_rdma: remove unused function

2013-07-26 Thread Dominique Martinet
I think I need to stop sending mails before triple-checking things!
So sorry for the multiple mails again.

Dominique Martinet wrote on Thu, Jul 25, 2013 :
> [rdma_cancelled]
> There is one problem though - if the server handles the original request
> before getting the flush, the receive buffer will be consumed and we
> won't send a new one, so we'll starve the reception queue.
> I'm afraid I don't have any bright idea there...

This still looks correct to me.


> While we are on reception buffer issues, there is another problem with
> the queue of receive buffers, even without flush, in the following
> scenario:
>  - post a buffer for tag 0, on a hanging request
>  - post a buffer for tag 1
>  - reply for tag 1 will come on buffer from tag 0
>  - post another request with tag 1.. the buffer already is in the queue,
> and we don't know we can post the buffer associated with tag 0 back.

It actually looks like the reply buffers are swapped properly - taken
out of the req struct into the context on send, then given back to the
appropriate req on reception, so on normal operation there's no problem
with what I described - sorry for crying wolf.
 
> I haven't found how to reproduce this perfectly yet, but a dd with
> blocksize 1MB and one with blocksize 10B in parallel brought the
> mountpoint down (and the whole server was completely unavailable for the
> duration of the dd - TCP sessions timed out, I even got IO errors on the
> local disk :D)

I need to run more tests to explain what happens with the two dds, but
it's easily reproductible with debugs on, I guess that helps with a race
somewhere.

Regards,
-- 
Dominique Martinet
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [V9fs-developer] [PATCH] net: trans_rdma: remove unused function

2013-07-26 Thread Dominique Martinet
Eric Van Hensbergen wrote on Fri, Jul 26, 2013 :
> It depends on the flags you use with 9p.  If you mount with nodevmap it'll
> treat them like normal files.  Alternatively you can use synthetics from
> fuse or even 9p file systems from Plan 9 port.

Ok, thank you for the explanation. Now I need to see if my server
supports this :)

> I need to go back and refresh more of my RDMA knowledge, I thought you
> could provide a better more direct coupling between the request and
> response buffers (similar to what we do in virtio).  In such a case, you'd
> simply need to reclaim the "skipped" buffer not shuffle everything around.
> Ultimately that would seem to work against the point of having "zero copy"
> RDMA.

What I said next was wrong, but the queue really isn't something you can
control this finely. Once something is posted, you can't take it back,
and you can't know which buffer will come back for which reply (but it's
easy enough to read the tag from the reply and associate it back)

For flush, I'm honestly not sure on how to deal with the possibility of
the servers processing the original query and sending a reply after
we've tagged our own buffer as free again, but I think there might
actually be a problem even on sucessful flush if the next post is from a
different tag (that has its own rc buffer) -- as far as I understand it,
this one's reply will be considered duplicate reply in handle_recv.
But I really need to do some more precise testing, I'll try playing with
nodevmap, if I get the flush functions to be called I'll get this
fixed.


> > I haven't found how to reproduce this perfectly yet, but a dd with
> > blocksize 1MB and one with blocksize 10B in parallel brought the
> > mountpoint down (and the whole server was completely unavailable for the
> > duration of the dd - TCP sessions timed out, I even got IO errors on the
> > local disk :D)
> >
> 
> Sounds like you ran out of memory and things didn't degrade gracefully.

I think it's actually "just" in the debug, there are two p9_debug printk
in p9_client_cb which are called from an rdma irq handler, and it gets
messy once the kernel buffer is full and it wants to flush to the
console.
I've got a short patch that adds an argument to p9_client_cb with an
indication that it shouldn't output messages, I'll send it to the v9fs
mailing list for comments if you think the idea is acceptable.

>  Which server are you using?

I know of two servers out there which handle 9P/RDMA - diod (
http://code.google.com/p/diod/ ) and nfs-ganesha (
https://github.com/nfs-ganesha/nfs-ganesha )
I wrote the RDMA implementation for ganesha and am mostly running tests
on that, but both should somewhat work with their quirks :)

Regards,
-- 
Dominique Martine
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [V9fs-developer] [PATCH] 9p: trans_fd, initialize recv fcall properly if not set

2015-09-05 Thread Dominique Martinet
Eric Van Hensbergen wrote on Sat, Sep 05, 2015:
> On Thu, Sep 3, 2015 at 4:38 AM, Dominique Martinet
>  wrote:
> > To be honest, I think it might be better to just bail out if we get in
> > this switch (m->req->rc == NULL after p9_tag_lookup) and not try to
> > allocate more, because if we get there it's likely a race condition and
> > silently re-allocating will end up in more troubles than trying to
> > recover is worth.
> > Thoughts ?
> >
> 
> Hmmm...trying to rattle my brain and remember why I put it in there
> back in 2008.
> It might have just been over-defensive programming -- or more likely it just
> pre-dated all the zero copy infrastructure which pretty much guaranteed we had
> an rc allocated and what is there is vestigial.  I'm happy to accept a
> patch which
> makes this an assert, or perhaps just resets the connection because something
> has gone horribly wrong (similar to the ENOMEM path that is there now).

Yeah, it looks like the safety comes from the zero-copy stuff that came
much later.
Let's go with resetting the connection then. Hmm. EIO is a bit too
generic so would be good to avoid that if possible, but can't think of
anything better...


Speaking of zero-copy, I believe it should be fairly straight-forward to
implement for trans_fd now I've actually looked at it, since we do the
payload read after a p9_tag_lookup, would just need m->req to point to a
zc buffer. Write is similar, if there's a zc buffer just send it after
the header.
The cost is a couple more pointers in req and an extra if in both
workers, that seems pretty reasonable.

Well, I'm not using trans_fd much here (and unfortunately zero-copy
isn't possible at all given the transport protocol for RDMA, at least
for recv), but if anyone cares it probably could be done without too
much hassle for the fd workers.

-- 
Dominique
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] 9p: trans_fd, bail out if recv fcall if missing

2015-09-07 Thread Dominique Martinet
req->rc is pre-allocated early on with p9_tag_alloc and shouldn't be missing

Signed-off-by: Dominique Martinet 
---
 net/9p/trans_fd.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

Feel free to adapt error code/message if you can think of something better.

diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c
index a270dcc..a6d89c0 100644
--- a/net/9p/trans_fd.c
+++ b/net/9p/trans_fd.c
@@ -356,13 +356,12 @@ static void p9_read_work(struct work_struct *work)
}
 
if (m->req->rc == NULL) {
-   m->req->rc = kmalloc(sizeof(struct p9_fcall) +
-   m->client->msize, GFP_NOFS);
-   if (!m->req->rc) {
-   m->req = NULL;
-   err = -ENOMEM;
-   goto error;
-   }
+   p9_debug(P9_DEBUG_ERROR,
+"No recv fcall for tag %d (req %p), 
disconnecting!\n",
+m->rc.tag, m->req);
+   m->req = NULL;
+   err = -EIO;
+   goto error;
}
m->rc.sdata = (char *)m->req->rc + sizeof(struct p9_fcall);
memcpy(m->rc.sdata, m->tmp_buf, m->rc.capacity);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] 9p: trans_fd, read rework to use p9_parse_header

2015-09-03 Thread Dominique Martinet
Most of the changes here are no-op and just renaming to use a
fcall struct, needed for p9_parse_header

It fixes the unaligned memory access to read the tag and defers to
common functions for part of the protocol knowledge (although header
length is still hard-coded...)

Reported-By: Rob Landley 
Signed-Off-By: Dominique Martinet 
---
 net/9p/trans_fd.c | 75 +--
 1 file changed, 40 insertions(+), 35 deletions(-)

It ended up alot bigger than I thought it'd be, submitting it anyway
but happy with either version - letting Eric decide what's better :)

diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c
index bced8c0..a270dcc 100644
--- a/net/9p/trans_fd.c
+++ b/net/9p/trans_fd.c
@@ -108,9 +108,7 @@ struct p9_poll_wait {
  * @unsent_req_list: accounting for requests that haven't been sent
  * @req: current request being processed (if any)
  * @tmp_buf: temporary buffer to read in header
- * @rsize: amount to read for current frame
- * @rpos: read position in current frame
- * @rbuf: current read buffer
+ * @rc: temporary fcall for reading current frame
  * @wpos: write position for current frame
  * @wsize: amount of data to write for current frame
  * @wbuf: current write buffer
@@ -131,9 +129,7 @@ struct p9_conn {
struct list_head unsent_req_list;
struct p9_req_t *req;
char tmp_buf[7];
-   int rsize;
-   int rpos;
-   char *rbuf;
+   struct p9_fcall rc;
int wpos;
int wsize;
char *wbuf;
@@ -305,49 +301,56 @@ static void p9_read_work(struct work_struct *work)
if (m->err < 0)
return;
 
-   p9_debug(P9_DEBUG_TRANS, "start mux %p pos %d\n", m, m->rpos);
+   p9_debug(P9_DEBUG_TRANS, "start mux %p pos %zd\n", m, m->rc.offset);
 
-   if (!m->rbuf) {
-   m->rbuf = m->tmp_buf;
-   m->rpos = 0;
-   m->rsize = 7; /* start by reading header */
+   if (!m->rc.sdata) {
+   m->rc.sdata = m->tmp_buf;
+   m->rc.offset = 0;
+   m->rc.capacity = 7; /* start by reading header */
}
 
clear_bit(Rpending, &m->wsched);
-   p9_debug(P9_DEBUG_TRANS, "read mux %p pos %d size: %d = %d\n",
-m, m->rpos, m->rsize, m->rsize-m->rpos);
-   err = p9_fd_read(m->client, m->rbuf + m->rpos,
-   m->rsize - m->rpos);
+   p9_debug(P9_DEBUG_TRANS, "read mux %p pos %zd size: %zd = %zd\n",
+m, m->rc.offset, m->rc.capacity,
+m->rc.capacity - m->rc.offset);
+   err = p9_fd_read(m->client, m->rc.sdata + m->rc.offset,
+m->rc.capacity - m->rc.offset);
p9_debug(P9_DEBUG_TRANS, "mux %p got %d bytes\n", m, err);
-   if (err == -EAGAIN) {
+   if (err == -EAGAIN)
goto end_clear;
-   }
 
if (err <= 0)
goto error;
 
-   m->rpos += err;
+   m->rc.offset += err;
 
-   if ((!m->req) && (m->rpos == m->rsize)) { /* header read in */
-   u16 tag;
+   /* header read in */
+   if ((!m->req) && (m->rc.offset == m->rc.capacity)) {
p9_debug(P9_DEBUG_TRANS, "got new header\n");
 
-   n = le32_to_cpu(*(__le32 *) m->rbuf); /* read packet size */
-   if (n >= m->client->msize) {
+   err = p9_parse_header(&m->rc, NULL, NULL, NULL, 0);
+   if (err) {
p9_debug(P9_DEBUG_ERROR,
-"requested packet size too big: %d\n", n);
+"error parsing header: %d\n", err);
+   goto error;
+   }
+
+   if (m->rc.size >= m->client->msize) {
+   p9_debug(P9_DEBUG_ERROR,
+"requested packet size too big: %d\n",
+m->rc.size);
err = -EIO;
goto error;
}
 
-   tag = le16_to_cpu(*(__le16 *) (m->rbuf+5)); /* read tag */
p9_debug(P9_DEBUG_TRANS,
-"mux %p pkt: size: %d bytes tag: %d\n", m, n, tag);
+"mux %p pkt: size: %d bytes tag: %d\n",
+m, m->rc.size, m->rc.tag);
 
-   m->req = p9_tag_lookup(m->client, tag);
+   m->req = p9_tag_lookup(m->client, m->rc.tag);
if (!m->req || (m->req->status != REQ_STATUS_SENT)) {
p9_debug(P9_DEBUG_ERROR, "Unexpected packet tag %d\n",
-tag);
+   

[PATCH] 9p: trans_fd, initialize recv fcall properly if not set

2015-09-03 Thread Dominique Martinet
That code really should never be called (rc is allocated in
tag_alloc), but if it had been it couldn't have worked...

Signed-off-by: Dominique Martinet 
---
 net/9p/trans_fd.c | 3 +++
 1 file changed, 3 insertions(+)

To be honest, I think it might be better to just bail out if we get in
this switch (m->req->rc == NULL after p9_tag_lookup) and not try to
allocate more, because if we get there it's likely a race condition and
silently re-allocating will end up in more troubles than trying to
recover is worth.
Thoughts ?

diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c
index a270dcc..0d9831a 100644
--- a/net/9p/trans_fd.c
+++ b/net/9p/trans_fd.c
@@ -363,6 +363,9 @@ static void p9_read_work(struct work_struct *work)
err = -ENOMEM;
goto error;
}
+   m->req->rc.capacity = m->client->msize;
+   m->req->rc.sdata = (char*)m->req->rc +
+   sizeof(struct p9_fcall);
}
m->rc.sdata = (char *)m->req->rc + sizeof(struct p9_fcall);
memcpy(m->rc.sdata, m->tmp_buf, m->rc.capacity);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/2] net/9p: embed fcall in req to round down buffer allocs

2018-08-02 Thread Dominique Martinet
Greg Kurz wrote on Thu, Aug 02, 2018:
> > @@ -443,9 +444,9 @@ static int rdma_request(struct p9_client *client, 
> > struct p9_req_t *req)
> >  **/
> > if (unlikely(atomic_read(&rdma->excess_rc) > 0)) {
> > if ((atomic_sub_return(1, &rdma->excess_rc) >= 0)) {
> > -   /* Got one ! */
> > -   kfree(req->rc);
> > -   req->rc = NULL;
> > +   /* Got one! */
> > +   kfree(req->rc.sdata);
> 
> Shouldn't this be p9_fcall_fini(&req->rc) ?

Right, I failed at bookkeeping, I changed that in the next patch but it
should have been done now.
Will add p9_fcall_fini to headers/export in this patch instead of the
next

> The rest looks good, so, with that fixed, you can add:
> 
> Reviewed-by: Greg Kurz 

Thanks!

-- 
Dominique


Re: [V9fs-developer] INFO: task hung in grab_super

2018-08-02 Thread Dominique Martinet
Dmitry Vyukov via V9fs-developer wrote on Wed, Jul 18, 2018:
> >> Btw, I see that p9_client_rpc uses wait_event_killable, why wasn't it
> >> killed along with the whole process?
> >>
> >
> > wait_event_killable() would return -ERESTARTSYS if got SIGKILL.
> > But if (c->status == Connected) && (type == P9_TFLUSH) is also true,
> > it ignores SIGKILL by retrying the loop...
> >
> >   again:
> > err = wait_event_killable(*req->wq, req->status >= REQ_STATUS_RCVD);
> > if ((err == -ERESTARTSYS) && (c->status == Connected) && (type == 
> > P9_TFLUSH)) {
> >   sigpending = 1;
> >   clear_thread_flag(TIF_SIGPENDING);
> >   goto again;
> > }
> >
> > I wish they don't ignore SIGKILL (by e.g. offloading operations to a kernel 
> > thread).
> 
> 
> I guess that's the problem, right? SIGKILL-ed task must not ignore
> SIGKILL and hang in infinite loop. This would explain a bunch of hangs
> in 9p.

Tricky with the current way we handle this, as the normal action if
wait_event_killable is interrupted is to send a tflush message (which is
what you could also notice, if you just send one sigkill it'll just send
a flush message and wait for that instead)

There's work in progress to add refcounting to requests which would make
us one step closer to being able to not wait for the flush reply (or
rather, it'll give us the ability to wait for it asynchronously) ; we
should be able to get rid of that loop after that.

-- 
Dominique


Re: [V9fs-developer] [PATCH v2] net/9p/trans_virtio.c: add null terminal for mount tag

2018-08-02 Thread Dominique Martinet
piaojun wrote on Fri, Aug 03, 2018:
> We'd better reach an agreement about the patch fix.

The way I read Greg's comment was that he agreed to the proposed changes
and is waiting for a new version.


I'm writing a longer reply than I should because I don't like people
saying strncmp is safe just because it's strncmp, feel free to skim
through the rest as it's just ranting.


> In my opinion, replacing strlen(chan->tag) with a local variable
> sounds reasonable,

So we agree here

> and changing strncmp to strcmp may be little beneficial, as strcmp is more
> dangerours such as buffer-flow.

strcmp is more dangerous for buffer-overflow if you're comparing
"unsafe" non-null terminated strings.
This isn't the case here as you've constructed chan->tag yourself and
you rely on it being null-terminated by calling strlen() on it yourself.

strncmp(x, y, strlen(y)+1) is at best awkward, but it's a false sense of
security if you think this is any better than strcmp here. It implies that:
 - y is null-terminated (for strlen() to work)
 - x is either null-terminated or longer than y

Here, x is the devname argument to p9_virtio_create, which comes
straight from the mount syscall with "copy_mount_string", using
strndup_user, which returns a null-terminated string or an error.

(the code is currently not safe if it returns an error, I'm sending
another mail about it right after this one as we already have a partial
fix)


strcmp(x, y) on the other hand assumes that x and y are null-terminated
in this case, which is the same assumptions you have, so is strictly as
"safe" as strncmp used that way.
(it could also assume that one is null terminated and the other one is
at least as long as that, e.g. when comparing a "char buf[42]" to the
constant "foo" then we don't care if buf is null-terminated)


Thanks,
-- 
Dominique



Re: [V9fs-developer] [PATCH v2] net/9p/trans_virtio.c: add null terminal for mount tag

2018-08-02 Thread Dominique Martinet
Dominique Martinet wrote on Fri, Aug 03, 2018:
> (the code is currently not safe if it returns an error, I'm sending
> another mail about it right after this one as we already have a partial
> fix)

I take that one back, ksys_mount() does check for error properly so just
the null checks we have in Tomas' patch[1] are enough ; sorry for the
double-mail

[1] http://lkml.kernel.org/r/20180727110558.5479-1-tomasbort...@gmail.com

-- 
Dominique


Re: [PATCH v3] net/9p/trans_virtio.c: add null terminal for mount tag

2018-08-03 Thread Dominique Martinet
piaojun wrote on Fri, Aug 03, 2018:
> chan->tag is Non-null terminated which will result in printing messy code
> when debugging code. So we should add '\0' for tag to make the code more
> convenient and robust. In addition, I drop char->tag_len to simplify the
> code.

LGTM, I'll pick this up.

Thanks for the rework

> Signed-off-by: Jun Piao 
> ---
>  net/9p/trans_virtio.c | 17 +++--
>  1 file changed, 7 insertions(+), 10 deletions(-)
> 
> diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
> index d422bfc..46a5ab2 100644
> --- a/net/9p/trans_virtio.c
> +++ b/net/9p/trans_virtio.c
> @@ -89,10 +89,8 @@ struct virtio_chan {
>   unsigned long p9_max_pages;
>   /* Scatterlist: can be too big for stack. */
>   struct scatterlist sg[VIRTQUEUE_NUM];
> -
> - int tag_len;
>   /*
> -  * tag name to identify a mount Non-null terminated
> +  * tag name to identify a mount null terminated
>*/
>   char *tag;
> 
> @@ -525,14 +523,15 @@ static ssize_t p9_mount_tag_show(struct device *dev,
>  {
>   struct virtio_chan *chan;
>   struct virtio_device *vdev;
> + int tag_len;
> 
>   vdev = dev_to_virtio(dev);
>   chan = vdev->priv;
> + tag_len = strlen(chan->tag);
> 
> - memcpy(buf, chan->tag, chan->tag_len);
> - buf[chan->tag_len] = 0;
> + memcpy(buf, chan->tag, tag_len + 1);
> 
> - return chan->tag_len + 1;
> + return tag_len + 1;
>  }
> 
>  static DEVICE_ATTR(mount_tag, 0444, p9_mount_tag_show, NULL);
> @@ -585,7 +584,7 @@ static int p9_virtio_probe(struct virtio_device *vdev)
>   err = -EINVAL;
>   goto out_free_vq;
>   }
> - tag = kmalloc(tag_len, GFP_KERNEL);
> + tag = kzalloc(tag_len + 1, GFP_KERNEL);
>   if (!tag) {
>   err = -ENOMEM;
>   goto out_free_vq;
> @@ -594,7 +593,6 @@ static int p9_virtio_probe(struct virtio_device *vdev)
>   virtio_cread_bytes(vdev, offsetof(struct virtio_9p_config, tag),
>  tag, tag_len);
>   chan->tag = tag;
> - chan->tag_len = tag_len;
>   err = sysfs_create_file(&(vdev->dev.kobj), &dev_attr_mount_tag.attr);
>   if (err) {
>   goto out_free_tag;
> @@ -654,8 +652,7 @@ static int p9_virtio_probe(struct virtio_device *vdev)
> 
>   mutex_lock(&virtio_9p_lock);
>   list_for_each_entry(chan, &virtio_chan_list, chan_list) {
> - if (!strncmp(devname, chan->tag, chan->tag_len) &&
> - strlen(devname) == chan->tag_len) {
> + if (!strcmp(devname, chan->tag)) {
>   if (!chan->inuse) {
>   chan->inuse = true;
>   found = 1;
> -- 


Re: KASAN: invalid-free in p9stat_free

2018-08-26 Thread Dominique Martinet
syzbot wrote on Sun, Aug 26, 2018:
> HEAD commit:e27bc174c9c6 Add linux-next specific files for 20180824
> git tree:   linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=15dc19a640
> kernel config:  https://syzkaller.appspot.com/x/.config?x=28446088176757ea
> dashboard link: https://syzkaller.appspot.com/bug?extid=d4252148d198410b864f
> compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=15f8efba40
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1178256a40
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+d4252148d198410b8...@syzkaller.appspotmail.com
> 
> random: sshd: uninitialized urandom read (32 bytes read)
> random: sshd: uninitialized urandom read (32 bytes read)
> random: sshd: uninitialized urandom read (32 bytes read)
> random: sshd: uninitialized urandom read (32 bytes read)
> ==
> BUG: KASAN: double-free or invalid-free in p9stat_free+0x35/0x100
> net/9p/protocol.c:48

That looks straight-forward enough, p9pdu_vreadf does p9stat_free on
error then v9fs_dir_readdir does the same ; there is nothing else that
could return an error without going through the first free so we could
just remove the later one...

There are a couple other users of the 'S' pdu read (that reads the stat
struct and frees it on error), so it's probably best to keep the current
behaviour as far as this is concerned, what we could do though is make
the free function idempotent (write NULLs in the freed fields), but I do
not see this being done often, do you know what the policy is about
this kind of pattern nowadays?

The struct is cleanly zeroed before being read so there is no risk of
double-frees between iterations so zeroing pointers is not strictly
required, but it does make things safer in general.


-- 
Dominique Martinet


KASAN error in [PATCH v3 7/8] proc/kcore: optimize multiple page reads

2018-08-28 Thread Dominique Martinet
 in the first case below:
if (&m->list == &kclist_head) {
meaning no address matched in the list, so you cannot check m->addr and
m->size in this case -- I'm afraid you will have to run through the list
just in case if that happens even if there likely won't be any match for
the next address either.


> + list_for_each_entry(m, &kclist_head, list) {
> + if (start >= m->addr &&
> +     start < m->addr + m->size)
> + break;
> + }
>   }
>  
>   if (&m->list == &kclist_head) {

-- 
Dominique Martinet


[PATCH] proc/kcore: fix invalid memory access in multi-page read optimization

2018-08-28 Thread Dominique Martinet
The 'm' kcore_list item can point to kclist_head, and it is incorrect to
look at m->addr / m->size in this case.
There is no choice but to run through the list of entries for every address
if we did not find any entry in the previous iteration

Fixes: bf991c2231117 ("proc/kcore: optimize multiple page reads")
Signed-off-by: Dominique Martinet 
---

I guess now I'm looking at bf991c2231117 again that it would be slightly
more efficient to remove the !m check and initialize m to point to
kclist_head like this:
 m = list_entry(&kclist_head, struct kcore_list, list);
but it feels a bit forced to me; deferring the choice to others.

 fs/proc/kcore.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index ad72261ee3fe..50036f6e1f52 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -451,7 +451,8 @@ read_kcore(struct file *file, char __user *buffer, size_t 
buflen, loff_t *fpos)
 * If this is the first iteration or the address is not within
 * the previous entry, search for a matching entry.
 */
-   if (!m || start < m->addr || start >= m->addr + m->size) {
+   if (!m || &m->list == &kclist_head || start < m->addr ||
+   start >= m->addr + m->size) {
list_for_each_entry(m, &kclist_head, list) {
if (start >= m->addr &&
start < m->addr + m->size)
-- 
2.17.1



Re: [PATCH] net/9p/trans_virtio.c: decrease the refcount of 9p virtio device when removing it

2018-08-08 Thread Dominique Martinet
piaojun wrote on Wed, Aug 08, 2018:
> I found that 9pnet_virtio.ko could not be removed by rmmod command, and I
> could still found it by lsmod. The reason is that we forgot decrease the
> refcount of 9p virtio device by kobject_put. So we should put refcount in
> p9_virtio_remove.


Hmm, I cannot seem to reproduce this. Can you give more details on how
to get into stuck state?
I tried mounting something, accessing the sysfs files, etc to no avail.

lsmod gives a counter to how many references there are to the module,
you can use that to debug a bit.
For example here I get this line just after loading the module:
  9pnet_virtio   32768  0

Then after mounting something there is one reference:
  9pnet_virtio   32768  1

Then unmounting puts that back to 0 and 'modprobe -r' (or rmmod) works.




I dislike saying the next part but I think form also is important,
please bear with me:

 - shorter subject line, please. For example, you can lose 20 characters
by reodering words so there is no need for pronouns
"net/9p/virtio: decrease 9p virtio device refcount on removal"

 - I personally dislike commit messages that are "novelized" (that is,
put yourself as an actor and describe what you were doing)
That seems to be somewhat accepted as looking at the kernel's git log I
see some (few) commits using "I ..." that are not pull request messages
but if possible please avoid this style and try to describe facts, how
things are wrong, what got fixed and if required how.
To give an example again, this says the same thing:
"The 9pnet_virtio module could not be unloaded because we forgot to
decrease the refcount of the 9p virtio device with kobject_put.

Put the reference in 9p_virtio_remove"


Thanks,
-- 
Dominique


Re: [PATCH] net/9p/trans_virtio.c: decrease the refcount of 9p virtio device when removing it

2018-08-08 Thread Dominique Martinet
piaojun wrote on Wed, Aug 08, 2018:
> I try to remove 9pnet_virtio.ko by 'rmmod 9pnet_virtio' as I want to
> replace it without rebooting system.

I do that all the time when testing, it works for me.
What exact kernel commit are you running?

> Here I have not mount 9pfs yet, so the refcount is still 0.
> 
> Before rmmod:
> # lsmod | grep 9p
> 9pnet_virtio   20480  0
> 9pnet 106496  1 9pnet_virtio
> virtio_ring28672  5 
> virtio_scsi,9pnet_virtio,virtio_pci,virtio_blk,virtio_net
> virtio 16384  5 
> virtio_scsi,9pnet_virtio,virtio_pci,virtio_blk,virtio_net
> 
> After rmmod:
> # lsmod | grep 9p
> 9pnet_virtio   20480  0
> 9pnet 106496  1 9pnet_virtio
> virtio_ring28672  5 
> virtio_scsi,9pnet_virtio,virtio_pci,virtio_blk,virtio_net
> virtio 16384  5 
> virtio_scsi,9pnet_virtio,virtio_pci,virtio_blk,virtio_net
> 
> Normally 9pnet_virtio should be invisible after rmmod like this:
> # lsmod | grep 9p
> 9pnet 106496  0

Right, that obviously didn't work...

But on the other hand, if I apply your commit and load/unload
9pnet_virtio 5-10 times (I ran it in a loop) I get KASAN errors because
we put too many of these refs ; that doesn't happen without your patch
so it's apparently wrong.
I'm curious how that could make modprobe work better for you as well, it
shouldn't depend on that...

Maybe `modprobe -r` might give a better error, or something in dmesg?

-- 
Dominique


Re: [PATCH] net/9p/trans_virtio.c: decrease the refcount of 9p virtio device when removing it

2018-08-08 Thread Dominique Martinet
piaojun wrote on Thu, Aug 09, 2018:
> > What exact kernel commit are you running?
>
> My kernel commit id 6edf1d4cb0acde, and I replace the 9p code with
> 9p-next. And I wonder if this will work well?

That is somewhere on top of 4.18-rc1 and got merged in 4.18-rc4, which
are close enough so while I can question the practice I don't see why
not.

I've just tried the following:
$ git checkout 6edf1d4cb0acde
$ git checkout martinetd/9p-next net/9p fs/9p include/net/9p
(martinetd/9p-next is 9f961802a7 as of this mail)

$ uname -r
4.18.0-rc1+
$ lsmod | grep -E '^9pnet_virtio' || echo "not loaded"
9pnet_virtio   32768  0
$ sudo modprobe -r 9pnet_virtio
$ lsmod | grep -E '^9pnet_virtio' || echo "not loaded"
not loaded
$ sudo modprobe 9pnet_virtio
$ sudo mount -t 9p -o debug=1,trans=virtio shm /mnt
$ ls /mnt

$ cat /sys/module/9pnet_virtio/drivers/virtio\:9pnet_virtio/*/mount_tag
tmpshm (these could use a new line...)
$ sudo umount /mnt
$ sudo modprobe -r 9pnet_virtio
$ lsmod | grep -E '^9pnet_virtio' || echo "not loaded"
not loaded

The /sys/devices/pci*/*/virtio*/mount_tag files are also removed
properly; I don't see any problem.


Not being able to reproduce is fine in general, but I also get problems
when applying the patch and unloading the module multiple times so I
can't help but question this patch and think your problem lies somewhere
else.

-- 
Dominique


[PATCH v3 2/2] net/9p: add a per-client fcall kmem_cache

2018-08-09 Thread Dominique Martinet
From: Dominique Martinet 

Having a specific cache for the fcall allocations helps speed up
allocations a bit, especially in case of non-"round" msizes.

The caches will automatically be merged if there are multiple caches
of items with the same size so we do not need to try to share a cache
between different clients of the same size.

Since the msize is negotiated with the server, only allocate the cache
after that negotiation has happened - previous allocations or
allocations of different sizes (e.g. zero-copy fcall) are made with
kmalloc directly.

Signed-off-by: Dominique Martinet 
Cc: Matthew Wilcox 
Cc: Greg Kurz 
Cc: Jun Piao 
---
v2:
 - Add a pointer to the cache in p9_fcall to make sure a buffer
allocated with kmalloc gets freed with kfree and vice-versa
This could have been smaller with a bool but this spares having
to look at the client so looked a bit cleaner, I'm expecting this
patch will need a v3 one way or another so I went for the bolder
approach - please say if you think a smaller item is better ; I *think*
nothing relies on this being ordered the same way as the data on the
wire (struct isn't packed anyway) so we can move id after tag and add
another u8 to not have any overhead

 - added likely() to cache existence check in allocation, but nothing
for msize check or free because of zc request being of different size

v3:
 - defined the userdata region in the cache, as read or write calls can
access the buffer directly (lead to warnings with HARDENED_USERCOPY=y)



I didn't get any comment on v2 for this patch, but I'm still not fully
happy with this in all honesty.

Part of me believes we might be better off always allocating from the
cache even for zero-copy headers, it's a waste of space but the buffers
are there and being reused constantly for non-zc calls, and mixing
kmallocs in could lead to these buffers being really freed and
reallocated all the time instead of reusing the slab buffers
continuously.

That would let me move the likely() for the fast path, with the only
exception being the TVERSION initial call on mount, for which I still
don't have any nice idea on how to handle, using a different size in
p9_client_rpc or prepare_req if the type is TVERSION and I'm really not
happy with that...


 include/net/9p/9p.h |  4 
 include/net/9p/client.h |  1 +
 net/9p/client.c | 40 +++-
 3 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/include/net/9p/9p.h b/include/net/9p/9p.h
index e23896116d9a..645266b74652 100644
--- a/include/net/9p/9p.h
+++ b/include/net/9p/9p.h
@@ -336,6 +336,9 @@ enum p9_qid_t {
 #define P9_NOFID   (u32)(~0)
 #define P9_MAXWELEM16
 
+/* Minimal header size: len + id + tag */
+#define P9_HDRSZ   7
+
 /* ample room for Twrite/Rread header */
 #define P9_IOHDRSZ 24
 
@@ -558,6 +561,7 @@ struct p9_fcall {
size_t offset;
size_t capacity;
 
+   struct kmem_cache *cache;
u8 *sdata;
 };
 
diff --git a/include/net/9p/client.h b/include/net/9p/client.h
index c2671d40bb6b..735f3979d559 100644
--- a/include/net/9p/client.h
+++ b/include/net/9p/client.h
@@ -123,6 +123,7 @@ struct p9_client {
struct p9_trans_module *trans_mod;
enum p9_trans_status status;
void *trans;
+   struct kmem_cache *fcall_cache;
 
union {
struct {
diff --git a/net/9p/client.c b/net/9p/client.c
index ed78751aee7c..6c57ab1294d7 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -192,6 +192,9 @@ static int parse_opts(char *opts, struct p9_client *clnt)
goto free_and_return;
}
 
+   if (clnt->trans_mod) {
+   pr_warn("Had trans %s\n", 
clnt->trans_mod->name);
+   }
v9fs_put_trans(clnt->trans_mod);
clnt->trans_mod = v9fs_get_trans_by_name(s);
if (clnt->trans_mod == NULL) {
@@ -231,9 +234,16 @@ static int parse_opts(char *opts, struct p9_client *clnt)
return ret;
 }
 
-static int p9_fcall_init(struct p9_fcall *fc, int alloc_msize)
+static int p9_fcall_init(struct p9_client *c, struct p9_fcall *fc,
+int alloc_msize)
 {
-   fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
+   if (likely(c->fcall_cache) && alloc_msize == c->msize) {
+   fc->sdata = kmem_cache_alloc(c->fcall_cache, GFP_NOFS);
+   fc->cache = c->fcall_cache;
+   } else {
+   fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
+   fc->cache = NULL;
+   }
if (!fc->sdata)
return -ENOMEM;
fc->capacity = alloc_msize;
@@ -242,7 +252,16 @@ static int p9_fcall_init(struct p9_fcall *fc, int 
alloc_msize)
 
 void p9_fcall_fini(struct p9_fcall *fc)
 {
-   kfree(fc->sd

[PATCH v3 1/2] net/9p: embed fcall in req to round down buffer allocs

2018-08-09 Thread Dominique Martinet
From: Dominique Martinet 

'msize' is often a power of two, or at least page-aligned, so avoiding
an overhead of two dozen bytes for each allocation will help the
allocator do its work and reduce memory fragmentation.

Suggested-by: Matthew Wilcox 
Signed-off-by: Dominique Martinet 
Reviewed-by: Greg Kurz 
Cc: Matthew Wilcox 
Cc: Jun Piao 
---
v2:
 - Add extra label to not free uninitialized memory on alloc failure
 - Rename p9_fcall_alloc to 9p_fcall_init
 - Add a p9_fcall_fini function to echo to init

v3:
 - use p9_call_fini() in rdma_request() instead of kfree, the code
was in the following patch in previous version

 include/net/9p/client.h |   5 +-
 net/9p/client.c | 167 +---
 net/9p/trans_fd.c   |  12 +--
 net/9p/trans_rdma.c |  29 +++
 net/9p/trans_virtio.c   |  18 ++---
 net/9p/trans_xen.c  |  12 +--
 6 files changed, 125 insertions(+), 118 deletions(-)

diff --git a/include/net/9p/client.h b/include/net/9p/client.h
index a4dc42c53d18..c2671d40bb6b 100644
--- a/include/net/9p/client.h
+++ b/include/net/9p/client.h
@@ -95,8 +95,8 @@ struct p9_req_t {
int status;
int t_err;
wait_queue_head_t wq;
-   struct p9_fcall *tc;
-   struct p9_fcall *rc;
+   struct p9_fcall tc;
+   struct p9_fcall rc;
void *aux;
struct list_head req_list;
 };
@@ -230,6 +230,7 @@ int p9_client_mkdir_dotl(struct p9_fid *fid, const char 
*name, int mode,
kgid_t gid, struct p9_qid *);
 int p9_client_lock_dotl(struct p9_fid *fid, struct p9_flock *flock, u8 
*status);
 int p9_client_getlock_dotl(struct p9_fid *fid, struct p9_getlock *fl);
+void p9_fcall_fini(struct p9_fcall *fc);
 struct p9_req_t *p9_tag_lookup(struct p9_client *, u16);
 void p9_client_cb(struct p9_client *c, struct p9_req_t *req, int status);
 
diff --git a/net/9p/client.c b/net/9p/client.c
index 88db45966740..ed78751aee7c 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -231,16 +231,20 @@ static int parse_opts(char *opts, struct p9_client *clnt)
return ret;
 }
 
-static struct p9_fcall *p9_fcall_alloc(int alloc_msize)
+static int p9_fcall_init(struct p9_fcall *fc, int alloc_msize)
 {
-   struct p9_fcall *fc;
-   fc = kmalloc(sizeof(struct p9_fcall) + alloc_msize, GFP_NOFS);
-   if (!fc)
-   return NULL;
+   fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
+   if (!fc->sdata)
+   return -ENOMEM;
fc->capacity = alloc_msize;
-   fc->sdata = (char *) fc + sizeof(struct p9_fcall);
-   return fc;
+   return 0;
+}
+
+void p9_fcall_fini(struct p9_fcall *fc)
+{
+   kfree(fc->sdata);
 }
+EXPORT_SYMBOL(p9_fcall_fini);
 
 static struct kmem_cache *p9_req_cache;
 
@@ -263,13 +267,13 @@ p9_tag_alloc(struct p9_client *c, int8_t type, unsigned 
int max_size)
if (!req)
return NULL;
 
-   req->tc = p9_fcall_alloc(alloc_msize);
-   req->rc = p9_fcall_alloc(alloc_msize);
-   if (!req->tc || !req->rc)
+   if (p9_fcall_init(&req->tc, alloc_msize))
+   goto free_req;
+   if (p9_fcall_init(&req->rc, alloc_msize))
goto free;
 
-   p9pdu_reset(req->tc);
-   p9pdu_reset(req->rc);
+   p9pdu_reset(&req->tc);
+   p9pdu_reset(&req->rc);
req->status = REQ_STATUS_ALLOC;
init_waitqueue_head(&req->wq);
INIT_LIST_HEAD(&req->req_list);
@@ -281,7 +285,7 @@ p9_tag_alloc(struct p9_client *c, int8_t type, unsigned int 
max_size)
GFP_NOWAIT);
else
tag = idr_alloc(&c->reqs, req, 0, P9_NOTAG, GFP_NOWAIT);
-   req->tc->tag = tag;
+   req->tc.tag = tag;
spin_unlock_irq(&c->lock);
idr_preload_end();
if (tag < 0)
@@ -290,8 +294,9 @@ p9_tag_alloc(struct p9_client *c, int8_t type, unsigned int 
max_size)
return req;
 
 free:
-   kfree(req->tc);
-   kfree(req->rc);
+   p9_fcall_fini(&req->tc);
+   p9_fcall_fini(&req->rc);
+free_req:
kmem_cache_free(p9_req_cache, req);
return ERR_PTR(-ENOMEM);
 }
@@ -329,14 +334,14 @@ EXPORT_SYMBOL(p9_tag_lookup);
 static void p9_free_req(struct p9_client *c, struct p9_req_t *r)
 {
unsigned long flags;
-   u16 tag = r->tc->tag;
+   u16 tag = r->tc.tag;
 
p9_debug(P9_DEBUG_MUX, "clnt %p req %p tag: %d\n", c, r, tag);
spin_lock_irqsave(&c->lock, flags);
idr_remove(&c->reqs, tag);
spin_unlock_irqrestore(&c->lock, flags);
-   kfree(r->tc);
-   kfree(r->rc);
+   p9_fcall_fini(&r->tc);
+   p9_fcall_fini(&r->rc);
kmem_cache_free(p9_req_cache, r);
 }
 
@@ -368,7 +373,7 @@ static void p9_tag_cleanup(struct p9_client *c)
  */
 void p9_client_cb(struct p9_client *c, struct

Re: [PATCH v3 2/2] net/9p: add a per-client fcall kmem_cache

2018-08-09 Thread Dominique Martinet
piaojun wrote on Fri, Aug 10, 2018:
> Could you help paste the test result of before-after-applied this patch in
> comment? And please see my comments below.

Thanks the the review, do you mean the commit message?

I'll add the summary I wrote in reply to your question a few mails
before.


> > diff --git a/include/net/9p/9p.h b/include/net/9p/9p.h
> > index e23896116d9a..645266b74652 100644
> > --- a/include/net/9p/9p.h
> > +++ b/include/net/9p/9p.h
> > @@ -336,6 +336,9 @@ enum p9_qid_t {
> >  #define P9_NOFID   (u32)(~0)
> >  #define P9_MAXWELEM16
> >  
> > +/* Minimal header size: len + id + tag */
> 
> Here should be 'size + id + tag'.

hm I didn't want to repeat size, but I guess people do refer to that
field as size.
I'll actually rewrite it as:
 Minimal header size: size[4] type[1] tag[2]

> > +   kmem_cache_destroy(clnt->fcall_cache);
> 
> I'm afraid that we should check NULL for clnt->fcall_cache.

kmem_cache_destroy() in mm/slab_common.c does the null check for us:
--
void kmem_cache_destroy(struct kmem_cache *s)
{
int err;

if (unlikely(!s))
return;
--

-- 
Dominique


[PATCH] 9p: Add Dominique Martinet to MAINTAINERS

2018-08-09 Thread Dominique Martinet
From: Dominique Martinet 

Signed-off-by: Dominique Martinet 
Cc: Eric Van Hensbergen 
Cc: Latchesar Ionkov 
Cc: Ron Minnich 
Cc: Andrew Morton 
---

I've had an off-list Ack from Lucho and Eric about adding myself, and
got reminded I should probably do it sooner than later by Andrew.

Could (a subset of) you three please confirm on the list?

(As usual, I was a bit hesitant on what email to use, but I really would
not be reactive on my work email so it probably is better that way)


FWIW, I did not get any answer from Ron when I tried reaching out to
him, but I finally found why: he left sandia 7 years ago so that mail
is likely a black hole.
I've sent him a mail on another address now asking him if he is ok to
be removed, and will probably submit another patch doing that within
the next few days.


 MAINTAINERS | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 7cebd5bba8a8..15116e6dff3a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -201,10 +201,12 @@ F:drivers/net/ethernet/8390/
 M: Eric Van Hensbergen 
 M: Ron Minnich 
 M: Latchesar Ionkov 
+M: Dominique Martinet 
 L: v9fs-develo...@lists.sourceforge.net
 W: http://swik.net/v9fs
 Q: http://patchwork.kernel.org/project/v9fs-devel/list/
 T: git git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs.git
+T: git git://github.com/martinetd/linux.git
 S: Maintained
 F: Documentation/filesystems/9p.txt
 F: fs/9p/
-- 
2.17.1



[PATCH] 9p: Remove Ron Minnich from MAINTAINERS

2018-08-11 Thread Dominique Martinet
From: Dominique Martinet 

Ron Minnich has left Sandia in 2011, and has not been involved in any
9p commit in the recent years.

Signed-off-by: Dominique Martinet 
Cc: Eric Van Hensbergen 
Cc: Ron Minnich 
Cc: Ronald G. Minnich 
Cc: Latchesar Ionkov 
---
Thanks for agreeing Ron.

 MAINTAINERS | 1 -
 1 file changed, 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 15116e6dff3a..09c2a6c0fb0c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -199,7 +199,6 @@ F:  drivers/net/ethernet/8390/
 
 9P FILE SYSTEM
 M: Eric Van Hensbergen 
-M: Ron Minnich 
 M: Latchesar Ionkov 
 M: Dominique Martinet 
 L: v9fs-develo...@lists.sourceforge.net
-- 
2.17.1



Re: [PATCH] compiler-gcc: get back Clang build

2018-08-21 Thread Dominique Martinet
Nick Desaulniers Aug. 21, 2018, 8:09 a.m. UTC:
> Thanks for noticing, and sending this patch.  I'm happy to see others
> testing with Clang.  I noticed this too near the end of the day
> https://github.com/ClangBuiltLinux/linux/issues/27.

FWIW libbcc so many BPF users also use clang, so this has more impact
than just testing to build linux with clang (not that this would be any
reason to delay fixing either way)

I would tend to agree havin a compiler-common + make clang/intel not
include compiler-gcc would probably be best in the long run but we might
want a quick fix for 4.19 meanwhile..

-- 
Dominique Martinet


Re: [PATCH] compiler-gcc: get back Clang build

2018-08-21 Thread Dominique Martinet
Nick Desaulniers wrote on Tue, Aug 21, 2018:
> On Tue, Aug 21, 2018 at 9:45 AM Joe Perches  wrote:
> > > Tested with gcc-7 and clang-8.
> >
> > clang-8? Isn't the latest officlal clang 6.0.1 ?
> [...]
> > So if something other than 6.0.x is required,
> > then some additional check should probably be
> > added to compiler-clang.h as well.
> 
> Sure, but that doesn't need to go in Mashiro's patch today.  That can
> wait for a proper separation between compiler headers where we can
> then implement improved version checks.

I can confirm that the patch here works for me with clang 6.0.1 as far
as bpf programs are concerned (and gcc 8.1.1 for the kernel build itself
on x86 while I was at it); so at least there is nothing on the
compiler*.h headers that put off clang 6.0.1


(also replying to other subthread)
>From Joe Perches @ 2018-08-21 17:22 UTC:
> The question remains, if clang can't compile
> v4.17, why does it immediately matter for v4.19?

I haven't had any problem with clang and 4.17 as far as building bpf
programs is concerned, because the CC_HAVE_ASM_GOTO check is done in
makefiles that aren't used in the bpf case and not in compiler-gcc.h
or another header.

So I guess the "immediately matters for v4.19" depends on how much you
would care about bcc-compiled BPF programs.


> Why wouldn't overriding the clang __GNUC_
> #defines in compiler-gcc.h work acceptably with
> adding whatever is necessary to compiler-clang.h?

I think that could work, but at the point making a separate
compiler-common.h and not including compiler-gcc.h for clang sounds
better to me... More importantly here, either solution sound complex
enough to require more than a few days and proper testing for all archs
etc when compared to the partial revert we have here.

-- 
Dominique Martinet


Re: [PATCH] compiler-gcc: get back Clang build

2018-08-21 Thread Dominique Martinet
Joe Perches wrote on Tue, Aug 21, 2018:
> On Wed, 2018-08-22 at 06:16 +0200, Dominique Martinet wrote:
> > I think that could work, but at the point making a separate
> > compiler-common.h and not including compiler-gcc.h for clang sounds
> > better to me... More importantly here, either solution sound complex
> > enough to require more than a few days and proper testing for all archs
> > etc when compared to the partial revert we have here.
> 
> The immediate need for a partial revert seems unnecessary as
> clang hasn't really worked for a couple releases now.

Sorry for repeating myself, clang is used by bcc for compiling BPF
programs (e.g. bpf_module_create_c_from_string() or any similar function
will use the clang libs to compile the bpf program with linux headers),
and this has always been working because it's not using our makefiles.

This broke today in master and I only joined this thread after looking
at why the build started failing today and noticing this patch, it
definitely hasn't been broken for two releases, or I wouldn't be here
with this timing :)


> The separate compiler file changes are much more sensible,
> even if it takes a few days.

A few days are fine, but I think some form of fix in 4.19-rc1 would be
good.

I'll stop taking your time now, but I think you are/were underestimating
how many people use clang with the linux kernel headers indirectly.
BPF is a well-used tool :)


Thanks,
-- 
Dominique Martinet



Re: [PATCH] compiler-gcc: get back Clang build

2018-08-22 Thread Dominique Martinet
Nick Desaulniers wrote on Wed, Aug 22, 2018:
> I'm currently testing a fix in
> https://github.com/ClangBuiltLinux/linux/commit/1f89ae7622c26b8131f42f3a362d6ef41b88a595,
> can you please share with me your steps to test/verify that the patch
> fixes the issue for eBPF?  I'll go talk to a co-worker who know more
> about eBPF, but I've not yet done anything with it.

Thanks for the link.

The simplest way to test with bcc for me would probably be to fetch the
example from their repo itself[1], whether you install bcc through your
distro or compile it yourself.

There are dozens of example programs in the tools/ directory of which
you can just run any, if will try to compile the program embedded wihtin
the example at runtime with clang.

[1] https://github.com/iovisor/bcc

(I just noticed fedora provides a bcc-tools package which provides these
tools directly, so if you're lucky you can just install that and run
examples in /usr/share/bcc/tools)



In particular I get a couple of errors with your patch, I think it boils
down to this one:
-
In file included from :3:
In file included from /virtual/include/bcc/helpers.h:23:
In file included from /lib/modules/4.18.0+/build/include/linux/log2.h:16:
In file included from /lib/modules/4.18.0+/build/include/linux/bitops.h:18:
In file included from 
/lib/modules/4.18.0+/build/arch/x86/include/asm/bitops.h:16:
/lib/modules/4.18.0+/build/include/linux/compiler.h:217:3: error:
expected expression
barrier();
^
/lib/modules/4.18.0+/build/include/linux/compiler-clang.h:43:20: note:
expanded from macro 'barrier'
#define barrier() (__asm__ __volatile__("" : : : "memory"))
   ^
-

And removing the parenthesis around the expression seems to work, they
weren't here in the compiler-gcc.h file in the previous version?

There might be other defines the simple examples I ran do not use but
from a quick glance it doesn't look like it, thank you for the split
work!

-- 
Dominique Martinet


Re: [PATCH] include/linux/compiler*.h: make compiler-*.h mutually exclusive

2018-08-22 Thread Dominique Martinet
Nick Desaulniers wrote on Wed, Aug 22, 2018:
> Commit cafa0010cd51 ("Raise the minimum required gcc version to 4.6")
> recently exposed a brittle part of the build for supporting non-gcc
> compilers.
> 
> Both Clang and ICC define __GNUC__, __GNUC_MINOR__, and
> __GNUC_PATCHLEVEL__ for quick compatibility with code bases that haven't
> added compiler specific checks for __clang__ or __INTEL_COMPILER.
> 
> This is brittle, as they happened to get compatibility by posing as a
> certain version of GCC.  This broke when upgrading the minimal version
> of GCC required to build the kernel, to a version above what ICC and
> Clang claim to be.
> 
> Rather than always including compiler-gcc.h then undefining or
> redefining macros in compiler-intel.h or compiler-clang.h, let's
> separate out the compiler specific macro definitions into mutually
> exclusive headers, do more proper compiler detection, and keep shared
> definitions in compiler_types.h.
> 
> Reported-by: Masahiro Yamada 
> Suggested-by: Eli Friedman 
> Suggested-by: Joe Perches 
> Signed-off-by: Nick Desaulniers 

Overall looks good to me, just pointing at the same error I wrote in my
other mail here -- I saw that by the time I was done writing this this
patch got taken but that alone will probably warrant a follow-up :/


I've also added a few style nitpicks/questions but feel free to ignore
these.

On a side note, I noticed tools/include/linux/compiler.h includes
linux/compiler-gcc.h but maybe it should include linux/compiler_types.h?
(I'm not sure at who uses that header, so it really is an open question
here)

> ---
>  arch/arm/kernel/asm-offsets.c |  13 +-
>  drivers/gpu/drm/i915/i915_utils.h |   2 +-
>  drivers/watchdog/kempld_wdt.c |   5 -
>  include/linux/compiler-clang.h|  20 ++-
>  include/linux/compiler-gcc.h  |  88 ---
>  include/linux/compiler-intel.h|  13 +-
>  include/linux/compiler_types.h| 238 ++
>  mm/ksm.c  |   4 +-
>  mm/migrate.c  |   3 +-
>  9 files changed, 133 insertions(+), 253 deletions(-)
> 
> [...]
> diff --git a/include/linux/compiler-clang.h b/include/linux/compiler-clang.h
> index 7087446c24c8..5f0754692ce6 100644
> --- a/include/linux/compiler-clang.h
> +++ b/include/linux/compiler-clang.h
> [...]
> @@ -40,9 +30,17 @@
>   * checks. Unfortunately, we don't know which version of gcc clang
>   * pretends to be, so the macro may or may not be defined.
>   */
> -#undef COMPILER_HAS_GENERIC_BUILTIN_OVERFLOW
>  #if __has_builtin(__builtin_mul_overflow) && \
>  __has_builtin(__builtin_add_overflow) && \
>  __has_builtin(__builtin_sub_overflow)
>  #define COMPILER_HAS_GENERIC_BUILTIN_OVERFLOW 1
>  #endif
> +
> +/* The following are for compatibility with GCC, from compiler-gcc.h,
> + * and may be redefined here because they should not be shared with other
> + * compilers, like ICC.
> + */
> +#define barrier() (__asm__ __volatile__("" : : : "memory"))

barrier here has gained external () compared to the definition and
compiler-gcc.h:
#define barrier() __asm__ __volatile__("": : :"memory")

And this fails with bpf:
In file included from :3:
In file included from /virtual/include/bcc/helpers.h:23:
In file included from /lib/modules/4.18.0+/build/include/linux/log2.h:16:
In file included from /lib/modules/4.18.0+/build/include/linux/bitops.h:18:
/lib/modules/4.18.0+/build/arch/x86/include/asm/bitops.h:170:2: error: expected 
expression
barrier();
^
/lib/modules/4.18.0+/build/include/linux/compiler-clang.h:43:20: note: expanded 
from macro 'barrier'
#define barrier() (__asm__ __volatile__("" : : : "memory"))
   ^


> +#define __must_be_array(a) BUILD_BUG_ON_ZERO(__same_type((a), &(a)[0]))
> +#define __assume_aligned(a, ...) \
> + __attribute__((__assume_aligned__(a, ## __VA_ARGS__)))
> diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
> index 250b9b7cfd60..763bbad1e258 100644
> --- a/include/linux/compiler-gcc.h
> +++ b/include/linux/compiler-gcc.h
> [..]
> -#define __cold   __attribute__((__cold__))
> -
>  #define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), 
> __COUNTER__)
>  
>  #define __optimize(level)__attribute__((__optimize__(level)))
> -#define __nostackprotector   __optimize("no-stack-protector")

I do not see this being added back, it's probably fine though as it
doesn't look used?
(I didn't check all removed lines, maybe about half)
  
>  #define __compiletime_object_size(obj) __builtin_object_size(obj, 0)
>
> diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h
> index fbf337933fd8..90479a0f3986 100644
> --- a/include/linux/compiler_types.h
> +++ b/include/linux/compiler_types.h
> [...]
>  /* Is this type a native word size -- useful for atomic operations */
> -#ifndef __native_word
> -# define __native_word(t) (sizeof(t) == sizeof(char) || sizeof(t) == 
> sizeof(short) || sizeof(t) == siz

Re: [PATCH] include/linux/compiler*.h: make compiler-*.h mutually exclusive

2018-08-22 Thread Dominique Martinet
Linus Torvalds wrote on Wed, Aug 22, 2018:
> I've fixed that manually, but when I tried to test it I just hit the
> 
>   arch/x86/Makefile:179: *** Compiler lacks asm-goto support..  Stop.
> 
> error.
> 
> Do you have some experimental clang build with asm goto support? What
> version? Or is it just that you're building ARM, not x86?

I'm not building linux directly, but BPF programs that indirectly uses
clang with bcc

On fedora you can install bcc-tools and run
e.g. /usr/share/bcc/tools/execsnoop that will compile something at
runtime.
If the package isn't available for your distro you can just install bcc
and run the script from their repo[1]

[1] https://raw.githubusercontent.com/iovisor/bcc/master/tools/execsnoop.py


Thanks,
-- 
Dominique


Re: [PATCH] include/linux/compiler*.h: make compiler-*.h mutually exclusive

2018-08-23 Thread Dominique Martinet
Nick Desaulniers wrote on Thu, Aug 23, 2018:
> > On a side note, I noticed tools/include/linux/compiler.h includes
> > linux/compiler-gcc.h but maybe it should include linux/compiler_types.h?
> > (I'm not sure at who uses that header, so it really is an open question
> > here)
> 
> Without looking into the code, this sounds like compiler.h is wrong.
> Looking at the source, there's references to ancient Android NDK's
> (what?! Let me show this to the NDK maintainers).  This whole thing
> needs to be cleaned up, too, IMO.  Oh, there's two compiler-gcc.h in
> the tree:
> 
> - tools/include/linux/compiler-gcc.h (that's what's being included in
> the case you point out)
> - include/linux/compiler-gcc.h
> 
> Maybe they can be combined?

Ah, I didn't notice there is another compiler-gcc...
Looking closer there seem to be other "reused" headers (almost all of
the headers in tools/include/linux have a counterpart in include/linux),
and some e.g. rbtree.h went from being a copy of the main include's to a
single '#include "../../../../include/linux/rbtree.h"' line to a
slightly simplified copy again to avoid bringing in rcu as dependency...

I think compiler.h could be done with a similar trick with a bit of
massaging, but as things stand linux/compiler.h includes
uapi/linux/types.h, and this complains about using kernel headers from
user space... So this isn't as trivial as just making tools use the
kernel's include at least.



> > If you tried to align these, __always_unused and __alias(symbol) look
> > off
> 
> I have `set tabstop=8` in vim, and it looks correct there, but both
> `git diff` and github's code viewer show it as off.  Maybe I have my
> settings wrong in vim and need to go back to first grade, but
> (personal opinion ahead) you don't have this kind of nonsense
> (ambiguity around how many spaces a tab should be displayed as in
> various code viewers) if you just always use spaces everywhere, for
> everything.  Other large codebases use automatic code formatters (like
> clang-format) and that prevents holy wars and other distractions.
> There's even a cool utility called `git-clang-format` that can check
> just the code you changed, which is useful for not reformatting the
> entire codebase messing up git blame.

Right, this is my mistake - the diff view adds a space so these "inner
tabs" alignments can be messed up.
FWIW I think checkpatch only complains about leading space-based
indentation, the inner filling can be whatever you want, but this is
fine and I was overzealous.


> On Wed, Aug 22, 2018 at 7:43 PM Masahiro Yamada
>  wrote:
> > It was previous included by all compilers,
> > but now it is only by _true_ GCC.
> 
> Good catch, and thanks for the report and testing.  I missed that
> because I was testing x86_64 and arm64 (which I guess don't hit that
> in the configs I tested) and not arm32.  Should be a simple fix to
> move it to shared the definition.  Send me a patch, or I'll get to it.
> 
> > Even if I move the __naked definition
> > to ,
> > I see a different error.
> 
> Did that regress due to my change?  If so, sorry and I'll look into it
> more soon.

I've looked at that one quickly and that warning looks legitimate to me,
I'm not sure why it would appear only now but clang does document[1] not
allowing the parameters to be used even in extended asm, and gcc
documentation[2] says "While using extended asm or a mixture of basic
asm and C code may appear to work, they cannot be depended upon to work
reliably and are not supported."

[1] 
http://infocenter.arm.com/help/topic/com.arm.doc.dui0774g/jhg1476893564298.html
[2] https://gcc.gnu.org/onlinedocs/gcc/ARM-Function-Attributes.html

In this case it looks like the arguments are only used for sanity with
__asmeq but in the first place the original commit for trusted
foundations talks about it not respecting the API so naked makes sense
but they're not making the API function naked (and the one which takes
an argument does use its argument) so this all feels a bit weird to me.

It might be worth asking the original authors on this one...

-- 
Dominique Martinet


Re: [PATCH v2 5/6] 9p: Use a slab for allocating requests

2018-07-30 Thread Dominique Martinet
Dominique Martinet wrote on Mon, Jul 23, 2018:
> I'll try to get figures for various approaches before the merge window
> for 4.19 starts, it's getting closer though...

Here's some numbers; with v4.18-rc7 + current test tree (my 9p-next) as
a base.


For the context, I'm running on VMs that bind their cores to CPUs on the
host (32 cores), and have a Connect-IB mellanox card through SRIOV.

The server is nfs-ganesha, serving a tmpfs filesystem on a second VM
(different host)

Mounting with msize=$((1024*1024))

My main problem with this test is that the client has way too much
memory and it's mostly pristine with a boot not long before, so any kind
of memory pressure won't be seen here.
If someone knows how to fragment memory quickly I'll take that and rerun
the tests :)


I've changed my mind from mdtest to a simple ior, as I'm testing on
trans=rdma there's no difference and I'm more familiar with ior options.

I ran two workloads:
 - 32 processes, file per process, 512k at a time writing a total of
32GB (1GB per file), repeated 10 times
 - 32 processes, file per process, 32 bytes at a time writing a total of
16MB (512k per file), repeated 10 times.

The first test gives a proper impression of the throughput the systems
can sustain and the results are pretty much around what I was expecting
for the setup; the second test is purely a latency test (how long does
it take to send 512k RPCs)


I ran almost all of these tests with KASAN enabled in the VMs a first
time, so leaving the results with KASAN at the end for reference...


Overall I'm rather happy with the result, without KASAN the overhead of
the patch isn't negligible (~6%) but I'd say it's acceptable for
correctness and with an extra two patchs with the suggesteed changes
(rounding down the alloc size to not include the struct overhead and
separate kmem cache) it's getting down to 0.5% which is quite good, I
think.

I'll send the two patchs to the list shortly. The first one is rather
huge even if it's a trivial change logically, so part of me wants to get
it merged quickly to not have to deal with rebases... ;)


With KASAN, well, it certainly does more things but I hope
performance-critical systems don't have it enabled in the first place.



Raw results:

 * Base = 4.18-rc7 + queued patches, without request cache rework
- "Big" I/Os:
Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB) StdDevMean(s) Test# #Tasks 
tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API 
RefNum
write5842.405751.585793.53  23.935.65606 0 32 32 10 1 0 
1 0 0 1 1073741824 524288 34359738368 POSIX 0
read 6098.926018.636064.30  20.005.40348 0 32 32 10 1 0 
1 0 0 1 1073741824 524288 34359738368 POSIX 0

- "Small" I/Os:
Operation   Max(MiB)   Min(MiB)  Mean(MiB) StdDevMean(s) Test# #Tasks 
tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API 
RefNum
write   2.10   1.91   2.00   0.058.01074 0 32 32 10 1 0 
1 0 0 1 524288 32 16777216 POSIX 0
read1.27   1.07   1.15   0.06   13.93901 0 32 32 10 1 0 
1 0 0 1 524288 32 16777216 POSIX 0
 -> 512k / 8.01074 = 65.4k req/s


 * Base + patch as submitted
- "Big" I/Os:
Operation   Max(MiB)   Min(MiB)  Mean(MiB) StdDevMean(s) Test# #Tasks 
tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API 
RefNum
write5844.845665.325787.15  48.945.66261 0 32 32 10 1 0 
1 0 0 1 1073741824 524288 34359738368 POSIX 0
read 6082.246039.626057.14  12.505.40983 0 32 32 10 1 0 
1 0 0 1 1073741824 524288 34359738368 POSIX 0
 
- "Small" I/Os:
Operation   Max(MiB)   Min(MiB)  Mean(MiB) StdDevMean(s) Test# #Tasks 
tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API 
RefNum
write   1.95   1.82   1.88   0.048.50453 0 32 32 10 1 0 
1 0 0 1 524288 32 16777216 POSIX 0
read1.18   1.07   1.14   0.03   14.04634 0 32 32 10 1 0 
1 0 0 1 524288 32 16777216 POSIX 0
 -> 512k / 8.50453 = 61.6k req/s


 * Base + patch as submitted + moving the header into req so the
allocation is "round" as suggested by Matthew
- "Big" I/Os:
Operation   Max(MiB)   Min(MiB)  Mean(MiB) StdDevMean(s) Test# #Tasks 
tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API 
RefNum
write5861.795680.995795.71  48.845.65424 0 32 32 10 1 0 
1 0 0 1 1073741824 524288 34359738368 POSIX 0
read 6098.546037.556067.80  19.395.40036 0 32 32 10 1 0 
1 0 0 1 1073741824 524288 34359738368 POSIX 0

- "Small" I/Os:
Operation   Max(MiB)   Min(MiB)  Mean(MiB) StdDevMean(s) Test# #Tasks 
tPN reps fPP reord reordof

[PATCH 1/2] net/9p: embed fcall in req to round down buffer allocs

2018-07-30 Thread Dominique Martinet
From: Dominique Martinet 

'msize' is often a power of two, or at least page-aligned, so avoiding
an overhead of two dozen bytes for each allocation will help the
allocator do its work and reduce memory fragmentation.

Suggested-by: Matthew Wilcox 
Signed-off-by: Dominique Martinet 
---
 include/net/9p/client.h |   4 +-
 net/9p/client.c | 160 
 net/9p/trans_fd.c   |  12 +--
 net/9p/trans_rdma.c |  29 
 net/9p/trans_virtio.c   |  18 ++---
 net/9p/trans_xen.c  |  12 +--
 6 files changed, 117 insertions(+), 118 deletions(-)

diff --git a/include/net/9p/client.h b/include/net/9p/client.h
index a4dc42c53d18..4b4ac1362ad5 100644
--- a/include/net/9p/client.h
+++ b/include/net/9p/client.h
@@ -95,8 +95,8 @@ struct p9_req_t {
int status;
int t_err;
wait_queue_head_t wq;
-   struct p9_fcall *tc;
-   struct p9_fcall *rc;
+   struct p9_fcall tc;
+   struct p9_fcall rc;
void *aux;
struct list_head req_list;
 };
diff --git a/net/9p/client.c b/net/9p/client.c
index 88db45966740..ba99a94a12c9 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -231,15 +231,13 @@ static int parse_opts(char *opts, struct p9_client *clnt)
return ret;
 }
 
-static struct p9_fcall *p9_fcall_alloc(int alloc_msize)
+static int p9_fcall_alloc(struct p9_fcall *fc, int alloc_msize)
 {
-   struct p9_fcall *fc;
-   fc = kmalloc(sizeof(struct p9_fcall) + alloc_msize, GFP_NOFS);
-   if (!fc)
-   return NULL;
+   fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
+   if (!fc->sdata)
+   return -ENOMEM;
fc->capacity = alloc_msize;
-   fc->sdata = (char *) fc + sizeof(struct p9_fcall);
-   return fc;
+   return 0;
 }
 
 static struct kmem_cache *p9_req_cache;
@@ -263,13 +261,13 @@ p9_tag_alloc(struct p9_client *c, int8_t type, unsigned 
int max_size)
if (!req)
return NULL;
 
-   req->tc = p9_fcall_alloc(alloc_msize);
-   req->rc = p9_fcall_alloc(alloc_msize);
-   if (!req->tc || !req->rc)
+   if (p9_fcall_alloc(&req->tc, alloc_msize))
+   goto free;
+   if (p9_fcall_alloc(&req->rc, alloc_msize))
goto free;
 
-   p9pdu_reset(req->tc);
-   p9pdu_reset(req->rc);
+   p9pdu_reset(&req->tc);
+   p9pdu_reset(&req->rc);
req->status = REQ_STATUS_ALLOC;
init_waitqueue_head(&req->wq);
INIT_LIST_HEAD(&req->req_list);
@@ -281,7 +279,7 @@ p9_tag_alloc(struct p9_client *c, int8_t type, unsigned int 
max_size)
GFP_NOWAIT);
else
tag = idr_alloc(&c->reqs, req, 0, P9_NOTAG, GFP_NOWAIT);
-   req->tc->tag = tag;
+   req->tc.tag = tag;
spin_unlock_irq(&c->lock);
idr_preload_end();
if (tag < 0)
@@ -290,8 +288,8 @@ p9_tag_alloc(struct p9_client *c, int8_t type, unsigned int 
max_size)
return req;
 
 free:
-   kfree(req->tc);
-   kfree(req->rc);
+   kfree(req->tc.sdata);
+   kfree(req->rc.sdata);
kmem_cache_free(p9_req_cache, req);
return ERR_PTR(-ENOMEM);
 }
@@ -329,14 +327,14 @@ EXPORT_SYMBOL(p9_tag_lookup);
 static void p9_free_req(struct p9_client *c, struct p9_req_t *r)
 {
unsigned long flags;
-   u16 tag = r->tc->tag;
+   u16 tag = r->tc.tag;
 
p9_debug(P9_DEBUG_MUX, "clnt %p req %p tag: %d\n", c, r, tag);
spin_lock_irqsave(&c->lock, flags);
idr_remove(&c->reqs, tag);
spin_unlock_irqrestore(&c->lock, flags);
-   kfree(r->tc);
-   kfree(r->rc);
+   kfree(r->tc.sdata);
+   kfree(r->rc.sdata);
kmem_cache_free(p9_req_cache, r);
 }
 
@@ -368,7 +366,7 @@ static void p9_tag_cleanup(struct p9_client *c)
  */
 void p9_client_cb(struct p9_client *c, struct p9_req_t *req, int status)
 {
-   p9_debug(P9_DEBUG_MUX, " tag %d\n", req->tc->tag);
+   p9_debug(P9_DEBUG_MUX, " tag %d\n", req->tc.tag);
 
/*
 * This barrier is needed to make sure any change made to req before
@@ -378,7 +376,7 @@ void p9_client_cb(struct p9_client *c, struct p9_req_t 
*req, int status)
req->status = status;
 
wake_up(&req->wq);
-   p9_debug(P9_DEBUG_MUX, "wakeup: %d\n", req->tc->tag);
+   p9_debug(P9_DEBUG_MUX, "wakeup: %d\n", req->tc.tag);
 }
 EXPORT_SYMBOL(p9_client_cb);
 
@@ -449,18 +447,18 @@ static int p9_check_errors(struct p9_client *c, struct 
p9_req_t *req)
int err;
int ecode;
 
-   err = p9_parse_header(req->rc, NULL, &type, NULL, 0);
-   if (req->rc->size >= c->msize) {
+   err = p9_parse_header(&req->rc, NULL, &type, NULL, 0);
+   if (req->rc.size >= c-

[PATCH 2/2] net/9p: add a per-client fcall kmem_cache

2018-07-30 Thread Dominique Martinet
From: Dominique Martinet 

Having a specific cache for the fcall allocations helps speed up
allocations a bit, especially in case of non-"round" msizes.

The caches will automatically be merged if there are multiple caches
of items with the same size so we do not need to try to share a cache
between different clients of the same size.

Since the msize is negotiated with the server, only allocate the cache
after that negotiation has happened - previous allocations or
allocations of different sizes (e.g. zero-copy fcall) are made with
kmalloc directly.

Signed-off-by: Dominique Martinet 
---
 include/net/9p/client.h |  2 ++
 net/9p/client.c | 40 
 net/9p/trans_rdma.c |  2 +-
 3 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/include/net/9p/client.h b/include/net/9p/client.h
index 4b4ac1362ad5..8d9bc7402a42 100644
--- a/include/net/9p/client.h
+++ b/include/net/9p/client.h
@@ -123,6 +123,7 @@ struct p9_client {
struct p9_trans_module *trans_mod;
enum p9_trans_status status;
void *trans;
+   struct kmem_cache *fcall_cache;
 
union {
struct {
@@ -230,6 +231,7 @@ int p9_client_mkdir_dotl(struct p9_fid *fid, const char 
*name, int mode,
kgid_t gid, struct p9_qid *);
 int p9_client_lock_dotl(struct p9_fid *fid, struct p9_flock *flock, u8 
*status);
 int p9_client_getlock_dotl(struct p9_fid *fid, struct p9_getlock *fl);
+void p9_fcall_free(struct p9_client *c, struct p9_fcall *fc);
 struct p9_req_t *p9_tag_lookup(struct p9_client *, u16);
 void p9_client_cb(struct p9_client *c, struct p9_req_t *req, int status);
 
diff --git a/net/9p/client.c b/net/9p/client.c
index ba99a94a12c9..215e3b1ed7b4 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -231,15 +231,34 @@ static int parse_opts(char *opts, struct p9_client *clnt)
return ret;
 }
 
-static int p9_fcall_alloc(struct p9_fcall *fc, int alloc_msize)
+static int p9_fcall_alloc(struct p9_client *c, struct p9_fcall *fc,
+ int alloc_msize)
 {
-   fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
+   if (c->fcall_cache && alloc_msize == c->msize)
+   fc->sdata = kmem_cache_alloc(c->fcall_cache, GFP_NOFS);
+   else
+   fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
if (!fc->sdata)
return -ENOMEM;
fc->capacity = alloc_msize;
return 0;
 }
 
+void p9_fcall_free(struct p9_client *c, struct p9_fcall *fc)
+{
+   /* sdata can be NULL for interrupted requests in trans_rdma,
+* and kmem_cache_free does not do NULL-check for us
+*/
+   if (unlikely(!fc->sdata))
+   return;
+
+   if (c->fcall_cache && fc->capacity == c->msize)
+   kmem_cache_free(c->fcall_cache, fc->sdata);
+   else
+   kfree(fc->sdata);
+}
+EXPORT_SYMBOL(p9_fcall_free);
+
 static struct kmem_cache *p9_req_cache;
 
 /**
@@ -261,9 +280,9 @@ p9_tag_alloc(struct p9_client *c, int8_t type, unsigned int 
max_size)
if (!req)
return NULL;
 
-   if (p9_fcall_alloc(&req->tc, alloc_msize))
+   if (p9_fcall_alloc(c, &req->tc, alloc_msize))
goto free;
-   if (p9_fcall_alloc(&req->rc, alloc_msize))
+   if (p9_fcall_alloc(c, &req->rc, alloc_msize))
goto free;
 
p9pdu_reset(&req->tc);
@@ -288,8 +307,8 @@ p9_tag_alloc(struct p9_client *c, int8_t type, unsigned int 
max_size)
return req;
 
 free:
-   kfree(req->tc.sdata);
-   kfree(req->rc.sdata);
+   p9_fcall_free(c, &req->tc);
+   p9_fcall_free(c, &req->rc);
kmem_cache_free(p9_req_cache, req);
return ERR_PTR(-ENOMEM);
 }
@@ -333,8 +352,8 @@ static void p9_free_req(struct p9_client *c, struct 
p9_req_t *r)
spin_lock_irqsave(&c->lock, flags);
idr_remove(&c->reqs, tag);
spin_unlock_irqrestore(&c->lock, flags);
-   kfree(r->tc.sdata);
-   kfree(r->rc.sdata);
+   p9_fcall_free(c, &r->tc);
+   p9_fcall_free(c, &r->rc);
kmem_cache_free(p9_req_cache, r);
 }
 
@@ -944,6 +963,7 @@ struct p9_client *p9_client_create(const char *dev_name, 
char *options)
 
clnt->trans_mod = NULL;
clnt->trans = NULL;
+   clnt->fcall_cache = NULL;
 
client_id = utsname()->nodename;
memcpy(clnt->name, client_id, strlen(client_id) + 1);
@@ -980,6 +1000,9 @@ struct p9_client *p9_client_create(const char *dev_name, 
char *options)
if (err)
goto close_trans;
 
+   clnt->fcall_cache = kmem_cache_create("9p-fcall-cache", clnt->msize,
+ 0, 0, NULL);
+
return clnt;
 
 close_trans:
@@ -1011,6 +1034,7 @@ void p9_client_destroy(struct 

Re: [PATCH] net/9p/trans_virtio.c: replace mutex_lock with spin_lock to protect 'virtio_chan_list'

2018-07-18 Thread Dominique Martinet
piaojun wrote on Thu, Jul 19, 2018:
> > piaojun wrote on Wed, Jul 18, 2018:
> > That's not a fast path operation, I don't mind changing things but I'd
> > like to understand why - these functions are only ever called at unmount
> > time or when something happens on the virtio bus (probe will happen on
> > probing on the pci bus and I'm not too sure on remove but probably pci
> > removal i.e. basically never?)
> > 
> > I don't see why this wouldn't work, but I won't take this without a
> > (good?) reason.
> > 
> virtio_9p_lock is responsable for protecting virtio_chan_list which has 3
> operation:
> 
> 1. Add a virtio chan to virtio_chan_list. This will happen when we insmod
> 9pnet_virtio.ko:
> p9_virtio_probe
> --list_add_tail(&chan->chan_list, &virtio_chan_list);
> 
> 2. Remove a virtio chan. This will happen when remnod 9pnet_virtio.ko:
> p9_virtio_remove
> --list_del(&chan->chan_list);
> 
> 3. Find a unused virtio chan when mount 9p:
> mount
> --p9_virtio_create
> --list_for_each_entry(chan, &virtio_chan_list, chan_list)
> 
> Multi mount process will compete for virtio_9p_lock when finding unused
> virtio chan, in which case mutex lock will cause process sleep and wake
> up. I think this a waste of CPU time. So we could use spin lock to avoid
> this.

Well, sure, that's theory; but how is that in practice?
I actually took the time to run some tests, setting up 20 virtio mount
points in qemu, and running this command with and without your patch:
# time sh -c 'for i in {1..20}; do
  sh -c "for j in {1..100}; do
mount -t 9p d$i d.$i;
umount d.$i;
  done" &
  done;
  wait'

This is quick & dirty but basically, mounts and unmounts 100 times in a
loop all 20 mount points in parallel to stress that lock.
I get these times 5 times (one run per column),
without patch:
real0m19.357s   0m19.626s   0m19.904s   0m19.926s   
0m21.321s
user0m6.795s0m6.874s0m6.807s0m6.768s0m6.892s
sys 0m29.936s   0m31.196s   0m31.702s   0m31.914s   
0m30.791s

With patch:
real0m19.439s   0m19.849s   0m19.683s   0m19.600s   
0m20.689s
user0m6.948s0m6.582s0m6.706s0m6.598s0m6.876s
sys 0m29.364s   0m30.898s   0m30.695s   0m31.311s   
0m33.391s

I honestly can't say I'm convinced with a difference either way, the
variations look more like noise than anything to me.


More to the point, while these tests ran my dmesg buffer was filled with
errors like:
FS-Cache: Duplicate cookie detected
FS-Cache: O-cookie c=00368cdb [p=548b03c2 fl=222 nc=0 na=1]
FS-Cache: O-cookie d=4cebd15f n=029a0b83
FS-Cache: O-key=[10] '34323935303838343536'
FS-Cache: N-cookie c=d4089478 [p=548b03c2 fl=2 nc=0 na=1]
FS-Cache: N-cookie d=4cebd15f n=959d4d37
FS-Cache: N-key=[10] '34323935303838343536'

or
(output mangled a bit)

==
BUG: KASAN: use-after-free in p9_client_cb+0x14d/0x160 [9pnet]
Read of size 8 at addr 88003522a088 by task systemd-udevd/492

CPU: 1 PID: 492 Comm: systemd-udevd Tainted: G   O  4.18.0-rc5+ #9
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 0>
Call Trace:
 
 dump_stack+0x7b/0xad
 print_address_description+0x6a/0x209
 ? p9_client_cb+0x14d/0x160 [9pnet]
 kasan_report.cold.7+0x242/0x2fe
 __asan_report_load8_noabort+0x19/0x20
 p9_client_cb+0x14d/0x160 [9pnet]
 req_done+0x22f/0x280 [9pnet_virtio]
 ? p9_mount_tag_show+0x120/0x120 [9pnet_virtio]
 vring_interrupt+0x108/0x1b0 [virtio_ring]
 ? vring_map_single.constprop.23+0x350/0x350 [virtio_ring]
 __handle_irq_event_percpu+0xec/0x460
 handle_irq_event_percpu+0x71/0x140
 ? __handle_irq_event_percpu+0x460/0x460
 ? apic_ack_irq+0xa3/0xe0
 handle_irq_event+0xb9/0x14a
 handle_edge_irq+0x1ea/0x7a0
 ? kasan_check_read+0x11/0x20
 handle_irq+0x48/0x60
 do_IRQ+0x67/0x140
 common_interrupt+0xf/0xf
 
RIP: 0010:finish_task_switch+0x10e/0x630
Code: e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 6d 04 00 00 41 c7 45 38 00 00 00 
00 4c 89 e7 ff 14 25 28 f5 66 8e fb 66 0f >
RSP: 0018:8800633e7a60 EFLAGS: 0246 ORIG_RAX: ffd4
RAX: 0001 RBX: 880036632000 RCX: 
RDX:  RSI:  RDI: 88006caaac00
RBP: 8800633e7aa0 R08: ed000cea15cd R09: ed000cea15cc
R10: ed000cea15cc R11: 88006750ae63 R12: 88006caaac00
R13: 88006558b000 R14:  R15: 880036632000
 ? __switch_to_asm+0x34/0x70
 ? __switch_to_asm+0x40/0x70
 __schedule+0x733/0x1c10
 ? __bpf_prog_run64+0xd0/0xd0
 ? firmware_map_remove+0x174/0x174
 schedule+0x7a/0x1a0
 schedule_hrtimeout_range_clock+0x306/0x3b0
 ? kasan_check_write+0x14/0x20
 ? hrtimer_nanosleep_restart+0x290/0x290
 ? ep_busy_loop_end+0x110/0x110
 schedule_hrtimeout_range+0x13/0x20
 ep_poll+0x7a7/0xb50
 ? __ia3

Re: [V9fs-developer] KASAN: use-after-free Read in generic_perform_write

2018-07-19 Thread Dominique Martinet
Andrew Morton wrote on Thu, Jul 19, 2018:
> On Thu, 19 Jul 2018 11:01:01 -0700 syzbot 
>  wrote:
> > Hello,
> > 
> > syzbot found the following crash on:
> > 
> > HEAD commit:1c34981993da Add linux-next specific files for 20180719
> > git tree:   linux-next
> > console output: https://syzkaller.appspot.com/x/log.txt?x=16e6ac4440
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=7002497517b09aec
> > dashboard link: https://syzkaller.appspot.com/bug?extid=b173e77096a8ba815511
> > compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> > 
> > Unfortunately, I don't have any reproducer for this crash yet.
> 
> Thanks.  I cc'ed v9fs-developer, optimistically.  That list manager is
> weird :(

I agree that list is weird, does anyone know the reason v9fs-developer
is not a vger.k.o list? Or a reason not to change? It's still not too
late...

> I'm suspecting v9fs.  Does that fs attempt to write to the fs from a
> kmalloced buffer?

Difficult to say without any idea of what syzkaller tried doing, but it
looks like it hook'd up a fd opened to a local ext4 file into a trans_fd
mount; so sending a packet to the "server" would trigger a local write
instead.
The reason it's freed too early probably is that the reply came from a
read before the write happened; this is going to be tricky to fix as
that write is 100% asynchronous without any feedback right now (the
design assumes that the write has to have finished by the time reply
came), but if we want to protect ourselves from rogue servers we'll have
to think about something.

I'll write it down to not forget, thanks for the cc.

-- 
Dominique Martinet


Re: [V9fs-developer] [PATCH 1/2] net/9p: embed fcall in req to round down buffer allocs

2018-07-30 Thread Dominique Martinet
piaojun wrote on Tue, Jul 31, 2018:
> This is really a *big* patch, but the modification seems no harm. And I
> suggest running testcases to cover this. Please see my comments below.

I'm always running tests, but more never hurt - please help ;)

For reference I'm running a subset of cthon04[1], ltp[2] and some custom
tests like these[3][4]

[1] https://fedorapeople.org/cgit/steved/public_git/cthon04.git/
[2] https://github.com/linux-test-project/ltp
[3] https://github.com/phdeniel/sigmund/blob/master/modules/allfs.inc#L208
[4] https://github.com/phdeniel/sigmund/blob/master/modules/allfs.inc#L251

> > [...]
> > @@ -263,13 +261,13 @@ p9_tag_alloc(struct p9_client *c, int8_t type, 
> > unsigned int max_size)
> > if (!req)
> > return NULL;
> >  
> > -   req->tc = p9_fcall_alloc(alloc_msize);
> > -   req->rc = p9_fcall_alloc(alloc_msize);
> > -   if (!req->tc || !req->rc)
> > +   if (p9_fcall_alloc(&req->tc, alloc_msize))
> > +   goto free;
> > +   if (p9_fcall_alloc(&req->rc, alloc_msize))
> > goto free;
> >  
> > -   p9pdu_reset(req->tc);
> > -   p9pdu_reset(req->rc);
> > +   p9pdu_reset(&req->tc);
> > +   p9pdu_reset(&req->rc);
> > req->status = REQ_STATUS_ALLOC;
> > init_waitqueue_head(&req->wq);
> > INIT_LIST_HEAD(&req->req_list);
> > @@ -281,7 +279,7 @@ p9_tag_alloc(struct p9_client *c, int8_t type, unsigned 
> > int max_size)
> > GFP_NOWAIT);
> > else
> > tag = idr_alloc(&c->reqs, req, 0, P9_NOTAG, GFP_NOWAIT);
> > -   req->tc->tag = tag;
> > +   req->tc.tag = tag;
> > spin_unlock_irq(&c->lock);
> > idr_preload_end();
> > if (tag < 0)
> > @@ -290,8 +288,8 @@ p9_tag_alloc(struct p9_client *c, int8_t type, unsigned 
> > int max_size)
> > return req;
> >  
> >  free:
> > -   kfree(req->tc);
> > -   kfree(req->rc);
> > +   kfree(req->tc.sdata);
> > +   kfree(req->rc.sdata);
> 
> I wonder if we will free a wild pointer as 'sdata' has not been initialized 
> NULL.

Good point, it's possible to jump here if the first fcall_alloc failed
since this declustered the two allocations.

Please consider this added to the previous patch (I'll send a v2 after
this has had more time for review, you can find the amended commit in my
9p-test tree meanwhile):
-8<-
diff --git a/net/9p/client.c b/net/9p/client.c
index ba99a94a12c9..fe030ef1c076 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -262,7 +262,7 @@ p9_tag_alloc(struct p9_client *c, int8_t type, unsigned int 
max_size)
return NULL;
 
if (p9_fcall_alloc(&req->tc, alloc_msize))
-   goto free;
+   goto free_req;
if (p9_fcall_alloc(&req->rc, alloc_msize))
goto free;
 
@@ -290,6 +290,7 @@ p9_tag_alloc(struct p9_client *c, int8_t type, unsigned int 
max_size)
 free:
kfree(req->tc.sdata);
kfree(req->rc.sdata);
+free_req:
kmem_cache_free(p9_req_cache, req);
return ERR_PTR(-ENOMEM);
 }
-8<-

The second goto doesn't need changing because rc.sdata will be set to
NULL if the allocation failed

-- 
Dominique


Re: [V9fs-developer] [PATCH 2/2] net/9p: add a per-client fcall kmem_cache

2018-07-30 Thread Dominique Martinet
piaojun wrote on Tue, Jul 31, 2018:
> Could you help paste some test result before-and-after the patch applied?

The only performance tests I did were sent to the list a couple of mails
earlier, you can find it here:
http://lkml.kernel.org/r/20180730093101.GA7894@nautica

In particular, the results for benchmark on small writes just before and
after this patch, without KASAN (these are the same numbers as in the
link, hardware/setup is described there):
 - no alloc (4.18-rc7 request cache): 65.4k req/s
 - non-power of two alloc, no patch: 61.6k req/s
 - power of two alloc, no patch: 62.2k req/s
 - non-power of two alloc, with patch: 64.7k req/s
 - power of two alloc, with patch: 65.1k req/s

I'm rather happy with the result, I didn't expect using a dedicated
cache would bring this much back but it's certainly worth it.

> > @@ -1011,6 +1034,7 @@ void p9_client_destroy(struct p9_client *clnt)
> >  
> > p9_tag_cleanup(clnt);
> >  
> > +   kmem_cache_destroy(clnt->fcall_cache);
> 
> We could set NULL for fcall_cache in case of use-after-free.
> 
> > kfree(clnt);

Hmm, I understand where this comes from, but I'm not sure I agree.
If someone tries to access the client while/after it is freed things are
going to break anyway, I'd rather let things break as obviously as
possible than try to cover it up.

-- 
Dominique


Re: [PATCH 2/2] net/9p: add a per-client fcall kmem_cache

2018-07-30 Thread Dominique Martinet
Matthew Wilcox wrote on Mon, Jul 30, 2018:
> On Mon, Jul 30, 2018 at 11:34:23AM +0200, Dominique Martinet wrote:
> > -static int p9_fcall_alloc(struct p9_fcall *fc, int alloc_msize)
> > +static int p9_fcall_alloc(struct p9_client *c, struct p9_fcall *fc,
> > + int alloc_msize)
> >  {
> > -   fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
> > +   if (c->fcall_cache && alloc_msize == c->msize)
> > +   fc->sdata = kmem_cache_alloc(c->fcall_cache, GFP_NOFS);
> > +   else
> > +   fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
> 
> Could you simplify this by initialising c->msize to 0 and then this
> can simply be:
> 
> > +   if (alloc_msize == c->msize)
> ...

Hmm, this is rather tricky with the current flow of things;
p9_client_version() has multiple uses for that msize field.

Basically what happens is:
 - init client struct, set clip msize to mount option/transport-specific
max
 - p9_client_version() uses current c->msize to send a suggested value
to the server
 - p9_client_rpc() uses current c->msize to allocate that first rpc,
this is pretty much hard-coded and will be quite intrusive to make an
exception for
 - p9_client_version() looks at the msize the server suggested and clips
c->msize if the reply's is smaller than c->msize


I kind of agree it'd be nice to remove that check being done all the
time for just startup, but I don't see how to do this easily with the
current code.

Making p9_client_version take an extra argument would be easy but we'd
need to actually hardcode in p9_client_rpc that "if the message type is
TVERSION then use [page size or whatever] for allocation" and that kinds
of kills the point... The alternative being having p9_client_rpc takes
the actual size as argument itself but this once again is pretty
intrusive even if it could be done mechanically...

I'll think about this some more

> > +void p9_fcall_free(struct p9_client *c, struct p9_fcall *fc)
> > +{
> > +   /* sdata can be NULL for interrupted requests in trans_rdma,
> > +* and kmem_cache_free does not do NULL-check for us
> > +*/
> > +   if (unlikely(!fc->sdata))
> > +   return;
> > +
> > +   if (c->fcall_cache && fc->capacity == c->msize)
> > +   kmem_cache_free(c->fcall_cache, fc->sdata);
> > +   else
> > +   kfree(fc->sdata);
> > +}
> 
> Is it possible for fcall_cache to be allocated before fcall_free is
> called?  I'm concerned we might do this:
> 
> allocate message A
> allocate message B
> receive response A
> allocate fcall_cache
> receive response B
> 
> and then we'd call kmem_cache_free() for something allocated by kmalloc(),
> which works with slab and slub, but doesn't work with slob (alas).

Bleh, I checked this would work for slab and didn't really check
others..

This cannot happen right now because we only return the client struct
from p9_client_create after the first message is done (and, right now,
freed) but when we start adding refcounting to requests it'd be possible
to free the very first response after fcall_cache is allocated with a
"bad" server like syzcaller does sending the version reply before the
request came in.

I can't see any work-around around this other than storing how the fcall
was allocated in the struct itself though...
I guess I might as well do that now, unless you have a better idea.


> > @@ -980,6 +1000,9 @@ struct p9_client *p9_client_create(const char 
> > *dev_name, char *options)
> > if (err)
> > goto close_trans;
> >  
> > +   clnt->fcall_cache = kmem_cache_create("9p-fcall-cache", clnt->msize,
> > + 0, 0, NULL);
> > +
> 
> If we have slab merging turned off, or we have two mounts from servers
> with different msizes, we'll end up with two slabs called 9p-fcall-cache.
> I'm OK with that, but are you?

Yeah, the reason I didn't make it global like p9_req_cache is precisely
to get two separate caches if the msizes are different.

I actually considered adding msize to the string with snprintf or
something but someone looking at it through slabinfo or similar will
have the sizes anyway so I don't think this would bring anything, do you
know if/think that tools will choke on multiple caches with the same
name?


I'm not sure about slab merging being disabled though, from the little I
understand I do not see why anyone would do that except for debugging,
and I'm fine with that.
Please let me know if I'm missing something though!


Thanks for the review,
-- 
Dominique Martinet


Re: [PATCH] net/9p/trans_virtio.c: add a terminal char for mount tag

2018-08-01 Thread Dominique Martinet
piaojun wrote on Wed, Aug 01, 2018:
> chan->tag has no terminal char at last which will result in printing messy
> code when debugging code. So we should add '\0' for tag.

9p is full of non null-terminated string so I'm not sure how I feel
about it, is there anything wrong with how this is used or was this just
when you tried to printf it?

If it's just for debugging I'd suggest using the printf format "%.*s"
with "chan->tag_len, chan->tag" arguments,


That said it's not like this is costly, so I'll take it if someone else
thinks this is helpful

> 
> Signed-off-by: Jun Piao 
> ---
>  net/9p/trans_virtio.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
> index d422bfc..49d71d6 100644
> --- a/net/9p/trans_virtio.c
> +++ b/net/9p/trans_virtio.c
> @@ -585,7 +585,7 @@ static int p9_virtio_probe(struct virtio_device *vdev)
>   err = -EINVAL;
>   goto out_free_vq;
>   }
> - tag = kmalloc(tag_len, GFP_KERNEL);
> + tag = kzalloc(tag_len + 1, GFP_KERNEL);
>   if (!tag) {
>   err = -ENOMEM;
>   goto out_free_vq;
> -- 
-- 
Dominique 


Re: [PATCH] net/9p/trans_virtio.c: add a terminal char for mount tag

2018-08-01 Thread Dominique Martinet
Greg Kurz wrote on Wed, Aug 01, 2018:
> So this patch basically turns chan->tag into a nul terminated string,
> which is indeed more convenient and robust. Maybe you can update the
> rest of the code accordingly and drop chan->tag_len then ?

If we can use that to simplify some other part of the code, that's
certainly more appealing to me :)


> FWIW, 9P strings received from the client are also converted to
> nul terminated strings:

Oh, good to know; I guess that makes sense when these strings come and
go from/to other components of the kernel that likely expect that.

-- 
Dominique


Re: [V9fs-developer] [PATCH 1/2] net/9p: embed fcall in req to round down buffer allocs

2018-08-01 Thread Dominique Martinet
Greg Kurz wrote on Wed, Aug 01, 2018:
> > @@ -263,13 +261,13 @@ p9_tag_alloc(struct p9_client *c, int8_t type, 
> > unsigned int max_size)
> > if (!req)
> > return NULL;
> >  
> > -   req->tc = p9_fcall_alloc(alloc_msize);
> > -   req->rc = p9_fcall_alloc(alloc_msize);
> > -   if (!req->tc || !req->rc)
> > +   if (p9_fcall_alloc(&req->tc, alloc_msize))
> > +   goto free;
> > +   if (p9_fcall_alloc(&req->rc, alloc_msize))
> > goto free;
> 
> Hmm... if the first allocation fails, we will kfree() req->rc.sdata.
> 
> Are we sure we won't have a stale pointer or uninitialized data in
> there ?

Yeah, Jun pointed that out and I have a v2 that only frees as needed
with an extra goto (I sent an incremental diff in my reply to his
comment here[1])

[1] https://lkml.kernel.org/r/20180731011256.GA30388@nautica

> And even if we don't with the current code base, this is fragile and
> could be easily broken.
> 
> I think you should drop this hunk and rather rename p9_fcall_alloc() to
> p9_fcall_alloc_sdata() instead, since this is what the function is
> actually doing with this patch applied.

Hmm. I agree the naming isn't accurate, but even if we rename it we'll
need to pass a pointer to fcall as argument as it inits its capacity.
p9_fcall_init(fc, msize) might be simpler?

(I'm not sure I follow what you mean by 'drop this hunk', to be honest,
did you want a single function call to init both maybe?)

-- 
Dominique


Re: [V9fs-developer] [PATCH 2/2] net/9p: add a per-client fcall kmem_cache

2018-08-01 Thread Dominique Martinet
Greg Kurz wrote on Wed, Aug 01, 2018:
> > diff --git a/net/9p/client.c b/net/9p/client.c
> > index ba99a94a12c9..215e3b1ed7b4 100644
> > --- a/net/9p/client.c
> > +++ b/net/9p/client.c
> > @@ -231,15 +231,34 @@ static int parse_opts(char *opts, struct p9_client 
> > *clnt)
> > return ret;
> >  }
> >  
> > -static int p9_fcall_alloc(struct p9_fcall *fc, int alloc_msize)
> > +static int p9_fcall_alloc(struct p9_client *c, struct p9_fcall *fc,
> > + int alloc_msize)
> >  {
> > -   fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
> > +   if (c->fcall_cache && alloc_msize == c->msize)
> 
> This is a presumably hot path for any request but the initial TVERSION,
> you probably want likely() here...

c->fcall_cache is indeed very likely, but alloc_size == c->msize not so
much, as zc requests will be quite common for virtio and will be 4k in
size.
Although with that cache I'm starting to wonder if we should always use
it... Speed-wise if there is no memory pressure the cache is likely
going to be faster.
If there is pressure and the items are reclaimed though that'll bring
some heavier slow-down as it'll need to find bigger memory regions.

I'm not sure which path we should favor, tbh. I'll keep these separate
for now.


For the first part of the check, Matthew suggested trying to trick msize
into a different value to fail this check for the initial TVERSION call,
but even after thinking it a bit I don't really see how to do that
cleanly. I can at least make -that- likely()...

> 
> > +   fc->sdata = kmem_cache_alloc(c->fcall_cache, GFP_NOFS);
> > +   else
> > +   fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
> > if (!fc->sdata)
> > return -ENOMEM;
> > fc->capacity = alloc_msize;
> > return 0;
> >  }
> >  
> > +void p9_fcall_free(struct p9_client *c, struct p9_fcall *fc)
> > +{
> > +   /* sdata can be NULL for interrupted requests in trans_rdma,
> > +* and kmem_cache_free does not do NULL-check for us
> > +*/
> > +   if (unlikely(!fc->sdata))
> > +   return;
> > +
> > +   if (c->fcall_cache && fc->capacity == c->msize)
> 
> ... and here as well.

For this one I'll unfortunately need to store in the fc how it has been
allocated as slob doesn't allow to kmem_cache_free() a buffer allocated
by kmalloc and in prevision of refs being refcounted in a hostile world
the initial TVERSION req could be freed after fcall_cache is created :/

That's a bit of a burden but at least will reduce the checks to one here

> > +   kmem_cache_free(c->fcall_cache, fc->sdata);
> > +   else
> > +   kfree(fc->sdata);
> > +}
> > +EXPORT_SYMBOL(p9_fcall_free);
> > +
> >  static struct kmem_cache *p9_req_cache;
> >  
> >  /**


Anyway I've had as many comments as I could hope for, thanks everyone
for the quick review.
I'll send a v2 of both patches tomorrow

-- 
Dominique


Re: [PATCH v2] net/9p/trans_virtio.c: add null terminal for mount tag

2018-08-01 Thread Dominique Martinet
piaojun wrote on Thu, Aug 02, 2018:
> chan->tag is Non-null terminated which will result in printing messy code
> when debugging code. So we should add '\0' for tag to make the code more
> convenient and robust. In addition, I drop char->tag_len to simplify the
> code.

Some new lines in commit message would be appreciated.

That aside, I have a couple of nitpicks, but it looks good to me - thanks

> 
> Signed-off-by: Jun Piao 
> ---
>  net/9p/trans_virtio.c | 15 +--
>  1 file changed, 5 insertions(+), 10 deletions(-)
> 
> diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
> index d422bfc..0fe9c37 100644
> --- a/net/9p/trans_virtio.c
> +++ b/net/9p/trans_virtio.c
> @@ -89,10 +89,8 @@ struct virtio_chan {
>   unsigned long p9_max_pages;
>   /* Scatterlist: can be too big for stack. */
>   struct scatterlist sg[VIRTQUEUE_NUM];
> -
> - int tag_len;
>   /*
> -  * tag name to identify a mount Non-null terminated
> +  * tag name to identify a mount null terminated
>*/
>   char *tag;
> 
> @@ -529,10 +527,9 @@ static ssize_t p9_mount_tag_show(struct device *dev,
>   vdev = dev_to_virtio(dev);
>   chan = vdev->priv;
> 
> - memcpy(buf, chan->tag, chan->tag_len);
> - buf[chan->tag_len] = 0;
> + memcpy(buf, chan->tag, strlen(chan->tag) + 1);
> 
> - return chan->tag_len + 1;
> + return strlen(chan->tag) + 1;

Use a local variable for strlen(chan->tag)?

>  }
> 
>  static DEVICE_ATTR(mount_tag, 0444, p9_mount_tag_show, NULL);
> @@ -585,7 +582,7 @@ static int p9_virtio_probe(struct virtio_device *vdev)
>   err = -EINVAL;
>   goto out_free_vq;
>   }
> - tag = kmalloc(tag_len, GFP_KERNEL);
> + tag = kzalloc(tag_len + 1, GFP_KERNEL);
>   if (!tag) {
>   err = -ENOMEM;
>   goto out_free_vq;
> @@ -594,7 +591,6 @@ static int p9_virtio_probe(struct virtio_device *vdev)
>   virtio_cread_bytes(vdev, offsetof(struct virtio_9p_config, tag),
>  tag, tag_len);
>   chan->tag = tag;
> - chan->tag_len = tag_len;
>   err = sysfs_create_file(&(vdev->dev.kobj), &dev_attr_mount_tag.attr);
>   if (err) {
>   goto out_free_tag;
> @@ -654,8 +650,7 @@ static int p9_virtio_probe(struct virtio_device *vdev)
> 
>   mutex_lock(&virtio_9p_lock);
>   list_for_each_entry(chan, &virtio_chan_list, chan_list) {
> - if (!strncmp(devname, chan->tag, chan->tag_len) &&
> - strlen(devname) == chan->tag_len) {
> + if (!strncmp(devname, chan->tag, strlen(chan->tag) + 1)) {

strncmp(x, y, strlen(y)+1) is precisely what strcmp does so let's use
the simpler version

-- 
Dominique


[PATCH v2 2/2] net/9p: add a per-client fcall kmem_cache

2018-08-01 Thread Dominique Martinet
From: Dominique Martinet 

Having a specific cache for the fcall allocations helps speed up
allocations a bit, especially in case of non-"round" msizes.

The caches will automatically be merged if there are multiple caches
of items with the same size so we do not need to try to share a cache
between different clients of the same size.

Since the msize is negotiated with the server, only allocate the cache
after that negotiation has happened - previous allocations or
allocations of different sizes (e.g. zero-copy fcall) are made with
kmalloc directly.

Signed-off-by: Dominique Martinet 
Cc: Matthew Wilcox 
Cc: Greg Kurz 
Cc: Jun Piao 
---
v2: 
 - Add a pointer to the cache in p9_fcall to make sure a buffer
allocated with kmalloc gets freed with kfree and vice-versa
This could have been smaller with a bool but this spares having
to look at the client so looked a bit cleaner, I'm expecting this
patch will need a v3 one way or another so I went for the bolder
approach - please say if you think a smaller item is better ; I *think*
nothing relies on this being ordered the same way as the data on the
wire (struct isn't packed anyway) so we can move id after tag and add
another u8 to not have any overhead

 - added likely() to cache existence check in allocation, but nothing
for msize check or free because of zc request being of different size

 include/net/9p/9p.h |  1 +
 include/net/9p/client.h |  2 ++
 net/9p/client.c | 34 --
 net/9p/trans_rdma.c |  2 +-
 4 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/include/net/9p/9p.h b/include/net/9p/9p.h
index e23896116d9a..f1d2ed3cee61 100644
--- a/include/net/9p/9p.h
+++ b/include/net/9p/9p.h
@@ -558,6 +558,7 @@ struct p9_fcall {
size_t offset;
size_t capacity;
 
+   struct kmem_cache *cache;
u8 *sdata;
 };
 
diff --git a/include/net/9p/client.h b/include/net/9p/client.h
index 4b4ac1362ad5..735f3979d559 100644
--- a/include/net/9p/client.h
+++ b/include/net/9p/client.h
@@ -123,6 +123,7 @@ struct p9_client {
struct p9_trans_module *trans_mod;
enum p9_trans_status status;
void *trans;
+   struct kmem_cache *fcall_cache;
 
union {
struct {
@@ -230,6 +231,7 @@ int p9_client_mkdir_dotl(struct p9_fid *fid, const char 
*name, int mode,
kgid_t gid, struct p9_qid *);
 int p9_client_lock_dotl(struct p9_fid *fid, struct p9_flock *flock, u8 
*status);
 int p9_client_getlock_dotl(struct p9_fid *fid, struct p9_getlock *fl);
+void p9_fcall_fini(struct p9_fcall *fc);
 struct p9_req_t *p9_tag_lookup(struct p9_client *, u16);
 void p9_client_cb(struct p9_client *c, struct p9_req_t *req, int status);
 
diff --git a/net/9p/client.c b/net/9p/client.c
index bc40bb11b832..0e0f8bb3fd3c 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -231,19 +231,36 @@ static int parse_opts(char *opts, struct p9_client *clnt)
return ret;
 }
 
-static int p9_fcall_init(struct p9_fcall *fc, int alloc_msize)
+static int p9_fcall_init(struct p9_client *c, struct p9_fcall *fc,
+int alloc_msize)
 {
-   fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
+   if (likely(c->fcall_cache) && alloc_msize == c->msize) {
+   fc->sdata = kmem_cache_alloc(c->fcall_cache, GFP_NOFS);
+   fc->cache = c->fcall_cache;
+   } else {
+   fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
+   fc->cache = NULL;
+   }
if (!fc->sdata)
return -ENOMEM;
fc->capacity = alloc_msize;
return 0;
 }
 
-static void p9_fcall_fini(struct p9_fcall *fc)
+void p9_fcall_fini(struct p9_fcall *fc)
 {
-   kfree(fc->sdata);
+   /* sdata can be NULL for interrupted requests in trans_rdma,
+* and kmem_cache_free does not do NULL-check for us
+*/
+   if (unlikely(!fc->sdata))
+   return;
+
+   if (fc->cache)
+   kmem_cache_free(fc->cache, fc->sdata);
+   else
+   kfree(fc->sdata);
 }
+EXPORT_SYMBOL(p9_fcall_fini);
 
 static struct kmem_cache *p9_req_cache;
 
@@ -266,9 +283,9 @@ p9_tag_alloc(struct p9_client *c, int8_t type, unsigned int 
max_size)
if (!req)
return NULL;
 
-   if (p9_fcall_init(&req->tc, alloc_msize))
+   if (p9_fcall_init(c, &req->tc, alloc_msize))
goto free_req;
-   if (p9_fcall_init(&req->rc, alloc_msize))
+   if (p9_fcall_init(c, &req->rc, alloc_msize))
goto free;
 
p9pdu_reset(&req->tc);
@@ -950,6 +967,7 @@ struct p9_client *p9_client_create(const char *dev_name, 
char *options)
 
clnt->trans_mod = NULL;
clnt->trans = NULL;
+   clnt->fcall_cache = NULL;
 
client_id = utsname()->nodename;
memcpy(clnt->name, client_id, strl

[PATCH v2 1/2] net/9p: embed fcall in req to round down buffer allocs

2018-08-01 Thread Dominique Martinet
From: Dominique Martinet 

'msize' is often a power of two, or at least page-aligned, so avoiding
an overhead of two dozen bytes for each allocation will help the
allocator do its work and reduce memory fragmentation.

Suggested-by: Matthew Wilcox 
Signed-off-by: Dominique Martinet 
Cc: Matthew Wilcox 
Cc: Greg Kurz 
Cc: Jun Piao 
---

v2:
 - Add extra label to not free uninitialized memory on alloc failure
 - Rename p9_fcall_alloc to 9p_fcall_init
 - Add a p9_fcall_fini function to echo to init

 include/net/9p/client.h |   4 +-
 net/9p/client.c | 166 
 net/9p/trans_fd.c   |  12 +--
 net/9p/trans_rdma.c |  29 +++
 net/9p/trans_virtio.c   |  18 ++---
 net/9p/trans_xen.c  |  12 +--
 6 files changed, 123 insertions(+), 118 deletions(-)

diff --git a/include/net/9p/client.h b/include/net/9p/client.h
index a4dc42c53d18..4b4ac1362ad5 100644
--- a/include/net/9p/client.h
+++ b/include/net/9p/client.h
@@ -95,8 +95,8 @@ struct p9_req_t {
int status;
int t_err;
wait_queue_head_t wq;
-   struct p9_fcall *tc;
-   struct p9_fcall *rc;
+   struct p9_fcall tc;
+   struct p9_fcall rc;
void *aux;
struct list_head req_list;
 };
diff --git a/net/9p/client.c b/net/9p/client.c
index 88db45966740..bc40bb11b832 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -231,15 +231,18 @@ static int parse_opts(char *opts, struct p9_client *clnt)
return ret;
 }
 
-static struct p9_fcall *p9_fcall_alloc(int alloc_msize)
+static int p9_fcall_init(struct p9_fcall *fc, int alloc_msize)
 {
-   struct p9_fcall *fc;
-   fc = kmalloc(sizeof(struct p9_fcall) + alloc_msize, GFP_NOFS);
-   if (!fc)
-   return NULL;
+   fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
+   if (!fc->sdata)
+   return -ENOMEM;
fc->capacity = alloc_msize;
-   fc->sdata = (char *) fc + sizeof(struct p9_fcall);
-   return fc;
+   return 0;
+}
+
+static void p9_fcall_fini(struct p9_fcall *fc)
+{
+   kfree(fc->sdata);
 }
 
 static struct kmem_cache *p9_req_cache;
@@ -263,13 +266,13 @@ p9_tag_alloc(struct p9_client *c, int8_t type, unsigned 
int max_size)
if (!req)
return NULL;
 
-   req->tc = p9_fcall_alloc(alloc_msize);
-   req->rc = p9_fcall_alloc(alloc_msize);
-   if (!req->tc || !req->rc)
+   if (p9_fcall_init(&req->tc, alloc_msize))
+   goto free_req;
+   if (p9_fcall_init(&req->rc, alloc_msize))
goto free;
 
-   p9pdu_reset(req->tc);
-   p9pdu_reset(req->rc);
+   p9pdu_reset(&req->tc);
+   p9pdu_reset(&req->rc);
req->status = REQ_STATUS_ALLOC;
init_waitqueue_head(&req->wq);
INIT_LIST_HEAD(&req->req_list);
@@ -281,7 +284,7 @@ p9_tag_alloc(struct p9_client *c, int8_t type, unsigned int 
max_size)
GFP_NOWAIT);
else
tag = idr_alloc(&c->reqs, req, 0, P9_NOTAG, GFP_NOWAIT);
-   req->tc->tag = tag;
+   req->tc.tag = tag;
spin_unlock_irq(&c->lock);
idr_preload_end();
if (tag < 0)
@@ -290,8 +293,9 @@ p9_tag_alloc(struct p9_client *c, int8_t type, unsigned int 
max_size)
return req;
 
 free:
-   kfree(req->tc);
-   kfree(req->rc);
+   p9_fcall_fini(&req->tc);
+   p9_fcall_fini(&req->rc);
+free_req:
kmem_cache_free(p9_req_cache, req);
return ERR_PTR(-ENOMEM);
 }
@@ -329,14 +333,14 @@ EXPORT_SYMBOL(p9_tag_lookup);
 static void p9_free_req(struct p9_client *c, struct p9_req_t *r)
 {
unsigned long flags;
-   u16 tag = r->tc->tag;
+   u16 tag = r->tc.tag;
 
p9_debug(P9_DEBUG_MUX, "clnt %p req %p tag: %d\n", c, r, tag);
spin_lock_irqsave(&c->lock, flags);
idr_remove(&c->reqs, tag);
spin_unlock_irqrestore(&c->lock, flags);
-   kfree(r->tc);
-   kfree(r->rc);
+   p9_fcall_fini(&r->tc);
+   p9_fcall_fini(&r->rc);
kmem_cache_free(p9_req_cache, r);
 }
 
@@ -368,7 +372,7 @@ static void p9_tag_cleanup(struct p9_client *c)
  */
 void p9_client_cb(struct p9_client *c, struct p9_req_t *req, int status)
 {
-   p9_debug(P9_DEBUG_MUX, " tag %d\n", req->tc->tag);
+   p9_debug(P9_DEBUG_MUX, " tag %d\n", req->tc.tag);
 
/*
 * This barrier is needed to make sure any change made to req before
@@ -378,7 +382,7 @@ void p9_client_cb(struct p9_client *c, struct p9_req_t 
*req, int status)
req->status = status;
 
wake_up(&req->wq);
-   p9_debug(P9_DEBUG_MUX, "wakeup: %d\n", req->tc->tag);
+   p9_debug(P9_DEBUG_MUX, "wakeup: %d\n", req->tc.tag);
 }
 EXPORT_SYMBOL(p9_client_cb);
 
@@ -449,18 +453,18 @@ s

Re: [V9fs-developer] [PATCH v2 2/2] net/9p: add a per-client fcall kmem_cache

2018-08-01 Thread Dominique Martinet
Dominique Martinet wrote on Thu, Aug 02, 2018:
> [...]
> + clnt->fcall_cache = kmem_cache_create("9p-fcall-cache", clnt->msize,
> +   0, 0, NULL);


Well, my gut feeling that I'd need a v3 was right, after a bit more time
testing (slightly different setup), I got this warning:
Bad or missing usercopy whitelist? Kernel memory overwrite attempt
detected to SLUB object '9p-fcall-cache' (offset 23, size 55078)!

So this kmem_cache_create needs to change to kmem_cache_create_usercopy,
but I'm not sure how to best specify the range -- this warning was about
writing data to the buffer so a TWRITE:
size[4] Twrite tag[2] fid[4] offset[8] count[4] data[count]
(the data[count] part comes from user - this matches 4+1+2+4+8+4 = 23)

Similarily RREAD has data that can be copied to userspace, which has a
similar check:
size[4] Rread tag[2] count[4] data[count]

So I could hardcode offset = 4+1+2+4=11, usercopy size = msize - 11...


We have some P9_*HDR*SZ macros but none are really appropriate:
---
/* ample room for Twrite/Rread header */
#define P9_IOHDRSZ  24

/* Room for readdir header */
#define P9_READDIRHDRSZ 24

/* size of header for zero copy read/write */
#define P9_ZC_HDR_SZ 4096
---

It's actually been bugging me for a while that I see hardcoded '7's for
9p main header size (len + msg type + tag) everywhere, I could add a
first P9_HDRSZ of 7 and use that + 4?

-- 
Dominique


Re: [PATCH v2 5/6] 9p: Use a slab for allocating requests

2018-07-23 Thread Dominique Martinet
Greg Kurz wrote on Mon, Jul 23, 2018:
> The patch is quite big and I'm not sure I can find time to review it
> carefully, but I'll try to help anyway.

No worry, thanks for this already.

> > Sorry for coming back to this patch now, I just noticed something that's
> > actually probably a fairly big hit on performance...
> > 
> > While the slab is just as good as the array for the request itself, this
> > makes every single request allocate "fcalls" everytime instead of
> > reusing a cached allocation.
> > The default msize is 8k and these allocs probably are fairly efficient,
> > but some transports like RDMA allow to increase this to up to 1MB... And
> 
> It can be even bigger with virtio:
> 
> #define VIRTQUEUE_NUM 128
> 
>   .maxsize = PAGE_SIZE * (VIRTQUEUE_NUM - 3),
> 
> On a typical ppc64 server class setup with 64KB pages, this is nearly 8MB.

I don't think I'll be able to test 64KB pages, and it's "just" 500k with
4K pages so I'll go with IB.
I just finished reinstalling my IB-enabled VMs, now to get some iops
test running (dbench maybe) and I'll get some figures to be able to play
with different models and evaluate the impact of these.

> > One thing is that the buffers are all going to be the same size for a
> > given client ( except virtio zc buffers, I wonder what I'm missing
> > or why that didn't blow up before?)
> 
> ZC allocates a 4KB buffer, which is more than enough to hold the 7-byte 9P
> header and the "dqd" part of all messages that may use ZC, ie, 16 bytes.
> So I'm not sure to catch what could blow up.

ZC requests won't blow up, but from what I can see with the current
(old) request cache array, if a ZC request has a not-yet used tag it'll
allocate a new 4k buffer, then if a normal request uses that tag it'll
get the 4k buffer instead of an msize sized one.

On the client size the request would be posted with req->rc->capacity
which would correctly be 4k, but I'm not sure what would happen if qemu
tries to write more than the given size to that request?

> > It's a shame because I really like that patch, I'll try to find time to
> > run some light benchmark with varying msizes eventually but I'm not sure
> > when I'll find time for that... Hopefully before the 4.19 merge window!
> > 
> 
> Yeah, the open-coded cache we have now really obfuscates things.
> 
> Maybe have a per-client kmem_cache object for non-ZC requests with
> size msize [*], and a global kmem_cache object for ZC requests with
> fixed size P9_ZC_HDR_SZ.
> 
> [*] the server can require a smaller msize during version negotiation,
> so maybe we should change the kmem_cache object in this case.

Yeah, if we're going to want to accomodate non-power of two buffers, I
think we'll need a separate kmem_cache for them.
The ZC requests could be made into exactly 4k and these could come with
regular kmalloc just fine, it looks like trying to create a cache of
that size would just return the same cache used by kmalloc anyway so
it's probably easier to fall back to kmalloc if requested alloc size
doesn't match what we were hoping for.

I'll try to get figures for various approaches before the merge window
for 4.19 starts, it's getting closer though...

-- 
Dominique


Re: [PATCH] fs/9p/xattr.c: catch the error of p9_client_clunk when setting xattr failed

2018-07-24 Thread Dominique Martinet
piaojun wrote on Wed, Jul 25, 2018:
> In my testing, v9fs_fid_xattr_set will return successfully even if the
> backend ext4 filesystem has no space to store xattr key-value. That will
> cause inconsistent behavior between front end and back end. The reason is
> that lsetxattr will be triggered by p9_client_clunk, and unfortunately we
> did not catch the error. This patch will catch the error to notify upper
> caller.
> 
> p9_client_clunk (in 9p)
>   p9_client_rpc(clnt, P9_TCLUNK, "d", fid->fid);
> v9fs_clunk (in qemu)
>   put_fid
> free_fid
>   v9fs_xattr_fid_clunk
> v9fs_co_lsetxattr
>   s->ops->lsetxattr
> ext4_xattr_user_set (in host ext4 filesystem)
> 
> Signed-off-by: Jun Piao 

Good catch!

LGTM on its own but see comment below for further error checking.
Please let me know if you want to do this in a v2 or submit a separate
patch altogether, I can pick this one up as it is.

> ---
>  fs/9p/xattr.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/9p/xattr.c b/fs/9p/xattr.c
> index f329eee..352abc3 100644
> --- a/fs/9p/xattr.c
> +++ b/fs/9p/xattr.c
> @@ -105,7 +105,7 @@ int v9fs_fid_xattr_set(struct p9_fid *fid, const char 
> *name,
>  {
>   struct kvec kvec = {.iov_base = (void *)value, .iov_len = value_len};
>   struct iov_iter from;
> - int retval;
> + int retval, err;
> 
>   iov_iter_kvec(&from, WRITE | ITER_KVEC, &kvec, 1, value_len);
> 
> @@ -126,7 +126,9 @@ int v9fs_fid_xattr_set(struct p9_fid *fid, const char 
> *name,
>retval);
>   else
>   p9_client_write(fid, 0, &from, &retval);

I'm not sure what to do about this but it's also possible for
p9_client_write to not write the full length without setting and error.

We should probably compare the return value of p9_client_write and
value_len to detect short writes and return an error in this case.

> - p9_client_clunk(fid);
> + err = p9_client_clunk(fid);
> + if (!retval && err)
> + retval = err;
>   return retval;
>  }
> 
> -- 


Re: [PATCH] fs/9p/xattr.c: catch the error of p9_client_clunk when setting xattr failed

2018-07-24 Thread Dominique Martinet
piaojun wrote on Wed, Jul 25, 2018:
> On 2018/7/25 11:32, Dominique Martinet wrote:
> >>p9_client_write(fid, 0, &from, &retval);
> > 
> > I'm not sure what to do about this but it's also possible for
> > p9_client_write to not write the full length without setting and error.
> > 
> > We should probably compare the return value of p9_client_write and
> > value_len to detect short writes and return an error in this case.
> 
> It looks like we could identify short writes error from the *err. If no
> error case breaks the while loop, the total equal to the whole data len.

Ah, I looked too fast and missed the while loop in p9_client_write...
It's rather unusual for such a low level call to handle retries, but I
guess that's 9p.

Right, will take that patch as is then.

Thanks,
-- 
Dominique


Re: [PATCH] 9p: validate PDU length

2018-07-24 Thread Dominique Martinet
Tomas Bortoli wrote on Mon, Jul 23, 2018:
> diff --git a/net/9p/client.c b/net/9p/client.c
> index 18c5271910dc..92240ccf476b 100644
> --- a/net/9p/client.c
> +++ b/net/9p/client.c
> @@ -524,6 +525,12 @@ static int p9_check_errors(struct p9_client *c, struct 
> p9_req_t *req)
>   int ecode;
>  
>   err = p9_parse_header(req->rc, NULL, &type, NULL, 0);
> + if (req->rc->size >= c->msize) {

I was looking at this again, I think it's more appropriate to use
req->rc->capacity at this point like you did in the first version of the
patch.

I had suggested msize in the common p9_parse_header function because
that'd let us accept zc requests where the size in the pdu could be
bigger than capacity, but this isn't the case in p9_check_errors.

If you're ok with this I'll edit your commit directly, this is less work
for me than having to check a new patch.

Thanks,
-- 
Dominique


Re: [V9fs-developer] [PATCH 5/6] 9p: Use a slab for allocating requests

2018-07-11 Thread Dominique Martinet
Matthew Wilcox wrote on Wed, Jul 11, 2018:
> On Wed, Jul 11, 2018 at 03:33:13PM +0200, Dominique Martinet wrote:
> > Well this appears to work but P9_NOTAG being '(u16)(~0)' I'm not too
> > confident with P9_NOTAG + 1. . . it doesn't look like it's overflowing
> > before the cast on my laptop but is that guaranteed?
> 
> By my understanding of n1256.pdf ... this falls under 6.3.1.8 ("Usual
> arithmetic conversions").  We have a u16 and an int.  Therefore this
> rule applies:
> 
> Otherwise, if the type of the operand with signed integer type can
> represent all of the values of the type of the operand with unsigned
> integer type, then the operand with unsigned integer type is converted
> to the type of the operand with signed integer type.

Thanks for checking, that'll work then.

> > I do not see any call to idr_destroy, is that OK?
> 
> Yes, that's fine.  It used to be (back in 2013) that one had to call
> idr_destroy() in order to free the preallocated idr data structures.
> Now it's a no-op if called on an empty IDR, and I would expect that both
> IDRs are empty at the time that it comes to unloading the module (and if
> they aren't, we probably have bigger problems than a small memory leak).
> Some users like to assert that the IDR is empty; most do not go to that
> extent of defensive programming.

Ok, I agree we're not there yet.

Just comments nitpicks, then :)

-- 
Dominique Martinet


Re: [PATCH v2 0/6] 9p: Use IDRs more effectively

2018-07-11 Thread Dominique Martinet
Matthew Wilcox wrote on Wed, Jul 11, 2018:
> The 9p code doesn't take advantage of the IDR's ability to store
> a pointer.  We can actually get rid of the p9_idpool abstraction
> and the multi-dimensional array of requests.
> 
> v2: Address feedback from Dominique.

Thanks, I've picked them up for 4.19
 ( git://github.com/martinetd/linux 9p-next as per mail on
 v9fs-developer list, "Current 9P patches - test branch")

This shouldn't stop anyone else from doing more reviews/test, I'm doing
this 9p-patch-gathering on no autority and I've only checked very basic
things for now.

-- 
Dominique Martinet


Re: [V9fs-developer] [PATCH] net/9p/client.c: fix misuse of spin_lock_irqsave for p9_client lock

2018-07-12 Thread Dominique Martinet
piaojun wrote on Thu, Jul 12, 2018:
> In p9_read_work(), we use spin_lock for client->lock, but misuse
> spin_lock_irqsave for it in p9_fid_create(). As p9_client lock won't be
> locked in irq context, so spin_lock is enough. And that will improve the
> performance.

Agreed on principle, see remark below

> Signed-off-by: Jun Piao 
> ---
>  net/9p/client.c   | 17 +++--
>  net/9p/trans_fd.c |  7 +++
>  2 files changed, 10 insertions(+), 14 deletions(-)
> 
> diff --git a/net/9p/client.c b/net/9p/client.c
> index 8bc8b3e..b05cbfc 100644
> --- a/net/9p/client.c
> +++ b/net/9p/client.c
> @@ -260,7 +260,6 @@ static struct p9_fcall *p9_fcall_alloc(int alloc_msize)
>  static struct p9_req_t *
>  p9_tag_alloc(struct p9_client *c, u16 tag, unsigned int max_size)
>  {
> - unsigned long flags;
>   int row, col;
>   struct p9_req_t *req;
>   int alloc_msize = min(c->msize, max_size);
> @@ -270,7 +269,7 @@ static struct p9_fcall *p9_fcall_alloc(int alloc_msize)
>   tag++;
> 
>   if (tag >= c->max_tag) {
> - spin_lock_irqsave(&c->lock, flags);
> + spin_lock(&c->lock);

This code doesn't exist anymore with Matthew's idr rework, could you
submit that patch based on top of my 9p-next branch?
(unless you really want Andrew to take this for the next 4.18-rc, but
I'm not convinced this qualifies)

Please see my "Current 9P patches - test branch" for details:
https://sourceforge.net/p/v9fs/mailman/message/36365359/

-- 
Dominique Martinet


Re: [PATCH v2 3/6] 9p: Replace the fidlist with an IDR

2018-07-12 Thread Dominique Martinet
Matthew Wilcox wrote on Wed, Jul 11, 2018:
> diff --git a/net/9p/client.c b/net/9p/client.c
> index 389a2904b7b3..b89c7298267c 100644
> --- a/net/9p/client.c
> +++ b/net/9p/client.c
> @@ -908,30 +908,29 @@ static struct p9_fid *p9_fid_create(struct p9_client 
> *clnt)
>  {
>   int ret;
>   struct p9_fid *fid;
> - unsigned long flags;
>  
>   p9_debug(P9_DEBUG_FID, "clnt %p\n", clnt);
>   fid = kmalloc(sizeof(struct p9_fid), GFP_KERNEL);
>   if (!fid)
>   return NULL;
>  
> - ret = p9_idpool_get(clnt->fidpool);
> - if (ret < 0)
> - goto error;
> - fid->fid = ret;
> -
>   memset(&fid->qid, 0, sizeof(struct p9_qid));

Ah, I had missed that you didn't update this memset as you said in reply
to comment on v1.

Could you resend just this patch and either initialize fid->fid or use
kzalloc for the fid allocation?

>   fid->mode = -1;
>   fid->uid = current_fsuid();
>   fid->clnt = clnt;
>   fid->rdir = NULL;
> - spin_lock_irqsave(&clnt->lock, flags);
> - list_add(&fid->flist, &clnt->fidlist);
> - spin_unlock_irqrestore(&clnt->lock, flags);
> + fid->fid = 0;
>  
> - return fid;
> + idr_preload(GFP_KERNEL);
> + spin_lock_irq(&clnt->lock);
> + ret = idr_alloc_u32(&clnt->fids, fid, &fid->fid, P9_NOFID - 1,
> + GFP_NOWAIT);

If you do resend, alignment here was wrong.


Thanks,
-- 
Dominique Martinet


Re: [PATCH v2 3/6] 9p: Replace the fidlist with an IDR

2018-07-12 Thread Dominique Martinet
Matthew Wilcox wrote on Thu, Jul 12, 2018:
> > Ah, I had missed that you didn't update this memset as you said in reply
> > to comment on v1.
> 
> Rather than update the memset, I ... 
> 
> > Could you resend just this patch and either initialize fid->fid or use
> > kzalloc for the fid allocation?
> > 
> > > + fid->fid = 0;
> 
> Did that instead ;-)
> 
> If I were going to initialise the entire structure to 0, I'd replace
> the kmalloc with kzalloc and drop the memset altogether.

Oh, I'm blind, sorry! :)

> > If you do resend, alignment here was wrong.
> 
> I think this warning from checkpatch is complete bullshit.  It's
> really none of checkpatch's business how I choose to align function
> arguments.
> 
> That said, if you want it to be aligned some particular way, feel free
> to adjust the whitespace.  I don't care.

I would tend to agree that sometimes checkpatch is overzealous, but
having worked on projects where code style is all over the place it
really feels much more comfortable to have a consistent style
everywhere.

Thanks for the permission, I'll adjust it (assuming this does end up
getting pulled from my branch, but nobody yelled so far)

-- 
Dominique Martinet


Re: [Intel-gfx] [PATCH] i915/intel_tv_get_modes: fix strncpy truncation warning

2018-07-12 Thread Dominique Martinet
Ville Syrjälä wrote on Thu, Jul 12, 2018:
> On Wed, Jul 11, 2018 at 09:46:15AM +0200, Dominique Martinet wrote:
> > This is effectively no-op as the next line writes a nul at the final
> 
> What is "This". Please write self contained commit messages.

This could either be 'this commit' as a whole or if you look only at the
commit message 'this strncpy fix' from the title (which is arguably the
same), and both interpretations sound fairly understandable in the
context of the title line without seeing the patch to me... Although
I'll admit this is difficult to judge of that as the author.

Thanksfully, the v2 of the patch didn't use this wording but while I
agree the message could be better I do not think it was horrible.


> > drivers/gpu/drm/i915/intel_tv.c: In function ‘intel_tv_get_modes’:
> > drivers/gpu/drm/i915/intel_tv.c:1358:3: error: ‘strncpy’ specified bound 32 
> > equals destination size [-Werror=stringop-truncation]
> >strncpy(mode_ptr->name, input->name, DRM_DISPLAY_MODE_LEN);
> >^~
> > cc1: all warnings being treated as errors
> 
> That warning should be in the actual commit message.

Yes and no, I gave it for referrence but when you update to gcc 8 you
will literally see it all over the place.
The words "strncpy truncation warning" is really precise once you've
seen them a few times and there are litteraly hundred of these warnings
in the kernel, some have already been fixed taking a glance at the git
log, some with and without the warning message.
I don't think it's worth polluting the git log with this many
warnings... Which leads to...

> This same pattern is used all over drm. Can you go and fix them all up?
> One might even consider writing a cocci patch for it ;)

Now this is something I can agree with.
This patch really was just a stop-gap measure because I could not build
the kernel at all without it, but yes I did consider having a look at
others.

Unfortunately coccinelle does not run on fedora 28 (and doesn't look
like it will fix itself any time soon, there is a bug report[1] open
since February that didn't get much love lately - I was just looking at
it a few days ago)

I think in this case it might actually be faster to look at gcc warnings
and s/strncpy/strlcpy/, but I am curious about Coccinelle so this is a
good excuse to look at it, I'll report back in a bit after poking at
that bug report and figuring out how coccinelle works.

I do not guarantee speed however, if anyone sees this and feels put off
from donig it themselves, please go ahead and just drop me a word.


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1544204

Thanks, and sorry for the mail longer than I originally intended,
-- 
Dominique Martinet


Re: [V9fs-developer] [PATCH v2 4/6] 9p: Embed wait_queue_head into p9_req_t

2018-07-12 Thread Dominique Martinet
Greg Kurz wrote on Thu, Jul 12, 2018:
> This is true when all tags have been used at least once. But the current code
> lazily allocates the wait_queue_head_t, ie, only when a tag is used for the
> first time. Your patch causes a full row of wait_quest_head_t to be 
> pre-allocated.
> 
> ie, P9_ROW_MAXTAG * 24 = 255 * 24 = 6120
> 
> instead of (P9_ROW_MAXTAG * 8) + 24 = 255 * 8 + 24 = 2064
> 
> This is nearly a page of allocated memory that might be never used.
> 
> Not sure if this is a problem though...

I thought the exact same, but the next patch in the serie removes that
array and allocates even more lazily with a slab, so this was a
no-brainer :)

-- 
Dominique Martinet


[PATCH 12/18] test_power: change strncpy+truncation to strlcpy

2018-07-12 Thread Dominique Martinet
Generated by scripts/coccinelle/misc/strncpy_truncation.cocci

Signed-off-by: Dominique Martinet 
---

Please see https://marc.info/?l=linux-kernel&m=153144450722324&w=2 (the
first patch of the serie) for the motivation behind this patch

 drivers/power/supply/test_power.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/power/supply/test_power.c 
b/drivers/power/supply/test_power.c
index 57246cdbd042..64adf630f64f 100644
--- a/drivers/power/supply/test_power.c
+++ b/drivers/power/supply/test_power.c
@@ -297,8 +297,7 @@ static int map_get_value(struct battery_property_map *map, 
const char *key,
char buf[MAX_KEYLENGTH];
int cr;
 
-   strncpy(buf, key, MAX_KEYLENGTH);
-   buf[MAX_KEYLENGTH-1] = '\0';
+   strlcpy(buf, key, MAX_KEYLENGTH);
 
cr = strnlen(buf, MAX_KEYLENGTH) - 1;
if (cr < 0)
-- 
2.17.1



[PATCH 14/18] kdb_support: change strncpy+truncation to strlcpy

2018-07-12 Thread Dominique Martinet
Generated by scripts/coccinelle/misc/strncpy_truncation.cocci

Signed-off-by: Dominique Martinet 
---

Please see https://marc.info/?l=linux-kernel&m=153144450722324&w=2 (the
first patch of the serie) for the motivation behind this patch

 kernel/debug/kdb/kdb_support.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/debug/kdb/kdb_support.c b/kernel/debug/kdb/kdb_support.c
index 990b3cc526c8..1f6a4b6bde0b 100644
--- a/kernel/debug/kdb/kdb_support.c
+++ b/kernel/debug/kdb/kdb_support.c
@@ -119,8 +119,7 @@ int kdbnearsym(unsigned long addr, kdb_symtab_t *symtab)
 * What was Rusty smoking when he wrote that code?
 */
if (symtab->sym_name != knt1) {
-   strncpy(knt1, symtab->sym_name, knt1_size);
-   knt1[knt1_size-1] = '\0';
+   strlcpy(knt1, symtab->sym_name, knt1_size);
}
for (i = 0; i < ARRAY_SIZE(kdb_name_table); ++i) {
if (kdb_name_table[i] &&
-- 
2.17.1



[PATCH 05/18] iio: change strncpy+truncation to strlcpy

2018-07-12 Thread Dominique Martinet
Generated by scripts/coccinelle/misc/strncpy_truncation.cocci

Signed-off-by: Dominique Martinet 
---

Please see https://marc.info/?l=linux-kernel&m=153144450722324&w=2 (the
first patch of the serie) for the motivation behind this patch

 drivers/iio/common/st_sensors/st_sensors_core.c | 3 +--
 drivers/iio/pressure/st_pressure_i2c.c  | 3 +--
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/iio/common/st_sensors/st_sensors_core.c 
b/drivers/iio/common/st_sensors/st_sensors_core.c
index 57db19182e95..26fbd1bd9413 100644
--- a/drivers/iio/common/st_sensors/st_sensors_core.c
+++ b/drivers/iio/common/st_sensors/st_sensors_core.c
@@ -380,8 +380,7 @@ void st_sensors_of_name_probe(struct device *dev,
return;
 
/* The name from the OF match takes precedence if present */
-   strncpy(name, of_id->data, len);
-   name[len - 1] = '\0';
+   strlcpy(name, of_id->data, len);
 }
 EXPORT_SYMBOL(st_sensors_of_name_probe);
 #else
diff --git a/drivers/iio/pressure/st_pressure_i2c.c 
b/drivers/iio/pressure/st_pressure_i2c.c
index fbb59059e942..2026a1012012 100644
--- a/drivers/iio/pressure/st_pressure_i2c.c
+++ b/drivers/iio/pressure/st_pressure_i2c.c
@@ -94,9 +94,8 @@ static int st_press_i2c_probe(struct i2c_client *client,
if ((ret < 0) || (ret >= ST_PRESS_MAX))
return -ENODEV;
 
-   strncpy(client->name, st_press_id_table[ret].name,
+   strlcpy(client->name, st_press_id_table[ret].name,
sizeof(client->name));
-   client->name[sizeof(client->name) - 1] = '\0';
} else if (!id)
return -ENODEV;
 
-- 
2.17.1



[PATCH 08/18] myricom: change strncpy+truncation to strlcpy

2018-07-12 Thread Dominique Martinet
Generated by scripts/coccinelle/misc/strncpy_truncation.cocci

Signed-off-by: Dominique Martinet 
---

Please see https://marc.info/?l=linux-kernel&m=153144450722324&w=2 (the
first patch of the serie) for the motivation behind this patch

 drivers/net/ethernet/myricom/myri10ge/myri10ge.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/myricom/myri10ge/myri10ge.c 
b/drivers/net/ethernet/myricom/myri10ge/myri10ge.c
index b2d2ec8c11e2..f7178cdb6bd8 100644
--- a/drivers/net/ethernet/myricom/myri10ge/myri10ge.c
+++ b/drivers/net/ethernet/myricom/myri10ge/myri10ge.c
@@ -553,8 +553,7 @@ myri10ge_validate_firmware(struct myri10ge_priv *mgp,
}
 
/* save firmware version for ethtool */
-   strncpy(mgp->fw_version, hdr->version, sizeof(mgp->fw_version));
-   mgp->fw_version[sizeof(mgp->fw_version) - 1] = '\0';
+   strlcpy(mgp->fw_version, hdr->version, sizeof(mgp->fw_version));
 
sscanf(mgp->fw_version, "%d.%d.%d", &mgp->fw_ver_major,
   &mgp->fw_ver_minor, &mgp->fw_ver_tiny);
-- 
2.17.1



[PATCH 17/18] perf: change strncpy+truncation to strlcpy

2018-07-12 Thread Dominique Martinet
Generated by scripts/coccinelle/misc/strncpy_truncation.cocci

Signed-off-by: Dominique Martinet 
---

Please see https://marc.info/?l=linux-kernel&m=153144450722324&w=2 (the
first patch of the serie) for the motivation behind this patch

 tools/perf/util/bpf-loader.h | 3 +--
 tools/perf/util/util.c   | 3 +--
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/tools/perf/util/bpf-loader.h b/tools/perf/util/bpf-loader.h
index 5d3aefd6fae7..8d08a1fc97a0 100644
--- a/tools/perf/util/bpf-loader.h
+++ b/tools/perf/util/bpf-loader.h
@@ -143,10 +143,9 @@ __bpf_strerror(char *buf, size_t size)
 {
if (!size)
return 0;
-   strncpy(buf,
+   strlcpy(buf,
"ERROR: eBPF object loading is disabled during compiling.\n",
size);
-   buf[size - 1] = '\0';
return 0;
 }
 
diff --git a/tools/perf/util/util.c b/tools/perf/util/util.c
index eac5b858a371..8b9e3aa7aad3 100644
--- a/tools/perf/util/util.c
+++ b/tools/perf/util/util.c
@@ -459,8 +459,7 @@ fetch_kernel_version(unsigned int *puint, char *str,
return -1;
 
if (str && str_size) {
-   strncpy(str, utsname.release, str_size);
-   str[str_size - 1] = '\0';
+   strlcpy(str, utsname.release, str_size);
}
 
if (!puint || int_ver_ready)
-- 
2.17.1



[PATCH 13/18] ibmvscsi: change strncpy+truncation to strlcpy

2018-07-12 Thread Dominique Martinet
Generated by scripts/coccinelle/misc/strncpy_truncation.cocci

Signed-off-by: Dominique Martinet 
---

Please see https://marc.info/?l=linux-kernel&m=153144450722324&w=2 (the
first patch of the serie) for the motivation behind this patch

 drivers/scsi/ibmvscsi/ibmvscsi.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c b/drivers/scsi/ibmvscsi/ibmvscsi.c
index 17df76f0be3c..79eb8af03a19 100644
--- a/drivers/scsi/ibmvscsi/ibmvscsi.c
+++ b/drivers/scsi/ibmvscsi/ibmvscsi.c
@@ -1274,14 +1274,12 @@ static void send_mad_capabilities(struct 
ibmvscsi_host_data *hostdata)
if (hostdata->client_migrated)
hostdata->caps.flags |= cpu_to_be32(CLIENT_MIGRATED);
 
-   strncpy(hostdata->caps.name, dev_name(&hostdata->host->shost_gendev),
+   strlcpy(hostdata->caps.name, dev_name(&hostdata->host->shost_gendev),
sizeof(hostdata->caps.name));
-   hostdata->caps.name[sizeof(hostdata->caps.name) - 1] = '\0';
 
location = of_get_property(of_node, "ibm,loc-code", NULL);
location = location ? location : dev_name(hostdata->dev);
-   strncpy(hostdata->caps.loc, location, sizeof(hostdata->caps.loc));
-   hostdata->caps.loc[sizeof(hostdata->caps.loc) - 1] = '\0';
+   strlcpy(hostdata->caps.loc, location, sizeof(hostdata->caps.loc));
 
req->common.type = cpu_to_be32(VIOSRP_CAPABILITIES_TYPE);
req->buffer = cpu_to_be64(hostdata->caps_addr);
-- 
2.17.1



[PATCH 15/18] blktrace: change strncpy+truncation to strlcpy

2018-07-12 Thread Dominique Martinet
Using strlcpy fixes this new gcc warning:
kernel/trace/blktrace.c: In function ‘do_blk_trace_setup’:
kernel/trace/blktrace.c:497:2: warning: ‘strncpy’ specified bound 32 equals 
destination size [-Wstringop-truncation]
  strncpy(buts->name, name, BLKTRACE_BDEV_SIZE);
  ^

Generated by scripts/coccinelle/misc/strncpy_truncation.cocci

Signed-off-by: Dominique Martinet 
---

Please see https://marc.info/?l=linux-kernel&m=153144450722324&w=2 (the
first patch of the serie) for the motivation behind this patch

 kernel/trace/blktrace.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 987d9a9ae283..2478d9838eab 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -494,8 +494,7 @@ static int do_blk_trace_setup(struct request_queue *q, char 
*name, dev_t dev,
if (!buts->buf_size || !buts->buf_nr)
return -EINVAL;
 
-   strncpy(buts->name, name, BLKTRACE_BDEV_SIZE);
-   buts->name[BLKTRACE_BDEV_SIZE - 1] = '\0';
+   strlcpy(buts->name, name, BLKTRACE_BDEV_SIZE);
 
/*
 * some device names have larger paths - convert the slashes
-- 
2.17.1



[PATCH 16/18] tools/accounting: change strncpy+truncation to strlcpy

2018-07-12 Thread Dominique Martinet
Generated by scripts/coccinelle/misc/strncpy_truncation.cocci

Signed-off-by: Dominique Martinet 
---

Please see https://marc.info/?l=linux-kernel&m=153144450722324&w=2 (the
first patch of the serie) for the motivation behind this patch

 tools/accounting/getdelays.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/tools/accounting/getdelays.c b/tools/accounting/getdelays.c
index 9f420d98b5fb..66817a7a4fce 100644
--- a/tools/accounting/getdelays.c
+++ b/tools/accounting/getdelays.c
@@ -314,8 +314,7 @@ int main(int argc, char *argv[])
err(1, "Invalid rcv buf size\n");
break;
case 'm':
-   strncpy(cpumask, optarg, sizeof(cpumask));
-   cpumask[sizeof(cpumask) - 1] = '\0';
+   strlcpy(cpumask, optarg, sizeof(cpumask));
maskset = 1;
printf("cpumask %s maskset %d\n", cpumask, maskset);
break;
-- 
2.17.1



[PATCH 18/18] cpupower: change strncpy+truncation to strlcpy

2018-07-12 Thread Dominique Martinet
Generated by scripts/coccinelle/misc/strncpy_truncation.cocci

Signed-off-by: Dominique Martinet 
---

Please see https://marc.info/?l=linux-kernel&m=153144450722324&w=2 (the
first patch of the serie) for the motivation behind this patch

 tools/power/cpupower/bench/parse.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/tools/power/cpupower/bench/parse.c 
b/tools/power/cpupower/bench/parse.c
index 9ba8a44ad2a7..1566b89989b2 100644
--- a/tools/power/cpupower/bench/parse.c
+++ b/tools/power/cpupower/bench/parse.c
@@ -221,9 +221,8 @@ int prepare_config(const char *path, struct config *config)
sscanf(val, "%u", &config->cpu);
 
else if (strcmp("governor", opt) == 0) {
-   strncpy(config->governor, val,
+   strlcpy(config->governor, val,
sizeof(config->governor));
-   config->governor[sizeof(config->governor) - 1] = '\0';
}
 
else if (strcmp("priority", opt) == 0) {
-- 
2.17.1



Re: [Cocci] [PATCH 01/18] coccinelle: change strncpy+truncation to strlcpy

2018-07-13 Thread Dominique Martinet
Himanshu Jha wrote on Fri, Jul 13, 2018:
> > I expect each maintainer will pick their share of the patchs if they
> > agree with it and the rest will just be dropped?
> 
> Masahiro Yamada  takes coccinelle patches,
> so please cc him or your patch would be lost.

Thanks, will do.

> > +virtual patch
> > +virtual context
> 
> You might consider adding context rule or remove this line perhaps ?

Victim of copypasta, I'll remove this.

> > +-strncpy@p(
> > ++strlcpy(
> > +  dest, src, sz);
> > +-dest[sz - 1] = '\0';
> 
> The above rule produces an output that I think is not correct:
> --
> diff = 
> diff -u -p a//ti/wl1251/acx.c b//ti/wl1251/acx.c
> --- a//ti/wl1251/acx.c
> +++ b//ti/wl1251/acx.c
> @@ -150,14 +150,7 @@ int wl1251_acx_fw_version(struct wl1251
>   }
>  
>   /* be careful with the buffer sizes */
> - strncpy(buf, rev->fw_version, min(len, sizeof(rev->fw_version)));
> -
> - /*
> -  * if the firmware version string is exactly
> -  * sizeof(rev->fw_version) long or fw_len is less than
> -  * sizeof(rev->fw_version) it won't be null terminated
> -  */
> - buf[min(len, sizeof(rev->fw_version)) - 1] = '\0';
> + strlcpy(buf, rev->fw_version, min(len, sizeof(rev->fw_version)));
> 
> -
> 
> I think the comment is useful and should not be removed.

I agree this comment is useful now that I'm taking a closer look, I
glanced at this too fast.
I'm not sure how to make coccinelle not remove comments between lines
though?

> Also, consider changing Confidence level appropriately.

I am (was?) pretty confident on the change itself, the only exceptions
would be if someone relied on strncpy to fill the end of the buffer with
zero to not leak data somewhere but that is not easy to judge by itself
(although I hope rare enough)

I'm honestly not sure what would be appropriate in this case.

-- 
Dominique Martinet


Re: [V9fs-developer] [PATCH] net/9p: Fix a deadlock case in the virtio transport

2018-07-14 Thread Dominique Martinet
jiangyiwen wrote on Sat, Jul 14, 2018:
> When client has multiple threads that issue io requests all the
> time, and the server has a very good performance, it may cause
> cpu is running in the irq context for a long time because it can
> check virtqueue has buf in the *while* loop.
> 
> So we should keep chan->lock in the whole loop.

Hmm, this is generally bad practice to hold a spin lock for long.
In general, spin locks are meant to protect data, not code.

I'd want some numbers to decide on this one, even if I think this
particular case is safe (e.g. this cannot dead-lock)

> Signed-off-by: Yiwen Jiang 
> ---
>  net/9p/trans_virtio.c | 8 +++-
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
> index 05006cb..9b0f5f2 100644
> --- a/net/9p/trans_virtio.c
> +++ b/net/9p/trans_virtio.c
> @@ -148,20 +148,18 @@ static void req_done(struct virtqueue *vq)
> 
>   p9_debug(P9_DEBUG_TRANS, ": request done\n");
> 
> + spin_lock_irqsave(&chan->lock, flags);
>   while (1) {
> - spin_lock_irqsave(&chan->lock, flags);
>   req = virtqueue_get_buf(chan->vq, &len);
> - if (req == NULL) {
> - spin_unlock_irqrestore(&chan->lock, flags);
> + if (req == NULL)
>   break;
> - }
>   chan->ring_bufs_avail = 1;
> - spin_unlock_irqrestore(&chan->lock, flags);
>   /* Wakeup if anyone waiting for VirtIO ring space. */
>   wake_up(chan->vc_wq);

In particular, the wake up here echoes to wait events that will
immediately try to grab the lock, and will needlessly spin on it until
this thread is done.
If we do go this way I'd want setting chan->ring_bufs_avail to be done
just before unlocking and the wakeup to be done just after unlocking out
of the loop iff we processed at least one iteration here.

That should also save you precious cpu cycles while under lock :)

-- 
Dominique Martinet


Re: [V9fs-developer] [PATCH] net/9p: Fix a deadlock case in the virtio transport

2018-07-14 Thread Dominique Martinet
jiangyiwen wrote on Sat, Jul 14, 2018:
> On 2018/7/14 17:05, Dominique Martinet wrote:
> > jiangyiwen wrote on Sat, Jul 14, 2018:
> >> When client has multiple threads that issue io requests all the
> >> time, and the server has a very good performance, it may cause
> >> cpu is running in the irq context for a long time because it can
> >> check virtqueue has buf in the *while* loop.
> >>
> >> So we should keep chan->lock in the whole loop.
> > 
> > Hmm, this is generally bad practice to hold a spin lock for long.
> > In general, spin locks are meant to protect data, not code.
> > 
> > I'd want some numbers to decide on this one, even if I think this
> > particular case is safe (e.g. this cannot dead-lock)
> > 
> 
> Actually, the loop will not hold a spin lock for long, because other
> threads will not issue new requests in this case. In addition,
> virtio-blk or virtio-scsi also use this solution, I guess it may also
> encounter this problem before.

Fair enough. If you do have some numbers to give though (throughput
and/or iops before/after) I'd still be really curious.

> >>chan->ring_bufs_avail = 1;
> >> -  spin_unlock_irqrestore(&chan->lock, flags);
> >>/* Wakeup if anyone waiting for VirtIO ring space. */
> >>wake_up(chan->vc_wq);
> > 
> > In particular, the wake up here echoes to wait events that will
> > immediately try to grab the lock, and will needlessly spin on it until
> > this thread is done.
> > If we do go this way I'd want setting chan->ring_bufs_avail to be done
> > just before unlocking and the wakeup to be done just after unlocking out
> > of the loop iff we processed at least one iteration here.
> 
> I can move the wakeup operation after the unlocking. Like what I said
> above, I think this loop will not execute for long.

Please do, you listed virtio_blk as doing this and they have the same
kind of pattern with a req_done bool and only restarting stopped queues
if they processed something

-- 
Dominique


Re: [PATCH] net/9p: include trans_common.h to fix missing prototype warning.

2018-11-13 Thread Dominique Martinet
Adeodato Simó wrote on Tue, Nov 13, 2018:
> This silences -Wmissing-prototypes when defining p9_release_pages.
> 
> Signed-off-by: Adeodato Simó 
> ---
>  net/9p/trans_common.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> It was suggested in the kernel-janitors mailing list that silencing
> -Wmissing-prototypes would make for a worthy cause.
> 
> https://www.spinics.net/lists/linux-kernel-janitors/msg43981.html
> 
> Resending w/ Cc: linux-kernel as requested by Dominique.

Great, pushed to my for-next branch, it'll probably get in there today
or tomorrow.

Thanks!
-- 
Dominique


Re: [V9fs-developer] [PATCH] p9_check_errors() validate PDU length

2018-07-10 Thread Dominique Martinet
Tomas Bortoli wrote on Tue, Jul 10, 2018:
> > This however gets more complicated once you start factoring in that
> > change I suggested about p9_parse_header not setting size (and checking
> > size) because trans_fd relies on it; so I'm not sure how we should
> > proceed.
>
> Mmh, me neither. I don't see where the *actual* PDU length is stored.

That's precisely the problem, none of the transports actually write in
the pdu how much has been read.
Instead, we blindly trust how much the client says it has written and
p9_parse_header will write pdu->size which is then used as the 'actual'
PDU length.


With the three lines I added in each file all transports will store the
PDU length in pdu->size so my following suggestion would be to just
check that in p9_parse_header the length read from the packet match that
of the received data and flag any packet where this is wrong as invalid.

The only problem with that is that TCP will use p9_parse_header with a
partial packet to know how much it needs to read, so that modification
needs actual testing (hence my question below) as it is more intrusive
than a simple boundary check.

I'd suggest implementing some logic like already here with the if
(pdu->size == 0) -- only check if pdu->size was previously set, and set
it if it wasn't... But feel free to try something else.

With that done I don't expect more problems but if I do it there's
little point in making you do it as well ;)


> > Do you have a working 9p tcp server to test changes are valid, or are
> > you only working off the syzbot reproducer?
>
> No, I was just using the reproducer to test.
>
> > In the first place, are you willing to take the time to do that bigger
> > fix?
>
> Yes.

Ok, let's get you a working setup for TCP then.
You need to install a 9p server, the two I'm aware of are diod and
nfs-ganesha (I'm using ganesha); once you have a working config you can
mount with something like this:
 mount -t 9p -o aname=path,trans=fd  

Once you have that you should be able to fiddle with things until it
works as expected.
(for virtio I use qemu, that's probably easy to test as well; for RDMA
you can use rxe to setup a virtual interface (with rxe_cfg) and then use
either diod or ganesha again but the setup is more complicated so it's
OK to leave that aside for now)

> For me both ways are good. Signed-off-by will be good.
> You know for sure more than me about 9p as I started delving it for the
> first time yesterday. I can also make and test a patch but I'll need to
> understand more about it. Let me know if you find out more.

There's a saying about giving a fish or teaching how to fish, so let's
have you try if you're motivated :)

I don't think much is left if you can sew the pieces together, the most
difficult part will probably to set the test environment up if you want
to do this properly.


Please ask if you run into any trouble,
-- 
Dominique Martinet | Asmadeus


Re: [V9fs-developer] [PATCH] net/9p/client.c: put refcount of trans_mod in error case in parse_opts()

2018-07-10 Thread Dominique Martinet
Andrew,

there seem to be some renew of interest in 9P lately, so if you'd like I
can take care of rounding these up and prepare a pull request for 4.19
(as we're already well into 4.18 release cycle, I believe most of the
patches can wait)

This patch however I consider important enough to take for 4.18 so could
you please grab it for now?

I've gathered the Review tags and added my own, feel free to change my
Reviewed-and-tested-by tag to Signed-off-by if it seems more appropriate
as I'm actively pushing for this patch.

piaojun wrote on Fri, Jul 06, 2018:
> >From my test, the second mount will fail after umounting successfully.
> The reason is that we put refcount of trans_mod in the correct case rather
> than the error case in parse_opts() at last. That will cause the refcount
> decrease to -1, and when we try to get trans_mod again in
> try_module_get(), we could only increase refcount to 0 which will cause
> failure as follows:
> parse_opts
>   v9fs_get_trans_by_name
> try_module_get : return NULL to caller which cause error
> 
> So we should put refcount of trans_mod in error case.
> 
> Fixes: 9421c3e64137ec ("net/9p/client.c: fix potential refcnt problem of 
> trans module")
> 
> Signed-off-by: Jun Piao 
Reviewed-by: Yiwen Jiang 
Reviewed-by: Greg Kurz 
Reviewed-and-tested-by: Dominique Martinet 

> ---
>  net/9p/client.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/net/9p/client.c b/net/9p/client.c
> index 18c5271..5c13431 100644
> --- a/net/9p/client.c
> +++ b/net/9p/client.c
> @@ -225,7 +225,8 @@ static int parse_opts(char *opts, struct p9_client *clnt)
>   }
> 
>  free_and_return:
> - v9fs_put_trans(clnt->trans_mod);
> +     if (ret)
> + v9fs_put_trans(clnt->trans_mod);
>   kfree(tmp_options);
>   return ret;
>  }

Thanks,
-- 
Dominique Martinet


[PATCH] i915/intel_tv_get_modes: fix strncpy truncation warning

2018-07-11 Thread Dominique Martinet
This is effectively no-op as the next line writes a nul at the final
byte of the buffer, so copying one letter less does not change the
behaviour.

Signed-off-by: Dominique Martinet 
---

gcc 8 gives the following warning, which I am not sure why is treated
as error for this file, thus making me fix it:
drivers/gpu/drm/i915/intel_tv.c: In function ‘intel_tv_get_modes’:
drivers/gpu/drm/i915/intel_tv.c:1358:3: error: ‘strncpy’ specified bound 32 
equals destination size [-Werror=stringop-truncation]
   strncpy(mode_ptr->name, input->name, DRM_DISPLAY_MODE_LEN);
   ^~
cc1: all warnings being treated as errors


Also, as a side note, while checking if this was already sent I stumbled
uppon this: 
https://lists.freedesktop.org/archives/intel-gfx/2018-July/170638.html
([Intel-gfx] [PATCH i-g-t 6/7] Fix truncate string in the strncpy)
which replaces strncpy(dest, src, sizeof(dest)) by strncpy(dest, src,
strlen(dest))... This isn't for linux but this looks like a pretty bad
idea to me... (I'm not on the list to reply there, sorry)

 drivers/gpu/drm/i915/intel_tv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/intel_tv.c b/drivers/gpu/drm/i915/intel_tv.c
index b55b5c157e38..f5614d07b10d 100644
--- a/drivers/gpu/drm/i915/intel_tv.c
+++ b/drivers/gpu/drm/i915/intel_tv.c
@@ -1355,7 +1355,7 @@ intel_tv_get_modes(struct drm_connector *connector)
mode_ptr = drm_mode_create(connector->dev);
if (!mode_ptr)
continue;
-   strncpy(mode_ptr->name, input->name, DRM_DISPLAY_MODE_LEN);
+   strncpy(mode_ptr->name, input->name, DRM_DISPLAY_MODE_LEN - 1);
mode_ptr->name[DRM_DISPLAY_MODE_LEN - 1] = '\0';
 
mode_ptr->hdisplay = hactive_s;
-- 
2.17.1



Re: [V9fs-developer] [PATCH 2/6] 9p: Replace the fidlist with an IDR

2018-07-11 Thread Dominique Martinet
Matthew Wilcox wrote on Thu, Jun 28, 2018:
> diff --git a/net/9p/client.c b/net/9p/client.c
> index f8d58b0852fe..bbab82f22c20 100644
> --- a/net/9p/client.c
> +++ b/net/9p/client.c
> @@ -908,30 +908,27 @@ static struct p9_fid *p9_fid_create(struct p9_client 
> *clnt)
>  {
>   int ret;
>   struct p9_fid *fid;
> - unsigned long flags;
>  
>   p9_debug(P9_DEBUG_FID, "clnt %p\n", clnt);
>   fid = kmalloc(sizeof(struct p9_fid), GFP_KERNEL);
>   if (!fid)
>   return NULL;
>  
> - ret = p9_idpool_get(clnt->fidpool);
> - if (ret < 0)
> - goto error;
> - fid->fid = ret;
> -
>   memset(&fid->qid, 0, sizeof(struct p9_qid));
>   fid->mode = -1;
>   fid->uid = current_fsuid();
>   fid->clnt = clnt;
>   fid->rdir = NULL;
> - spin_lock_irqsave(&clnt->lock, flags);
> - list_add(&fid->flist, &clnt->fidlist);
> - spin_unlock_irqrestore(&clnt->lock, flags);
>  
> - return fid;
> + idr_preload(GFP_KERNEL);
> + spin_lock_irq(&clnt->lock);
> + ret = idr_alloc_u32(&clnt->fids, fid, &fid->fid, UINT_MAX, GFP_NOWAIT);

There's also a P9_NOFID that we shouldn't use, so the max here should be
P9_NOFID - 1

That aside this introduces a change of behaviour that fid used to be
alloc'd linearily from 0 which no longer holds true, that breaks one
serveur (nfs-ganesha just returns ERANGE) but others seem to handle this
just fine so they'll just need to fix that server.
max aside this looks good.

-- 
Dominique Martinet


Re: [V9fs-developer] [PATCH 2/6] 9p: Replace the fidlist with an IDR

2018-07-11 Thread Dominique Martinet
Matthew Wilcox wrote on Wed, Jul 11, 2018:
> On Wed, Jul 11, 2018 at 02:40:38PM +0200, Dominique Martinet wrote:
> > > + idr_preload(GFP_KERNEL);
> > > + spin_lock_irq(&clnt->lock);
> > > + ret = idr_alloc_u32(&clnt->fids, fid, &fid->fid, UINT_MAX, GFP_NOWAIT);
> > 
> > There's also a P9_NOFID that we shouldn't use, so the max here should be
> > P9_NOFID - 1
> 
> Happy to fix that.  It shouldn't actually happen, of course.  I can't
> imagine us having 4 billion FIDs in use at once.

Oh, I said that assuming that it was random.. But I guess we might as
well fix it.

> > That aside this introduces a change of behaviour that fid used to be
> > alloc'd linearily from 0 which no longer holds true, that breaks one
> > serveur (nfs-ganesha just returns ERANGE) but others seem to handle this
> > just fine so they'll just need to fix that server.
> > max aside this looks good.
> 
> I don't understand your assertion that this is a change of behaviour.
> The implementation of p9_idpool_get() uses idr_alloc(), not
> idr_alloc_cyclic(), so I don't believe I've changed which FID would
> be allocated.

Hmm, I just tried mounting something with ganesha and that broke because
it received fid=1730858632 in a TATTACH request, so something in the
patch series made some big fid happens.

If you say this isn't intented though this needs debugging, I'm glad I
brought this up :P

(Note that it really should be fine if it is random within the pool, I
have notified the current developpers of the problem)
-- 
Dominique


Re: [V9fs-developer] [PATCH 5/6] 9p: Use a slab for allocating requests

2018-07-11 Thread Dominique Martinet
Matthew Wilcox wrote on Thu, Jun 28, 2018:
> diff --git a/include/net/9p/client.h b/include/net/9p/client.h
> index 0fa0fbab33b0..fd326811ebd4 100644
> --- a/include/net/9p/client.h
> +++ b/include/net/9p/client.h
> @@ -92,24 +85,12 @@ enum p9_req_status_t {
>   * struct p9_req_t - request slots
>   * @status: status of this request slot
>   * @t_err: transport error
> - * @flush_tag: tag of request being flushed (for flush requests)
>   * @wq: wait_queue for the client to block on for this request
>   * @tc: the request fcall structure
>   * @rc: the response fcall structure
>   * @aux: transport specific data (provided for trans_fd migration)
>   * @req_list: link for higher level objects to chain requests
> - *
> - * Transport use an array to track outstanding requests
> - * instead of a list.  While this may incurr overhead during initial
> - * allocation or expansion, it makes request lookup much easier as the
> - * tag id is a index into an array.  (We use tag+1 so that we can accommodate
> - * the -1 tag for the T_VERSION request).
> - * This also has the nice effect of only having to allocate wait_queues
> - * once, instead of constantly allocating and freeing them.  Its possible
> - * other resources could benefit from this scheme as well.
> - *
>   */
> -
>  struct p9_req_t {
>   int status;
>   int t_err;
> @@ -117,40 +98,26 @@ struct p9_req_t {
>   struct p9_fcall *tc;
>   struct p9_fcall *rc;
>   void *aux;
> -
>   struct list_head req_list;
>  };
>  
>  /**
>   * struct p9_client - per client instance state
> - * @lock: protect @fidlist
> + * @lock: protect @fids and @reqs
>   * @msize: maximum data size negotiated by protocol
> - * @dotu: extension flags negotiated by protocol
>   * @proto_version: 9P protocol version to use
>   * @trans_mod: module API instantiated with this client
> + * @status: XXX

Let's give this a proper comment; something like 'connection state
machine' ? (this contains Connected/BeginDisconnect/Disconnected/Hung)


> diff --git a/net/9p/client.c b/net/9p/client.c
> index 2dce8d8307cc..fd5446377445 100644
> --- a/net/9p/client.c
> +++ b/net/9p/client.c
> @@ -241,132 +241,104 @@ static struct p9_fcall *p9_fcall_alloc(int 
> alloc_msize)
>   return fc;
>  }
>  
> +static struct kmem_cache *p9_req_cache;
> +
>  /**
> - * p9_tag_alloc - lookup/allocate a request by tag
> - * @c: client session to lookup tag within
> - * @tag: numeric id for transaction
> - *
> - * this is a simple array lookup, but will grow the
> - * request_slots as necessary to accommodate transaction
> - * ids which did not previously have a slot.
> - *
> - * this code relies on the client spinlock to manage locks, its
> - * possible we should switch to something else, but I'd rather
> - * stick with something low-overhead for the common case.
> + * p9_req_alloc - Allocate a new request.
> + * @c: Client session.
> + * @type: Transaction type.
> + * @max_size: Maximum packet size for this request.
>   *
> + * Context: Process context.  What mutex might we be holding that
> + * requires GFP_NOFS?

Good question, but p9_tag_alloc happens on every single client request
so the fs/9p functions might be trying to do something and the alloc
request here comes in and could try to destroy the inode that is
currently used in the request -- I'm not sure how likely this is, but
I'd rather not tempt fate :p

> + * Return: Pointer to new request.
>   */
> -
>  static struct p9_req_t *
> -p9_tag_alloc(struct p9_client *c, u16 tag, unsigned int max_size)
> +p9_tag_alloc(struct p9_client *c, int8_t type, unsigned int max_size)
>  {
> - unsigned long flags;
> - int row, col;
> - struct p9_req_t *req;
> + struct p9_req_t *req = kmem_cache_alloc(p9_req_cache, GFP_NOFS);
>   int alloc_msize = min(c->msize, max_size);
> + int tag;
>  
> - /* This looks up the original request by tag so we know which
> -  * buffer to read the data into */
> - tag++;
> -
> - if (tag >= c->max_tag) {
> - spin_lock_irqsave(&c->lock, flags);
> - /* check again since original check was outside of lock */
> - while (tag >= c->max_tag) {
> - row = (tag / P9_ROW_MAXTAG);
> - c->reqs[row] = kcalloc(P9_ROW_MAXTAG,
> - sizeof(struct p9_req_t), GFP_ATOMIC);
> -
> - if (!c->reqs[row]) {
> - pr_err("Couldn't grow tag array\n");
> - spin_unlock_irqrestore(&c->lock, flags);
> - return ERR_PTR(-ENOMEM);
> - }
> - for (col = 0; col < P9_ROW_MAXTAG; col++) {
> - req = &c->reqs[row][col];
> - req->status = REQ_STATUS_IDLE;
> - init_waitqueue_head(&req->wq);
> - }
> - c->max_tag += P9_ROW_MAXTAG;
> - }
> -

Re: [V9fs-developer] [PATCH 0/6] 9p: Use IDRs more effectively

2018-07-11 Thread Dominique Martinet
Matthew Wilcox wrote on Thu, Jun 28, 2018:
> The 9p code doesn't take advantage of the IDR's ability to store
> a pointer.  We can actually get rid of the p9_idpool abstraction
> and the multi-dimensional array of requests.
> 
> I haven't tested these patches, so caveat maintainer.

Overall pretty good for something that hadn't even been tested!

I'm done with the comments here, the three other patches I didn't
comment look good to me.

For [PATCH 4/6] 9p: Remove an unnecessary memory barrier, your suggested
wording in the comment is good; I've offered to do it if you aren't
going to but since you're submitting a second version of the other
patchs I suggest you do this one as well.


Thanks for the well-needed cleanup,
-- 
Dominique Martinet


Re: [V9fs-developer] [PATCH 4/6] 9p: Remove an unnecessary memory barrier

2018-06-28 Thread Dominique Martinet
Matthew Wilcox wrote on Thu, Jun 28, 2018:
> --- a/net/9p/client.c
> +++ b/net/9p/client.c
> @@ -436,13 +436,9 @@ void p9_client_cb(struct p9_client *c, struct p9_req_t 
> *req, int status)
>  {
>   p9_debug(P9_DEBUG_MUX, " tag %d\n", req->tc->tag);
>  
> - /*
> -  * This barrier is needed to make sure any change made to req before
> -  * the other thread wakes up will indeed be seen by the waiting side.
> -  */
> - smp_wmb();
>   req->status = status;
>  
> + /* wake_up is an implicit write memory barrier */

Nope.
Please note the wmb is _before_ setting status, basically it protects
from cpu optimizations where status could be set before other fields,
then other core opportunistically checking and finding status is good so
other thread continuing.

I could only reproduce this bug with infiniband network, but it is very
definitely needed. Here is the commit message of when I added that barrier:
-
9P: Add memory barriers to protect request fields over cb/rpc threads handoff

We need barriers to guarantee this pattern works as intended:
[w] req->rc, 1  [r] req->status, 1
wmb rmb
[w] req->status, 1  [r] req->rc

Where the wmb ensures that rc gets written before status,
and the rmb ensures that if you observe status == 1, rc is the new value.
-

It might need an update to the comment though, if you thought about
removing it...
-- 
Dominique Martinet | Asmadeus


Re: [V9fs-developer] [PATCH 4/6] 9p: Remove an unnecessary memory barrier

2018-06-28 Thread Dominique Martinet
Matthew Wilcox wrote on Thu, Jun 28, 2018:
> How about this?
> 
>   /*
>* This barrier is needed to make sure any change made to req before
> -  * the other thread wakes up will indeed be seen by the waiting side.
> +  * the status change is visible to another thread
>*/

Yes, that sounds better.

This code is fairly old and I was wondering if the new WRITE_ONCE and
READ_ONCE macros would help but it looks like compile-time barriers only
so I do not think they would be enough...
Documentation/memory-barriers.txt still suggests something similar in
the "SLEEP AND WAKE-UP FUNCTIONS" section so I guess this is fine.


Thanks,
-- 
Dominique Martinet | Asmadeus


Re: [PATCH 18/18] cpupower: change strncpy+truncation to strlcpy

2018-08-14 Thread Dominique Martinet
Daniel Díaz wrote on Tue, Aug 14, 2018:
> I can't get cpupower to compile anymore now that it made its way to 
> linux-next:
>   [/linux/tools/power/cpupower]$ make
> CC   lib/cpufreq.o
>   [...]
>   make[1]: Entering directory '/linux/tools/power/cpupower/bench'
> CC   main.o
> CC   parse.o
>   parse.c: In function ‘prepare_config’:
>   parse.c:224:4: warning: implicit declaration of function ‘strlcpy’
> [-Wimplicit-function-declaration]
>   strlcpy(config->governor, val,
>   ^
> CC   system.o
> CC   benchmark.o
> CC   cpufreq-bench
>   .//parse.o: In function `prepare_config':
>   /linux/tools/power/cpupower/bench/parse.c:224: undefined reference
> to `strlcpy'
>   collect2: error: ld returned 1 exit status
>   Makefile:25: recipe for target 'cpufreq-bench' failed
>   make[1]: *** [cpufreq-bench] Error 1
>   make[1]: Leaving directory '/linux/tools/power/cpupower/bench'
>   Makefile:258: recipe for target 'compile-bench' failed
>   make: *** [compile-bench] Error 2
> 
> Does it need anything special to make?

Ugh, no, I am really ashamed about this patch series for insufficient
testing in general. It is currently "under rework" for an indefinite
time frame as I have had other priorities but I'll add cpupower to the
list...
More precisely, the function is defined in the linux kernel but for
userspace strlcpy is only available through libbsd, and I don't believe
we should pull that in just for this.

I'll send a second patch using snprintf and warning if a truncation
occurs (which is the proper fix that the gcc folks intended people to do
anyway) when I get around to it, but I would recommend to just revert
the patch for now.


Shuah, could you take the patch off please if you haven't pushed it to
linus yet?


Sorry for the time you might have spent on this,
-- 
Dominique Martinet


[PATCH v2] 9p: Remove Ron Minnich from MAINTAINERS

2018-08-16 Thread Dominique Martinet
From: Dominique Martinet 

Ron Minnich has left Sandia in 2011, and has not been involved in any
9p commit in the recent years.

Signed-off-by: Dominique Martinet 
Cc: Eric Van Hensbergen 
Cc: Ron Minnich 
Cc: Ronald G. Minnich 
Cc: Latchesar Ionkov 
To: Andrew Morton 
---
v2: added CREDITS entry

 CREDITS | 5 +
 MAINTAINERS | 1 -
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/CREDITS b/CREDITS
index 989cda91c427..5befd2d714d0 100644
--- a/CREDITS
+++ b/CREDITS
@@ -2571,6 +2571,11 @@ S: Helstorfer Str. 7
 S: D-30625 Hannover
 S: Germany
 
+N: Ron Minnich
+E: rminn...@sandia.gov
+E: rminn...@gmail.com
+D: 9p filesystem development
+
 N: Corey Minyard
 E: miny...@wf-rch.cirr.com
 E: miny...@mvista.com
diff --git a/MAINTAINERS b/MAINTAINERS
index 15116e6dff3a..09c2a6c0fb0c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -199,7 +199,6 @@ F:  drivers/net/ethernet/8390/
 
 9P FILE SYSTEM
 M: Eric Van Hensbergen 
-M: Ron Minnich 
 M: Latchesar Ionkov 
 M: Dominique Martinet 
 L: v9fs-develo...@lists.sourceforge.net
-- 
2.17.1



Re: [PATCH 1/3] 9p/net: implement asynchronous rpc

2018-12-17 Thread Dominique Martinet
Tomas Bortoli wrote on Mon, Dec 17, 2018:
> sorry for the delay, I've been quite busy these days.

No problem.

> The patches looks good to me and should indeed speed up the code a bit.
> I quickly tested them against Syzkaller tuned for the 9p subsystem and
> everything seems fine.

Thanks, can I add your Reviewed-by on all three?

> And by the way, which refcount races?

There's a problem with trans_fd read_work and cancelled callback; I'm
not so sure about refcount but we can definitely get double list_del
as we're not checking the status. I think when we incorrectly remove
from the list we also mismanage the refcount, but honestly need to
test..

-- 
Dominique


[PATCH 3/3] 9p/net: make flush asynchronous

2018-12-11 Thread Dominique Martinet
From: Dominique Martinet 

Make the client flush asynchronous so we don't need to ignore signals in
rpc functions:
 - on signal, send a flush request as we previously did but use the new
asynchronous helper and return immediately
 - when the reply has been received or connection is destroyed, free
both tags in the handler

Signed-off-by: Dominique Martinet 
Cc: Eric Van Hensbergen 
Cc: Latchesar Ionkov 
Cc: Tomas Bortoli 
Cc: Dmitry Vyukov 
---
 include/net/9p/client.h |   2 +
 net/9p/client.c | 172 ++--
 2 files changed, 78 insertions(+), 96 deletions(-)

diff --git a/include/net/9p/client.h b/include/net/9p/client.h
index 75d7f83e5b94..dcd40e7ef202 100644
--- a/include/net/9p/client.h
+++ b/include/net/9p/client.h
@@ -91,6 +91,7 @@ enum p9_req_status_t {
  * @aux: transport specific data (provided for trans_fd migration)
  * @req_list: link used by trans_fd
  * @async_list: link used to check on async requests
+ * @flushed_req: for flush, points to matching flushed req
  * @clunked_fid: for clunk, points to fid
  */
 struct p9_req_t {
@@ -104,6 +105,7 @@ struct p9_req_t {
struct list_head req_list;
struct list_head async_list;
union {
+   struct p9_req_t *flushed_req;
struct p9_fid *clunked_fid;
};
 };
diff --git a/net/9p/client.c b/net/9p/client.c
index a47b5a54573d..666a722088e9 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -694,50 +694,6 @@ static void p9_fid_destroy(struct p9_fid *fid)
kfree(fid);
 }
 
-static struct p9_req_t *
-p9_client_rpc(struct p9_client *c, int8_t type, const char *fmt, ...);
-
-/**
- * p9_client_flush - flush (cancel) a request
- * @c: client state
- * @oldreq: request to cancel
- *
- * This sents a flush for a particular request and links
- * the flush request to the original request.  The current
- * code only supports a single flush request although the protocol
- * allows for multiple flush requests to be sent for a single request.
- *
- */
-
-static int p9_client_flush(struct p9_client *c, struct p9_req_t *oldreq)
-{
-   struct p9_req_t *req;
-   int16_t oldtag;
-   int err;
-
-   err = p9_parse_header(&oldreq->tc, NULL, NULL, &oldtag, 1);
-   if (err)
-   return err;
-
-   p9_debug(P9_DEBUG_9P, ">>> TFLUSH tag %d\n", oldtag);
-
-   req = p9_client_rpc(c, P9_TFLUSH, "w", oldtag);
-   if (IS_ERR(req))
-   return PTR_ERR(req);
-
-   /*
-* if we haven't received a response for oldreq,
-* remove it from the list
-*/
-   if (oldreq->status == REQ_STATUS_SENT) {
-   if (c->trans_mod->cancelled)
-   c->trans_mod->cancelled(c, oldreq);
-   }
-
-   p9_tag_remove(c, req);
-   return 0;
-}
-
 static struct p9_req_t *p9_client_prepare_req(struct p9_client *c,
  int8_t type, int req_size,
  const char *fmt, va_list ap)
@@ -800,6 +756,39 @@ p9_client_async_rpc(struct p9_client *c, int8_t type, 
const char *fmt, ...)
return req;
 }
 
+/**
+ * p9_client_flush - flush (cancel) a request
+ * @c: client state
+ * @oldreq: request to cancel
+ *
+ * This sents a flush for a particular request and links
+ * the flush request to the original request.  The current
+ * code only supports a single flush request although the protocol
+ * allows for multiple flush requests to be sent for a single request.
+ *
+ */
+
+static int p9_client_flush(struct p9_client *c, struct p9_req_t *oldreq)
+{
+   struct p9_req_t *req;
+   int16_t oldtag = oldreq->tc.tag;
+
+   p9_debug(P9_DEBUG_9P, ">>> TFLUSH tag %d\n", oldtag);
+   req = p9_client_async_rpc(c, P9_TFLUSH, "w", oldtag);
+   if (IS_ERR(req)) {
+   return PTR_ERR(req);
+   }
+
+   p9_debug(P9_DEBUG_MUX, "sent flush for oldreq %d type %d with flush tag 
%d\n",
+oldtag, oldreq->tc.id, req->tc.tag);
+   req->flushed_req = oldreq;
+   spin_lock_irq(&c->lock);
+   list_add(&req->async_list, &c->async_req_list);
+   spin_unlock_irq(&c->lock);
+
+   return 0;
+}
+
 static void p9_client_handle_async(struct p9_client *c, bool free_all)
 {
struct p9_req_t *req, *nreq;
@@ -823,6 +812,24 @@ static void p9_client_handle_async(struct p9_client *c, 
bool free_all)
}
if (free_all || req->status >= REQ_STATUS_RCVD) {
/* Put old refs whatever reqs actually returned */
+   if (req->tc.id == P9_TFLUSH) {
+   p9_debug(P9_DEBUG_MUX,
+"flushing oldreq tag %d status %d, 
flush_req tag %d\n",
+  

[PATCH 1/3] 9p/net: implement asynchronous rpc

2018-12-11 Thread Dominique Martinet
From: Dominique Martinet 

Add p9_client_async_rpc that will let us send an rpc without waiting
for the reply, as well as an async handler hook.

The work done in the hook could mostly be done in the client callback,
but I prefered not to for a couple of reasons:
 - the callback can be called in an IRQ for some transports, and the
handler needs to call p9_tag_remove which takes the client lock that has
recently been reworked to not be irq-compatible (as it was never taken
in IRQs until now)
 - the handled replies can be handled anytime, there is nothing the
userspace would care about and want to be notified of quickly
 - the async request list and this function would be needed anyway for
when we disconnect the client, in order to not leak async requests that
haven't been replied to

Signed-off-by: Dominique Martinet 
Cc: Eric Van Hensbergen 
Cc: Latchesar Ionkov 
Cc: Tomas Bortoli 
Cc: Dmitry Vyukov 
---

I've been sitting on these patches for almost a month now because I
wanted to fix the cancel race, but I think it's what the most recent
syzbot email is about so if it could find it without this it probably
won't hurt to at least get some reviews for these three patches first.

I won't submit these for next cycle, so will only put them into -next
after 4.20 is released; hopefully I'll have time to look at the two
pending refcount races around that time.
Until then, comments please!


 include/net/9p/client.h |  9 +-
 net/9p/client.c | 64 +
 2 files changed, 72 insertions(+), 1 deletion(-)

diff --git a/include/net/9p/client.h b/include/net/9p/client.h
index 947a570307a6..a4ded7666c73 100644
--- a/include/net/9p/client.h
+++ b/include/net/9p/client.h
@@ -89,7 +89,8 @@ enum p9_req_status_t {
  * @tc: the request fcall structure
  * @rc: the response fcall structure
  * @aux: transport specific data (provided for trans_fd migration)
- * @req_list: link for higher level objects to chain requests
+ * @req_list: link used by trans_fd
+ * @async_list: link used to check on async requests
  */
 struct p9_req_t {
int status;
@@ -100,6 +101,7 @@ struct p9_req_t {
struct p9_fcall rc;
void *aux;
struct list_head req_list;
+   struct list_head async_list;
 };
 
 /**
@@ -110,6 +112,9 @@ struct p9_req_t {
  * @trans_mod: module API instantiated with this client
  * @status: connection state
  * @trans: tranport instance state and API
+ * @fcall_cache: kmem cache for client request data (msize-specific)
+ * @async_req_lock: protects @async_req_list
+ * @async_req_list: list of requests waiting a reply
  * @fids: All active FID handles
  * @reqs: All active requests.
  * @name: node name used as client id
@@ -125,6 +130,8 @@ struct p9_client {
enum p9_trans_status status;
void *trans;
struct kmem_cache *fcall_cache;
+   spinlock_t async_req_lock;
+   struct list_head async_req_list;
 
union {
struct {
diff --git a/net/9p/client.c b/net/9p/client.c
index 357214a51f13..0a67c3ccd4fd 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -729,6 +729,62 @@ static struct p9_req_t *p9_client_prepare_req(struct 
p9_client *c,
return ERR_PTR(err);
 }
 
+static struct p9_req_t *
+p9_client_async_rpc(struct p9_client *c, int8_t type, const char *fmt, ...)
+{
+   va_list ap;
+   int err;
+   struct p9_req_t *req;
+
+   va_start(ap, fmt);
+   req = p9_client_prepare_req(c, type, c->msize, fmt, ap);
+   va_end(ap);
+   if (IS_ERR(req))
+   return req;
+
+   err = c->trans_mod->request(c, req);
+   if (err < 0) {
+   /* bail out without flushing... */
+   p9_req_put(req);
+   p9_tag_remove(c, req);
+   if (err != -ERESTARTSYS && err != -EFAULT)
+   c->status = Disconnected;
+   return ERR_PTR(safe_errno(err));
+   }
+
+   return req;
+}
+
+static void p9_client_handle_async(struct p9_client *c, bool free_all)
+{
+   struct p9_req_t *req, *nreq;
+
+   /* it's ok to miss some async replies here, do a quick check without
+* lock first unless called with free_all
+*/
+   if (!free_all && list_empty(&c->async_req_list))
+   return;
+
+   spin_lock_irq(&c->async_req_lock);
+   list_for_each_entry_safe(req, nreq, &c->async_req_list, async_list) {
+   if (free_all && req->status < REQ_STATUS_RCVD) {
+   p9_debug(P9_DEBUG_ERROR,
+"final async handler found req tag %d type %d 
status %d\n",
+req->tc.tag, req->tc.id, req->status);
+   if (c->trans_mod->cancelled)
+   c->trans_mod->cancelled(c, req);
+   /* won't recei

[PATCH 2/3] 9p/net: make clunk asynchronous

2018-12-11 Thread Dominique Martinet
From: Dominique Martinet 

clunk is defined as making the fid invalid whatever the server returns,
and we should ignore errors so it is a good candidate for async call.

The change should make 9p slightly faster (many vfs systeme calls won't
need to wait for that clunk), but more importantly the flush rework
means we won't clear pending signals and the current implementation of
"retry twice then leak the fid" will stop working, so this needed
improving.

Signed-off-by: Dominique Martinet 
Cc: Eric Van Hensbergen 
Cc: Latchesar Ionkov 
Cc: Tomas Bortoli 
Cc: Dmitry Vyukov 
---
 include/net/9p/client.h |   4 ++
 net/9p/client.c | 124 +++-
 2 files changed, 62 insertions(+), 66 deletions(-)

diff --git a/include/net/9p/client.h b/include/net/9p/client.h
index a4ded7666c73..75d7f83e5b94 100644
--- a/include/net/9p/client.h
+++ b/include/net/9p/client.h
@@ -91,6 +91,7 @@ enum p9_req_status_t {
  * @aux: transport specific data (provided for trans_fd migration)
  * @req_list: link used by trans_fd
  * @async_list: link used to check on async requests
+ * @clunked_fid: for clunk, points to fid
  */
 struct p9_req_t {
int status;
@@ -102,6 +103,9 @@ struct p9_req_t {
void *aux;
struct list_head req_list;
struct list_head async_list;
+   union {
+   struct p9_fid *clunked_fid;
+   };
 };
 
 /**
diff --git a/net/9p/client.c b/net/9p/client.c
index 0a67c3ccd4fd..a47b5a54573d 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -649,6 +649,51 @@ static int p9_check_zc_errors(struct p9_client *c, struct 
p9_req_t *req,
return err;
 }
 
+static struct p9_fid *p9_fid_create(struct p9_client *clnt)
+{
+   int ret;
+   struct p9_fid *fid;
+
+   p9_debug(P9_DEBUG_FID, "clnt %p\n", clnt);
+   fid = kmalloc(sizeof(struct p9_fid), GFP_KERNEL);
+   if (!fid)
+   return NULL;
+
+   memset(&fid->qid, 0, sizeof(struct p9_qid));
+   fid->mode = -1;
+   fid->uid = current_fsuid();
+   fid->clnt = clnt;
+   fid->rdir = NULL;
+   fid->fid = 0;
+
+   idr_preload(GFP_KERNEL);
+   spin_lock_irq(&clnt->lock);
+   ret = idr_alloc_u32(&clnt->fids, fid, &fid->fid, P9_NOFID - 1,
+   GFP_NOWAIT);
+   spin_unlock_irq(&clnt->lock);
+   idr_preload_end();
+
+   if (!ret)
+   return fid;
+
+   kfree(fid);
+   return NULL;
+}
+
+static void p9_fid_destroy(struct p9_fid *fid)
+{
+   struct p9_client *clnt;
+   unsigned long flags;
+
+   p9_debug(P9_DEBUG_FID, "fid %d\n", fid->fid);
+   clnt = fid->clnt;
+   spin_lock_irqsave(&clnt->lock, flags);
+   idr_remove(&clnt->fids, fid->fid);
+   spin_unlock_irqrestore(&clnt->lock, flags);
+   kfree(fid->rdir);
+   kfree(fid);
+}
+
 static struct p9_req_t *
 p9_client_rpc(struct p9_client *c, int8_t type, const char *fmt, ...);
 
@@ -778,6 +823,9 @@ static void p9_client_handle_async(struct p9_client *c, 
bool free_all)
}
if (free_all || req->status >= REQ_STATUS_RCVD) {
/* Put old refs whatever reqs actually returned */
+   if (req->tc.id == P9_TCLUNK) {
+   p9_fid_destroy(req->clunked_fid);
+   }
list_del(&req->async_list);
p9_tag_remove(c, req);
}
@@ -959,51 +1007,6 @@ static struct p9_req_t *p9_client_zc_rpc(struct p9_client 
*c, int8_t type,
return ERR_PTR(safe_errno(err));
 }
 
-static struct p9_fid *p9_fid_create(struct p9_client *clnt)
-{
-   int ret;
-   struct p9_fid *fid;
-
-   p9_debug(P9_DEBUG_FID, "clnt %p\n", clnt);
-   fid = kmalloc(sizeof(struct p9_fid), GFP_KERNEL);
-   if (!fid)
-   return NULL;
-
-   memset(&fid->qid, 0, sizeof(struct p9_qid));
-   fid->mode = -1;
-   fid->uid = current_fsuid();
-   fid->clnt = clnt;
-   fid->rdir = NULL;
-   fid->fid = 0;
-
-   idr_preload(GFP_KERNEL);
-   spin_lock_irq(&clnt->lock);
-   ret = idr_alloc_u32(&clnt->fids, fid, &fid->fid, P9_NOFID - 1,
-   GFP_NOWAIT);
-   spin_unlock_irq(&clnt->lock);
-   idr_preload_end();
-
-   if (!ret)
-   return fid;
-
-   kfree(fid);
-   return NULL;
-}
-
-static void p9_fid_destroy(struct p9_fid *fid)
-{
-   struct p9_client *clnt;
-   unsigned long flags;
-
-   p9_debug(P9_DEBUG_FID, "fid %d\n", fid->fid);
-   clnt = fid->clnt;
-   spin_lock_irqsave(&clnt->lock, flags);
-   idr_remove(&clnt->fids, fid->fid);
-   spin_unlock_irqrestore(&clnt->lock, flags);
-   kfree(fid-

Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

2019-01-06 Thread Dominique Martinet
Linus Torvalds wrote on Sat, Jan 05, 2019:
> But I think my patch to just rip out all that page lookup, and just
> base it on the page table state has the fundamental advantage that it
> gets rid of code. Maybe I should jst commit it, and see if anything
> breaks? We do have options in case things break, and then we'd at
> least know who cares (and perhaps a lot more information of _why_ they
> care).

There actually are many tools like fincore which depend on mincore to
try to tell whether a file is "loaded in cache" or not (I personally use
vmtouch[1], but I know of at least nocache[2] uses it as well to only
try to evict used pages)

[1] https://hoytech.com/vmtouch/
[2] https://github.com/Feh/nocache


I mostly use these to either fadvise(POSIX_FADV_DONTNEED) or
prefetch/lock whole files so my "production" use-cases don't actually
rely on the mincore part of them; but when playing with these actions
it's actually fairly useful to be able to visualize which part of a file
ended in cache or monitor how a file's content evolve in cache...

There are various non-obvious behaviours where being able to poke around
is enlightening (e.g. fadvise dontneed is actually a hint, so even if
nothing uses the file linux sometimes keep the data around if it thinks
that would be useful and nocache has a mode to call fadvise multiple
times and things like that...)


Anyway, I agree the use of mincore for this is rather ugly, and
frankly some "cache management API" might be better in the long run if
only for performance reason (don't try these tools on a hundred TB
sparse file...), but until that pipe dream comes true I think mincore as
it was is useful for system admins.



Linus Torvalds wrote on Sun, Jan 06, 2019:
> I decided to just apply that patch. It is *not* marked for stable,
> very intentionally, because I expect that we will need to wait and see
> if there are issues with it, and whether we might have to do something
> entirely different (more like the traditional behavior with some extra
> "only for owner" logic).

FWIW I personally don't care much about "only for owner" or depending on
mmap options; I don't understand much of the security implications
honestly so I'm not sure how these limitations actually help.
On the other hand, a simple CAP_SYS_ADMIN check making the call take
either behaviour should be safe and would cover what I described above.


(by the way, while we are discussing permissions, a regular user can use
fadvise dontneed on files it doesn't own as well as long as it can open
them for reading; I'm not sure if that would need restricting as well in
the context of the security issue. Frankly even with mincore someone
could likely tell the difference through timing, if they just do it a
few times. Do magic, probe, flush out, repeat until satisfied.)


Thanks,
-- 
Dominique Martinet | Asmadeus


Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

2019-01-07 Thread Dominique Martinet
Vlastimil Babka wrote on Mon, Jan 07, 2019:
> On 1/7/19 5:32 AM, Dominique Martinet wrote:
> > Linus Torvalds wrote on Sat, Jan 05, 2019:
> >> But I think my patch to just rip out all that page lookup, and just
> >> base it on the page table state has the fundamental advantage that it
> >> gets rid of code. Maybe I should jst commit it, and see if anything
> >> breaks? We do have options in case things break, and then we'd at
> >> least know who cares (and perhaps a lot more information of _why_ they
> >> care).
> > 
> > There actually are many tools like fincore which depend on mincore to
> > try to tell whether a file is "loaded in cache" or not (I personally use
> > vmtouch[1], but I know of at least nocache[2] uses it as well to only
> > try to evict used pages)
> 
> nocache could probably do fine without mincore. IIUC the point is to not
> evict anything that was already resident prior to running some command
> wrapped in nocache. Without the mincore checks,
> posix_fadvise(POSIX_FADV_DONTNEED) will still not drop anything that
> others have mapped. That means without mincore() it will drop data
> that's in cache but not currently in use by anybody, which shouldn't
> cause large performance regressions?

Yes, I sent them a patch to allow this behaviour (which got merged ages
ago); but the default nocache usage will try to preserve pages that were
mapped before even if the user mapped it themselves so it's a little bit
more robust that just trusting linux to do the right thing. I did
various tests with fadvise dontneed and honestly the way it works is far
from obvious when operating as a single user at least; I didn't test
multi-users as that didn't really concern me.

With the current mincore change, it will think everything was "in core"
and not flush anything unless my option to just fadvise dontneed
everything is passed though ; so even if we can make it work it is a
change of behaviour that is breaking an existing application, and it has
no way of telling it didn't work.


Honestly though, as I said, mincore() is much more useful for debugging
for me ; the application can be changed if required. I just pointed it
out as it'll need changing, and it has no obvious way of testing at
runtime if the syscall works (except dumb kernel version check, but that
won't work with stable backports); so it's not that obvious.

> > FWIW I personally don't care much about "only for owner" or depending on
> > mmap options; I don't understand much of the security implications
> > honestly so I'm not sure how these limitations actually help.
> > On the other hand, a simple CAP_SYS_ADMIN check making the call take
> > either behaviour should be safe and would cover what I described above.
> 
> So without CAP_SYS_ADMIN, mincore() would return mapping status, and
> with CAP_SYS_ADMIN, it would return cache residency status? Very clumsy
> :( Maybe if we introduced mincore2() with flags similar to BSD mentioned
> earlier in the thread, and the cache residency flag would require
> CAP_SYS_ADMIN or something similar.

I agree, that's rather clumsy... Or rather might lead to some unexpected
behaviours. I'm open to other ideas.
I'm not sure how the BSD flags help though?

> > (by the way, while we are discussing permissions, a regular user can use
> > fadvise dontneed on files it doesn't own as well as long as it can open
> > them for reading; I'm not sure if that would need restricting as well in
> > the context of the security issue.
> 
> Probably not, as I've mentioned it won't evict what's mapped by somebody
> else. And eviction is also possible via controlling LRU, which is what
> the paper [1] does anyway (and also mentions that DONTNEED doesn't
> work). Being able to evict somebody's page is AFAIU not sufficient for
> attack, the side channel is about knowing that somebody brought that
> page back to RAM by touching it.

Thanks for the link to the paper, I hadn't taken the time to extract it
from the news article but it's much more interesting indeed.
Some of what's been said here makes more sense to me now (checking
about ownership and similar might indeed be good enough).

For evicting page I agree most targets would be different users so I
guess that makes sense as well; there seem to be other ways of
efficiently evicting a page from cache anyway.

(BTW, this was brought up earlier, the paper also mentions
/proc/self/pagemap as not being useful.)


> > Frankly even with mincore someone
> > could likely tell the difference through timing, if they just do it a
> > few times. Do magic, probe, flush out, repeat until satisfied.)

(I 

Re: [PATCH v15 23/26] sched: early boot clock

2019-01-07 Thread Dominique Martinet
Pavel Tatashin wrote on Mon, Jan 07, 2019:
> I could not reproduce the problem. Did you suspend to memory between
> wake ups? Does this time jump happen every time, even if your laptop
> sleeps for a minute?

I'm not sure I understand "suspend to memory between the wake ups".
The full sequence is:
 - start a VM (just in case, I let it boot till the end)
 - suspend to memory (aka systemctl suspend) the host
 - after resuming the host, soft reboot the VM (login through
serial/ssh/whatever and reboot or in the qemu console 'system_reset')

I've just slept exactly one minute and reproduced again with the fedora
stock kernel now (4.19.13-300.fc29.x86_64) in the VM.

Interestingly I'm not getting the same offset between multiple reboots
now despite not suspending again; but if I don't suspend I cannot seem
to get it to give an offset at all (only tried for a few minutes; this
might not be true) ; OTOH I pushed my luck further and even with a five
seconds sleep I'm getting a noticeable offset on first VM reboot after
resume:

[0.00] Hypervisor detected: KVM
[0.00] kvm-clock: Using msrs 4b564d01 and 4b564d00
[  179.362163] kvm-clock: cpu 0, msr 13c01001, primary cpu clock
[  179.362163] clocksource: kvm-clock: mask: 0x max_cycles: 
0x1cd42e4dffb, max_idle_ns: 881590591483 ns

Honestly not sure what more information I could give, I'll try on some
other hardware than my laptop (if I can get a server to resume after
suspend through ipmi or wake on lan); but I don't have anything I could
install ubuntu on to try their qemu's version... although I really don't
want to believe that's the difference...

Thanks,
-- 
Dominique


Re: [PATCH v15 23/26] sched: early boot clock

2019-01-07 Thread Dominique Martinet
Pavel Tatashin wrote on Mon, Jan 07, 2019:
> I did exactly the same sequence on Kaby Lake CPU and could not
> reproduce it. What is your host CPU?

skylake consumer laptop CPU: Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz

I don't have any kaby lake around; I have access to older servers though...
-- 
Dominique


[GIT PULL] 9p updates for 4.21

2019-01-01 Thread Dominique Martinet
Small pull request this time around with only two commits; some missing
prototype warning fix and a syzkaller fix when a 9p server advertises a
too small msize.

The commit date is close-ish because I reworded a commit message to
add a Cc to stable for the msize fix recently, but the patchs themselves
have been in linux-next since Nov 20~ish


Thanks,

The following changes since commit ccda4af0f4b92f7b4c308d3acc262f4a7e3affad:

  Linux 4.20-rc2 (2018-11-11 17:12:31 -0600)

are available in the Git repository at:

  git://github.com/martinetd/linux tags/9p-for-4.21

for you to fetch changes up to 574d356b7a02c7e1b01a1d9cba8a26b3c2888f45:

  9p/net: put a lower bound on msize (2018-12-25 17:07:49 +0900)


Pull request for inclusion in 4.21

Missing prototype warning fix and a syzkaller fix when a 9p server
advertises a too small msize


Adeodato Simó (1):
  net/9p: include trans_common.h to fix missing prototype warning.

Dominique Martinet (1):
  9p/net: put a lower bound on msize

 net/9p/client.c   | 21 +
 net/9p/trans_common.c |  1 +

-- 
Dominique Martinet


Re: [PATCH v15 23/26] sched: early boot clock

2019-01-03 Thread Dominique Martinet
Pavel Tatashin wrote on Thu, Jan 03, 2019:
> Could you please send the config file and qemu arguments that were
> used to reproduce this problem.

Running qemu by hand, nothing fancy e.g. this works:

# qemu-system-x86_64 -m 1G -smp 4 -drive 
file=/root/kvm-wrapper/disks/f2.img,if=virtio -serial mon:stdio --enable-kvm 
-cpu Haswell -device virtio-rng-pci -nographic

(used a specific cpu just in case but normally runnning with cpu host on
a skylake machine; can probably go older)


qemu is fedora 29 blend as is:
$ qemu-system-x86_64 --version
QEMU emulator version 3.0.0 (qemu-3.0.0-3.fc29)
Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers


compressed .config attached to the mail, this can likely be trimmed down
some as well but that takes more time for me..
I didn't rebuild the kernel so not 100% sure (comes from
/proc/config.gz) but it should work on a 4.20-rc2 kernel as written in
the first few lines; 857baa87b64 I referred to in another mail was
merged in 4.19-rc1 so anything past that is probably OK to reproduce...


Re-checked today with these exact options (fresh VM start; then suspend
laptop for a bit, then reboot VM):
[0.00] Hypervisor detected: KVM
[0.00] kvm-clock: Using msrs 4b564d01 and 4b564d00
[ 2477.907447] kvm-clock: cpu 0, msr 153a4001, primary cpu clock
[ 2477.907448] clocksource: kvm-clock: mask: 0x max_cycles: 
0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[ 2477.907450] tsc: Detected 2592.000 MHz processor


As offered previously, happy to help in any way.

Thanks,
-- 
Dominique


config.xz
Description: Binary data


Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation

2018-12-28 Thread Dominique Martinet
Theodore Y. Ts'o wrote on Fri, Dec 28, 2018:
> > The problem is that there is no 32-bit API in some cases
> > (unless I have misunderstood the kernel code) -- not all
> > host architectures implement compat syscalls or allow them
> > to be called from 64-bit processes or implement all the older
> > syscall variants that had smaller offets. If there was a guaranteed
> > "this syscall always exists and always gives me 32-bit offsets"
> > we could use it.
> 
> Are there going to be cases where a process or a thread will sometimes
> want the 64-bit interface, and sometimes want the 32-bit interface?
> Or is it always going to be one or the other?  I wonder if we could
> simply add a new flag to the process personality(2) flags.

That would likely work for qemu user, but the qemu system+9p case is
going to be more painful..
More precisely, the 9p protocol does not plan for anything other than
64bit offset so if the vfs needs to hand out a 32bit offset we'll need
to make a correspondance table between the 32bit offsets we hand off and
the 64bit ones to use; unless some flag can be passed at lopen to tell
the server to always hand out 32bit offsets for this directory... And if
we do that then 9p servers will need a way to use both APIs in parallel
for both types of directories.

(Although I'd rather not have to do either in the first place, keeping
track of all used offsets just in case seems like a waste even if we
only do it for processes in 32bit mode, and a new flag would be a
protocol change with 9p not being designed to catter for subtle protocol
changes so would be rather painful to roll out)


No bright idea here, sorry.
-- 
Dominique Martinet | Asmadeus


Re: [V9fs-developer] [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation

2018-12-28 Thread Dominique Martinet
Theodore Y. Ts'o wrote on Fri, Dec 28, 2018:
> On Sat, Dec 29, 2018 at 03:37:21AM +0100, Dominique Martinet wrote:
> > > Are there going to be cases where a process or a thread will sometimes
> > > want the 64-bit interface, and sometimes want the 32-bit interface?
> > > Or is it always going to be one or the other?  I wonder if we could
> > > simply add a new flag to the process personality(2) flags.
> > 
> > That would likely work for qemu user, but the qemu system+9p case is
> > going to be more painful..
> > More precisely, the 9p protocol does not plan for anything other than
> > 64bit offset so if the vfs needs to hand out a 32bit offset we'll need
> > to make a correspondance table between the 32bit offsets we hand off and
> > the 64bit ones to use; unless some flag can be passed at lopen to tell
> > the server to always hand out 32bit offsets for this directory... And if
> > we do that then 9p servers will need a way to use both APIs in parallel
> > for both types of directories.
> 
> How about if we add a fcntl(2) mediated flag, which is tied to a
> struct file?  Would that be more or less painful for 9p and qemu
> system+9p?

Hmm. 9P2000.L doesn't have anything akin to fcntl either, the only two
obvious places where we could pass a flag is lopen (which already
handles a bunch of linux-specific flags, e.g. passing O_LARGEFILE
O_NOATIME etc will just forward these through for qemu/diod at least),
or adding a new parameter to the 9p readdir.

The former would let us get away without modifying the protocol as
servers will just ignore flags they don't handle on implementations I
checked, so it'd definitely be the least effort choice from what I can
tell.


On the other hand a fcntl would solve the server-side problem, it'd
allow the server to request appropriately-sized offsets per fd, so it's
a good start; we "just" need to figure how to translate that on the wire.

-- 
Dominique Martinet | Asmadeus


Re: BUG: corrupted list in p9_read_work

2018-10-10 Thread Dominique Martinet
Dmitry Vyukov wrote on Wed, Oct 10, 2018:
> How can they be faked?
> If we could create a private rdma/virtio stub instance per test
> process, then we could I think easily use that instance for 9p. But is
> it possible?

"RDMA" itself can be faked pretty easily nowadays, there's a "rxe"
driver that is soft RDMA over ethernet and can run over anything.

The problem is that you can't just give the client a file like trans fd;
you'd need to open an ""rdma socket"" (simplifying wording a bit), and
afaik there is no standard tool for it ; or rather, the problem is that
RDMA is packet based so even if there were you can't just write stuff
in a fd and hope it'll work, so you need a server.

If you're interested, 9p is trivial enough that I could provide you with
a trivial server that works like your file (just need to reimplement
something that parses header to packetize it properly; so you could
write to its stdin for example) ; that'd require some setup in the VM
(configure rxe and install that tool), but it would definitely be
possible.
What do you think ?


For virtio, I'm not as familiar with the environment so I do not know if
there are ways to fake it as easily unfortunately.

> Testing on real hardware is mostly outside of our priorities at the
> moment. I mean syzkaller itself can be run on anything, and one could
> extend descriptions to use a known rdma interface and run on a real
> hardware. But we can't afford this at the moment.

Sure, I understand that.

> As far as I understand RDMA maintainers run syzkaller on real
> hardware, but I don't know if they are up to including 9p into
> testing. +Leon

I'd be interested in knowing what kind of tests runs there :)

-- 
Dominique Martinet


Re: BUG: corrupted list in p9_read_work

2018-10-10 Thread Dominique Martinet
Dmitry Vyukov wrote on Wed, Oct 10, 2018:
> > Back to the current patch, since as I said I am not confident this is a
> > good enough fix for the current bug, will I get notified if the bug
> > happens again once the patch hits linux-next with the Reported-by tag ?
> > (I don't have the setup necessary to run a syz repro as there is no C
> > repro, and won't have much time to do that setup sorry)
> 
> Yes, the bug will be reported again if it still happens after the
> patch is merged (not just into linux-next, but into all tested trees,
> but it does not matter much). So marking bugs as fixed tentatively is
> fine if that's our best guess.

Ok, thanks for confirming...

> But note that syzbot can test fixes itself on request. It boils down
> to just giving it the patch and the base tree:
> https://github.com/google/syzkaller/blob/master/docs/syzbot.md#testing-patches

.. and for clarifying that bit, let's try that! :)

#syz test: git://github.com/martinetd/linux 
e4ca13f7d075e551dc158df6af18fb412a1dba0a

-- 
Dominique


Re: BUG: corrupted list in p9_read_work

2018-10-11 Thread Dominique Martinet
Dmitry Vyukov wrote on Thu, Oct 11, 2018:
> > That's still the tricky part, I'm afraid... Making a separate server
> > would have been easy because I could have reused some of my junk for the
> > actual connection handling (some rdma helper library I wrote ages
> > ago[1]), but if you're going to just embed C code you'll probably want
> > something lower level? I've never seen syzkaller use any library call
> > but I'm not even sure I would know how to create a qp without
> > libibverbs, would standard stuff be OK ?
> 
> Raw syscalls preferably.
> What does 'rxe_cfg start ens3' do on syscall level? Some netlink?

modprobe rdma_rxe (and a bunch of other rdma modules before that) then
writes the interface name in /sys/module/rdma_rxe/parameters/add
apparently; then checks it worked.
this part could be done in C directly without too much trouble, but as
long as the proper kernel configuration/modules are available

> Any libraries and utilities are hell pain in linux world. Will it work
> in Android userspace? gVisor? Who will explain all syzkaller users
> where they get this for their who-knows-what distro, which is 10 years
> old because of corp policies, and debug how their version of the
> library has a slightly incompatible version?
> For example, after figuring out that rxe_cfg actually comes from
> rdma-core (which is a separate delight on linux), my debian
> destribution failed to install it because of some conflicts around
> /etc/modprobe.d/mlx4.conf, and my ubuntu distro does not know about
> such package. And we've just started :)

The rdma ecosystem is a pain, I'll easily agree with that...

> Syscalls tend to be simpler and more reliable. If it gives ENOSUPP,
> ok, that's it. If it works, great, we can use it.

I'll have to look into it a bit more; libibverbs abstracts a lot of
stuff into per-nic userspace drivers (the files I cited in a previous
mail) and basically with the mellanox cards I'm familiar with the whole
user session looks like this:
 * common libibverbs/rdmacm code opens /dev/infiniband/rdma_cm and
/dev/infiniband/uverbs0 (plus a bunch of files to figure out abi
version, what user driver to load etc) 
 * it and the userspace driver issue "commands" over these two files' fd
to setup the connection ; some commands are standard but some are
specific to the interface and defined in the driver.
There are many facets to a connection in RDMA: a protection domain used
to register memory with the nic, a queue pair that is the actual tx/rx
connection, optionally a completion channel that will be another fd to
listen on for events that tell you something happened and finally some
memory regions to directly communicate with the nic from userspace
depending on the specific driver.
 * then there's the actual usage, more commands through the uverbs0 char
device to register the memory you'll use, and once that's done it's
entierly up to the driver - for example the mellanox lib can do
everything in userspace playing with the memory regions it registered,
but I'd wager the rxe driver does more calls through the uverbs0 fd...

Honestly I'm not keen on reimplementing all of this; the interface
itself pretty much depends on your version of the kernel (there is a
common ABI defined, but as far as specific nics are concerned if your
kernel module doesn't match the user library version you can get some
nasty surprises), and it's far from the black or white of a good ol'
ENOSUPP error.


I'll look if I can figure out if there is a common subset of verbs
commands that are standard and sufficient to setup a listening
connection and exchange data that should be supported for all devices
and would let us reimplement just that, but while I hear your point
about android and ten years in the future I think it's more likely than
ten years in the future the verb abi will have changed but libibverbs
will just have the new version implemented and hide the change :P

-- 
Dominique


9p/RDMA for syzkaller (Was: BUG: corrupted list in p9_read_work)

2018-10-11 Thread Dominique Martinet
Dmitry Vyukov wrote on Thu, Oct 11, 2018:
> But again we don't need to support all of the available hardware.

I agree with that, I just have no idea what the "librxe-rdmav16.so" lib
could be doing and described something I am slightly more familiar with
(e.g. libmlx5)
I talked about a common subset of the verb abi because I didn't want to
look into what it's doing, but if it's not enough there's always that
possibility.


> For example, we are testing net stack from external side using tun.
> tun is a very simple, virtual abstraction of a network card. It allows
> us to test all of generic net stack starting from L2 without messing
> with any real drivers and their differences entirely. I had impression
> that we are talking about something similar here too. Or not?

That sounds about right, rxe is a software implementation that should
work on most network interfaces ; at least from what I tried it worked
on a VM's virtio net down to my laptop's wifi interface so it's a good
start... I'm not saying all because I just tried a dummy interface and
that returned EINVAL.
The only point I disagree is the 'very simple', even getting that to
work will be a far cry from a socket() call... :)


> Also I am a bit missing context about rdma<->9p interface. Do we need
> to setup all these ring buffers to satisfy the parts that 9p needs? Is
> it that 9p actually reads data directly from these ring buffers? Or
> there is some higher-level rdma interface that 9p uses?

It needs an "RDMA_PS_TCP" connection established, that requires
everything I described unfortunately...
Once that's established we need to register some memory to the driver
and post some recv buffers (even if we won't read it, the client would
get errors if we aren't ready to receive anything - at least it does
with real hardware), and also use some registered memory to send data.

Thinking back though I think that my server implementation isn't very
far from the raw level in what I'm doing, I recall libibverbs fallback
implementation (e.g. if the driver lib doesn't implement it otherwise)
of the functions I looked at like ibv_post_send to mostly be just
serializing the arguments, slapping the command from an enum in front of
it and sending it to the kernel, so it might be enough to just
reimplement that shim in or figure a way to generate the binary commands
once and then use these values; now I'm comparing two runs of strace of
my test server I definitely see a pattern.

I'll give it a try but don't expect something fast, and it's probably
not going to be very pretty either...

To give a concrete example, here are all the read/write/fcntl calls
looking just at /dev/infiniband in a hello world program that just
establishes connection (server side), receive and send two messages and
quits:


This part apparently sets up the listening connection of the server:

1430  1539262699.126025 openat(AT_FDCWD, "/dev/infiniband/rdma_cm", 
O_RDWR|O_CLOEXEC) = 3
1430  1539262699.126155 write(3, 
"\0\0\0\0\30\0\4\0@m'\1\0\0\0\0\344\327\375\271\374\177\0\0?\1\2\0\0\0\0\0", 
32) = 32
1430  1539262699.126192 write(3, 
"\24\0\0\0\210\0\0\0\0\0\0\\0\0\0\33\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\6\0\0\377\377\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0",
 144) = 144
1430  1539262699.126223 write(3, "\23\0\0\0\20\0\20\1 
\326\375\271\374\177\0\0\0\0\0\0\0\0\0\0", 24) = 24
1430  1539262699.126250 write(3, "\23\0\0\0\20\0\20\1 
\326\375\271\374\177\0\0\0\0\0\0\2\0\0\0", 24) = 24
1430  1539262699.126274 write(3, 
"\1\0\0\0\20\0\4\0\324\327\375\271\374\177\0\0\0\0\0\0\0\0\0\0", 24) = 24
1430  1539262699.126303 close(3)= 0
1430  1539262699.126360 openat(AT_FDCWD, "/dev/infiniband/rdma_cm", 
O_RDWR|O_CLOEXEC) = 3
1430  1539262699.126429 write(3, 
"\0\0\0\0\30\0\4\0\240\217'\1\0\0\0\0t\330\375\271\374\177\0\0\6\1\2\0\0\0\0\0",
 32) = 32
1430  1539262699.126472 write(3, 
"\24\0\0\0\210\0\0\0\0\0\0\0\34\0\0\0\n\0\4\323\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0",
 144) = 144
1430  1539262699.126501 write(3, 
"\23\0\0\0\20\0\20\1p\326\375\271\374\177\0\0\0\0\0\0\0\0\0\0", 24) = 24
1430  1539262699.126534 write(3, 
"\23\0\0\0\20\0\20\1p\326\375\271\374\177\0\0\0\0\0\0\2\0\0\0", 24) = 24
1430  1539262699.127119 write(3, "\7\0\0\0\10\0\0\0\0\0\0\0@\0\0\0", 16) = 16
1430  1539262699.127149 write(3, 
"\23\0\0\0\20\0\20\1`\327\375\271\374\177\0\0\0\0\0\0\0\0\0\0", 24) = 24
1430  1539262699.127319 fcntl(3, F_GETFL) = 0x8002 (flags O_RDWR|O_LARGEFILE)
1430  1539262699.127348 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK|O_LARGEFILE 


Then the client connects (had some epoll on read on fd 3, but no read?!)

1446  1539262706.

9p/RDMA for syzkaller (Was: BUG: corrupted list in p9_read_work)

2018-10-11 Thread Dominique Martinet
Dmitry Vyukov wrote on Thu, Oct 11, 2018:
> > Now we are talking!
> > We generally assume that all modules are simply compiled into kernel.
> > At least that's we have on syzbot. If somebody can't compile them in,
> > we can suggest to add modprobe into init.
> > So this boils down to just writing to /sys/module/rdma_rxe/parameters/add.
> 
> This fails for me:
> 
> root@syzkaller:~# echo -n syz1 > /sys/module/rdma_rxe/parameters/add
> [20992.905406] rdma_rxe: interface syz1 not found
> bash: echo: write error: Invalid argument

Works here, I just did:

[root@f2 ~]# modprobe rdma_rxe
[root@f2 ~]# echo -n ens3 > /sys/module/rdma_rxe/parameters/add 

dmesg says:
[   35.595534] rdma_rxe: set rxe0 active
[   35.595541] rdma_rxe: added rxe0 to ens3


Actually for a dummy interface if I try the echo directly the echo
works, and a verb device is created, and I just confirmed I can use
it... so not sure why rxe_cfg said EINVAL earlier...

[root@f2 ~]# ip link add dummy0 type dummy
[root@f2 ~]# ip link set dummy0 up
[root@f2 ~]# ip addr add 10.1.1.1/24 dev dummy0
[root@f2 ~]# modprobe rdma_rxe
[root@f2 ~]# echo -n dummy0 > /sys/module/rdma_rxe/parameters/add 


(then using my test client:
[root@f2 src]# ./rcat -s
INFO:  trans_rdma.c (879), msk_cma_event_handler: CONNECT_REQUEST
INFO:  trans_rdma.c (862), msk_cma_event_handler: ESTABLISHED
INFO:  trans_rdma.c (917), msk_cma_event_handler: DISCONNECT EVENT...

[root@f2 src]# ./rcat -c 10.1.1.1
INFO:  trans_rdma.c (862), msk_cma_event_handler: ESTABLISHED
^C
)


I assume your syz1 interface is a tap device as you were saying earlier?
Got anything in dmesg?

-- 
Dominique


  1   2   3   >