Re: [V9fs-developer] [PATCH] net: trans_rdma: remove unused function
Hi, David Miller wrote on Wed, Jul 24, 2013 : > From: Paul Bolle > Date: Thu, 25 Jul 2013 01:09:47 +0200 > > After this patch one might as well revert the rest of commit > > 80b45261a0b2 ("9P: Add cancelled() to the transport functions.") too. It > > seems the entire cancelled callback stuff is now pointless. > > > > As I already asked in https://lkml.org/lkml/2013/7/15/87 : did that > > commit "forget to actually hook up rdma_cancelled() into > > p9_rdma_trans()?". It does look so to me. > > If nobody responds to this in the next day or so, feel free to send me > a patch which rips it all out. > > No response means they don't care, and neither, therefore, should we. Well, I do care - but I couldn't find where the trans->cancelled member function was supposed to be called anyway... So adding it to the struct and fixing the warning is well and fine, but if it's still never called in the end I don't see much point and there's nothing to test. It's on my lng todo list of things to look at, but as no other reply came and the patch was never picked up it kind of dropped in priority. I'll have another look next week I guess the worst that could happen would be submitting the rdma_cancelled function again when this is sorted out :) Regards, -- Dominique Martinet -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [V9fs-developer] [PATCH] net: trans_rdma: remove unused function
Hard morning, sorry for the double mail. Dominique Martinet wrote on Thu, Jul 25, 2013 : > Well, I do care - but I couldn't find where the trans->cancelled member > function was supposed to be called anyway... > So adding it to the struct and fixing the warning is well and fine, but > if it's still never called in the end I don't see much point and there's > nothing to test. To be more precise, there's a single call to c->trans_mode->cancelled in net/9p/client.c, in p9_client_flush, which is called on subfunction returning -ERESTARTSYS... which never happens as far as I could see. This will be useful once/if we start working on client recovery, though - so the function in itself definitely does interest me, and I guess that thinking about I would have preferred to have the hook added rather than the function removed. But there definitely is no hurry to add this cancelled function till then. Regards, -- Dominique Martinet -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [V9fs-developer] [PATCH] net: trans_rdma: remove unused function
Eric Van Hensbergen wrote on Thu, Jul 25, 2013 : > So, the cancel function should be used to flush any pending requests that > haven't actually been sent yet. Looking at the 9p RDMA code, it looks like > the thought was that this wasn't going to be possible. Regardless of > removing unsent requests, the flush will still be sent and if the server > processes it before the original request and sends a flush response back > then we need to clear the posted buffer. This is what rdma_cancelled is > supposed to be doing. So, the fix is to hook it into the structure -- but > looking at the code it seems like we probably need to do something more to > reclaim the buffer rather than just incrementing a counter. > > To be clear this has less to do with recovery and more to do with the > proper implementation of 9p flush semantics. By and large, those semantics > won't impact static file system users -- but if anyone is using the > transport to access synthetic filesystems or files then they'll definitely > want to have a properly implemented flush setup. The way to test this is > to get a blocking read on a remote named pipe or fifo and then ^C it. Ok, I knew about the concept of flush but didn't think a ^C would cause a -ESYSRESTART, so didn't think of that. That said, reading from, say, a fifo is an entierly local operation: the client does a walk, getattr, doesn't do anything 9p-wise, and clunks when it's done with it. As for the function needing a bit more work, there's a race, but on "normal" requests I think it is about right - the answer lays in a comment in rdma_request: /* When an error occurs between posting the recv and the send, * there will be a receive context posted without a pending request. * Since there is no way to "un-post" it, we remember it and skip * post_recv() for the next request. * So here, * see if we are this `next request' and need to absorb an excess rc. * If yes, then drop and free our own, and do not recv_post(). **/ Basically, receive buffers are sent in a queue, and we can't "retrieve" it back, so we just don't sent next one. There is one problem though - if the server handles the original request before getting the flush, the receive buffer will be consumed and we won't send a new one, so we'll starve the reception queue. I'm afraid I don't have any bright idea there... While we are on reception buffer issues, there is another problem with the queue of receive buffers, even without flush, in the following scenario: - post a buffer for tag 0, on a hanging request - post a buffer for tag 1 - reply for tag 1 will come on buffer from tag 0 - post another request with tag 1.. the buffer already is in the queue, and we don't know we can post the buffer associated with tag 0 back. I haven't found how to reproduce this perfectly yet, but a dd with blocksize 1MB and one with blocksize 10B in parallel brought the mountpoint down (and the whole server was completely unavailable for the duration of the dd - TCP sessions timed out, I even got IO errors on the local disk :D) Regards, -- Dominique Martinet -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [V9fs-developer] [PATCH] net: trans_rdma: remove unused function
I think I need to stop sending mails before triple-checking things! So sorry for the multiple mails again. Dominique Martinet wrote on Thu, Jul 25, 2013 : > [rdma_cancelled] > There is one problem though - if the server handles the original request > before getting the flush, the receive buffer will be consumed and we > won't send a new one, so we'll starve the reception queue. > I'm afraid I don't have any bright idea there... This still looks correct to me. > While we are on reception buffer issues, there is another problem with > the queue of receive buffers, even without flush, in the following > scenario: > - post a buffer for tag 0, on a hanging request > - post a buffer for tag 1 > - reply for tag 1 will come on buffer from tag 0 > - post another request with tag 1.. the buffer already is in the queue, > and we don't know we can post the buffer associated with tag 0 back. It actually looks like the reply buffers are swapped properly - taken out of the req struct into the context on send, then given back to the appropriate req on reception, so on normal operation there's no problem with what I described - sorry for crying wolf. > I haven't found how to reproduce this perfectly yet, but a dd with > blocksize 1MB and one with blocksize 10B in parallel brought the > mountpoint down (and the whole server was completely unavailable for the > duration of the dd - TCP sessions timed out, I even got IO errors on the > local disk :D) I need to run more tests to explain what happens with the two dds, but it's easily reproductible with debugs on, I guess that helps with a race somewhere. Regards, -- Dominique Martinet -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [V9fs-developer] [PATCH] net: trans_rdma: remove unused function
Eric Van Hensbergen wrote on Fri, Jul 26, 2013 : > It depends on the flags you use with 9p. If you mount with nodevmap it'll > treat them like normal files. Alternatively you can use synthetics from > fuse or even 9p file systems from Plan 9 port. Ok, thank you for the explanation. Now I need to see if my server supports this :) > I need to go back and refresh more of my RDMA knowledge, I thought you > could provide a better more direct coupling between the request and > response buffers (similar to what we do in virtio). In such a case, you'd > simply need to reclaim the "skipped" buffer not shuffle everything around. > Ultimately that would seem to work against the point of having "zero copy" > RDMA. What I said next was wrong, but the queue really isn't something you can control this finely. Once something is posted, you can't take it back, and you can't know which buffer will come back for which reply (but it's easy enough to read the tag from the reply and associate it back) For flush, I'm honestly not sure on how to deal with the possibility of the servers processing the original query and sending a reply after we've tagged our own buffer as free again, but I think there might actually be a problem even on sucessful flush if the next post is from a different tag (that has its own rc buffer) -- as far as I understand it, this one's reply will be considered duplicate reply in handle_recv. But I really need to do some more precise testing, I'll try playing with nodevmap, if I get the flush functions to be called I'll get this fixed. > > I haven't found how to reproduce this perfectly yet, but a dd with > > blocksize 1MB and one with blocksize 10B in parallel brought the > > mountpoint down (and the whole server was completely unavailable for the > > duration of the dd - TCP sessions timed out, I even got IO errors on the > > local disk :D) > > > > Sounds like you ran out of memory and things didn't degrade gracefully. I think it's actually "just" in the debug, there are two p9_debug printk in p9_client_cb which are called from an rdma irq handler, and it gets messy once the kernel buffer is full and it wants to flush to the console. I've got a short patch that adds an argument to p9_client_cb with an indication that it shouldn't output messages, I'll send it to the v9fs mailing list for comments if you think the idea is acceptable. > Which server are you using? I know of two servers out there which handle 9P/RDMA - diod ( http://code.google.com/p/diod/ ) and nfs-ganesha ( https://github.com/nfs-ganesha/nfs-ganesha ) I wrote the RDMA implementation for ganesha and am mostly running tests on that, but both should somewhat work with their quirks :) Regards, -- Dominique Martine -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [V9fs-developer] [PATCH] 9p: trans_fd, initialize recv fcall properly if not set
Eric Van Hensbergen wrote on Sat, Sep 05, 2015: > On Thu, Sep 3, 2015 at 4:38 AM, Dominique Martinet > wrote: > > To be honest, I think it might be better to just bail out if we get in > > this switch (m->req->rc == NULL after p9_tag_lookup) and not try to > > allocate more, because if we get there it's likely a race condition and > > silently re-allocating will end up in more troubles than trying to > > recover is worth. > > Thoughts ? > > > > Hmmm...trying to rattle my brain and remember why I put it in there > back in 2008. > It might have just been over-defensive programming -- or more likely it just > pre-dated all the zero copy infrastructure which pretty much guaranteed we had > an rc allocated and what is there is vestigial. I'm happy to accept a > patch which > makes this an assert, or perhaps just resets the connection because something > has gone horribly wrong (similar to the ENOMEM path that is there now). Yeah, it looks like the safety comes from the zero-copy stuff that came much later. Let's go with resetting the connection then. Hmm. EIO is a bit too generic so would be good to avoid that if possible, but can't think of anything better... Speaking of zero-copy, I believe it should be fairly straight-forward to implement for trans_fd now I've actually looked at it, since we do the payload read after a p9_tag_lookup, would just need m->req to point to a zc buffer. Write is similar, if there's a zc buffer just send it after the header. The cost is a couple more pointers in req and an extra if in both workers, that seems pretty reasonable. Well, I'm not using trans_fd much here (and unfortunately zero-copy isn't possible at all given the transport protocol for RDMA, at least for recv), but if anyone cares it probably could be done without too much hassle for the fd workers. -- Dominique -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2] 9p: trans_fd, bail out if recv fcall if missing
req->rc is pre-allocated early on with p9_tag_alloc and shouldn't be missing Signed-off-by: Dominique Martinet --- net/9p/trans_fd.c | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) Feel free to adapt error code/message if you can think of something better. diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c index a270dcc..a6d89c0 100644 --- a/net/9p/trans_fd.c +++ b/net/9p/trans_fd.c @@ -356,13 +356,12 @@ static void p9_read_work(struct work_struct *work) } if (m->req->rc == NULL) { - m->req->rc = kmalloc(sizeof(struct p9_fcall) + - m->client->msize, GFP_NOFS); - if (!m->req->rc) { - m->req = NULL; - err = -ENOMEM; - goto error; - } + p9_debug(P9_DEBUG_ERROR, +"No recv fcall for tag %d (req %p), disconnecting!\n", +m->rc.tag, m->req); + m->req = NULL; + err = -EIO; + goto error; } m->rc.sdata = (char *)m->req->rc + sizeof(struct p9_fcall); memcpy(m->rc.sdata, m->tmp_buf, m->rc.capacity); -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] 9p: trans_fd, read rework to use p9_parse_header
Most of the changes here are no-op and just renaming to use a fcall struct, needed for p9_parse_header It fixes the unaligned memory access to read the tag and defers to common functions for part of the protocol knowledge (although header length is still hard-coded...) Reported-By: Rob Landley Signed-Off-By: Dominique Martinet --- net/9p/trans_fd.c | 75 +-- 1 file changed, 40 insertions(+), 35 deletions(-) It ended up alot bigger than I thought it'd be, submitting it anyway but happy with either version - letting Eric decide what's better :) diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c index bced8c0..a270dcc 100644 --- a/net/9p/trans_fd.c +++ b/net/9p/trans_fd.c @@ -108,9 +108,7 @@ struct p9_poll_wait { * @unsent_req_list: accounting for requests that haven't been sent * @req: current request being processed (if any) * @tmp_buf: temporary buffer to read in header - * @rsize: amount to read for current frame - * @rpos: read position in current frame - * @rbuf: current read buffer + * @rc: temporary fcall for reading current frame * @wpos: write position for current frame * @wsize: amount of data to write for current frame * @wbuf: current write buffer @@ -131,9 +129,7 @@ struct p9_conn { struct list_head unsent_req_list; struct p9_req_t *req; char tmp_buf[7]; - int rsize; - int rpos; - char *rbuf; + struct p9_fcall rc; int wpos; int wsize; char *wbuf; @@ -305,49 +301,56 @@ static void p9_read_work(struct work_struct *work) if (m->err < 0) return; - p9_debug(P9_DEBUG_TRANS, "start mux %p pos %d\n", m, m->rpos); + p9_debug(P9_DEBUG_TRANS, "start mux %p pos %zd\n", m, m->rc.offset); - if (!m->rbuf) { - m->rbuf = m->tmp_buf; - m->rpos = 0; - m->rsize = 7; /* start by reading header */ + if (!m->rc.sdata) { + m->rc.sdata = m->tmp_buf; + m->rc.offset = 0; + m->rc.capacity = 7; /* start by reading header */ } clear_bit(Rpending, &m->wsched); - p9_debug(P9_DEBUG_TRANS, "read mux %p pos %d size: %d = %d\n", -m, m->rpos, m->rsize, m->rsize-m->rpos); - err = p9_fd_read(m->client, m->rbuf + m->rpos, - m->rsize - m->rpos); + p9_debug(P9_DEBUG_TRANS, "read mux %p pos %zd size: %zd = %zd\n", +m, m->rc.offset, m->rc.capacity, +m->rc.capacity - m->rc.offset); + err = p9_fd_read(m->client, m->rc.sdata + m->rc.offset, +m->rc.capacity - m->rc.offset); p9_debug(P9_DEBUG_TRANS, "mux %p got %d bytes\n", m, err); - if (err == -EAGAIN) { + if (err == -EAGAIN) goto end_clear; - } if (err <= 0) goto error; - m->rpos += err; + m->rc.offset += err; - if ((!m->req) && (m->rpos == m->rsize)) { /* header read in */ - u16 tag; + /* header read in */ + if ((!m->req) && (m->rc.offset == m->rc.capacity)) { p9_debug(P9_DEBUG_TRANS, "got new header\n"); - n = le32_to_cpu(*(__le32 *) m->rbuf); /* read packet size */ - if (n >= m->client->msize) { + err = p9_parse_header(&m->rc, NULL, NULL, NULL, 0); + if (err) { p9_debug(P9_DEBUG_ERROR, -"requested packet size too big: %d\n", n); +"error parsing header: %d\n", err); + goto error; + } + + if (m->rc.size >= m->client->msize) { + p9_debug(P9_DEBUG_ERROR, +"requested packet size too big: %d\n", +m->rc.size); err = -EIO; goto error; } - tag = le16_to_cpu(*(__le16 *) (m->rbuf+5)); /* read tag */ p9_debug(P9_DEBUG_TRANS, -"mux %p pkt: size: %d bytes tag: %d\n", m, n, tag); +"mux %p pkt: size: %d bytes tag: %d\n", +m, m->rc.size, m->rc.tag); - m->req = p9_tag_lookup(m->client, tag); + m->req = p9_tag_lookup(m->client, m->rc.tag); if (!m->req || (m->req->status != REQ_STATUS_SENT)) { p9_debug(P9_DEBUG_ERROR, "Unexpected packet tag %d\n", -tag); +
[PATCH] 9p: trans_fd, initialize recv fcall properly if not set
That code really should never be called (rc is allocated in tag_alloc), but if it had been it couldn't have worked... Signed-off-by: Dominique Martinet --- net/9p/trans_fd.c | 3 +++ 1 file changed, 3 insertions(+) To be honest, I think it might be better to just bail out if we get in this switch (m->req->rc == NULL after p9_tag_lookup) and not try to allocate more, because if we get there it's likely a race condition and silently re-allocating will end up in more troubles than trying to recover is worth. Thoughts ? diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c index a270dcc..0d9831a 100644 --- a/net/9p/trans_fd.c +++ b/net/9p/trans_fd.c @@ -363,6 +363,9 @@ static void p9_read_work(struct work_struct *work) err = -ENOMEM; goto error; } + m->req->rc.capacity = m->client->msize; + m->req->rc.sdata = (char*)m->req->rc + + sizeof(struct p9_fcall); } m->rc.sdata = (char *)m->req->rc + sizeof(struct p9_fcall); memcpy(m->rc.sdata, m->tmp_buf, m->rc.capacity); -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/2] net/9p: embed fcall in req to round down buffer allocs
Greg Kurz wrote on Thu, Aug 02, 2018: > > @@ -443,9 +444,9 @@ static int rdma_request(struct p9_client *client, > > struct p9_req_t *req) > > **/ > > if (unlikely(atomic_read(&rdma->excess_rc) > 0)) { > > if ((atomic_sub_return(1, &rdma->excess_rc) >= 0)) { > > - /* Got one ! */ > > - kfree(req->rc); > > - req->rc = NULL; > > + /* Got one! */ > > + kfree(req->rc.sdata); > > Shouldn't this be p9_fcall_fini(&req->rc) ? Right, I failed at bookkeeping, I changed that in the next patch but it should have been done now. Will add p9_fcall_fini to headers/export in this patch instead of the next > The rest looks good, so, with that fixed, you can add: > > Reviewed-by: Greg Kurz Thanks! -- Dominique
Re: [V9fs-developer] INFO: task hung in grab_super
Dmitry Vyukov via V9fs-developer wrote on Wed, Jul 18, 2018: > >> Btw, I see that p9_client_rpc uses wait_event_killable, why wasn't it > >> killed along with the whole process? > >> > > > > wait_event_killable() would return -ERESTARTSYS if got SIGKILL. > > But if (c->status == Connected) && (type == P9_TFLUSH) is also true, > > it ignores SIGKILL by retrying the loop... > > > > again: > > err = wait_event_killable(*req->wq, req->status >= REQ_STATUS_RCVD); > > if ((err == -ERESTARTSYS) && (c->status == Connected) && (type == > > P9_TFLUSH)) { > > sigpending = 1; > > clear_thread_flag(TIF_SIGPENDING); > > goto again; > > } > > > > I wish they don't ignore SIGKILL (by e.g. offloading operations to a kernel > > thread). > > > I guess that's the problem, right? SIGKILL-ed task must not ignore > SIGKILL and hang in infinite loop. This would explain a bunch of hangs > in 9p. Tricky with the current way we handle this, as the normal action if wait_event_killable is interrupted is to send a tflush message (which is what you could also notice, if you just send one sigkill it'll just send a flush message and wait for that instead) There's work in progress to add refcounting to requests which would make us one step closer to being able to not wait for the flush reply (or rather, it'll give us the ability to wait for it asynchronously) ; we should be able to get rid of that loop after that. -- Dominique
Re: [V9fs-developer] [PATCH v2] net/9p/trans_virtio.c: add null terminal for mount tag
piaojun wrote on Fri, Aug 03, 2018: > We'd better reach an agreement about the patch fix. The way I read Greg's comment was that he agreed to the proposed changes and is waiting for a new version. I'm writing a longer reply than I should because I don't like people saying strncmp is safe just because it's strncmp, feel free to skim through the rest as it's just ranting. > In my opinion, replacing strlen(chan->tag) with a local variable > sounds reasonable, So we agree here > and changing strncmp to strcmp may be little beneficial, as strcmp is more > dangerours such as buffer-flow. strcmp is more dangerous for buffer-overflow if you're comparing "unsafe" non-null terminated strings. This isn't the case here as you've constructed chan->tag yourself and you rely on it being null-terminated by calling strlen() on it yourself. strncmp(x, y, strlen(y)+1) is at best awkward, but it's a false sense of security if you think this is any better than strcmp here. It implies that: - y is null-terminated (for strlen() to work) - x is either null-terminated or longer than y Here, x is the devname argument to p9_virtio_create, which comes straight from the mount syscall with "copy_mount_string", using strndup_user, which returns a null-terminated string or an error. (the code is currently not safe if it returns an error, I'm sending another mail about it right after this one as we already have a partial fix) strcmp(x, y) on the other hand assumes that x and y are null-terminated in this case, which is the same assumptions you have, so is strictly as "safe" as strncmp used that way. (it could also assume that one is null terminated and the other one is at least as long as that, e.g. when comparing a "char buf[42]" to the constant "foo" then we don't care if buf is null-terminated) Thanks, -- Dominique
Re: [V9fs-developer] [PATCH v2] net/9p/trans_virtio.c: add null terminal for mount tag
Dominique Martinet wrote on Fri, Aug 03, 2018: > (the code is currently not safe if it returns an error, I'm sending > another mail about it right after this one as we already have a partial > fix) I take that one back, ksys_mount() does check for error properly so just the null checks we have in Tomas' patch[1] are enough ; sorry for the double-mail [1] http://lkml.kernel.org/r/20180727110558.5479-1-tomasbort...@gmail.com -- Dominique
Re: [PATCH v3] net/9p/trans_virtio.c: add null terminal for mount tag
piaojun wrote on Fri, Aug 03, 2018: > chan->tag is Non-null terminated which will result in printing messy code > when debugging code. So we should add '\0' for tag to make the code more > convenient and robust. In addition, I drop char->tag_len to simplify the > code. LGTM, I'll pick this up. Thanks for the rework > Signed-off-by: Jun Piao > --- > net/9p/trans_virtio.c | 17 +++-- > 1 file changed, 7 insertions(+), 10 deletions(-) > > diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c > index d422bfc..46a5ab2 100644 > --- a/net/9p/trans_virtio.c > +++ b/net/9p/trans_virtio.c > @@ -89,10 +89,8 @@ struct virtio_chan { > unsigned long p9_max_pages; > /* Scatterlist: can be too big for stack. */ > struct scatterlist sg[VIRTQUEUE_NUM]; > - > - int tag_len; > /* > - * tag name to identify a mount Non-null terminated > + * tag name to identify a mount null terminated >*/ > char *tag; > > @@ -525,14 +523,15 @@ static ssize_t p9_mount_tag_show(struct device *dev, > { > struct virtio_chan *chan; > struct virtio_device *vdev; > + int tag_len; > > vdev = dev_to_virtio(dev); > chan = vdev->priv; > + tag_len = strlen(chan->tag); > > - memcpy(buf, chan->tag, chan->tag_len); > - buf[chan->tag_len] = 0; > + memcpy(buf, chan->tag, tag_len + 1); > > - return chan->tag_len + 1; > + return tag_len + 1; > } > > static DEVICE_ATTR(mount_tag, 0444, p9_mount_tag_show, NULL); > @@ -585,7 +584,7 @@ static int p9_virtio_probe(struct virtio_device *vdev) > err = -EINVAL; > goto out_free_vq; > } > - tag = kmalloc(tag_len, GFP_KERNEL); > + tag = kzalloc(tag_len + 1, GFP_KERNEL); > if (!tag) { > err = -ENOMEM; > goto out_free_vq; > @@ -594,7 +593,6 @@ static int p9_virtio_probe(struct virtio_device *vdev) > virtio_cread_bytes(vdev, offsetof(struct virtio_9p_config, tag), > tag, tag_len); > chan->tag = tag; > - chan->tag_len = tag_len; > err = sysfs_create_file(&(vdev->dev.kobj), &dev_attr_mount_tag.attr); > if (err) { > goto out_free_tag; > @@ -654,8 +652,7 @@ static int p9_virtio_probe(struct virtio_device *vdev) > > mutex_lock(&virtio_9p_lock); > list_for_each_entry(chan, &virtio_chan_list, chan_list) { > - if (!strncmp(devname, chan->tag, chan->tag_len) && > - strlen(devname) == chan->tag_len) { > + if (!strcmp(devname, chan->tag)) { > if (!chan->inuse) { > chan->inuse = true; > found = 1; > --
Re: KASAN: invalid-free in p9stat_free
syzbot wrote on Sun, Aug 26, 2018: > HEAD commit:e27bc174c9c6 Add linux-next specific files for 20180824 > git tree: linux-next > console output: https://syzkaller.appspot.com/x/log.txt?x=15dc19a640 > kernel config: https://syzkaller.appspot.com/x/.config?x=28446088176757ea > dashboard link: https://syzkaller.appspot.com/bug?extid=d4252148d198410b864f > compiler: gcc (GCC) 8.0.1 20180413 (experimental) > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=15f8efba40 > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1178256a40 > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > Reported-by: syzbot+d4252148d198410b8...@syzkaller.appspotmail.com > > random: sshd: uninitialized urandom read (32 bytes read) > random: sshd: uninitialized urandom read (32 bytes read) > random: sshd: uninitialized urandom read (32 bytes read) > random: sshd: uninitialized urandom read (32 bytes read) > == > BUG: KASAN: double-free or invalid-free in p9stat_free+0x35/0x100 > net/9p/protocol.c:48 That looks straight-forward enough, p9pdu_vreadf does p9stat_free on error then v9fs_dir_readdir does the same ; there is nothing else that could return an error without going through the first free so we could just remove the later one... There are a couple other users of the 'S' pdu read (that reads the stat struct and frees it on error), so it's probably best to keep the current behaviour as far as this is concerned, what we could do though is make the free function idempotent (write NULLs in the freed fields), but I do not see this being done often, do you know what the policy is about this kind of pattern nowadays? The struct is cleanly zeroed before being read so there is no risk of double-frees between iterations so zeroing pointers is not strictly required, but it does make things safer in general. -- Dominique Martinet
KASAN error in [PATCH v3 7/8] proc/kcore: optimize multiple page reads
in the first case below: if (&m->list == &kclist_head) { meaning no address matched in the list, so you cannot check m->addr and m->size in this case -- I'm afraid you will have to run through the list just in case if that happens even if there likely won't be any match for the next address either. > + list_for_each_entry(m, &kclist_head, list) { > + if (start >= m->addr && > + start < m->addr + m->size) > + break; > + } > } > > if (&m->list == &kclist_head) { -- Dominique Martinet
[PATCH] proc/kcore: fix invalid memory access in multi-page read optimization
The 'm' kcore_list item can point to kclist_head, and it is incorrect to look at m->addr / m->size in this case. There is no choice but to run through the list of entries for every address if we did not find any entry in the previous iteration Fixes: bf991c2231117 ("proc/kcore: optimize multiple page reads") Signed-off-by: Dominique Martinet --- I guess now I'm looking at bf991c2231117 again that it would be slightly more efficient to remove the !m check and initialize m to point to kclist_head like this: m = list_entry(&kclist_head, struct kcore_list, list); but it feels a bit forced to me; deferring the choice to others. fs/proc/kcore.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c index ad72261ee3fe..50036f6e1f52 100644 --- a/fs/proc/kcore.c +++ b/fs/proc/kcore.c @@ -451,7 +451,8 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos) * If this is the first iteration or the address is not within * the previous entry, search for a matching entry. */ - if (!m || start < m->addr || start >= m->addr + m->size) { + if (!m || &m->list == &kclist_head || start < m->addr || + start >= m->addr + m->size) { list_for_each_entry(m, &kclist_head, list) { if (start >= m->addr && start < m->addr + m->size) -- 2.17.1
Re: [PATCH] net/9p/trans_virtio.c: decrease the refcount of 9p virtio device when removing it
piaojun wrote on Wed, Aug 08, 2018: > I found that 9pnet_virtio.ko could not be removed by rmmod command, and I > could still found it by lsmod. The reason is that we forgot decrease the > refcount of 9p virtio device by kobject_put. So we should put refcount in > p9_virtio_remove. Hmm, I cannot seem to reproduce this. Can you give more details on how to get into stuck state? I tried mounting something, accessing the sysfs files, etc to no avail. lsmod gives a counter to how many references there are to the module, you can use that to debug a bit. For example here I get this line just after loading the module: 9pnet_virtio 32768 0 Then after mounting something there is one reference: 9pnet_virtio 32768 1 Then unmounting puts that back to 0 and 'modprobe -r' (or rmmod) works. I dislike saying the next part but I think form also is important, please bear with me: - shorter subject line, please. For example, you can lose 20 characters by reodering words so there is no need for pronouns "net/9p/virtio: decrease 9p virtio device refcount on removal" - I personally dislike commit messages that are "novelized" (that is, put yourself as an actor and describe what you were doing) That seems to be somewhat accepted as looking at the kernel's git log I see some (few) commits using "I ..." that are not pull request messages but if possible please avoid this style and try to describe facts, how things are wrong, what got fixed and if required how. To give an example again, this says the same thing: "The 9pnet_virtio module could not be unloaded because we forgot to decrease the refcount of the 9p virtio device with kobject_put. Put the reference in 9p_virtio_remove" Thanks, -- Dominique
Re: [PATCH] net/9p/trans_virtio.c: decrease the refcount of 9p virtio device when removing it
piaojun wrote on Wed, Aug 08, 2018: > I try to remove 9pnet_virtio.ko by 'rmmod 9pnet_virtio' as I want to > replace it without rebooting system. I do that all the time when testing, it works for me. What exact kernel commit are you running? > Here I have not mount 9pfs yet, so the refcount is still 0. > > Before rmmod: > # lsmod | grep 9p > 9pnet_virtio 20480 0 > 9pnet 106496 1 9pnet_virtio > virtio_ring28672 5 > virtio_scsi,9pnet_virtio,virtio_pci,virtio_blk,virtio_net > virtio 16384 5 > virtio_scsi,9pnet_virtio,virtio_pci,virtio_blk,virtio_net > > After rmmod: > # lsmod | grep 9p > 9pnet_virtio 20480 0 > 9pnet 106496 1 9pnet_virtio > virtio_ring28672 5 > virtio_scsi,9pnet_virtio,virtio_pci,virtio_blk,virtio_net > virtio 16384 5 > virtio_scsi,9pnet_virtio,virtio_pci,virtio_blk,virtio_net > > Normally 9pnet_virtio should be invisible after rmmod like this: > # lsmod | grep 9p > 9pnet 106496 0 Right, that obviously didn't work... But on the other hand, if I apply your commit and load/unload 9pnet_virtio 5-10 times (I ran it in a loop) I get KASAN errors because we put too many of these refs ; that doesn't happen without your patch so it's apparently wrong. I'm curious how that could make modprobe work better for you as well, it shouldn't depend on that... Maybe `modprobe -r` might give a better error, or something in dmesg? -- Dominique
Re: [PATCH] net/9p/trans_virtio.c: decrease the refcount of 9p virtio device when removing it
piaojun wrote on Thu, Aug 09, 2018: > > What exact kernel commit are you running? > > My kernel commit id 6edf1d4cb0acde, and I replace the 9p code with > 9p-next. And I wonder if this will work well? That is somewhere on top of 4.18-rc1 and got merged in 4.18-rc4, which are close enough so while I can question the practice I don't see why not. I've just tried the following: $ git checkout 6edf1d4cb0acde $ git checkout martinetd/9p-next net/9p fs/9p include/net/9p (martinetd/9p-next is 9f961802a7 as of this mail) $ uname -r 4.18.0-rc1+ $ lsmod | grep -E '^9pnet_virtio' || echo "not loaded" 9pnet_virtio 32768 0 $ sudo modprobe -r 9pnet_virtio $ lsmod | grep -E '^9pnet_virtio' || echo "not loaded" not loaded $ sudo modprobe 9pnet_virtio $ sudo mount -t 9p -o debug=1,trans=virtio shm /mnt $ ls /mnt $ cat /sys/module/9pnet_virtio/drivers/virtio\:9pnet_virtio/*/mount_tag tmpshm (these could use a new line...) $ sudo umount /mnt $ sudo modprobe -r 9pnet_virtio $ lsmod | grep -E '^9pnet_virtio' || echo "not loaded" not loaded The /sys/devices/pci*/*/virtio*/mount_tag files are also removed properly; I don't see any problem. Not being able to reproduce is fine in general, but I also get problems when applying the patch and unloading the module multiple times so I can't help but question this patch and think your problem lies somewhere else. -- Dominique
[PATCH v3 2/2] net/9p: add a per-client fcall kmem_cache
From: Dominique Martinet Having a specific cache for the fcall allocations helps speed up allocations a bit, especially in case of non-"round" msizes. The caches will automatically be merged if there are multiple caches of items with the same size so we do not need to try to share a cache between different clients of the same size. Since the msize is negotiated with the server, only allocate the cache after that negotiation has happened - previous allocations or allocations of different sizes (e.g. zero-copy fcall) are made with kmalloc directly. Signed-off-by: Dominique Martinet Cc: Matthew Wilcox Cc: Greg Kurz Cc: Jun Piao --- v2: - Add a pointer to the cache in p9_fcall to make sure a buffer allocated with kmalloc gets freed with kfree and vice-versa This could have been smaller with a bool but this spares having to look at the client so looked a bit cleaner, I'm expecting this patch will need a v3 one way or another so I went for the bolder approach - please say if you think a smaller item is better ; I *think* nothing relies on this being ordered the same way as the data on the wire (struct isn't packed anyway) so we can move id after tag and add another u8 to not have any overhead - added likely() to cache existence check in allocation, but nothing for msize check or free because of zc request being of different size v3: - defined the userdata region in the cache, as read or write calls can access the buffer directly (lead to warnings with HARDENED_USERCOPY=y) I didn't get any comment on v2 for this patch, but I'm still not fully happy with this in all honesty. Part of me believes we might be better off always allocating from the cache even for zero-copy headers, it's a waste of space but the buffers are there and being reused constantly for non-zc calls, and mixing kmallocs in could lead to these buffers being really freed and reallocated all the time instead of reusing the slab buffers continuously. That would let me move the likely() for the fast path, with the only exception being the TVERSION initial call on mount, for which I still don't have any nice idea on how to handle, using a different size in p9_client_rpc or prepare_req if the type is TVERSION and I'm really not happy with that... include/net/9p/9p.h | 4 include/net/9p/client.h | 1 + net/9p/client.c | 40 +++- 3 files changed, 40 insertions(+), 5 deletions(-) diff --git a/include/net/9p/9p.h b/include/net/9p/9p.h index e23896116d9a..645266b74652 100644 --- a/include/net/9p/9p.h +++ b/include/net/9p/9p.h @@ -336,6 +336,9 @@ enum p9_qid_t { #define P9_NOFID (u32)(~0) #define P9_MAXWELEM16 +/* Minimal header size: len + id + tag */ +#define P9_HDRSZ 7 + /* ample room for Twrite/Rread header */ #define P9_IOHDRSZ 24 @@ -558,6 +561,7 @@ struct p9_fcall { size_t offset; size_t capacity; + struct kmem_cache *cache; u8 *sdata; }; diff --git a/include/net/9p/client.h b/include/net/9p/client.h index c2671d40bb6b..735f3979d559 100644 --- a/include/net/9p/client.h +++ b/include/net/9p/client.h @@ -123,6 +123,7 @@ struct p9_client { struct p9_trans_module *trans_mod; enum p9_trans_status status; void *trans; + struct kmem_cache *fcall_cache; union { struct { diff --git a/net/9p/client.c b/net/9p/client.c index ed78751aee7c..6c57ab1294d7 100644 --- a/net/9p/client.c +++ b/net/9p/client.c @@ -192,6 +192,9 @@ static int parse_opts(char *opts, struct p9_client *clnt) goto free_and_return; } + if (clnt->trans_mod) { + pr_warn("Had trans %s\n", clnt->trans_mod->name); + } v9fs_put_trans(clnt->trans_mod); clnt->trans_mod = v9fs_get_trans_by_name(s); if (clnt->trans_mod == NULL) { @@ -231,9 +234,16 @@ static int parse_opts(char *opts, struct p9_client *clnt) return ret; } -static int p9_fcall_init(struct p9_fcall *fc, int alloc_msize) +static int p9_fcall_init(struct p9_client *c, struct p9_fcall *fc, +int alloc_msize) { - fc->sdata = kmalloc(alloc_msize, GFP_NOFS); + if (likely(c->fcall_cache) && alloc_msize == c->msize) { + fc->sdata = kmem_cache_alloc(c->fcall_cache, GFP_NOFS); + fc->cache = c->fcall_cache; + } else { + fc->sdata = kmalloc(alloc_msize, GFP_NOFS); + fc->cache = NULL; + } if (!fc->sdata) return -ENOMEM; fc->capacity = alloc_msize; @@ -242,7 +252,16 @@ static int p9_fcall_init(struct p9_fcall *fc, int alloc_msize) void p9_fcall_fini(struct p9_fcall *fc) { - kfree(fc->sd
[PATCH v3 1/2] net/9p: embed fcall in req to round down buffer allocs
From: Dominique Martinet 'msize' is often a power of two, or at least page-aligned, so avoiding an overhead of two dozen bytes for each allocation will help the allocator do its work and reduce memory fragmentation. Suggested-by: Matthew Wilcox Signed-off-by: Dominique Martinet Reviewed-by: Greg Kurz Cc: Matthew Wilcox Cc: Jun Piao --- v2: - Add extra label to not free uninitialized memory on alloc failure - Rename p9_fcall_alloc to 9p_fcall_init - Add a p9_fcall_fini function to echo to init v3: - use p9_call_fini() in rdma_request() instead of kfree, the code was in the following patch in previous version include/net/9p/client.h | 5 +- net/9p/client.c | 167 +--- net/9p/trans_fd.c | 12 +-- net/9p/trans_rdma.c | 29 +++ net/9p/trans_virtio.c | 18 ++--- net/9p/trans_xen.c | 12 +-- 6 files changed, 125 insertions(+), 118 deletions(-) diff --git a/include/net/9p/client.h b/include/net/9p/client.h index a4dc42c53d18..c2671d40bb6b 100644 --- a/include/net/9p/client.h +++ b/include/net/9p/client.h @@ -95,8 +95,8 @@ struct p9_req_t { int status; int t_err; wait_queue_head_t wq; - struct p9_fcall *tc; - struct p9_fcall *rc; + struct p9_fcall tc; + struct p9_fcall rc; void *aux; struct list_head req_list; }; @@ -230,6 +230,7 @@ int p9_client_mkdir_dotl(struct p9_fid *fid, const char *name, int mode, kgid_t gid, struct p9_qid *); int p9_client_lock_dotl(struct p9_fid *fid, struct p9_flock *flock, u8 *status); int p9_client_getlock_dotl(struct p9_fid *fid, struct p9_getlock *fl); +void p9_fcall_fini(struct p9_fcall *fc); struct p9_req_t *p9_tag_lookup(struct p9_client *, u16); void p9_client_cb(struct p9_client *c, struct p9_req_t *req, int status); diff --git a/net/9p/client.c b/net/9p/client.c index 88db45966740..ed78751aee7c 100644 --- a/net/9p/client.c +++ b/net/9p/client.c @@ -231,16 +231,20 @@ static int parse_opts(char *opts, struct p9_client *clnt) return ret; } -static struct p9_fcall *p9_fcall_alloc(int alloc_msize) +static int p9_fcall_init(struct p9_fcall *fc, int alloc_msize) { - struct p9_fcall *fc; - fc = kmalloc(sizeof(struct p9_fcall) + alloc_msize, GFP_NOFS); - if (!fc) - return NULL; + fc->sdata = kmalloc(alloc_msize, GFP_NOFS); + if (!fc->sdata) + return -ENOMEM; fc->capacity = alloc_msize; - fc->sdata = (char *) fc + sizeof(struct p9_fcall); - return fc; + return 0; +} + +void p9_fcall_fini(struct p9_fcall *fc) +{ + kfree(fc->sdata); } +EXPORT_SYMBOL(p9_fcall_fini); static struct kmem_cache *p9_req_cache; @@ -263,13 +267,13 @@ p9_tag_alloc(struct p9_client *c, int8_t type, unsigned int max_size) if (!req) return NULL; - req->tc = p9_fcall_alloc(alloc_msize); - req->rc = p9_fcall_alloc(alloc_msize); - if (!req->tc || !req->rc) + if (p9_fcall_init(&req->tc, alloc_msize)) + goto free_req; + if (p9_fcall_init(&req->rc, alloc_msize)) goto free; - p9pdu_reset(req->tc); - p9pdu_reset(req->rc); + p9pdu_reset(&req->tc); + p9pdu_reset(&req->rc); req->status = REQ_STATUS_ALLOC; init_waitqueue_head(&req->wq); INIT_LIST_HEAD(&req->req_list); @@ -281,7 +285,7 @@ p9_tag_alloc(struct p9_client *c, int8_t type, unsigned int max_size) GFP_NOWAIT); else tag = idr_alloc(&c->reqs, req, 0, P9_NOTAG, GFP_NOWAIT); - req->tc->tag = tag; + req->tc.tag = tag; spin_unlock_irq(&c->lock); idr_preload_end(); if (tag < 0) @@ -290,8 +294,9 @@ p9_tag_alloc(struct p9_client *c, int8_t type, unsigned int max_size) return req; free: - kfree(req->tc); - kfree(req->rc); + p9_fcall_fini(&req->tc); + p9_fcall_fini(&req->rc); +free_req: kmem_cache_free(p9_req_cache, req); return ERR_PTR(-ENOMEM); } @@ -329,14 +334,14 @@ EXPORT_SYMBOL(p9_tag_lookup); static void p9_free_req(struct p9_client *c, struct p9_req_t *r) { unsigned long flags; - u16 tag = r->tc->tag; + u16 tag = r->tc.tag; p9_debug(P9_DEBUG_MUX, "clnt %p req %p tag: %d\n", c, r, tag); spin_lock_irqsave(&c->lock, flags); idr_remove(&c->reqs, tag); spin_unlock_irqrestore(&c->lock, flags); - kfree(r->tc); - kfree(r->rc); + p9_fcall_fini(&r->tc); + p9_fcall_fini(&r->rc); kmem_cache_free(p9_req_cache, r); } @@ -368,7 +373,7 @@ static void p9_tag_cleanup(struct p9_client *c) */ void p9_client_cb(struct p9_client *c, struct
Re: [PATCH v3 2/2] net/9p: add a per-client fcall kmem_cache
piaojun wrote on Fri, Aug 10, 2018: > Could you help paste the test result of before-after-applied this patch in > comment? And please see my comments below. Thanks the the review, do you mean the commit message? I'll add the summary I wrote in reply to your question a few mails before. > > diff --git a/include/net/9p/9p.h b/include/net/9p/9p.h > > index e23896116d9a..645266b74652 100644 > > --- a/include/net/9p/9p.h > > +++ b/include/net/9p/9p.h > > @@ -336,6 +336,9 @@ enum p9_qid_t { > > #define P9_NOFID (u32)(~0) > > #define P9_MAXWELEM16 > > > > +/* Minimal header size: len + id + tag */ > > Here should be 'size + id + tag'. hm I didn't want to repeat size, but I guess people do refer to that field as size. I'll actually rewrite it as: Minimal header size: size[4] type[1] tag[2] > > + kmem_cache_destroy(clnt->fcall_cache); > > I'm afraid that we should check NULL for clnt->fcall_cache. kmem_cache_destroy() in mm/slab_common.c does the null check for us: -- void kmem_cache_destroy(struct kmem_cache *s) { int err; if (unlikely(!s)) return; -- -- Dominique
[PATCH] 9p: Add Dominique Martinet to MAINTAINERS
From: Dominique Martinet Signed-off-by: Dominique Martinet Cc: Eric Van Hensbergen Cc: Latchesar Ionkov Cc: Ron Minnich Cc: Andrew Morton --- I've had an off-list Ack from Lucho and Eric about adding myself, and got reminded I should probably do it sooner than later by Andrew. Could (a subset of) you three please confirm on the list? (As usual, I was a bit hesitant on what email to use, but I really would not be reactive on my work email so it probably is better that way) FWIW, I did not get any answer from Ron when I tried reaching out to him, but I finally found why: he left sandia 7 years ago so that mail is likely a black hole. I've sent him a mail on another address now asking him if he is ok to be removed, and will probably submit another patch doing that within the next few days. MAINTAINERS | 2 ++ 1 file changed, 2 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index 7cebd5bba8a8..15116e6dff3a 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -201,10 +201,12 @@ F:drivers/net/ethernet/8390/ M: Eric Van Hensbergen M: Ron Minnich M: Latchesar Ionkov +M: Dominique Martinet L: v9fs-develo...@lists.sourceforge.net W: http://swik.net/v9fs Q: http://patchwork.kernel.org/project/v9fs-devel/list/ T: git git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs.git +T: git git://github.com/martinetd/linux.git S: Maintained F: Documentation/filesystems/9p.txt F: fs/9p/ -- 2.17.1
[PATCH] 9p: Remove Ron Minnich from MAINTAINERS
From: Dominique Martinet Ron Minnich has left Sandia in 2011, and has not been involved in any 9p commit in the recent years. Signed-off-by: Dominique Martinet Cc: Eric Van Hensbergen Cc: Ron Minnich Cc: Ronald G. Minnich Cc: Latchesar Ionkov --- Thanks for agreeing Ron. MAINTAINERS | 1 - 1 file changed, 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index 15116e6dff3a..09c2a6c0fb0c 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -199,7 +199,6 @@ F: drivers/net/ethernet/8390/ 9P FILE SYSTEM M: Eric Van Hensbergen -M: Ron Minnich M: Latchesar Ionkov M: Dominique Martinet L: v9fs-develo...@lists.sourceforge.net -- 2.17.1
Re: [PATCH] compiler-gcc: get back Clang build
Nick Desaulniers Aug. 21, 2018, 8:09 a.m. UTC: > Thanks for noticing, and sending this patch. I'm happy to see others > testing with Clang. I noticed this too near the end of the day > https://github.com/ClangBuiltLinux/linux/issues/27. FWIW libbcc so many BPF users also use clang, so this has more impact than just testing to build linux with clang (not that this would be any reason to delay fixing either way) I would tend to agree havin a compiler-common + make clang/intel not include compiler-gcc would probably be best in the long run but we might want a quick fix for 4.19 meanwhile.. -- Dominique Martinet
Re: [PATCH] compiler-gcc: get back Clang build
Nick Desaulniers wrote on Tue, Aug 21, 2018: > On Tue, Aug 21, 2018 at 9:45 AM Joe Perches wrote: > > > Tested with gcc-7 and clang-8. > > > > clang-8? Isn't the latest officlal clang 6.0.1 ? > [...] > > So if something other than 6.0.x is required, > > then some additional check should probably be > > added to compiler-clang.h as well. > > Sure, but that doesn't need to go in Mashiro's patch today. That can > wait for a proper separation between compiler headers where we can > then implement improved version checks. I can confirm that the patch here works for me with clang 6.0.1 as far as bpf programs are concerned (and gcc 8.1.1 for the kernel build itself on x86 while I was at it); so at least there is nothing on the compiler*.h headers that put off clang 6.0.1 (also replying to other subthread) >From Joe Perches @ 2018-08-21 17:22 UTC: > The question remains, if clang can't compile > v4.17, why does it immediately matter for v4.19? I haven't had any problem with clang and 4.17 as far as building bpf programs is concerned, because the CC_HAVE_ASM_GOTO check is done in makefiles that aren't used in the bpf case and not in compiler-gcc.h or another header. So I guess the "immediately matters for v4.19" depends on how much you would care about bcc-compiled BPF programs. > Why wouldn't overriding the clang __GNUC_ > #defines in compiler-gcc.h work acceptably with > adding whatever is necessary to compiler-clang.h? I think that could work, but at the point making a separate compiler-common.h and not including compiler-gcc.h for clang sounds better to me... More importantly here, either solution sound complex enough to require more than a few days and proper testing for all archs etc when compared to the partial revert we have here. -- Dominique Martinet
Re: [PATCH] compiler-gcc: get back Clang build
Joe Perches wrote on Tue, Aug 21, 2018: > On Wed, 2018-08-22 at 06:16 +0200, Dominique Martinet wrote: > > I think that could work, but at the point making a separate > > compiler-common.h and not including compiler-gcc.h for clang sounds > > better to me... More importantly here, either solution sound complex > > enough to require more than a few days and proper testing for all archs > > etc when compared to the partial revert we have here. > > The immediate need for a partial revert seems unnecessary as > clang hasn't really worked for a couple releases now. Sorry for repeating myself, clang is used by bcc for compiling BPF programs (e.g. bpf_module_create_c_from_string() or any similar function will use the clang libs to compile the bpf program with linux headers), and this has always been working because it's not using our makefiles. This broke today in master and I only joined this thread after looking at why the build started failing today and noticing this patch, it definitely hasn't been broken for two releases, or I wouldn't be here with this timing :) > The separate compiler file changes are much more sensible, > even if it takes a few days. A few days are fine, but I think some form of fix in 4.19-rc1 would be good. I'll stop taking your time now, but I think you are/were underestimating how many people use clang with the linux kernel headers indirectly. BPF is a well-used tool :) Thanks, -- Dominique Martinet
Re: [PATCH] compiler-gcc: get back Clang build
Nick Desaulniers wrote on Wed, Aug 22, 2018: > I'm currently testing a fix in > https://github.com/ClangBuiltLinux/linux/commit/1f89ae7622c26b8131f42f3a362d6ef41b88a595, > can you please share with me your steps to test/verify that the patch > fixes the issue for eBPF? I'll go talk to a co-worker who know more > about eBPF, but I've not yet done anything with it. Thanks for the link. The simplest way to test with bcc for me would probably be to fetch the example from their repo itself[1], whether you install bcc through your distro or compile it yourself. There are dozens of example programs in the tools/ directory of which you can just run any, if will try to compile the program embedded wihtin the example at runtime with clang. [1] https://github.com/iovisor/bcc (I just noticed fedora provides a bcc-tools package which provides these tools directly, so if you're lucky you can just install that and run examples in /usr/share/bcc/tools) In particular I get a couple of errors with your patch, I think it boils down to this one: - In file included from :3: In file included from /virtual/include/bcc/helpers.h:23: In file included from /lib/modules/4.18.0+/build/include/linux/log2.h:16: In file included from /lib/modules/4.18.0+/build/include/linux/bitops.h:18: In file included from /lib/modules/4.18.0+/build/arch/x86/include/asm/bitops.h:16: /lib/modules/4.18.0+/build/include/linux/compiler.h:217:3: error: expected expression barrier(); ^ /lib/modules/4.18.0+/build/include/linux/compiler-clang.h:43:20: note: expanded from macro 'barrier' #define barrier() (__asm__ __volatile__("" : : : "memory")) ^ - And removing the parenthesis around the expression seems to work, they weren't here in the compiler-gcc.h file in the previous version? There might be other defines the simple examples I ran do not use but from a quick glance it doesn't look like it, thank you for the split work! -- Dominique Martinet
Re: [PATCH] include/linux/compiler*.h: make compiler-*.h mutually exclusive
Nick Desaulniers wrote on Wed, Aug 22, 2018: > Commit cafa0010cd51 ("Raise the minimum required gcc version to 4.6") > recently exposed a brittle part of the build for supporting non-gcc > compilers. > > Both Clang and ICC define __GNUC__, __GNUC_MINOR__, and > __GNUC_PATCHLEVEL__ for quick compatibility with code bases that haven't > added compiler specific checks for __clang__ or __INTEL_COMPILER. > > This is brittle, as they happened to get compatibility by posing as a > certain version of GCC. This broke when upgrading the minimal version > of GCC required to build the kernel, to a version above what ICC and > Clang claim to be. > > Rather than always including compiler-gcc.h then undefining or > redefining macros in compiler-intel.h or compiler-clang.h, let's > separate out the compiler specific macro definitions into mutually > exclusive headers, do more proper compiler detection, and keep shared > definitions in compiler_types.h. > > Reported-by: Masahiro Yamada > Suggested-by: Eli Friedman > Suggested-by: Joe Perches > Signed-off-by: Nick Desaulniers Overall looks good to me, just pointing at the same error I wrote in my other mail here -- I saw that by the time I was done writing this this patch got taken but that alone will probably warrant a follow-up :/ I've also added a few style nitpicks/questions but feel free to ignore these. On a side note, I noticed tools/include/linux/compiler.h includes linux/compiler-gcc.h but maybe it should include linux/compiler_types.h? (I'm not sure at who uses that header, so it really is an open question here) > --- > arch/arm/kernel/asm-offsets.c | 13 +- > drivers/gpu/drm/i915/i915_utils.h | 2 +- > drivers/watchdog/kempld_wdt.c | 5 - > include/linux/compiler-clang.h| 20 ++- > include/linux/compiler-gcc.h | 88 --- > include/linux/compiler-intel.h| 13 +- > include/linux/compiler_types.h| 238 ++ > mm/ksm.c | 4 +- > mm/migrate.c | 3 +- > 9 files changed, 133 insertions(+), 253 deletions(-) > > [...] > diff --git a/include/linux/compiler-clang.h b/include/linux/compiler-clang.h > index 7087446c24c8..5f0754692ce6 100644 > --- a/include/linux/compiler-clang.h > +++ b/include/linux/compiler-clang.h > [...] > @@ -40,9 +30,17 @@ > * checks. Unfortunately, we don't know which version of gcc clang > * pretends to be, so the macro may or may not be defined. > */ > -#undef COMPILER_HAS_GENERIC_BUILTIN_OVERFLOW > #if __has_builtin(__builtin_mul_overflow) && \ > __has_builtin(__builtin_add_overflow) && \ > __has_builtin(__builtin_sub_overflow) > #define COMPILER_HAS_GENERIC_BUILTIN_OVERFLOW 1 > #endif > + > +/* The following are for compatibility with GCC, from compiler-gcc.h, > + * and may be redefined here because they should not be shared with other > + * compilers, like ICC. > + */ > +#define barrier() (__asm__ __volatile__("" : : : "memory")) barrier here has gained external () compared to the definition and compiler-gcc.h: #define barrier() __asm__ __volatile__("": : :"memory") And this fails with bpf: In file included from :3: In file included from /virtual/include/bcc/helpers.h:23: In file included from /lib/modules/4.18.0+/build/include/linux/log2.h:16: In file included from /lib/modules/4.18.0+/build/include/linux/bitops.h:18: /lib/modules/4.18.0+/build/arch/x86/include/asm/bitops.h:170:2: error: expected expression barrier(); ^ /lib/modules/4.18.0+/build/include/linux/compiler-clang.h:43:20: note: expanded from macro 'barrier' #define barrier() (__asm__ __volatile__("" : : : "memory")) ^ > +#define __must_be_array(a) BUILD_BUG_ON_ZERO(__same_type((a), &(a)[0])) > +#define __assume_aligned(a, ...) \ > + __attribute__((__assume_aligned__(a, ## __VA_ARGS__))) > diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h > index 250b9b7cfd60..763bbad1e258 100644 > --- a/include/linux/compiler-gcc.h > +++ b/include/linux/compiler-gcc.h > [..] > -#define __cold __attribute__((__cold__)) > - > #define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), > __COUNTER__) > > #define __optimize(level)__attribute__((__optimize__(level))) > -#define __nostackprotector __optimize("no-stack-protector") I do not see this being added back, it's probably fine though as it doesn't look used? (I didn't check all removed lines, maybe about half) > #define __compiletime_object_size(obj) __builtin_object_size(obj, 0) > > diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h > index fbf337933fd8..90479a0f3986 100644 > --- a/include/linux/compiler_types.h > +++ b/include/linux/compiler_types.h > [...] > /* Is this type a native word size -- useful for atomic operations */ > -#ifndef __native_word > -# define __native_word(t) (sizeof(t) == sizeof(char) || sizeof(t) == > sizeof(short) || sizeof(t) == siz
Re: [PATCH] include/linux/compiler*.h: make compiler-*.h mutually exclusive
Linus Torvalds wrote on Wed, Aug 22, 2018: > I've fixed that manually, but when I tried to test it I just hit the > > arch/x86/Makefile:179: *** Compiler lacks asm-goto support.. Stop. > > error. > > Do you have some experimental clang build with asm goto support? What > version? Or is it just that you're building ARM, not x86? I'm not building linux directly, but BPF programs that indirectly uses clang with bcc On fedora you can install bcc-tools and run e.g. /usr/share/bcc/tools/execsnoop that will compile something at runtime. If the package isn't available for your distro you can just install bcc and run the script from their repo[1] [1] https://raw.githubusercontent.com/iovisor/bcc/master/tools/execsnoop.py Thanks, -- Dominique
Re: [PATCH] include/linux/compiler*.h: make compiler-*.h mutually exclusive
Nick Desaulniers wrote on Thu, Aug 23, 2018: > > On a side note, I noticed tools/include/linux/compiler.h includes > > linux/compiler-gcc.h but maybe it should include linux/compiler_types.h? > > (I'm not sure at who uses that header, so it really is an open question > > here) > > Without looking into the code, this sounds like compiler.h is wrong. > Looking at the source, there's references to ancient Android NDK's > (what?! Let me show this to the NDK maintainers). This whole thing > needs to be cleaned up, too, IMO. Oh, there's two compiler-gcc.h in > the tree: > > - tools/include/linux/compiler-gcc.h (that's what's being included in > the case you point out) > - include/linux/compiler-gcc.h > > Maybe they can be combined? Ah, I didn't notice there is another compiler-gcc... Looking closer there seem to be other "reused" headers (almost all of the headers in tools/include/linux have a counterpart in include/linux), and some e.g. rbtree.h went from being a copy of the main include's to a single '#include "../../../../include/linux/rbtree.h"' line to a slightly simplified copy again to avoid bringing in rcu as dependency... I think compiler.h could be done with a similar trick with a bit of massaging, but as things stand linux/compiler.h includes uapi/linux/types.h, and this complains about using kernel headers from user space... So this isn't as trivial as just making tools use the kernel's include at least. > > If you tried to align these, __always_unused and __alias(symbol) look > > off > > I have `set tabstop=8` in vim, and it looks correct there, but both > `git diff` and github's code viewer show it as off. Maybe I have my > settings wrong in vim and need to go back to first grade, but > (personal opinion ahead) you don't have this kind of nonsense > (ambiguity around how many spaces a tab should be displayed as in > various code viewers) if you just always use spaces everywhere, for > everything. Other large codebases use automatic code formatters (like > clang-format) and that prevents holy wars and other distractions. > There's even a cool utility called `git-clang-format` that can check > just the code you changed, which is useful for not reformatting the > entire codebase messing up git blame. Right, this is my mistake - the diff view adds a space so these "inner tabs" alignments can be messed up. FWIW I think checkpatch only complains about leading space-based indentation, the inner filling can be whatever you want, but this is fine and I was overzealous. > On Wed, Aug 22, 2018 at 7:43 PM Masahiro Yamada > wrote: > > It was previous included by all compilers, > > but now it is only by _true_ GCC. > > Good catch, and thanks for the report and testing. I missed that > because I was testing x86_64 and arm64 (which I guess don't hit that > in the configs I tested) and not arm32. Should be a simple fix to > move it to shared the definition. Send me a patch, or I'll get to it. > > > Even if I move the __naked definition > > to , > > I see a different error. > > Did that regress due to my change? If so, sorry and I'll look into it > more soon. I've looked at that one quickly and that warning looks legitimate to me, I'm not sure why it would appear only now but clang does document[1] not allowing the parameters to be used even in extended asm, and gcc documentation[2] says "While using extended asm or a mixture of basic asm and C code may appear to work, they cannot be depended upon to work reliably and are not supported." [1] http://infocenter.arm.com/help/topic/com.arm.doc.dui0774g/jhg1476893564298.html [2] https://gcc.gnu.org/onlinedocs/gcc/ARM-Function-Attributes.html In this case it looks like the arguments are only used for sanity with __asmeq but in the first place the original commit for trusted foundations talks about it not respecting the API so naked makes sense but they're not making the API function naked (and the one which takes an argument does use its argument) so this all feels a bit weird to me. It might be worth asking the original authors on this one... -- Dominique Martinet
Re: [PATCH v2 5/6] 9p: Use a slab for allocating requests
Dominique Martinet wrote on Mon, Jul 23, 2018: > I'll try to get figures for various approaches before the merge window > for 4.19 starts, it's getting closer though... Here's some numbers; with v4.18-rc7 + current test tree (my 9p-next) as a base. For the context, I'm running on VMs that bind their cores to CPUs on the host (32 cores), and have a Connect-IB mellanox card through SRIOV. The server is nfs-ganesha, serving a tmpfs filesystem on a second VM (different host) Mounting with msize=$((1024*1024)) My main problem with this test is that the client has way too much memory and it's mostly pristine with a boot not long before, so any kind of memory pressure won't be seen here. If someone knows how to fragment memory quickly I'll take that and rerun the tests :) I've changed my mind from mdtest to a simple ior, as I'm testing on trans=rdma there's no difference and I'm more familiar with ior options. I ran two workloads: - 32 processes, file per process, 512k at a time writing a total of 32GB (1GB per file), repeated 10 times - 32 processes, file per process, 32 bytes at a time writing a total of 16MB (512k per file), repeated 10 times. The first test gives a proper impression of the throughput the systems can sustain and the results are pretty much around what I was expecting for the setup; the second test is purely a latency test (how long does it take to send 512k RPCs) I ran almost all of these tests with KASAN enabled in the VMs a first time, so leaving the results with KASAN at the end for reference... Overall I'm rather happy with the result, without KASAN the overhead of the patch isn't negligible (~6%) but I'd say it's acceptable for correctness and with an extra two patchs with the suggesteed changes (rounding down the alloc size to not include the struct overhead and separate kmem cache) it's getting down to 0.5% which is quite good, I think. I'll send the two patchs to the list shortly. The first one is rather huge even if it's a trivial change logically, so part of me wants to get it merged quickly to not have to deal with rebases... ;) With KASAN, well, it certainly does more things but I hope performance-critical systems don't have it enabled in the first place. Raw results: * Base = 4.18-rc7 + queued patches, without request cache rework - "Big" I/Os: Summary of all tests: Operation Max(MiB) Min(MiB) Mean(MiB) StdDevMean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum write5842.405751.585793.53 23.935.65606 0 32 32 10 1 0 1 0 0 1 1073741824 524288 34359738368 POSIX 0 read 6098.926018.636064.30 20.005.40348 0 32 32 10 1 0 1 0 0 1 1073741824 524288 34359738368 POSIX 0 - "Small" I/Os: Operation Max(MiB) Min(MiB) Mean(MiB) StdDevMean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum write 2.10 1.91 2.00 0.058.01074 0 32 32 10 1 0 1 0 0 1 524288 32 16777216 POSIX 0 read1.27 1.07 1.15 0.06 13.93901 0 32 32 10 1 0 1 0 0 1 524288 32 16777216 POSIX 0 -> 512k / 8.01074 = 65.4k req/s * Base + patch as submitted - "Big" I/Os: Operation Max(MiB) Min(MiB) Mean(MiB) StdDevMean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum write5844.845665.325787.15 48.945.66261 0 32 32 10 1 0 1 0 0 1 1073741824 524288 34359738368 POSIX 0 read 6082.246039.626057.14 12.505.40983 0 32 32 10 1 0 1 0 0 1 1073741824 524288 34359738368 POSIX 0 - "Small" I/Os: Operation Max(MiB) Min(MiB) Mean(MiB) StdDevMean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum write 1.95 1.82 1.88 0.048.50453 0 32 32 10 1 0 1 0 0 1 524288 32 16777216 POSIX 0 read1.18 1.07 1.14 0.03 14.04634 0 32 32 10 1 0 1 0 0 1 524288 32 16777216 POSIX 0 -> 512k / 8.50453 = 61.6k req/s * Base + patch as submitted + moving the header into req so the allocation is "round" as suggested by Matthew - "Big" I/Os: Operation Max(MiB) Min(MiB) Mean(MiB) StdDevMean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum write5861.795680.995795.71 48.845.65424 0 32 32 10 1 0 1 0 0 1 1073741824 524288 34359738368 POSIX 0 read 6098.546037.556067.80 19.395.40036 0 32 32 10 1 0 1 0 0 1 1073741824 524288 34359738368 POSIX 0 - "Small" I/Os: Operation Max(MiB) Min(MiB) Mean(MiB) StdDevMean(s) Test# #Tasks tPN reps fPP reord reordof
[PATCH 1/2] net/9p: embed fcall in req to round down buffer allocs
From: Dominique Martinet 'msize' is often a power of two, or at least page-aligned, so avoiding an overhead of two dozen bytes for each allocation will help the allocator do its work and reduce memory fragmentation. Suggested-by: Matthew Wilcox Signed-off-by: Dominique Martinet --- include/net/9p/client.h | 4 +- net/9p/client.c | 160 net/9p/trans_fd.c | 12 +-- net/9p/trans_rdma.c | 29 net/9p/trans_virtio.c | 18 ++--- net/9p/trans_xen.c | 12 +-- 6 files changed, 117 insertions(+), 118 deletions(-) diff --git a/include/net/9p/client.h b/include/net/9p/client.h index a4dc42c53d18..4b4ac1362ad5 100644 --- a/include/net/9p/client.h +++ b/include/net/9p/client.h @@ -95,8 +95,8 @@ struct p9_req_t { int status; int t_err; wait_queue_head_t wq; - struct p9_fcall *tc; - struct p9_fcall *rc; + struct p9_fcall tc; + struct p9_fcall rc; void *aux; struct list_head req_list; }; diff --git a/net/9p/client.c b/net/9p/client.c index 88db45966740..ba99a94a12c9 100644 --- a/net/9p/client.c +++ b/net/9p/client.c @@ -231,15 +231,13 @@ static int parse_opts(char *opts, struct p9_client *clnt) return ret; } -static struct p9_fcall *p9_fcall_alloc(int alloc_msize) +static int p9_fcall_alloc(struct p9_fcall *fc, int alloc_msize) { - struct p9_fcall *fc; - fc = kmalloc(sizeof(struct p9_fcall) + alloc_msize, GFP_NOFS); - if (!fc) - return NULL; + fc->sdata = kmalloc(alloc_msize, GFP_NOFS); + if (!fc->sdata) + return -ENOMEM; fc->capacity = alloc_msize; - fc->sdata = (char *) fc + sizeof(struct p9_fcall); - return fc; + return 0; } static struct kmem_cache *p9_req_cache; @@ -263,13 +261,13 @@ p9_tag_alloc(struct p9_client *c, int8_t type, unsigned int max_size) if (!req) return NULL; - req->tc = p9_fcall_alloc(alloc_msize); - req->rc = p9_fcall_alloc(alloc_msize); - if (!req->tc || !req->rc) + if (p9_fcall_alloc(&req->tc, alloc_msize)) + goto free; + if (p9_fcall_alloc(&req->rc, alloc_msize)) goto free; - p9pdu_reset(req->tc); - p9pdu_reset(req->rc); + p9pdu_reset(&req->tc); + p9pdu_reset(&req->rc); req->status = REQ_STATUS_ALLOC; init_waitqueue_head(&req->wq); INIT_LIST_HEAD(&req->req_list); @@ -281,7 +279,7 @@ p9_tag_alloc(struct p9_client *c, int8_t type, unsigned int max_size) GFP_NOWAIT); else tag = idr_alloc(&c->reqs, req, 0, P9_NOTAG, GFP_NOWAIT); - req->tc->tag = tag; + req->tc.tag = tag; spin_unlock_irq(&c->lock); idr_preload_end(); if (tag < 0) @@ -290,8 +288,8 @@ p9_tag_alloc(struct p9_client *c, int8_t type, unsigned int max_size) return req; free: - kfree(req->tc); - kfree(req->rc); + kfree(req->tc.sdata); + kfree(req->rc.sdata); kmem_cache_free(p9_req_cache, req); return ERR_PTR(-ENOMEM); } @@ -329,14 +327,14 @@ EXPORT_SYMBOL(p9_tag_lookup); static void p9_free_req(struct p9_client *c, struct p9_req_t *r) { unsigned long flags; - u16 tag = r->tc->tag; + u16 tag = r->tc.tag; p9_debug(P9_DEBUG_MUX, "clnt %p req %p tag: %d\n", c, r, tag); spin_lock_irqsave(&c->lock, flags); idr_remove(&c->reqs, tag); spin_unlock_irqrestore(&c->lock, flags); - kfree(r->tc); - kfree(r->rc); + kfree(r->tc.sdata); + kfree(r->rc.sdata); kmem_cache_free(p9_req_cache, r); } @@ -368,7 +366,7 @@ static void p9_tag_cleanup(struct p9_client *c) */ void p9_client_cb(struct p9_client *c, struct p9_req_t *req, int status) { - p9_debug(P9_DEBUG_MUX, " tag %d\n", req->tc->tag); + p9_debug(P9_DEBUG_MUX, " tag %d\n", req->tc.tag); /* * This barrier is needed to make sure any change made to req before @@ -378,7 +376,7 @@ void p9_client_cb(struct p9_client *c, struct p9_req_t *req, int status) req->status = status; wake_up(&req->wq); - p9_debug(P9_DEBUG_MUX, "wakeup: %d\n", req->tc->tag); + p9_debug(P9_DEBUG_MUX, "wakeup: %d\n", req->tc.tag); } EXPORT_SYMBOL(p9_client_cb); @@ -449,18 +447,18 @@ static int p9_check_errors(struct p9_client *c, struct p9_req_t *req) int err; int ecode; - err = p9_parse_header(req->rc, NULL, &type, NULL, 0); - if (req->rc->size >= c->msize) { + err = p9_parse_header(&req->rc, NULL, &type, NULL, 0); + if (req->rc.size >= c-
[PATCH 2/2] net/9p: add a per-client fcall kmem_cache
From: Dominique Martinet Having a specific cache for the fcall allocations helps speed up allocations a bit, especially in case of non-"round" msizes. The caches will automatically be merged if there are multiple caches of items with the same size so we do not need to try to share a cache between different clients of the same size. Since the msize is negotiated with the server, only allocate the cache after that negotiation has happened - previous allocations or allocations of different sizes (e.g. zero-copy fcall) are made with kmalloc directly. Signed-off-by: Dominique Martinet --- include/net/9p/client.h | 2 ++ net/9p/client.c | 40 net/9p/trans_rdma.c | 2 +- 3 files changed, 35 insertions(+), 9 deletions(-) diff --git a/include/net/9p/client.h b/include/net/9p/client.h index 4b4ac1362ad5..8d9bc7402a42 100644 --- a/include/net/9p/client.h +++ b/include/net/9p/client.h @@ -123,6 +123,7 @@ struct p9_client { struct p9_trans_module *trans_mod; enum p9_trans_status status; void *trans; + struct kmem_cache *fcall_cache; union { struct { @@ -230,6 +231,7 @@ int p9_client_mkdir_dotl(struct p9_fid *fid, const char *name, int mode, kgid_t gid, struct p9_qid *); int p9_client_lock_dotl(struct p9_fid *fid, struct p9_flock *flock, u8 *status); int p9_client_getlock_dotl(struct p9_fid *fid, struct p9_getlock *fl); +void p9_fcall_free(struct p9_client *c, struct p9_fcall *fc); struct p9_req_t *p9_tag_lookup(struct p9_client *, u16); void p9_client_cb(struct p9_client *c, struct p9_req_t *req, int status); diff --git a/net/9p/client.c b/net/9p/client.c index ba99a94a12c9..215e3b1ed7b4 100644 --- a/net/9p/client.c +++ b/net/9p/client.c @@ -231,15 +231,34 @@ static int parse_opts(char *opts, struct p9_client *clnt) return ret; } -static int p9_fcall_alloc(struct p9_fcall *fc, int alloc_msize) +static int p9_fcall_alloc(struct p9_client *c, struct p9_fcall *fc, + int alloc_msize) { - fc->sdata = kmalloc(alloc_msize, GFP_NOFS); + if (c->fcall_cache && alloc_msize == c->msize) + fc->sdata = kmem_cache_alloc(c->fcall_cache, GFP_NOFS); + else + fc->sdata = kmalloc(alloc_msize, GFP_NOFS); if (!fc->sdata) return -ENOMEM; fc->capacity = alloc_msize; return 0; } +void p9_fcall_free(struct p9_client *c, struct p9_fcall *fc) +{ + /* sdata can be NULL for interrupted requests in trans_rdma, +* and kmem_cache_free does not do NULL-check for us +*/ + if (unlikely(!fc->sdata)) + return; + + if (c->fcall_cache && fc->capacity == c->msize) + kmem_cache_free(c->fcall_cache, fc->sdata); + else + kfree(fc->sdata); +} +EXPORT_SYMBOL(p9_fcall_free); + static struct kmem_cache *p9_req_cache; /** @@ -261,9 +280,9 @@ p9_tag_alloc(struct p9_client *c, int8_t type, unsigned int max_size) if (!req) return NULL; - if (p9_fcall_alloc(&req->tc, alloc_msize)) + if (p9_fcall_alloc(c, &req->tc, alloc_msize)) goto free; - if (p9_fcall_alloc(&req->rc, alloc_msize)) + if (p9_fcall_alloc(c, &req->rc, alloc_msize)) goto free; p9pdu_reset(&req->tc); @@ -288,8 +307,8 @@ p9_tag_alloc(struct p9_client *c, int8_t type, unsigned int max_size) return req; free: - kfree(req->tc.sdata); - kfree(req->rc.sdata); + p9_fcall_free(c, &req->tc); + p9_fcall_free(c, &req->rc); kmem_cache_free(p9_req_cache, req); return ERR_PTR(-ENOMEM); } @@ -333,8 +352,8 @@ static void p9_free_req(struct p9_client *c, struct p9_req_t *r) spin_lock_irqsave(&c->lock, flags); idr_remove(&c->reqs, tag); spin_unlock_irqrestore(&c->lock, flags); - kfree(r->tc.sdata); - kfree(r->rc.sdata); + p9_fcall_free(c, &r->tc); + p9_fcall_free(c, &r->rc); kmem_cache_free(p9_req_cache, r); } @@ -944,6 +963,7 @@ struct p9_client *p9_client_create(const char *dev_name, char *options) clnt->trans_mod = NULL; clnt->trans = NULL; + clnt->fcall_cache = NULL; client_id = utsname()->nodename; memcpy(clnt->name, client_id, strlen(client_id) + 1); @@ -980,6 +1000,9 @@ struct p9_client *p9_client_create(const char *dev_name, char *options) if (err) goto close_trans; + clnt->fcall_cache = kmem_cache_create("9p-fcall-cache", clnt->msize, + 0, 0, NULL); + return clnt; close_trans: @@ -1011,6 +1034,7 @@ void p9_client_destroy(struct
Re: [PATCH] net/9p/trans_virtio.c: replace mutex_lock with spin_lock to protect 'virtio_chan_list'
piaojun wrote on Thu, Jul 19, 2018: > > piaojun wrote on Wed, Jul 18, 2018: > > That's not a fast path operation, I don't mind changing things but I'd > > like to understand why - these functions are only ever called at unmount > > time or when something happens on the virtio bus (probe will happen on > > probing on the pci bus and I'm not too sure on remove but probably pci > > removal i.e. basically never?) > > > > I don't see why this wouldn't work, but I won't take this without a > > (good?) reason. > > > virtio_9p_lock is responsable for protecting virtio_chan_list which has 3 > operation: > > 1. Add a virtio chan to virtio_chan_list. This will happen when we insmod > 9pnet_virtio.ko: > p9_virtio_probe > --list_add_tail(&chan->chan_list, &virtio_chan_list); > > 2. Remove a virtio chan. This will happen when remnod 9pnet_virtio.ko: > p9_virtio_remove > --list_del(&chan->chan_list); > > 3. Find a unused virtio chan when mount 9p: > mount > --p9_virtio_create > --list_for_each_entry(chan, &virtio_chan_list, chan_list) > > Multi mount process will compete for virtio_9p_lock when finding unused > virtio chan, in which case mutex lock will cause process sleep and wake > up. I think this a waste of CPU time. So we could use spin lock to avoid > this. Well, sure, that's theory; but how is that in practice? I actually took the time to run some tests, setting up 20 virtio mount points in qemu, and running this command with and without your patch: # time sh -c 'for i in {1..20}; do sh -c "for j in {1..100}; do mount -t 9p d$i d.$i; umount d.$i; done" & done; wait' This is quick & dirty but basically, mounts and unmounts 100 times in a loop all 20 mount points in parallel to stress that lock. I get these times 5 times (one run per column), without patch: real0m19.357s 0m19.626s 0m19.904s 0m19.926s 0m21.321s user0m6.795s0m6.874s0m6.807s0m6.768s0m6.892s sys 0m29.936s 0m31.196s 0m31.702s 0m31.914s 0m30.791s With patch: real0m19.439s 0m19.849s 0m19.683s 0m19.600s 0m20.689s user0m6.948s0m6.582s0m6.706s0m6.598s0m6.876s sys 0m29.364s 0m30.898s 0m30.695s 0m31.311s 0m33.391s I honestly can't say I'm convinced with a difference either way, the variations look more like noise than anything to me. More to the point, while these tests ran my dmesg buffer was filled with errors like: FS-Cache: Duplicate cookie detected FS-Cache: O-cookie c=00368cdb [p=548b03c2 fl=222 nc=0 na=1] FS-Cache: O-cookie d=4cebd15f n=029a0b83 FS-Cache: O-key=[10] '34323935303838343536' FS-Cache: N-cookie c=d4089478 [p=548b03c2 fl=2 nc=0 na=1] FS-Cache: N-cookie d=4cebd15f n=959d4d37 FS-Cache: N-key=[10] '34323935303838343536' or (output mangled a bit) == BUG: KASAN: use-after-free in p9_client_cb+0x14d/0x160 [9pnet] Read of size 8 at addr 88003522a088 by task systemd-udevd/492 CPU: 1 PID: 492 Comm: systemd-udevd Tainted: G O 4.18.0-rc5+ #9 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 0> Call Trace: dump_stack+0x7b/0xad print_address_description+0x6a/0x209 ? p9_client_cb+0x14d/0x160 [9pnet] kasan_report.cold.7+0x242/0x2fe __asan_report_load8_noabort+0x19/0x20 p9_client_cb+0x14d/0x160 [9pnet] req_done+0x22f/0x280 [9pnet_virtio] ? p9_mount_tag_show+0x120/0x120 [9pnet_virtio] vring_interrupt+0x108/0x1b0 [virtio_ring] ? vring_map_single.constprop.23+0x350/0x350 [virtio_ring] __handle_irq_event_percpu+0xec/0x460 handle_irq_event_percpu+0x71/0x140 ? __handle_irq_event_percpu+0x460/0x460 ? apic_ack_irq+0xa3/0xe0 handle_irq_event+0xb9/0x14a handle_edge_irq+0x1ea/0x7a0 ? kasan_check_read+0x11/0x20 handle_irq+0x48/0x60 do_IRQ+0x67/0x140 common_interrupt+0xf/0xf RIP: 0010:finish_task_switch+0x10e/0x630 Code: e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 6d 04 00 00 41 c7 45 38 00 00 00 00 4c 89 e7 ff 14 25 28 f5 66 8e fb 66 0f > RSP: 0018:8800633e7a60 EFLAGS: 0246 ORIG_RAX: ffd4 RAX: 0001 RBX: 880036632000 RCX: RDX: RSI: RDI: 88006caaac00 RBP: 8800633e7aa0 R08: ed000cea15cd R09: ed000cea15cc R10: ed000cea15cc R11: 88006750ae63 R12: 88006caaac00 R13: 88006558b000 R14: R15: 880036632000 ? __switch_to_asm+0x34/0x70 ? __switch_to_asm+0x40/0x70 __schedule+0x733/0x1c10 ? __bpf_prog_run64+0xd0/0xd0 ? firmware_map_remove+0x174/0x174 schedule+0x7a/0x1a0 schedule_hrtimeout_range_clock+0x306/0x3b0 ? kasan_check_write+0x14/0x20 ? hrtimer_nanosleep_restart+0x290/0x290 ? ep_busy_loop_end+0x110/0x110 schedule_hrtimeout_range+0x13/0x20 ep_poll+0x7a7/0xb50 ? __ia3
Re: [V9fs-developer] KASAN: use-after-free Read in generic_perform_write
Andrew Morton wrote on Thu, Jul 19, 2018: > On Thu, 19 Jul 2018 11:01:01 -0700 syzbot > wrote: > > Hello, > > > > syzbot found the following crash on: > > > > HEAD commit:1c34981993da Add linux-next specific files for 20180719 > > git tree: linux-next > > console output: https://syzkaller.appspot.com/x/log.txt?x=16e6ac4440 > > kernel config: https://syzkaller.appspot.com/x/.config?x=7002497517b09aec > > dashboard link: https://syzkaller.appspot.com/bug?extid=b173e77096a8ba815511 > > compiler: gcc (GCC) 8.0.1 20180413 (experimental) > > > > Unfortunately, I don't have any reproducer for this crash yet. > > Thanks. I cc'ed v9fs-developer, optimistically. That list manager is > weird :( I agree that list is weird, does anyone know the reason v9fs-developer is not a vger.k.o list? Or a reason not to change? It's still not too late... > I'm suspecting v9fs. Does that fs attempt to write to the fs from a > kmalloced buffer? Difficult to say without any idea of what syzkaller tried doing, but it looks like it hook'd up a fd opened to a local ext4 file into a trans_fd mount; so sending a packet to the "server" would trigger a local write instead. The reason it's freed too early probably is that the reply came from a read before the write happened; this is going to be tricky to fix as that write is 100% asynchronous without any feedback right now (the design assumes that the write has to have finished by the time reply came), but if we want to protect ourselves from rogue servers we'll have to think about something. I'll write it down to not forget, thanks for the cc. -- Dominique Martinet
Re: [V9fs-developer] [PATCH 1/2] net/9p: embed fcall in req to round down buffer allocs
piaojun wrote on Tue, Jul 31, 2018: > This is really a *big* patch, but the modification seems no harm. And I > suggest running testcases to cover this. Please see my comments below. I'm always running tests, but more never hurt - please help ;) For reference I'm running a subset of cthon04[1], ltp[2] and some custom tests like these[3][4] [1] https://fedorapeople.org/cgit/steved/public_git/cthon04.git/ [2] https://github.com/linux-test-project/ltp [3] https://github.com/phdeniel/sigmund/blob/master/modules/allfs.inc#L208 [4] https://github.com/phdeniel/sigmund/blob/master/modules/allfs.inc#L251 > > [...] > > @@ -263,13 +261,13 @@ p9_tag_alloc(struct p9_client *c, int8_t type, > > unsigned int max_size) > > if (!req) > > return NULL; > > > > - req->tc = p9_fcall_alloc(alloc_msize); > > - req->rc = p9_fcall_alloc(alloc_msize); > > - if (!req->tc || !req->rc) > > + if (p9_fcall_alloc(&req->tc, alloc_msize)) > > + goto free; > > + if (p9_fcall_alloc(&req->rc, alloc_msize)) > > goto free; > > > > - p9pdu_reset(req->tc); > > - p9pdu_reset(req->rc); > > + p9pdu_reset(&req->tc); > > + p9pdu_reset(&req->rc); > > req->status = REQ_STATUS_ALLOC; > > init_waitqueue_head(&req->wq); > > INIT_LIST_HEAD(&req->req_list); > > @@ -281,7 +279,7 @@ p9_tag_alloc(struct p9_client *c, int8_t type, unsigned > > int max_size) > > GFP_NOWAIT); > > else > > tag = idr_alloc(&c->reqs, req, 0, P9_NOTAG, GFP_NOWAIT); > > - req->tc->tag = tag; > > + req->tc.tag = tag; > > spin_unlock_irq(&c->lock); > > idr_preload_end(); > > if (tag < 0) > > @@ -290,8 +288,8 @@ p9_tag_alloc(struct p9_client *c, int8_t type, unsigned > > int max_size) > > return req; > > > > free: > > - kfree(req->tc); > > - kfree(req->rc); > > + kfree(req->tc.sdata); > > + kfree(req->rc.sdata); > > I wonder if we will free a wild pointer as 'sdata' has not been initialized > NULL. Good point, it's possible to jump here if the first fcall_alloc failed since this declustered the two allocations. Please consider this added to the previous patch (I'll send a v2 after this has had more time for review, you can find the amended commit in my 9p-test tree meanwhile): -8<- diff --git a/net/9p/client.c b/net/9p/client.c index ba99a94a12c9..fe030ef1c076 100644 --- a/net/9p/client.c +++ b/net/9p/client.c @@ -262,7 +262,7 @@ p9_tag_alloc(struct p9_client *c, int8_t type, unsigned int max_size) return NULL; if (p9_fcall_alloc(&req->tc, alloc_msize)) - goto free; + goto free_req; if (p9_fcall_alloc(&req->rc, alloc_msize)) goto free; @@ -290,6 +290,7 @@ p9_tag_alloc(struct p9_client *c, int8_t type, unsigned int max_size) free: kfree(req->tc.sdata); kfree(req->rc.sdata); +free_req: kmem_cache_free(p9_req_cache, req); return ERR_PTR(-ENOMEM); } -8<- The second goto doesn't need changing because rc.sdata will be set to NULL if the allocation failed -- Dominique
Re: [V9fs-developer] [PATCH 2/2] net/9p: add a per-client fcall kmem_cache
piaojun wrote on Tue, Jul 31, 2018: > Could you help paste some test result before-and-after the patch applied? The only performance tests I did were sent to the list a couple of mails earlier, you can find it here: http://lkml.kernel.org/r/20180730093101.GA7894@nautica In particular, the results for benchmark on small writes just before and after this patch, without KASAN (these are the same numbers as in the link, hardware/setup is described there): - no alloc (4.18-rc7 request cache): 65.4k req/s - non-power of two alloc, no patch: 61.6k req/s - power of two alloc, no patch: 62.2k req/s - non-power of two alloc, with patch: 64.7k req/s - power of two alloc, with patch: 65.1k req/s I'm rather happy with the result, I didn't expect using a dedicated cache would bring this much back but it's certainly worth it. > > @@ -1011,6 +1034,7 @@ void p9_client_destroy(struct p9_client *clnt) > > > > p9_tag_cleanup(clnt); > > > > + kmem_cache_destroy(clnt->fcall_cache); > > We could set NULL for fcall_cache in case of use-after-free. > > > kfree(clnt); Hmm, I understand where this comes from, but I'm not sure I agree. If someone tries to access the client while/after it is freed things are going to break anyway, I'd rather let things break as obviously as possible than try to cover it up. -- Dominique
Re: [PATCH 2/2] net/9p: add a per-client fcall kmem_cache
Matthew Wilcox wrote on Mon, Jul 30, 2018: > On Mon, Jul 30, 2018 at 11:34:23AM +0200, Dominique Martinet wrote: > > -static int p9_fcall_alloc(struct p9_fcall *fc, int alloc_msize) > > +static int p9_fcall_alloc(struct p9_client *c, struct p9_fcall *fc, > > + int alloc_msize) > > { > > - fc->sdata = kmalloc(alloc_msize, GFP_NOFS); > > + if (c->fcall_cache && alloc_msize == c->msize) > > + fc->sdata = kmem_cache_alloc(c->fcall_cache, GFP_NOFS); > > + else > > + fc->sdata = kmalloc(alloc_msize, GFP_NOFS); > > Could you simplify this by initialising c->msize to 0 and then this > can simply be: > > > + if (alloc_msize == c->msize) > ... Hmm, this is rather tricky with the current flow of things; p9_client_version() has multiple uses for that msize field. Basically what happens is: - init client struct, set clip msize to mount option/transport-specific max - p9_client_version() uses current c->msize to send a suggested value to the server - p9_client_rpc() uses current c->msize to allocate that first rpc, this is pretty much hard-coded and will be quite intrusive to make an exception for - p9_client_version() looks at the msize the server suggested and clips c->msize if the reply's is smaller than c->msize I kind of agree it'd be nice to remove that check being done all the time for just startup, but I don't see how to do this easily with the current code. Making p9_client_version take an extra argument would be easy but we'd need to actually hardcode in p9_client_rpc that "if the message type is TVERSION then use [page size or whatever] for allocation" and that kinds of kills the point... The alternative being having p9_client_rpc takes the actual size as argument itself but this once again is pretty intrusive even if it could be done mechanically... I'll think about this some more > > +void p9_fcall_free(struct p9_client *c, struct p9_fcall *fc) > > +{ > > + /* sdata can be NULL for interrupted requests in trans_rdma, > > +* and kmem_cache_free does not do NULL-check for us > > +*/ > > + if (unlikely(!fc->sdata)) > > + return; > > + > > + if (c->fcall_cache && fc->capacity == c->msize) > > + kmem_cache_free(c->fcall_cache, fc->sdata); > > + else > > + kfree(fc->sdata); > > +} > > Is it possible for fcall_cache to be allocated before fcall_free is > called? I'm concerned we might do this: > > allocate message A > allocate message B > receive response A > allocate fcall_cache > receive response B > > and then we'd call kmem_cache_free() for something allocated by kmalloc(), > which works with slab and slub, but doesn't work with slob (alas). Bleh, I checked this would work for slab and didn't really check others.. This cannot happen right now because we only return the client struct from p9_client_create after the first message is done (and, right now, freed) but when we start adding refcounting to requests it'd be possible to free the very first response after fcall_cache is allocated with a "bad" server like syzcaller does sending the version reply before the request came in. I can't see any work-around around this other than storing how the fcall was allocated in the struct itself though... I guess I might as well do that now, unless you have a better idea. > > @@ -980,6 +1000,9 @@ struct p9_client *p9_client_create(const char > > *dev_name, char *options) > > if (err) > > goto close_trans; > > > > + clnt->fcall_cache = kmem_cache_create("9p-fcall-cache", clnt->msize, > > + 0, 0, NULL); > > + > > If we have slab merging turned off, or we have two mounts from servers > with different msizes, we'll end up with two slabs called 9p-fcall-cache. > I'm OK with that, but are you? Yeah, the reason I didn't make it global like p9_req_cache is precisely to get two separate caches if the msizes are different. I actually considered adding msize to the string with snprintf or something but someone looking at it through slabinfo or similar will have the sizes anyway so I don't think this would bring anything, do you know if/think that tools will choke on multiple caches with the same name? I'm not sure about slab merging being disabled though, from the little I understand I do not see why anyone would do that except for debugging, and I'm fine with that. Please let me know if I'm missing something though! Thanks for the review, -- Dominique Martinet
Re: [PATCH] net/9p/trans_virtio.c: add a terminal char for mount tag
piaojun wrote on Wed, Aug 01, 2018: > chan->tag has no terminal char at last which will result in printing messy > code when debugging code. So we should add '\0' for tag. 9p is full of non null-terminated string so I'm not sure how I feel about it, is there anything wrong with how this is used or was this just when you tried to printf it? If it's just for debugging I'd suggest using the printf format "%.*s" with "chan->tag_len, chan->tag" arguments, That said it's not like this is costly, so I'll take it if someone else thinks this is helpful > > Signed-off-by: Jun Piao > --- > net/9p/trans_virtio.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c > index d422bfc..49d71d6 100644 > --- a/net/9p/trans_virtio.c > +++ b/net/9p/trans_virtio.c > @@ -585,7 +585,7 @@ static int p9_virtio_probe(struct virtio_device *vdev) > err = -EINVAL; > goto out_free_vq; > } > - tag = kmalloc(tag_len, GFP_KERNEL); > + tag = kzalloc(tag_len + 1, GFP_KERNEL); > if (!tag) { > err = -ENOMEM; > goto out_free_vq; > -- -- Dominique
Re: [PATCH] net/9p/trans_virtio.c: add a terminal char for mount tag
Greg Kurz wrote on Wed, Aug 01, 2018: > So this patch basically turns chan->tag into a nul terminated string, > which is indeed more convenient and robust. Maybe you can update the > rest of the code accordingly and drop chan->tag_len then ? If we can use that to simplify some other part of the code, that's certainly more appealing to me :) > FWIW, 9P strings received from the client are also converted to > nul terminated strings: Oh, good to know; I guess that makes sense when these strings come and go from/to other components of the kernel that likely expect that. -- Dominique
Re: [V9fs-developer] [PATCH 1/2] net/9p: embed fcall in req to round down buffer allocs
Greg Kurz wrote on Wed, Aug 01, 2018: > > @@ -263,13 +261,13 @@ p9_tag_alloc(struct p9_client *c, int8_t type, > > unsigned int max_size) > > if (!req) > > return NULL; > > > > - req->tc = p9_fcall_alloc(alloc_msize); > > - req->rc = p9_fcall_alloc(alloc_msize); > > - if (!req->tc || !req->rc) > > + if (p9_fcall_alloc(&req->tc, alloc_msize)) > > + goto free; > > + if (p9_fcall_alloc(&req->rc, alloc_msize)) > > goto free; > > Hmm... if the first allocation fails, we will kfree() req->rc.sdata. > > Are we sure we won't have a stale pointer or uninitialized data in > there ? Yeah, Jun pointed that out and I have a v2 that only frees as needed with an extra goto (I sent an incremental diff in my reply to his comment here[1]) [1] https://lkml.kernel.org/r/20180731011256.GA30388@nautica > And even if we don't with the current code base, this is fragile and > could be easily broken. > > I think you should drop this hunk and rather rename p9_fcall_alloc() to > p9_fcall_alloc_sdata() instead, since this is what the function is > actually doing with this patch applied. Hmm. I agree the naming isn't accurate, but even if we rename it we'll need to pass a pointer to fcall as argument as it inits its capacity. p9_fcall_init(fc, msize) might be simpler? (I'm not sure I follow what you mean by 'drop this hunk', to be honest, did you want a single function call to init both maybe?) -- Dominique
Re: [V9fs-developer] [PATCH 2/2] net/9p: add a per-client fcall kmem_cache
Greg Kurz wrote on Wed, Aug 01, 2018: > > diff --git a/net/9p/client.c b/net/9p/client.c > > index ba99a94a12c9..215e3b1ed7b4 100644 > > --- a/net/9p/client.c > > +++ b/net/9p/client.c > > @@ -231,15 +231,34 @@ static int parse_opts(char *opts, struct p9_client > > *clnt) > > return ret; > > } > > > > -static int p9_fcall_alloc(struct p9_fcall *fc, int alloc_msize) > > +static int p9_fcall_alloc(struct p9_client *c, struct p9_fcall *fc, > > + int alloc_msize) > > { > > - fc->sdata = kmalloc(alloc_msize, GFP_NOFS); > > + if (c->fcall_cache && alloc_msize == c->msize) > > This is a presumably hot path for any request but the initial TVERSION, > you probably want likely() here... c->fcall_cache is indeed very likely, but alloc_size == c->msize not so much, as zc requests will be quite common for virtio and will be 4k in size. Although with that cache I'm starting to wonder if we should always use it... Speed-wise if there is no memory pressure the cache is likely going to be faster. If there is pressure and the items are reclaimed though that'll bring some heavier slow-down as it'll need to find bigger memory regions. I'm not sure which path we should favor, tbh. I'll keep these separate for now. For the first part of the check, Matthew suggested trying to trick msize into a different value to fail this check for the initial TVERSION call, but even after thinking it a bit I don't really see how to do that cleanly. I can at least make -that- likely()... > > > + fc->sdata = kmem_cache_alloc(c->fcall_cache, GFP_NOFS); > > + else > > + fc->sdata = kmalloc(alloc_msize, GFP_NOFS); > > if (!fc->sdata) > > return -ENOMEM; > > fc->capacity = alloc_msize; > > return 0; > > } > > > > +void p9_fcall_free(struct p9_client *c, struct p9_fcall *fc) > > +{ > > + /* sdata can be NULL for interrupted requests in trans_rdma, > > +* and kmem_cache_free does not do NULL-check for us > > +*/ > > + if (unlikely(!fc->sdata)) > > + return; > > + > > + if (c->fcall_cache && fc->capacity == c->msize) > > ... and here as well. For this one I'll unfortunately need to store in the fc how it has been allocated as slob doesn't allow to kmem_cache_free() a buffer allocated by kmalloc and in prevision of refs being refcounted in a hostile world the initial TVERSION req could be freed after fcall_cache is created :/ That's a bit of a burden but at least will reduce the checks to one here > > + kmem_cache_free(c->fcall_cache, fc->sdata); > > + else > > + kfree(fc->sdata); > > +} > > +EXPORT_SYMBOL(p9_fcall_free); > > + > > static struct kmem_cache *p9_req_cache; > > > > /** Anyway I've had as many comments as I could hope for, thanks everyone for the quick review. I'll send a v2 of both patches tomorrow -- Dominique
Re: [PATCH v2] net/9p/trans_virtio.c: add null terminal for mount tag
piaojun wrote on Thu, Aug 02, 2018: > chan->tag is Non-null terminated which will result in printing messy code > when debugging code. So we should add '\0' for tag to make the code more > convenient and robust. In addition, I drop char->tag_len to simplify the > code. Some new lines in commit message would be appreciated. That aside, I have a couple of nitpicks, but it looks good to me - thanks > > Signed-off-by: Jun Piao > --- > net/9p/trans_virtio.c | 15 +-- > 1 file changed, 5 insertions(+), 10 deletions(-) > > diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c > index d422bfc..0fe9c37 100644 > --- a/net/9p/trans_virtio.c > +++ b/net/9p/trans_virtio.c > @@ -89,10 +89,8 @@ struct virtio_chan { > unsigned long p9_max_pages; > /* Scatterlist: can be too big for stack. */ > struct scatterlist sg[VIRTQUEUE_NUM]; > - > - int tag_len; > /* > - * tag name to identify a mount Non-null terminated > + * tag name to identify a mount null terminated >*/ > char *tag; > > @@ -529,10 +527,9 @@ static ssize_t p9_mount_tag_show(struct device *dev, > vdev = dev_to_virtio(dev); > chan = vdev->priv; > > - memcpy(buf, chan->tag, chan->tag_len); > - buf[chan->tag_len] = 0; > + memcpy(buf, chan->tag, strlen(chan->tag) + 1); > > - return chan->tag_len + 1; > + return strlen(chan->tag) + 1; Use a local variable for strlen(chan->tag)? > } > > static DEVICE_ATTR(mount_tag, 0444, p9_mount_tag_show, NULL); > @@ -585,7 +582,7 @@ static int p9_virtio_probe(struct virtio_device *vdev) > err = -EINVAL; > goto out_free_vq; > } > - tag = kmalloc(tag_len, GFP_KERNEL); > + tag = kzalloc(tag_len + 1, GFP_KERNEL); > if (!tag) { > err = -ENOMEM; > goto out_free_vq; > @@ -594,7 +591,6 @@ static int p9_virtio_probe(struct virtio_device *vdev) > virtio_cread_bytes(vdev, offsetof(struct virtio_9p_config, tag), > tag, tag_len); > chan->tag = tag; > - chan->tag_len = tag_len; > err = sysfs_create_file(&(vdev->dev.kobj), &dev_attr_mount_tag.attr); > if (err) { > goto out_free_tag; > @@ -654,8 +650,7 @@ static int p9_virtio_probe(struct virtio_device *vdev) > > mutex_lock(&virtio_9p_lock); > list_for_each_entry(chan, &virtio_chan_list, chan_list) { > - if (!strncmp(devname, chan->tag, chan->tag_len) && > - strlen(devname) == chan->tag_len) { > + if (!strncmp(devname, chan->tag, strlen(chan->tag) + 1)) { strncmp(x, y, strlen(y)+1) is precisely what strcmp does so let's use the simpler version -- Dominique
[PATCH v2 2/2] net/9p: add a per-client fcall kmem_cache
From: Dominique Martinet Having a specific cache for the fcall allocations helps speed up allocations a bit, especially in case of non-"round" msizes. The caches will automatically be merged if there are multiple caches of items with the same size so we do not need to try to share a cache between different clients of the same size. Since the msize is negotiated with the server, only allocate the cache after that negotiation has happened - previous allocations or allocations of different sizes (e.g. zero-copy fcall) are made with kmalloc directly. Signed-off-by: Dominique Martinet Cc: Matthew Wilcox Cc: Greg Kurz Cc: Jun Piao --- v2: - Add a pointer to the cache in p9_fcall to make sure a buffer allocated with kmalloc gets freed with kfree and vice-versa This could have been smaller with a bool but this spares having to look at the client so looked a bit cleaner, I'm expecting this patch will need a v3 one way or another so I went for the bolder approach - please say if you think a smaller item is better ; I *think* nothing relies on this being ordered the same way as the data on the wire (struct isn't packed anyway) so we can move id after tag and add another u8 to not have any overhead - added likely() to cache existence check in allocation, but nothing for msize check or free because of zc request being of different size include/net/9p/9p.h | 1 + include/net/9p/client.h | 2 ++ net/9p/client.c | 34 -- net/9p/trans_rdma.c | 2 +- 4 files changed, 32 insertions(+), 7 deletions(-) diff --git a/include/net/9p/9p.h b/include/net/9p/9p.h index e23896116d9a..f1d2ed3cee61 100644 --- a/include/net/9p/9p.h +++ b/include/net/9p/9p.h @@ -558,6 +558,7 @@ struct p9_fcall { size_t offset; size_t capacity; + struct kmem_cache *cache; u8 *sdata; }; diff --git a/include/net/9p/client.h b/include/net/9p/client.h index 4b4ac1362ad5..735f3979d559 100644 --- a/include/net/9p/client.h +++ b/include/net/9p/client.h @@ -123,6 +123,7 @@ struct p9_client { struct p9_trans_module *trans_mod; enum p9_trans_status status; void *trans; + struct kmem_cache *fcall_cache; union { struct { @@ -230,6 +231,7 @@ int p9_client_mkdir_dotl(struct p9_fid *fid, const char *name, int mode, kgid_t gid, struct p9_qid *); int p9_client_lock_dotl(struct p9_fid *fid, struct p9_flock *flock, u8 *status); int p9_client_getlock_dotl(struct p9_fid *fid, struct p9_getlock *fl); +void p9_fcall_fini(struct p9_fcall *fc); struct p9_req_t *p9_tag_lookup(struct p9_client *, u16); void p9_client_cb(struct p9_client *c, struct p9_req_t *req, int status); diff --git a/net/9p/client.c b/net/9p/client.c index bc40bb11b832..0e0f8bb3fd3c 100644 --- a/net/9p/client.c +++ b/net/9p/client.c @@ -231,19 +231,36 @@ static int parse_opts(char *opts, struct p9_client *clnt) return ret; } -static int p9_fcall_init(struct p9_fcall *fc, int alloc_msize) +static int p9_fcall_init(struct p9_client *c, struct p9_fcall *fc, +int alloc_msize) { - fc->sdata = kmalloc(alloc_msize, GFP_NOFS); + if (likely(c->fcall_cache) && alloc_msize == c->msize) { + fc->sdata = kmem_cache_alloc(c->fcall_cache, GFP_NOFS); + fc->cache = c->fcall_cache; + } else { + fc->sdata = kmalloc(alloc_msize, GFP_NOFS); + fc->cache = NULL; + } if (!fc->sdata) return -ENOMEM; fc->capacity = alloc_msize; return 0; } -static void p9_fcall_fini(struct p9_fcall *fc) +void p9_fcall_fini(struct p9_fcall *fc) { - kfree(fc->sdata); + /* sdata can be NULL for interrupted requests in trans_rdma, +* and kmem_cache_free does not do NULL-check for us +*/ + if (unlikely(!fc->sdata)) + return; + + if (fc->cache) + kmem_cache_free(fc->cache, fc->sdata); + else + kfree(fc->sdata); } +EXPORT_SYMBOL(p9_fcall_fini); static struct kmem_cache *p9_req_cache; @@ -266,9 +283,9 @@ p9_tag_alloc(struct p9_client *c, int8_t type, unsigned int max_size) if (!req) return NULL; - if (p9_fcall_init(&req->tc, alloc_msize)) + if (p9_fcall_init(c, &req->tc, alloc_msize)) goto free_req; - if (p9_fcall_init(&req->rc, alloc_msize)) + if (p9_fcall_init(c, &req->rc, alloc_msize)) goto free; p9pdu_reset(&req->tc); @@ -950,6 +967,7 @@ struct p9_client *p9_client_create(const char *dev_name, char *options) clnt->trans_mod = NULL; clnt->trans = NULL; + clnt->fcall_cache = NULL; client_id = utsname()->nodename; memcpy(clnt->name, client_id, strl
[PATCH v2 1/2] net/9p: embed fcall in req to round down buffer allocs
From: Dominique Martinet 'msize' is often a power of two, or at least page-aligned, so avoiding an overhead of two dozen bytes for each allocation will help the allocator do its work and reduce memory fragmentation. Suggested-by: Matthew Wilcox Signed-off-by: Dominique Martinet Cc: Matthew Wilcox Cc: Greg Kurz Cc: Jun Piao --- v2: - Add extra label to not free uninitialized memory on alloc failure - Rename p9_fcall_alloc to 9p_fcall_init - Add a p9_fcall_fini function to echo to init include/net/9p/client.h | 4 +- net/9p/client.c | 166 net/9p/trans_fd.c | 12 +-- net/9p/trans_rdma.c | 29 +++ net/9p/trans_virtio.c | 18 ++--- net/9p/trans_xen.c | 12 +-- 6 files changed, 123 insertions(+), 118 deletions(-) diff --git a/include/net/9p/client.h b/include/net/9p/client.h index a4dc42c53d18..4b4ac1362ad5 100644 --- a/include/net/9p/client.h +++ b/include/net/9p/client.h @@ -95,8 +95,8 @@ struct p9_req_t { int status; int t_err; wait_queue_head_t wq; - struct p9_fcall *tc; - struct p9_fcall *rc; + struct p9_fcall tc; + struct p9_fcall rc; void *aux; struct list_head req_list; }; diff --git a/net/9p/client.c b/net/9p/client.c index 88db45966740..bc40bb11b832 100644 --- a/net/9p/client.c +++ b/net/9p/client.c @@ -231,15 +231,18 @@ static int parse_opts(char *opts, struct p9_client *clnt) return ret; } -static struct p9_fcall *p9_fcall_alloc(int alloc_msize) +static int p9_fcall_init(struct p9_fcall *fc, int alloc_msize) { - struct p9_fcall *fc; - fc = kmalloc(sizeof(struct p9_fcall) + alloc_msize, GFP_NOFS); - if (!fc) - return NULL; + fc->sdata = kmalloc(alloc_msize, GFP_NOFS); + if (!fc->sdata) + return -ENOMEM; fc->capacity = alloc_msize; - fc->sdata = (char *) fc + sizeof(struct p9_fcall); - return fc; + return 0; +} + +static void p9_fcall_fini(struct p9_fcall *fc) +{ + kfree(fc->sdata); } static struct kmem_cache *p9_req_cache; @@ -263,13 +266,13 @@ p9_tag_alloc(struct p9_client *c, int8_t type, unsigned int max_size) if (!req) return NULL; - req->tc = p9_fcall_alloc(alloc_msize); - req->rc = p9_fcall_alloc(alloc_msize); - if (!req->tc || !req->rc) + if (p9_fcall_init(&req->tc, alloc_msize)) + goto free_req; + if (p9_fcall_init(&req->rc, alloc_msize)) goto free; - p9pdu_reset(req->tc); - p9pdu_reset(req->rc); + p9pdu_reset(&req->tc); + p9pdu_reset(&req->rc); req->status = REQ_STATUS_ALLOC; init_waitqueue_head(&req->wq); INIT_LIST_HEAD(&req->req_list); @@ -281,7 +284,7 @@ p9_tag_alloc(struct p9_client *c, int8_t type, unsigned int max_size) GFP_NOWAIT); else tag = idr_alloc(&c->reqs, req, 0, P9_NOTAG, GFP_NOWAIT); - req->tc->tag = tag; + req->tc.tag = tag; spin_unlock_irq(&c->lock); idr_preload_end(); if (tag < 0) @@ -290,8 +293,9 @@ p9_tag_alloc(struct p9_client *c, int8_t type, unsigned int max_size) return req; free: - kfree(req->tc); - kfree(req->rc); + p9_fcall_fini(&req->tc); + p9_fcall_fini(&req->rc); +free_req: kmem_cache_free(p9_req_cache, req); return ERR_PTR(-ENOMEM); } @@ -329,14 +333,14 @@ EXPORT_SYMBOL(p9_tag_lookup); static void p9_free_req(struct p9_client *c, struct p9_req_t *r) { unsigned long flags; - u16 tag = r->tc->tag; + u16 tag = r->tc.tag; p9_debug(P9_DEBUG_MUX, "clnt %p req %p tag: %d\n", c, r, tag); spin_lock_irqsave(&c->lock, flags); idr_remove(&c->reqs, tag); spin_unlock_irqrestore(&c->lock, flags); - kfree(r->tc); - kfree(r->rc); + p9_fcall_fini(&r->tc); + p9_fcall_fini(&r->rc); kmem_cache_free(p9_req_cache, r); } @@ -368,7 +372,7 @@ static void p9_tag_cleanup(struct p9_client *c) */ void p9_client_cb(struct p9_client *c, struct p9_req_t *req, int status) { - p9_debug(P9_DEBUG_MUX, " tag %d\n", req->tc->tag); + p9_debug(P9_DEBUG_MUX, " tag %d\n", req->tc.tag); /* * This barrier is needed to make sure any change made to req before @@ -378,7 +382,7 @@ void p9_client_cb(struct p9_client *c, struct p9_req_t *req, int status) req->status = status; wake_up(&req->wq); - p9_debug(P9_DEBUG_MUX, "wakeup: %d\n", req->tc->tag); + p9_debug(P9_DEBUG_MUX, "wakeup: %d\n", req->tc.tag); } EXPORT_SYMBOL(p9_client_cb); @@ -449,18 +453,18 @@ s
Re: [V9fs-developer] [PATCH v2 2/2] net/9p: add a per-client fcall kmem_cache
Dominique Martinet wrote on Thu, Aug 02, 2018: > [...] > + clnt->fcall_cache = kmem_cache_create("9p-fcall-cache", clnt->msize, > + 0, 0, NULL); Well, my gut feeling that I'd need a v3 was right, after a bit more time testing (slightly different setup), I got this warning: Bad or missing usercopy whitelist? Kernel memory overwrite attempt detected to SLUB object '9p-fcall-cache' (offset 23, size 55078)! So this kmem_cache_create needs to change to kmem_cache_create_usercopy, but I'm not sure how to best specify the range -- this warning was about writing data to the buffer so a TWRITE: size[4] Twrite tag[2] fid[4] offset[8] count[4] data[count] (the data[count] part comes from user - this matches 4+1+2+4+8+4 = 23) Similarily RREAD has data that can be copied to userspace, which has a similar check: size[4] Rread tag[2] count[4] data[count] So I could hardcode offset = 4+1+2+4=11, usercopy size = msize - 11... We have some P9_*HDR*SZ macros but none are really appropriate: --- /* ample room for Twrite/Rread header */ #define P9_IOHDRSZ 24 /* Room for readdir header */ #define P9_READDIRHDRSZ 24 /* size of header for zero copy read/write */ #define P9_ZC_HDR_SZ 4096 --- It's actually been bugging me for a while that I see hardcoded '7's for 9p main header size (len + msg type + tag) everywhere, I could add a first P9_HDRSZ of 7 and use that + 4? -- Dominique
Re: [PATCH v2 5/6] 9p: Use a slab for allocating requests
Greg Kurz wrote on Mon, Jul 23, 2018: > The patch is quite big and I'm not sure I can find time to review it > carefully, but I'll try to help anyway. No worry, thanks for this already. > > Sorry for coming back to this patch now, I just noticed something that's > > actually probably a fairly big hit on performance... > > > > While the slab is just as good as the array for the request itself, this > > makes every single request allocate "fcalls" everytime instead of > > reusing a cached allocation. > > The default msize is 8k and these allocs probably are fairly efficient, > > but some transports like RDMA allow to increase this to up to 1MB... And > > It can be even bigger with virtio: > > #define VIRTQUEUE_NUM 128 > > .maxsize = PAGE_SIZE * (VIRTQUEUE_NUM - 3), > > On a typical ppc64 server class setup with 64KB pages, this is nearly 8MB. I don't think I'll be able to test 64KB pages, and it's "just" 500k with 4K pages so I'll go with IB. I just finished reinstalling my IB-enabled VMs, now to get some iops test running (dbench maybe) and I'll get some figures to be able to play with different models and evaluate the impact of these. > > One thing is that the buffers are all going to be the same size for a > > given client ( except virtio zc buffers, I wonder what I'm missing > > or why that didn't blow up before?) > > ZC allocates a 4KB buffer, which is more than enough to hold the 7-byte 9P > header and the "dqd" part of all messages that may use ZC, ie, 16 bytes. > So I'm not sure to catch what could blow up. ZC requests won't blow up, but from what I can see with the current (old) request cache array, if a ZC request has a not-yet used tag it'll allocate a new 4k buffer, then if a normal request uses that tag it'll get the 4k buffer instead of an msize sized one. On the client size the request would be posted with req->rc->capacity which would correctly be 4k, but I'm not sure what would happen if qemu tries to write more than the given size to that request? > > It's a shame because I really like that patch, I'll try to find time to > > run some light benchmark with varying msizes eventually but I'm not sure > > when I'll find time for that... Hopefully before the 4.19 merge window! > > > > Yeah, the open-coded cache we have now really obfuscates things. > > Maybe have a per-client kmem_cache object for non-ZC requests with > size msize [*], and a global kmem_cache object for ZC requests with > fixed size P9_ZC_HDR_SZ. > > [*] the server can require a smaller msize during version negotiation, > so maybe we should change the kmem_cache object in this case. Yeah, if we're going to want to accomodate non-power of two buffers, I think we'll need a separate kmem_cache for them. The ZC requests could be made into exactly 4k and these could come with regular kmalloc just fine, it looks like trying to create a cache of that size would just return the same cache used by kmalloc anyway so it's probably easier to fall back to kmalloc if requested alloc size doesn't match what we were hoping for. I'll try to get figures for various approaches before the merge window for 4.19 starts, it's getting closer though... -- Dominique
Re: [PATCH] fs/9p/xattr.c: catch the error of p9_client_clunk when setting xattr failed
piaojun wrote on Wed, Jul 25, 2018: > In my testing, v9fs_fid_xattr_set will return successfully even if the > backend ext4 filesystem has no space to store xattr key-value. That will > cause inconsistent behavior between front end and back end. The reason is > that lsetxattr will be triggered by p9_client_clunk, and unfortunately we > did not catch the error. This patch will catch the error to notify upper > caller. > > p9_client_clunk (in 9p) > p9_client_rpc(clnt, P9_TCLUNK, "d", fid->fid); > v9fs_clunk (in qemu) > put_fid > free_fid > v9fs_xattr_fid_clunk > v9fs_co_lsetxattr > s->ops->lsetxattr > ext4_xattr_user_set (in host ext4 filesystem) > > Signed-off-by: Jun Piao Good catch! LGTM on its own but see comment below for further error checking. Please let me know if you want to do this in a v2 or submit a separate patch altogether, I can pick this one up as it is. > --- > fs/9p/xattr.c | 6 -- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/fs/9p/xattr.c b/fs/9p/xattr.c > index f329eee..352abc3 100644 > --- a/fs/9p/xattr.c > +++ b/fs/9p/xattr.c > @@ -105,7 +105,7 @@ int v9fs_fid_xattr_set(struct p9_fid *fid, const char > *name, > { > struct kvec kvec = {.iov_base = (void *)value, .iov_len = value_len}; > struct iov_iter from; > - int retval; > + int retval, err; > > iov_iter_kvec(&from, WRITE | ITER_KVEC, &kvec, 1, value_len); > > @@ -126,7 +126,9 @@ int v9fs_fid_xattr_set(struct p9_fid *fid, const char > *name, >retval); > else > p9_client_write(fid, 0, &from, &retval); I'm not sure what to do about this but it's also possible for p9_client_write to not write the full length without setting and error. We should probably compare the return value of p9_client_write and value_len to detect short writes and return an error in this case. > - p9_client_clunk(fid); > + err = p9_client_clunk(fid); > + if (!retval && err) > + retval = err; > return retval; > } > > --
Re: [PATCH] fs/9p/xattr.c: catch the error of p9_client_clunk when setting xattr failed
piaojun wrote on Wed, Jul 25, 2018: > On 2018/7/25 11:32, Dominique Martinet wrote: > >>p9_client_write(fid, 0, &from, &retval); > > > > I'm not sure what to do about this but it's also possible for > > p9_client_write to not write the full length without setting and error. > > > > We should probably compare the return value of p9_client_write and > > value_len to detect short writes and return an error in this case. > > It looks like we could identify short writes error from the *err. If no > error case breaks the while loop, the total equal to the whole data len. Ah, I looked too fast and missed the while loop in p9_client_write... It's rather unusual for such a low level call to handle retries, but I guess that's 9p. Right, will take that patch as is then. Thanks, -- Dominique
Re: [PATCH] 9p: validate PDU length
Tomas Bortoli wrote on Mon, Jul 23, 2018: > diff --git a/net/9p/client.c b/net/9p/client.c > index 18c5271910dc..92240ccf476b 100644 > --- a/net/9p/client.c > +++ b/net/9p/client.c > @@ -524,6 +525,12 @@ static int p9_check_errors(struct p9_client *c, struct > p9_req_t *req) > int ecode; > > err = p9_parse_header(req->rc, NULL, &type, NULL, 0); > + if (req->rc->size >= c->msize) { I was looking at this again, I think it's more appropriate to use req->rc->capacity at this point like you did in the first version of the patch. I had suggested msize in the common p9_parse_header function because that'd let us accept zc requests where the size in the pdu could be bigger than capacity, but this isn't the case in p9_check_errors. If you're ok with this I'll edit your commit directly, this is less work for me than having to check a new patch. Thanks, -- Dominique
Re: [V9fs-developer] [PATCH 5/6] 9p: Use a slab for allocating requests
Matthew Wilcox wrote on Wed, Jul 11, 2018: > On Wed, Jul 11, 2018 at 03:33:13PM +0200, Dominique Martinet wrote: > > Well this appears to work but P9_NOTAG being '(u16)(~0)' I'm not too > > confident with P9_NOTAG + 1. . . it doesn't look like it's overflowing > > before the cast on my laptop but is that guaranteed? > > By my understanding of n1256.pdf ... this falls under 6.3.1.8 ("Usual > arithmetic conversions"). We have a u16 and an int. Therefore this > rule applies: > > Otherwise, if the type of the operand with signed integer type can > represent all of the values of the type of the operand with unsigned > integer type, then the operand with unsigned integer type is converted > to the type of the operand with signed integer type. Thanks for checking, that'll work then. > > I do not see any call to idr_destroy, is that OK? > > Yes, that's fine. It used to be (back in 2013) that one had to call > idr_destroy() in order to free the preallocated idr data structures. > Now it's a no-op if called on an empty IDR, and I would expect that both > IDRs are empty at the time that it comes to unloading the module (and if > they aren't, we probably have bigger problems than a small memory leak). > Some users like to assert that the IDR is empty; most do not go to that > extent of defensive programming. Ok, I agree we're not there yet. Just comments nitpicks, then :) -- Dominique Martinet
Re: [PATCH v2 0/6] 9p: Use IDRs more effectively
Matthew Wilcox wrote on Wed, Jul 11, 2018: > The 9p code doesn't take advantage of the IDR's ability to store > a pointer. We can actually get rid of the p9_idpool abstraction > and the multi-dimensional array of requests. > > v2: Address feedback from Dominique. Thanks, I've picked them up for 4.19 ( git://github.com/martinetd/linux 9p-next as per mail on v9fs-developer list, "Current 9P patches - test branch") This shouldn't stop anyone else from doing more reviews/test, I'm doing this 9p-patch-gathering on no autority and I've only checked very basic things for now. -- Dominique Martinet
Re: [V9fs-developer] [PATCH] net/9p/client.c: fix misuse of spin_lock_irqsave for p9_client lock
piaojun wrote on Thu, Jul 12, 2018: > In p9_read_work(), we use spin_lock for client->lock, but misuse > spin_lock_irqsave for it in p9_fid_create(). As p9_client lock won't be > locked in irq context, so spin_lock is enough. And that will improve the > performance. Agreed on principle, see remark below > Signed-off-by: Jun Piao > --- > net/9p/client.c | 17 +++-- > net/9p/trans_fd.c | 7 +++ > 2 files changed, 10 insertions(+), 14 deletions(-) > > diff --git a/net/9p/client.c b/net/9p/client.c > index 8bc8b3e..b05cbfc 100644 > --- a/net/9p/client.c > +++ b/net/9p/client.c > @@ -260,7 +260,6 @@ static struct p9_fcall *p9_fcall_alloc(int alloc_msize) > static struct p9_req_t * > p9_tag_alloc(struct p9_client *c, u16 tag, unsigned int max_size) > { > - unsigned long flags; > int row, col; > struct p9_req_t *req; > int alloc_msize = min(c->msize, max_size); > @@ -270,7 +269,7 @@ static struct p9_fcall *p9_fcall_alloc(int alloc_msize) > tag++; > > if (tag >= c->max_tag) { > - spin_lock_irqsave(&c->lock, flags); > + spin_lock(&c->lock); This code doesn't exist anymore with Matthew's idr rework, could you submit that patch based on top of my 9p-next branch? (unless you really want Andrew to take this for the next 4.18-rc, but I'm not convinced this qualifies) Please see my "Current 9P patches - test branch" for details: https://sourceforge.net/p/v9fs/mailman/message/36365359/ -- Dominique Martinet
Re: [PATCH v2 3/6] 9p: Replace the fidlist with an IDR
Matthew Wilcox wrote on Wed, Jul 11, 2018: > diff --git a/net/9p/client.c b/net/9p/client.c > index 389a2904b7b3..b89c7298267c 100644 > --- a/net/9p/client.c > +++ b/net/9p/client.c > @@ -908,30 +908,29 @@ static struct p9_fid *p9_fid_create(struct p9_client > *clnt) > { > int ret; > struct p9_fid *fid; > - unsigned long flags; > > p9_debug(P9_DEBUG_FID, "clnt %p\n", clnt); > fid = kmalloc(sizeof(struct p9_fid), GFP_KERNEL); > if (!fid) > return NULL; > > - ret = p9_idpool_get(clnt->fidpool); > - if (ret < 0) > - goto error; > - fid->fid = ret; > - > memset(&fid->qid, 0, sizeof(struct p9_qid)); Ah, I had missed that you didn't update this memset as you said in reply to comment on v1. Could you resend just this patch and either initialize fid->fid or use kzalloc for the fid allocation? > fid->mode = -1; > fid->uid = current_fsuid(); > fid->clnt = clnt; > fid->rdir = NULL; > - spin_lock_irqsave(&clnt->lock, flags); > - list_add(&fid->flist, &clnt->fidlist); > - spin_unlock_irqrestore(&clnt->lock, flags); > + fid->fid = 0; > > - return fid; > + idr_preload(GFP_KERNEL); > + spin_lock_irq(&clnt->lock); > + ret = idr_alloc_u32(&clnt->fids, fid, &fid->fid, P9_NOFID - 1, > + GFP_NOWAIT); If you do resend, alignment here was wrong. Thanks, -- Dominique Martinet
Re: [PATCH v2 3/6] 9p: Replace the fidlist with an IDR
Matthew Wilcox wrote on Thu, Jul 12, 2018: > > Ah, I had missed that you didn't update this memset as you said in reply > > to comment on v1. > > Rather than update the memset, I ... > > > Could you resend just this patch and either initialize fid->fid or use > > kzalloc for the fid allocation? > > > > > + fid->fid = 0; > > Did that instead ;-) > > If I were going to initialise the entire structure to 0, I'd replace > the kmalloc with kzalloc and drop the memset altogether. Oh, I'm blind, sorry! :) > > If you do resend, alignment here was wrong. > > I think this warning from checkpatch is complete bullshit. It's > really none of checkpatch's business how I choose to align function > arguments. > > That said, if you want it to be aligned some particular way, feel free > to adjust the whitespace. I don't care. I would tend to agree that sometimes checkpatch is overzealous, but having worked on projects where code style is all over the place it really feels much more comfortable to have a consistent style everywhere. Thanks for the permission, I'll adjust it (assuming this does end up getting pulled from my branch, but nobody yelled so far) -- Dominique Martinet
Re: [Intel-gfx] [PATCH] i915/intel_tv_get_modes: fix strncpy truncation warning
Ville Syrjälä wrote on Thu, Jul 12, 2018: > On Wed, Jul 11, 2018 at 09:46:15AM +0200, Dominique Martinet wrote: > > This is effectively no-op as the next line writes a nul at the final > > What is "This". Please write self contained commit messages. This could either be 'this commit' as a whole or if you look only at the commit message 'this strncpy fix' from the title (which is arguably the same), and both interpretations sound fairly understandable in the context of the title line without seeing the patch to me... Although I'll admit this is difficult to judge of that as the author. Thanksfully, the v2 of the patch didn't use this wording but while I agree the message could be better I do not think it was horrible. > > drivers/gpu/drm/i915/intel_tv.c: In function ‘intel_tv_get_modes’: > > drivers/gpu/drm/i915/intel_tv.c:1358:3: error: ‘strncpy’ specified bound 32 > > equals destination size [-Werror=stringop-truncation] > >strncpy(mode_ptr->name, input->name, DRM_DISPLAY_MODE_LEN); > >^~ > > cc1: all warnings being treated as errors > > That warning should be in the actual commit message. Yes and no, I gave it for referrence but when you update to gcc 8 you will literally see it all over the place. The words "strncpy truncation warning" is really precise once you've seen them a few times and there are litteraly hundred of these warnings in the kernel, some have already been fixed taking a glance at the git log, some with and without the warning message. I don't think it's worth polluting the git log with this many warnings... Which leads to... > This same pattern is used all over drm. Can you go and fix them all up? > One might even consider writing a cocci patch for it ;) Now this is something I can agree with. This patch really was just a stop-gap measure because I could not build the kernel at all without it, but yes I did consider having a look at others. Unfortunately coccinelle does not run on fedora 28 (and doesn't look like it will fix itself any time soon, there is a bug report[1] open since February that didn't get much love lately - I was just looking at it a few days ago) I think in this case it might actually be faster to look at gcc warnings and s/strncpy/strlcpy/, but I am curious about Coccinelle so this is a good excuse to look at it, I'll report back in a bit after poking at that bug report and figuring out how coccinelle works. I do not guarantee speed however, if anyone sees this and feels put off from donig it themselves, please go ahead and just drop me a word. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1544204 Thanks, and sorry for the mail longer than I originally intended, -- Dominique Martinet
Re: [V9fs-developer] [PATCH v2 4/6] 9p: Embed wait_queue_head into p9_req_t
Greg Kurz wrote on Thu, Jul 12, 2018: > This is true when all tags have been used at least once. But the current code > lazily allocates the wait_queue_head_t, ie, only when a tag is used for the > first time. Your patch causes a full row of wait_quest_head_t to be > pre-allocated. > > ie, P9_ROW_MAXTAG * 24 = 255 * 24 = 6120 > > instead of (P9_ROW_MAXTAG * 8) + 24 = 255 * 8 + 24 = 2064 > > This is nearly a page of allocated memory that might be never used. > > Not sure if this is a problem though... I thought the exact same, but the next patch in the serie removes that array and allocates even more lazily with a slab, so this was a no-brainer :) -- Dominique Martinet
[PATCH 12/18] test_power: change strncpy+truncation to strlcpy
Generated by scripts/coccinelle/misc/strncpy_truncation.cocci Signed-off-by: Dominique Martinet --- Please see https://marc.info/?l=linux-kernel&m=153144450722324&w=2 (the first patch of the serie) for the motivation behind this patch drivers/power/supply/test_power.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/power/supply/test_power.c b/drivers/power/supply/test_power.c index 57246cdbd042..64adf630f64f 100644 --- a/drivers/power/supply/test_power.c +++ b/drivers/power/supply/test_power.c @@ -297,8 +297,7 @@ static int map_get_value(struct battery_property_map *map, const char *key, char buf[MAX_KEYLENGTH]; int cr; - strncpy(buf, key, MAX_KEYLENGTH); - buf[MAX_KEYLENGTH-1] = '\0'; + strlcpy(buf, key, MAX_KEYLENGTH); cr = strnlen(buf, MAX_KEYLENGTH) - 1; if (cr < 0) -- 2.17.1
[PATCH 14/18] kdb_support: change strncpy+truncation to strlcpy
Generated by scripts/coccinelle/misc/strncpy_truncation.cocci Signed-off-by: Dominique Martinet --- Please see https://marc.info/?l=linux-kernel&m=153144450722324&w=2 (the first patch of the serie) for the motivation behind this patch kernel/debug/kdb/kdb_support.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/kernel/debug/kdb/kdb_support.c b/kernel/debug/kdb/kdb_support.c index 990b3cc526c8..1f6a4b6bde0b 100644 --- a/kernel/debug/kdb/kdb_support.c +++ b/kernel/debug/kdb/kdb_support.c @@ -119,8 +119,7 @@ int kdbnearsym(unsigned long addr, kdb_symtab_t *symtab) * What was Rusty smoking when he wrote that code? */ if (symtab->sym_name != knt1) { - strncpy(knt1, symtab->sym_name, knt1_size); - knt1[knt1_size-1] = '\0'; + strlcpy(knt1, symtab->sym_name, knt1_size); } for (i = 0; i < ARRAY_SIZE(kdb_name_table); ++i) { if (kdb_name_table[i] && -- 2.17.1
[PATCH 05/18] iio: change strncpy+truncation to strlcpy
Generated by scripts/coccinelle/misc/strncpy_truncation.cocci Signed-off-by: Dominique Martinet --- Please see https://marc.info/?l=linux-kernel&m=153144450722324&w=2 (the first patch of the serie) for the motivation behind this patch drivers/iio/common/st_sensors/st_sensors_core.c | 3 +-- drivers/iio/pressure/st_pressure_i2c.c | 3 +-- 2 files changed, 2 insertions(+), 4 deletions(-) diff --git a/drivers/iio/common/st_sensors/st_sensors_core.c b/drivers/iio/common/st_sensors/st_sensors_core.c index 57db19182e95..26fbd1bd9413 100644 --- a/drivers/iio/common/st_sensors/st_sensors_core.c +++ b/drivers/iio/common/st_sensors/st_sensors_core.c @@ -380,8 +380,7 @@ void st_sensors_of_name_probe(struct device *dev, return; /* The name from the OF match takes precedence if present */ - strncpy(name, of_id->data, len); - name[len - 1] = '\0'; + strlcpy(name, of_id->data, len); } EXPORT_SYMBOL(st_sensors_of_name_probe); #else diff --git a/drivers/iio/pressure/st_pressure_i2c.c b/drivers/iio/pressure/st_pressure_i2c.c index fbb59059e942..2026a1012012 100644 --- a/drivers/iio/pressure/st_pressure_i2c.c +++ b/drivers/iio/pressure/st_pressure_i2c.c @@ -94,9 +94,8 @@ static int st_press_i2c_probe(struct i2c_client *client, if ((ret < 0) || (ret >= ST_PRESS_MAX)) return -ENODEV; - strncpy(client->name, st_press_id_table[ret].name, + strlcpy(client->name, st_press_id_table[ret].name, sizeof(client->name)); - client->name[sizeof(client->name) - 1] = '\0'; } else if (!id) return -ENODEV; -- 2.17.1
[PATCH 08/18] myricom: change strncpy+truncation to strlcpy
Generated by scripts/coccinelle/misc/strncpy_truncation.cocci Signed-off-by: Dominique Martinet --- Please see https://marc.info/?l=linux-kernel&m=153144450722324&w=2 (the first patch of the serie) for the motivation behind this patch drivers/net/ethernet/myricom/myri10ge/myri10ge.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/net/ethernet/myricom/myri10ge/myri10ge.c b/drivers/net/ethernet/myricom/myri10ge/myri10ge.c index b2d2ec8c11e2..f7178cdb6bd8 100644 --- a/drivers/net/ethernet/myricom/myri10ge/myri10ge.c +++ b/drivers/net/ethernet/myricom/myri10ge/myri10ge.c @@ -553,8 +553,7 @@ myri10ge_validate_firmware(struct myri10ge_priv *mgp, } /* save firmware version for ethtool */ - strncpy(mgp->fw_version, hdr->version, sizeof(mgp->fw_version)); - mgp->fw_version[sizeof(mgp->fw_version) - 1] = '\0'; + strlcpy(mgp->fw_version, hdr->version, sizeof(mgp->fw_version)); sscanf(mgp->fw_version, "%d.%d.%d", &mgp->fw_ver_major, &mgp->fw_ver_minor, &mgp->fw_ver_tiny); -- 2.17.1
[PATCH 17/18] perf: change strncpy+truncation to strlcpy
Generated by scripts/coccinelle/misc/strncpy_truncation.cocci Signed-off-by: Dominique Martinet --- Please see https://marc.info/?l=linux-kernel&m=153144450722324&w=2 (the first patch of the serie) for the motivation behind this patch tools/perf/util/bpf-loader.h | 3 +-- tools/perf/util/util.c | 3 +-- 2 files changed, 2 insertions(+), 4 deletions(-) diff --git a/tools/perf/util/bpf-loader.h b/tools/perf/util/bpf-loader.h index 5d3aefd6fae7..8d08a1fc97a0 100644 --- a/tools/perf/util/bpf-loader.h +++ b/tools/perf/util/bpf-loader.h @@ -143,10 +143,9 @@ __bpf_strerror(char *buf, size_t size) { if (!size) return 0; - strncpy(buf, + strlcpy(buf, "ERROR: eBPF object loading is disabled during compiling.\n", size); - buf[size - 1] = '\0'; return 0; } diff --git a/tools/perf/util/util.c b/tools/perf/util/util.c index eac5b858a371..8b9e3aa7aad3 100644 --- a/tools/perf/util/util.c +++ b/tools/perf/util/util.c @@ -459,8 +459,7 @@ fetch_kernel_version(unsigned int *puint, char *str, return -1; if (str && str_size) { - strncpy(str, utsname.release, str_size); - str[str_size - 1] = '\0'; + strlcpy(str, utsname.release, str_size); } if (!puint || int_ver_ready) -- 2.17.1
[PATCH 13/18] ibmvscsi: change strncpy+truncation to strlcpy
Generated by scripts/coccinelle/misc/strncpy_truncation.cocci Signed-off-by: Dominique Martinet --- Please see https://marc.info/?l=linux-kernel&m=153144450722324&w=2 (the first patch of the serie) for the motivation behind this patch drivers/scsi/ibmvscsi/ibmvscsi.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c b/drivers/scsi/ibmvscsi/ibmvscsi.c index 17df76f0be3c..79eb8af03a19 100644 --- a/drivers/scsi/ibmvscsi/ibmvscsi.c +++ b/drivers/scsi/ibmvscsi/ibmvscsi.c @@ -1274,14 +1274,12 @@ static void send_mad_capabilities(struct ibmvscsi_host_data *hostdata) if (hostdata->client_migrated) hostdata->caps.flags |= cpu_to_be32(CLIENT_MIGRATED); - strncpy(hostdata->caps.name, dev_name(&hostdata->host->shost_gendev), + strlcpy(hostdata->caps.name, dev_name(&hostdata->host->shost_gendev), sizeof(hostdata->caps.name)); - hostdata->caps.name[sizeof(hostdata->caps.name) - 1] = '\0'; location = of_get_property(of_node, "ibm,loc-code", NULL); location = location ? location : dev_name(hostdata->dev); - strncpy(hostdata->caps.loc, location, sizeof(hostdata->caps.loc)); - hostdata->caps.loc[sizeof(hostdata->caps.loc) - 1] = '\0'; + strlcpy(hostdata->caps.loc, location, sizeof(hostdata->caps.loc)); req->common.type = cpu_to_be32(VIOSRP_CAPABILITIES_TYPE); req->buffer = cpu_to_be64(hostdata->caps_addr); -- 2.17.1
[PATCH 15/18] blktrace: change strncpy+truncation to strlcpy
Using strlcpy fixes this new gcc warning: kernel/trace/blktrace.c: In function ‘do_blk_trace_setup’: kernel/trace/blktrace.c:497:2: warning: ‘strncpy’ specified bound 32 equals destination size [-Wstringop-truncation] strncpy(buts->name, name, BLKTRACE_BDEV_SIZE); ^ Generated by scripts/coccinelle/misc/strncpy_truncation.cocci Signed-off-by: Dominique Martinet --- Please see https://marc.info/?l=linux-kernel&m=153144450722324&w=2 (the first patch of the serie) for the motivation behind this patch kernel/trace/blktrace.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c index 987d9a9ae283..2478d9838eab 100644 --- a/kernel/trace/blktrace.c +++ b/kernel/trace/blktrace.c @@ -494,8 +494,7 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev, if (!buts->buf_size || !buts->buf_nr) return -EINVAL; - strncpy(buts->name, name, BLKTRACE_BDEV_SIZE); - buts->name[BLKTRACE_BDEV_SIZE - 1] = '\0'; + strlcpy(buts->name, name, BLKTRACE_BDEV_SIZE); /* * some device names have larger paths - convert the slashes -- 2.17.1
[PATCH 16/18] tools/accounting: change strncpy+truncation to strlcpy
Generated by scripts/coccinelle/misc/strncpy_truncation.cocci Signed-off-by: Dominique Martinet --- Please see https://marc.info/?l=linux-kernel&m=153144450722324&w=2 (the first patch of the serie) for the motivation behind this patch tools/accounting/getdelays.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/tools/accounting/getdelays.c b/tools/accounting/getdelays.c index 9f420d98b5fb..66817a7a4fce 100644 --- a/tools/accounting/getdelays.c +++ b/tools/accounting/getdelays.c @@ -314,8 +314,7 @@ int main(int argc, char *argv[]) err(1, "Invalid rcv buf size\n"); break; case 'm': - strncpy(cpumask, optarg, sizeof(cpumask)); - cpumask[sizeof(cpumask) - 1] = '\0'; + strlcpy(cpumask, optarg, sizeof(cpumask)); maskset = 1; printf("cpumask %s maskset %d\n", cpumask, maskset); break; -- 2.17.1
[PATCH 18/18] cpupower: change strncpy+truncation to strlcpy
Generated by scripts/coccinelle/misc/strncpy_truncation.cocci Signed-off-by: Dominique Martinet --- Please see https://marc.info/?l=linux-kernel&m=153144450722324&w=2 (the first patch of the serie) for the motivation behind this patch tools/power/cpupower/bench/parse.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/tools/power/cpupower/bench/parse.c b/tools/power/cpupower/bench/parse.c index 9ba8a44ad2a7..1566b89989b2 100644 --- a/tools/power/cpupower/bench/parse.c +++ b/tools/power/cpupower/bench/parse.c @@ -221,9 +221,8 @@ int prepare_config(const char *path, struct config *config) sscanf(val, "%u", &config->cpu); else if (strcmp("governor", opt) == 0) { - strncpy(config->governor, val, + strlcpy(config->governor, val, sizeof(config->governor)); - config->governor[sizeof(config->governor) - 1] = '\0'; } else if (strcmp("priority", opt) == 0) { -- 2.17.1
Re: [Cocci] [PATCH 01/18] coccinelle: change strncpy+truncation to strlcpy
Himanshu Jha wrote on Fri, Jul 13, 2018: > > I expect each maintainer will pick their share of the patchs if they > > agree with it and the rest will just be dropped? > > Masahiro Yamada takes coccinelle patches, > so please cc him or your patch would be lost. Thanks, will do. > > +virtual patch > > +virtual context > > You might consider adding context rule or remove this line perhaps ? Victim of copypasta, I'll remove this. > > +-strncpy@p( > > ++strlcpy( > > + dest, src, sz); > > +-dest[sz - 1] = '\0'; > > The above rule produces an output that I think is not correct: > -- > diff = > diff -u -p a//ti/wl1251/acx.c b//ti/wl1251/acx.c > --- a//ti/wl1251/acx.c > +++ b//ti/wl1251/acx.c > @@ -150,14 +150,7 @@ int wl1251_acx_fw_version(struct wl1251 > } > > /* be careful with the buffer sizes */ > - strncpy(buf, rev->fw_version, min(len, sizeof(rev->fw_version))); > - > - /* > - * if the firmware version string is exactly > - * sizeof(rev->fw_version) long or fw_len is less than > - * sizeof(rev->fw_version) it won't be null terminated > - */ > - buf[min(len, sizeof(rev->fw_version)) - 1] = '\0'; > + strlcpy(buf, rev->fw_version, min(len, sizeof(rev->fw_version))); > > - > > I think the comment is useful and should not be removed. I agree this comment is useful now that I'm taking a closer look, I glanced at this too fast. I'm not sure how to make coccinelle not remove comments between lines though? > Also, consider changing Confidence level appropriately. I am (was?) pretty confident on the change itself, the only exceptions would be if someone relied on strncpy to fill the end of the buffer with zero to not leak data somewhere but that is not easy to judge by itself (although I hope rare enough) I'm honestly not sure what would be appropriate in this case. -- Dominique Martinet
Re: [V9fs-developer] [PATCH] net/9p: Fix a deadlock case in the virtio transport
jiangyiwen wrote on Sat, Jul 14, 2018: > When client has multiple threads that issue io requests all the > time, and the server has a very good performance, it may cause > cpu is running in the irq context for a long time because it can > check virtqueue has buf in the *while* loop. > > So we should keep chan->lock in the whole loop. Hmm, this is generally bad practice to hold a spin lock for long. In general, spin locks are meant to protect data, not code. I'd want some numbers to decide on this one, even if I think this particular case is safe (e.g. this cannot dead-lock) > Signed-off-by: Yiwen Jiang > --- > net/9p/trans_virtio.c | 8 +++- > 1 file changed, 3 insertions(+), 5 deletions(-) > > diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c > index 05006cb..9b0f5f2 100644 > --- a/net/9p/trans_virtio.c > +++ b/net/9p/trans_virtio.c > @@ -148,20 +148,18 @@ static void req_done(struct virtqueue *vq) > > p9_debug(P9_DEBUG_TRANS, ": request done\n"); > > + spin_lock_irqsave(&chan->lock, flags); > while (1) { > - spin_lock_irqsave(&chan->lock, flags); > req = virtqueue_get_buf(chan->vq, &len); > - if (req == NULL) { > - spin_unlock_irqrestore(&chan->lock, flags); > + if (req == NULL) > break; > - } > chan->ring_bufs_avail = 1; > - spin_unlock_irqrestore(&chan->lock, flags); > /* Wakeup if anyone waiting for VirtIO ring space. */ > wake_up(chan->vc_wq); In particular, the wake up here echoes to wait events that will immediately try to grab the lock, and will needlessly spin on it until this thread is done. If we do go this way I'd want setting chan->ring_bufs_avail to be done just before unlocking and the wakeup to be done just after unlocking out of the loop iff we processed at least one iteration here. That should also save you precious cpu cycles while under lock :) -- Dominique Martinet
Re: [V9fs-developer] [PATCH] net/9p: Fix a deadlock case in the virtio transport
jiangyiwen wrote on Sat, Jul 14, 2018: > On 2018/7/14 17:05, Dominique Martinet wrote: > > jiangyiwen wrote on Sat, Jul 14, 2018: > >> When client has multiple threads that issue io requests all the > >> time, and the server has a very good performance, it may cause > >> cpu is running in the irq context for a long time because it can > >> check virtqueue has buf in the *while* loop. > >> > >> So we should keep chan->lock in the whole loop. > > > > Hmm, this is generally bad practice to hold a spin lock for long. > > In general, spin locks are meant to protect data, not code. > > > > I'd want some numbers to decide on this one, even if I think this > > particular case is safe (e.g. this cannot dead-lock) > > > > Actually, the loop will not hold a spin lock for long, because other > threads will not issue new requests in this case. In addition, > virtio-blk or virtio-scsi also use this solution, I guess it may also > encounter this problem before. Fair enough. If you do have some numbers to give though (throughput and/or iops before/after) I'd still be really curious. > >>chan->ring_bufs_avail = 1; > >> - spin_unlock_irqrestore(&chan->lock, flags); > >>/* Wakeup if anyone waiting for VirtIO ring space. */ > >>wake_up(chan->vc_wq); > > > > In particular, the wake up here echoes to wait events that will > > immediately try to grab the lock, and will needlessly spin on it until > > this thread is done. > > If we do go this way I'd want setting chan->ring_bufs_avail to be done > > just before unlocking and the wakeup to be done just after unlocking out > > of the loop iff we processed at least one iteration here. > > I can move the wakeup operation after the unlocking. Like what I said > above, I think this loop will not execute for long. Please do, you listed virtio_blk as doing this and they have the same kind of pattern with a req_done bool and only restarting stopped queues if they processed something -- Dominique
Re: [PATCH] net/9p: include trans_common.h to fix missing prototype warning.
Adeodato Simó wrote on Tue, Nov 13, 2018: > This silences -Wmissing-prototypes when defining p9_release_pages. > > Signed-off-by: Adeodato Simó > --- > net/9p/trans_common.c | 1 + > 1 file changed, 1 insertion(+) > > It was suggested in the kernel-janitors mailing list that silencing > -Wmissing-prototypes would make for a worthy cause. > > https://www.spinics.net/lists/linux-kernel-janitors/msg43981.html > > Resending w/ Cc: linux-kernel as requested by Dominique. Great, pushed to my for-next branch, it'll probably get in there today or tomorrow. Thanks! -- Dominique
Re: [V9fs-developer] [PATCH] p9_check_errors() validate PDU length
Tomas Bortoli wrote on Tue, Jul 10, 2018: > > This however gets more complicated once you start factoring in that > > change I suggested about p9_parse_header not setting size (and checking > > size) because trans_fd relies on it; so I'm not sure how we should > > proceed. > > Mmh, me neither. I don't see where the *actual* PDU length is stored. That's precisely the problem, none of the transports actually write in the pdu how much has been read. Instead, we blindly trust how much the client says it has written and p9_parse_header will write pdu->size which is then used as the 'actual' PDU length. With the three lines I added in each file all transports will store the PDU length in pdu->size so my following suggestion would be to just check that in p9_parse_header the length read from the packet match that of the received data and flag any packet where this is wrong as invalid. The only problem with that is that TCP will use p9_parse_header with a partial packet to know how much it needs to read, so that modification needs actual testing (hence my question below) as it is more intrusive than a simple boundary check. I'd suggest implementing some logic like already here with the if (pdu->size == 0) -- only check if pdu->size was previously set, and set it if it wasn't... But feel free to try something else. With that done I don't expect more problems but if I do it there's little point in making you do it as well ;) > > Do you have a working 9p tcp server to test changes are valid, or are > > you only working off the syzbot reproducer? > > No, I was just using the reproducer to test. > > > In the first place, are you willing to take the time to do that bigger > > fix? > > Yes. Ok, let's get you a working setup for TCP then. You need to install a 9p server, the two I'm aware of are diod and nfs-ganesha (I'm using ganesha); once you have a working config you can mount with something like this: mount -t 9p -o aname=path,trans=fd Once you have that you should be able to fiddle with things until it works as expected. (for virtio I use qemu, that's probably easy to test as well; for RDMA you can use rxe to setup a virtual interface (with rxe_cfg) and then use either diod or ganesha again but the setup is more complicated so it's OK to leave that aside for now) > For me both ways are good. Signed-off-by will be good. > You know for sure more than me about 9p as I started delving it for the > first time yesterday. I can also make and test a patch but I'll need to > understand more about it. Let me know if you find out more. There's a saying about giving a fish or teaching how to fish, so let's have you try if you're motivated :) I don't think much is left if you can sew the pieces together, the most difficult part will probably to set the test environment up if you want to do this properly. Please ask if you run into any trouble, -- Dominique Martinet | Asmadeus
Re: [V9fs-developer] [PATCH] net/9p/client.c: put refcount of trans_mod in error case in parse_opts()
Andrew, there seem to be some renew of interest in 9P lately, so if you'd like I can take care of rounding these up and prepare a pull request for 4.19 (as we're already well into 4.18 release cycle, I believe most of the patches can wait) This patch however I consider important enough to take for 4.18 so could you please grab it for now? I've gathered the Review tags and added my own, feel free to change my Reviewed-and-tested-by tag to Signed-off-by if it seems more appropriate as I'm actively pushing for this patch. piaojun wrote on Fri, Jul 06, 2018: > >From my test, the second mount will fail after umounting successfully. > The reason is that we put refcount of trans_mod in the correct case rather > than the error case in parse_opts() at last. That will cause the refcount > decrease to -1, and when we try to get trans_mod again in > try_module_get(), we could only increase refcount to 0 which will cause > failure as follows: > parse_opts > v9fs_get_trans_by_name > try_module_get : return NULL to caller which cause error > > So we should put refcount of trans_mod in error case. > > Fixes: 9421c3e64137ec ("net/9p/client.c: fix potential refcnt problem of > trans module") > > Signed-off-by: Jun Piao Reviewed-by: Yiwen Jiang Reviewed-by: Greg Kurz Reviewed-and-tested-by: Dominique Martinet > --- > net/9p/client.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/net/9p/client.c b/net/9p/client.c > index 18c5271..5c13431 100644 > --- a/net/9p/client.c > +++ b/net/9p/client.c > @@ -225,7 +225,8 @@ static int parse_opts(char *opts, struct p9_client *clnt) > } > > free_and_return: > - v9fs_put_trans(clnt->trans_mod); > + if (ret) > + v9fs_put_trans(clnt->trans_mod); > kfree(tmp_options); > return ret; > } Thanks, -- Dominique Martinet
[PATCH] i915/intel_tv_get_modes: fix strncpy truncation warning
This is effectively no-op as the next line writes a nul at the final byte of the buffer, so copying one letter less does not change the behaviour. Signed-off-by: Dominique Martinet --- gcc 8 gives the following warning, which I am not sure why is treated as error for this file, thus making me fix it: drivers/gpu/drm/i915/intel_tv.c: In function ‘intel_tv_get_modes’: drivers/gpu/drm/i915/intel_tv.c:1358:3: error: ‘strncpy’ specified bound 32 equals destination size [-Werror=stringop-truncation] strncpy(mode_ptr->name, input->name, DRM_DISPLAY_MODE_LEN); ^~ cc1: all warnings being treated as errors Also, as a side note, while checking if this was already sent I stumbled uppon this: https://lists.freedesktop.org/archives/intel-gfx/2018-July/170638.html ([Intel-gfx] [PATCH i-g-t 6/7] Fix truncate string in the strncpy) which replaces strncpy(dest, src, sizeof(dest)) by strncpy(dest, src, strlen(dest))... This isn't for linux but this looks like a pretty bad idea to me... (I'm not on the list to reply there, sorry) drivers/gpu/drm/i915/intel_tv.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/intel_tv.c b/drivers/gpu/drm/i915/intel_tv.c index b55b5c157e38..f5614d07b10d 100644 --- a/drivers/gpu/drm/i915/intel_tv.c +++ b/drivers/gpu/drm/i915/intel_tv.c @@ -1355,7 +1355,7 @@ intel_tv_get_modes(struct drm_connector *connector) mode_ptr = drm_mode_create(connector->dev); if (!mode_ptr) continue; - strncpy(mode_ptr->name, input->name, DRM_DISPLAY_MODE_LEN); + strncpy(mode_ptr->name, input->name, DRM_DISPLAY_MODE_LEN - 1); mode_ptr->name[DRM_DISPLAY_MODE_LEN - 1] = '\0'; mode_ptr->hdisplay = hactive_s; -- 2.17.1
Re: [V9fs-developer] [PATCH 2/6] 9p: Replace the fidlist with an IDR
Matthew Wilcox wrote on Thu, Jun 28, 2018: > diff --git a/net/9p/client.c b/net/9p/client.c > index f8d58b0852fe..bbab82f22c20 100644 > --- a/net/9p/client.c > +++ b/net/9p/client.c > @@ -908,30 +908,27 @@ static struct p9_fid *p9_fid_create(struct p9_client > *clnt) > { > int ret; > struct p9_fid *fid; > - unsigned long flags; > > p9_debug(P9_DEBUG_FID, "clnt %p\n", clnt); > fid = kmalloc(sizeof(struct p9_fid), GFP_KERNEL); > if (!fid) > return NULL; > > - ret = p9_idpool_get(clnt->fidpool); > - if (ret < 0) > - goto error; > - fid->fid = ret; > - > memset(&fid->qid, 0, sizeof(struct p9_qid)); > fid->mode = -1; > fid->uid = current_fsuid(); > fid->clnt = clnt; > fid->rdir = NULL; > - spin_lock_irqsave(&clnt->lock, flags); > - list_add(&fid->flist, &clnt->fidlist); > - spin_unlock_irqrestore(&clnt->lock, flags); > > - return fid; > + idr_preload(GFP_KERNEL); > + spin_lock_irq(&clnt->lock); > + ret = idr_alloc_u32(&clnt->fids, fid, &fid->fid, UINT_MAX, GFP_NOWAIT); There's also a P9_NOFID that we shouldn't use, so the max here should be P9_NOFID - 1 That aside this introduces a change of behaviour that fid used to be alloc'd linearily from 0 which no longer holds true, that breaks one serveur (nfs-ganesha just returns ERANGE) but others seem to handle this just fine so they'll just need to fix that server. max aside this looks good. -- Dominique Martinet
Re: [V9fs-developer] [PATCH 2/6] 9p: Replace the fidlist with an IDR
Matthew Wilcox wrote on Wed, Jul 11, 2018: > On Wed, Jul 11, 2018 at 02:40:38PM +0200, Dominique Martinet wrote: > > > + idr_preload(GFP_KERNEL); > > > + spin_lock_irq(&clnt->lock); > > > + ret = idr_alloc_u32(&clnt->fids, fid, &fid->fid, UINT_MAX, GFP_NOWAIT); > > > > There's also a P9_NOFID that we shouldn't use, so the max here should be > > P9_NOFID - 1 > > Happy to fix that. It shouldn't actually happen, of course. I can't > imagine us having 4 billion FIDs in use at once. Oh, I said that assuming that it was random.. But I guess we might as well fix it. > > That aside this introduces a change of behaviour that fid used to be > > alloc'd linearily from 0 which no longer holds true, that breaks one > > serveur (nfs-ganesha just returns ERANGE) but others seem to handle this > > just fine so they'll just need to fix that server. > > max aside this looks good. > > I don't understand your assertion that this is a change of behaviour. > The implementation of p9_idpool_get() uses idr_alloc(), not > idr_alloc_cyclic(), so I don't believe I've changed which FID would > be allocated. Hmm, I just tried mounting something with ganesha and that broke because it received fid=1730858632 in a TATTACH request, so something in the patch series made some big fid happens. If you say this isn't intented though this needs debugging, I'm glad I brought this up :P (Note that it really should be fine if it is random within the pool, I have notified the current developpers of the problem) -- Dominique
Re: [V9fs-developer] [PATCH 5/6] 9p: Use a slab for allocating requests
Matthew Wilcox wrote on Thu, Jun 28, 2018: > diff --git a/include/net/9p/client.h b/include/net/9p/client.h > index 0fa0fbab33b0..fd326811ebd4 100644 > --- a/include/net/9p/client.h > +++ b/include/net/9p/client.h > @@ -92,24 +85,12 @@ enum p9_req_status_t { > * struct p9_req_t - request slots > * @status: status of this request slot > * @t_err: transport error > - * @flush_tag: tag of request being flushed (for flush requests) > * @wq: wait_queue for the client to block on for this request > * @tc: the request fcall structure > * @rc: the response fcall structure > * @aux: transport specific data (provided for trans_fd migration) > * @req_list: link for higher level objects to chain requests > - * > - * Transport use an array to track outstanding requests > - * instead of a list. While this may incurr overhead during initial > - * allocation or expansion, it makes request lookup much easier as the > - * tag id is a index into an array. (We use tag+1 so that we can accommodate > - * the -1 tag for the T_VERSION request). > - * This also has the nice effect of only having to allocate wait_queues > - * once, instead of constantly allocating and freeing them. Its possible > - * other resources could benefit from this scheme as well. > - * > */ > - > struct p9_req_t { > int status; > int t_err; > @@ -117,40 +98,26 @@ struct p9_req_t { > struct p9_fcall *tc; > struct p9_fcall *rc; > void *aux; > - > struct list_head req_list; > }; > > /** > * struct p9_client - per client instance state > - * @lock: protect @fidlist > + * @lock: protect @fids and @reqs > * @msize: maximum data size negotiated by protocol > - * @dotu: extension flags negotiated by protocol > * @proto_version: 9P protocol version to use > * @trans_mod: module API instantiated with this client > + * @status: XXX Let's give this a proper comment; something like 'connection state machine' ? (this contains Connected/BeginDisconnect/Disconnected/Hung) > diff --git a/net/9p/client.c b/net/9p/client.c > index 2dce8d8307cc..fd5446377445 100644 > --- a/net/9p/client.c > +++ b/net/9p/client.c > @@ -241,132 +241,104 @@ static struct p9_fcall *p9_fcall_alloc(int > alloc_msize) > return fc; > } > > +static struct kmem_cache *p9_req_cache; > + > /** > - * p9_tag_alloc - lookup/allocate a request by tag > - * @c: client session to lookup tag within > - * @tag: numeric id for transaction > - * > - * this is a simple array lookup, but will grow the > - * request_slots as necessary to accommodate transaction > - * ids which did not previously have a slot. > - * > - * this code relies on the client spinlock to manage locks, its > - * possible we should switch to something else, but I'd rather > - * stick with something low-overhead for the common case. > + * p9_req_alloc - Allocate a new request. > + * @c: Client session. > + * @type: Transaction type. > + * @max_size: Maximum packet size for this request. > * > + * Context: Process context. What mutex might we be holding that > + * requires GFP_NOFS? Good question, but p9_tag_alloc happens on every single client request so the fs/9p functions might be trying to do something and the alloc request here comes in and could try to destroy the inode that is currently used in the request -- I'm not sure how likely this is, but I'd rather not tempt fate :p > + * Return: Pointer to new request. > */ > - > static struct p9_req_t * > -p9_tag_alloc(struct p9_client *c, u16 tag, unsigned int max_size) > +p9_tag_alloc(struct p9_client *c, int8_t type, unsigned int max_size) > { > - unsigned long flags; > - int row, col; > - struct p9_req_t *req; > + struct p9_req_t *req = kmem_cache_alloc(p9_req_cache, GFP_NOFS); > int alloc_msize = min(c->msize, max_size); > + int tag; > > - /* This looks up the original request by tag so we know which > - * buffer to read the data into */ > - tag++; > - > - if (tag >= c->max_tag) { > - spin_lock_irqsave(&c->lock, flags); > - /* check again since original check was outside of lock */ > - while (tag >= c->max_tag) { > - row = (tag / P9_ROW_MAXTAG); > - c->reqs[row] = kcalloc(P9_ROW_MAXTAG, > - sizeof(struct p9_req_t), GFP_ATOMIC); > - > - if (!c->reqs[row]) { > - pr_err("Couldn't grow tag array\n"); > - spin_unlock_irqrestore(&c->lock, flags); > - return ERR_PTR(-ENOMEM); > - } > - for (col = 0; col < P9_ROW_MAXTAG; col++) { > - req = &c->reqs[row][col]; > - req->status = REQ_STATUS_IDLE; > - init_waitqueue_head(&req->wq); > - } > - c->max_tag += P9_ROW_MAXTAG; > - } > -
Re: [V9fs-developer] [PATCH 0/6] 9p: Use IDRs more effectively
Matthew Wilcox wrote on Thu, Jun 28, 2018: > The 9p code doesn't take advantage of the IDR's ability to store > a pointer. We can actually get rid of the p9_idpool abstraction > and the multi-dimensional array of requests. > > I haven't tested these patches, so caveat maintainer. Overall pretty good for something that hadn't even been tested! I'm done with the comments here, the three other patches I didn't comment look good to me. For [PATCH 4/6] 9p: Remove an unnecessary memory barrier, your suggested wording in the comment is good; I've offered to do it if you aren't going to but since you're submitting a second version of the other patchs I suggest you do this one as well. Thanks for the well-needed cleanup, -- Dominique Martinet
Re: [V9fs-developer] [PATCH 4/6] 9p: Remove an unnecessary memory barrier
Matthew Wilcox wrote on Thu, Jun 28, 2018: > --- a/net/9p/client.c > +++ b/net/9p/client.c > @@ -436,13 +436,9 @@ void p9_client_cb(struct p9_client *c, struct p9_req_t > *req, int status) > { > p9_debug(P9_DEBUG_MUX, " tag %d\n", req->tc->tag); > > - /* > - * This barrier is needed to make sure any change made to req before > - * the other thread wakes up will indeed be seen by the waiting side. > - */ > - smp_wmb(); > req->status = status; > > + /* wake_up is an implicit write memory barrier */ Nope. Please note the wmb is _before_ setting status, basically it protects from cpu optimizations where status could be set before other fields, then other core opportunistically checking and finding status is good so other thread continuing. I could only reproduce this bug with infiniband network, but it is very definitely needed. Here is the commit message of when I added that barrier: - 9P: Add memory barriers to protect request fields over cb/rpc threads handoff We need barriers to guarantee this pattern works as intended: [w] req->rc, 1 [r] req->status, 1 wmb rmb [w] req->status, 1 [r] req->rc Where the wmb ensures that rc gets written before status, and the rmb ensures that if you observe status == 1, rc is the new value. - It might need an update to the comment though, if you thought about removing it... -- Dominique Martinet | Asmadeus
Re: [V9fs-developer] [PATCH 4/6] 9p: Remove an unnecessary memory barrier
Matthew Wilcox wrote on Thu, Jun 28, 2018: > How about this? > > /* >* This barrier is needed to make sure any change made to req before > - * the other thread wakes up will indeed be seen by the waiting side. > + * the status change is visible to another thread >*/ Yes, that sounds better. This code is fairly old and I was wondering if the new WRITE_ONCE and READ_ONCE macros would help but it looks like compile-time barriers only so I do not think they would be enough... Documentation/memory-barriers.txt still suggests something similar in the "SLEEP AND WAKE-UP FUNCTIONS" section so I guess this is fine. Thanks, -- Dominique Martinet | Asmadeus
Re: [PATCH 18/18] cpupower: change strncpy+truncation to strlcpy
Daniel Díaz wrote on Tue, Aug 14, 2018: > I can't get cpupower to compile anymore now that it made its way to > linux-next: > [/linux/tools/power/cpupower]$ make > CC lib/cpufreq.o > [...] > make[1]: Entering directory '/linux/tools/power/cpupower/bench' > CC main.o > CC parse.o > parse.c: In function ‘prepare_config’: > parse.c:224:4: warning: implicit declaration of function ‘strlcpy’ > [-Wimplicit-function-declaration] > strlcpy(config->governor, val, > ^ > CC system.o > CC benchmark.o > CC cpufreq-bench > .//parse.o: In function `prepare_config': > /linux/tools/power/cpupower/bench/parse.c:224: undefined reference > to `strlcpy' > collect2: error: ld returned 1 exit status > Makefile:25: recipe for target 'cpufreq-bench' failed > make[1]: *** [cpufreq-bench] Error 1 > make[1]: Leaving directory '/linux/tools/power/cpupower/bench' > Makefile:258: recipe for target 'compile-bench' failed > make: *** [compile-bench] Error 2 > > Does it need anything special to make? Ugh, no, I am really ashamed about this patch series for insufficient testing in general. It is currently "under rework" for an indefinite time frame as I have had other priorities but I'll add cpupower to the list... More precisely, the function is defined in the linux kernel but for userspace strlcpy is only available through libbsd, and I don't believe we should pull that in just for this. I'll send a second patch using snprintf and warning if a truncation occurs (which is the proper fix that the gcc folks intended people to do anyway) when I get around to it, but I would recommend to just revert the patch for now. Shuah, could you take the patch off please if you haven't pushed it to linus yet? Sorry for the time you might have spent on this, -- Dominique Martinet
[PATCH v2] 9p: Remove Ron Minnich from MAINTAINERS
From: Dominique Martinet Ron Minnich has left Sandia in 2011, and has not been involved in any 9p commit in the recent years. Signed-off-by: Dominique Martinet Cc: Eric Van Hensbergen Cc: Ron Minnich Cc: Ronald G. Minnich Cc: Latchesar Ionkov To: Andrew Morton --- v2: added CREDITS entry CREDITS | 5 + MAINTAINERS | 1 - 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/CREDITS b/CREDITS index 989cda91c427..5befd2d714d0 100644 --- a/CREDITS +++ b/CREDITS @@ -2571,6 +2571,11 @@ S: Helstorfer Str. 7 S: D-30625 Hannover S: Germany +N: Ron Minnich +E: rminn...@sandia.gov +E: rminn...@gmail.com +D: 9p filesystem development + N: Corey Minyard E: miny...@wf-rch.cirr.com E: miny...@mvista.com diff --git a/MAINTAINERS b/MAINTAINERS index 15116e6dff3a..09c2a6c0fb0c 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -199,7 +199,6 @@ F: drivers/net/ethernet/8390/ 9P FILE SYSTEM M: Eric Van Hensbergen -M: Ron Minnich M: Latchesar Ionkov M: Dominique Martinet L: v9fs-develo...@lists.sourceforge.net -- 2.17.1
Re: [PATCH 1/3] 9p/net: implement asynchronous rpc
Tomas Bortoli wrote on Mon, Dec 17, 2018: > sorry for the delay, I've been quite busy these days. No problem. > The patches looks good to me and should indeed speed up the code a bit. > I quickly tested them against Syzkaller tuned for the 9p subsystem and > everything seems fine. Thanks, can I add your Reviewed-by on all three? > And by the way, which refcount races? There's a problem with trans_fd read_work and cancelled callback; I'm not so sure about refcount but we can definitely get double list_del as we're not checking the status. I think when we incorrectly remove from the list we also mismanage the refcount, but honestly need to test.. -- Dominique
[PATCH 3/3] 9p/net: make flush asynchronous
From: Dominique Martinet Make the client flush asynchronous so we don't need to ignore signals in rpc functions: - on signal, send a flush request as we previously did but use the new asynchronous helper and return immediately - when the reply has been received or connection is destroyed, free both tags in the handler Signed-off-by: Dominique Martinet Cc: Eric Van Hensbergen Cc: Latchesar Ionkov Cc: Tomas Bortoli Cc: Dmitry Vyukov --- include/net/9p/client.h | 2 + net/9p/client.c | 172 ++-- 2 files changed, 78 insertions(+), 96 deletions(-) diff --git a/include/net/9p/client.h b/include/net/9p/client.h index 75d7f83e5b94..dcd40e7ef202 100644 --- a/include/net/9p/client.h +++ b/include/net/9p/client.h @@ -91,6 +91,7 @@ enum p9_req_status_t { * @aux: transport specific data (provided for trans_fd migration) * @req_list: link used by trans_fd * @async_list: link used to check on async requests + * @flushed_req: for flush, points to matching flushed req * @clunked_fid: for clunk, points to fid */ struct p9_req_t { @@ -104,6 +105,7 @@ struct p9_req_t { struct list_head req_list; struct list_head async_list; union { + struct p9_req_t *flushed_req; struct p9_fid *clunked_fid; }; }; diff --git a/net/9p/client.c b/net/9p/client.c index a47b5a54573d..666a722088e9 100644 --- a/net/9p/client.c +++ b/net/9p/client.c @@ -694,50 +694,6 @@ static void p9_fid_destroy(struct p9_fid *fid) kfree(fid); } -static struct p9_req_t * -p9_client_rpc(struct p9_client *c, int8_t type, const char *fmt, ...); - -/** - * p9_client_flush - flush (cancel) a request - * @c: client state - * @oldreq: request to cancel - * - * This sents a flush for a particular request and links - * the flush request to the original request. The current - * code only supports a single flush request although the protocol - * allows for multiple flush requests to be sent for a single request. - * - */ - -static int p9_client_flush(struct p9_client *c, struct p9_req_t *oldreq) -{ - struct p9_req_t *req; - int16_t oldtag; - int err; - - err = p9_parse_header(&oldreq->tc, NULL, NULL, &oldtag, 1); - if (err) - return err; - - p9_debug(P9_DEBUG_9P, ">>> TFLUSH tag %d\n", oldtag); - - req = p9_client_rpc(c, P9_TFLUSH, "w", oldtag); - if (IS_ERR(req)) - return PTR_ERR(req); - - /* -* if we haven't received a response for oldreq, -* remove it from the list -*/ - if (oldreq->status == REQ_STATUS_SENT) { - if (c->trans_mod->cancelled) - c->trans_mod->cancelled(c, oldreq); - } - - p9_tag_remove(c, req); - return 0; -} - static struct p9_req_t *p9_client_prepare_req(struct p9_client *c, int8_t type, int req_size, const char *fmt, va_list ap) @@ -800,6 +756,39 @@ p9_client_async_rpc(struct p9_client *c, int8_t type, const char *fmt, ...) return req; } +/** + * p9_client_flush - flush (cancel) a request + * @c: client state + * @oldreq: request to cancel + * + * This sents a flush for a particular request and links + * the flush request to the original request. The current + * code only supports a single flush request although the protocol + * allows for multiple flush requests to be sent for a single request. + * + */ + +static int p9_client_flush(struct p9_client *c, struct p9_req_t *oldreq) +{ + struct p9_req_t *req; + int16_t oldtag = oldreq->tc.tag; + + p9_debug(P9_DEBUG_9P, ">>> TFLUSH tag %d\n", oldtag); + req = p9_client_async_rpc(c, P9_TFLUSH, "w", oldtag); + if (IS_ERR(req)) { + return PTR_ERR(req); + } + + p9_debug(P9_DEBUG_MUX, "sent flush for oldreq %d type %d with flush tag %d\n", +oldtag, oldreq->tc.id, req->tc.tag); + req->flushed_req = oldreq; + spin_lock_irq(&c->lock); + list_add(&req->async_list, &c->async_req_list); + spin_unlock_irq(&c->lock); + + return 0; +} + static void p9_client_handle_async(struct p9_client *c, bool free_all) { struct p9_req_t *req, *nreq; @@ -823,6 +812,24 @@ static void p9_client_handle_async(struct p9_client *c, bool free_all) } if (free_all || req->status >= REQ_STATUS_RCVD) { /* Put old refs whatever reqs actually returned */ + if (req->tc.id == P9_TFLUSH) { + p9_debug(P9_DEBUG_MUX, +"flushing oldreq tag %d status %d, flush_req tag %d\n", +
[PATCH 1/3] 9p/net: implement asynchronous rpc
From: Dominique Martinet Add p9_client_async_rpc that will let us send an rpc without waiting for the reply, as well as an async handler hook. The work done in the hook could mostly be done in the client callback, but I prefered not to for a couple of reasons: - the callback can be called in an IRQ for some transports, and the handler needs to call p9_tag_remove which takes the client lock that has recently been reworked to not be irq-compatible (as it was never taken in IRQs until now) - the handled replies can be handled anytime, there is nothing the userspace would care about and want to be notified of quickly - the async request list and this function would be needed anyway for when we disconnect the client, in order to not leak async requests that haven't been replied to Signed-off-by: Dominique Martinet Cc: Eric Van Hensbergen Cc: Latchesar Ionkov Cc: Tomas Bortoli Cc: Dmitry Vyukov --- I've been sitting on these patches for almost a month now because I wanted to fix the cancel race, but I think it's what the most recent syzbot email is about so if it could find it without this it probably won't hurt to at least get some reviews for these three patches first. I won't submit these for next cycle, so will only put them into -next after 4.20 is released; hopefully I'll have time to look at the two pending refcount races around that time. Until then, comments please! include/net/9p/client.h | 9 +- net/9p/client.c | 64 + 2 files changed, 72 insertions(+), 1 deletion(-) diff --git a/include/net/9p/client.h b/include/net/9p/client.h index 947a570307a6..a4ded7666c73 100644 --- a/include/net/9p/client.h +++ b/include/net/9p/client.h @@ -89,7 +89,8 @@ enum p9_req_status_t { * @tc: the request fcall structure * @rc: the response fcall structure * @aux: transport specific data (provided for trans_fd migration) - * @req_list: link for higher level objects to chain requests + * @req_list: link used by trans_fd + * @async_list: link used to check on async requests */ struct p9_req_t { int status; @@ -100,6 +101,7 @@ struct p9_req_t { struct p9_fcall rc; void *aux; struct list_head req_list; + struct list_head async_list; }; /** @@ -110,6 +112,9 @@ struct p9_req_t { * @trans_mod: module API instantiated with this client * @status: connection state * @trans: tranport instance state and API + * @fcall_cache: kmem cache for client request data (msize-specific) + * @async_req_lock: protects @async_req_list + * @async_req_list: list of requests waiting a reply * @fids: All active FID handles * @reqs: All active requests. * @name: node name used as client id @@ -125,6 +130,8 @@ struct p9_client { enum p9_trans_status status; void *trans; struct kmem_cache *fcall_cache; + spinlock_t async_req_lock; + struct list_head async_req_list; union { struct { diff --git a/net/9p/client.c b/net/9p/client.c index 357214a51f13..0a67c3ccd4fd 100644 --- a/net/9p/client.c +++ b/net/9p/client.c @@ -729,6 +729,62 @@ static struct p9_req_t *p9_client_prepare_req(struct p9_client *c, return ERR_PTR(err); } +static struct p9_req_t * +p9_client_async_rpc(struct p9_client *c, int8_t type, const char *fmt, ...) +{ + va_list ap; + int err; + struct p9_req_t *req; + + va_start(ap, fmt); + req = p9_client_prepare_req(c, type, c->msize, fmt, ap); + va_end(ap); + if (IS_ERR(req)) + return req; + + err = c->trans_mod->request(c, req); + if (err < 0) { + /* bail out without flushing... */ + p9_req_put(req); + p9_tag_remove(c, req); + if (err != -ERESTARTSYS && err != -EFAULT) + c->status = Disconnected; + return ERR_PTR(safe_errno(err)); + } + + return req; +} + +static void p9_client_handle_async(struct p9_client *c, bool free_all) +{ + struct p9_req_t *req, *nreq; + + /* it's ok to miss some async replies here, do a quick check without +* lock first unless called with free_all +*/ + if (!free_all && list_empty(&c->async_req_list)) + return; + + spin_lock_irq(&c->async_req_lock); + list_for_each_entry_safe(req, nreq, &c->async_req_list, async_list) { + if (free_all && req->status < REQ_STATUS_RCVD) { + p9_debug(P9_DEBUG_ERROR, +"final async handler found req tag %d type %d status %d\n", +req->tc.tag, req->tc.id, req->status); + if (c->trans_mod->cancelled) + c->trans_mod->cancelled(c, req); + /* won't recei
[PATCH 2/3] 9p/net: make clunk asynchronous
From: Dominique Martinet clunk is defined as making the fid invalid whatever the server returns, and we should ignore errors so it is a good candidate for async call. The change should make 9p slightly faster (many vfs systeme calls won't need to wait for that clunk), but more importantly the flush rework means we won't clear pending signals and the current implementation of "retry twice then leak the fid" will stop working, so this needed improving. Signed-off-by: Dominique Martinet Cc: Eric Van Hensbergen Cc: Latchesar Ionkov Cc: Tomas Bortoli Cc: Dmitry Vyukov --- include/net/9p/client.h | 4 ++ net/9p/client.c | 124 +++- 2 files changed, 62 insertions(+), 66 deletions(-) diff --git a/include/net/9p/client.h b/include/net/9p/client.h index a4ded7666c73..75d7f83e5b94 100644 --- a/include/net/9p/client.h +++ b/include/net/9p/client.h @@ -91,6 +91,7 @@ enum p9_req_status_t { * @aux: transport specific data (provided for trans_fd migration) * @req_list: link used by trans_fd * @async_list: link used to check on async requests + * @clunked_fid: for clunk, points to fid */ struct p9_req_t { int status; @@ -102,6 +103,9 @@ struct p9_req_t { void *aux; struct list_head req_list; struct list_head async_list; + union { + struct p9_fid *clunked_fid; + }; }; /** diff --git a/net/9p/client.c b/net/9p/client.c index 0a67c3ccd4fd..a47b5a54573d 100644 --- a/net/9p/client.c +++ b/net/9p/client.c @@ -649,6 +649,51 @@ static int p9_check_zc_errors(struct p9_client *c, struct p9_req_t *req, return err; } +static struct p9_fid *p9_fid_create(struct p9_client *clnt) +{ + int ret; + struct p9_fid *fid; + + p9_debug(P9_DEBUG_FID, "clnt %p\n", clnt); + fid = kmalloc(sizeof(struct p9_fid), GFP_KERNEL); + if (!fid) + return NULL; + + memset(&fid->qid, 0, sizeof(struct p9_qid)); + fid->mode = -1; + fid->uid = current_fsuid(); + fid->clnt = clnt; + fid->rdir = NULL; + fid->fid = 0; + + idr_preload(GFP_KERNEL); + spin_lock_irq(&clnt->lock); + ret = idr_alloc_u32(&clnt->fids, fid, &fid->fid, P9_NOFID - 1, + GFP_NOWAIT); + spin_unlock_irq(&clnt->lock); + idr_preload_end(); + + if (!ret) + return fid; + + kfree(fid); + return NULL; +} + +static void p9_fid_destroy(struct p9_fid *fid) +{ + struct p9_client *clnt; + unsigned long flags; + + p9_debug(P9_DEBUG_FID, "fid %d\n", fid->fid); + clnt = fid->clnt; + spin_lock_irqsave(&clnt->lock, flags); + idr_remove(&clnt->fids, fid->fid); + spin_unlock_irqrestore(&clnt->lock, flags); + kfree(fid->rdir); + kfree(fid); +} + static struct p9_req_t * p9_client_rpc(struct p9_client *c, int8_t type, const char *fmt, ...); @@ -778,6 +823,9 @@ static void p9_client_handle_async(struct p9_client *c, bool free_all) } if (free_all || req->status >= REQ_STATUS_RCVD) { /* Put old refs whatever reqs actually returned */ + if (req->tc.id == P9_TCLUNK) { + p9_fid_destroy(req->clunked_fid); + } list_del(&req->async_list); p9_tag_remove(c, req); } @@ -959,51 +1007,6 @@ static struct p9_req_t *p9_client_zc_rpc(struct p9_client *c, int8_t type, return ERR_PTR(safe_errno(err)); } -static struct p9_fid *p9_fid_create(struct p9_client *clnt) -{ - int ret; - struct p9_fid *fid; - - p9_debug(P9_DEBUG_FID, "clnt %p\n", clnt); - fid = kmalloc(sizeof(struct p9_fid), GFP_KERNEL); - if (!fid) - return NULL; - - memset(&fid->qid, 0, sizeof(struct p9_qid)); - fid->mode = -1; - fid->uid = current_fsuid(); - fid->clnt = clnt; - fid->rdir = NULL; - fid->fid = 0; - - idr_preload(GFP_KERNEL); - spin_lock_irq(&clnt->lock); - ret = idr_alloc_u32(&clnt->fids, fid, &fid->fid, P9_NOFID - 1, - GFP_NOWAIT); - spin_unlock_irq(&clnt->lock); - idr_preload_end(); - - if (!ret) - return fid; - - kfree(fid); - return NULL; -} - -static void p9_fid_destroy(struct p9_fid *fid) -{ - struct p9_client *clnt; - unsigned long flags; - - p9_debug(P9_DEBUG_FID, "fid %d\n", fid->fid); - clnt = fid->clnt; - spin_lock_irqsave(&clnt->lock, flags); - idr_remove(&clnt->fids, fid->fid); - spin_unlock_irqrestore(&clnt->lock, flags); - kfree(fid-
Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged
Linus Torvalds wrote on Sat, Jan 05, 2019: > But I think my patch to just rip out all that page lookup, and just > base it on the page table state has the fundamental advantage that it > gets rid of code. Maybe I should jst commit it, and see if anything > breaks? We do have options in case things break, and then we'd at > least know who cares (and perhaps a lot more information of _why_ they > care). There actually are many tools like fincore which depend on mincore to try to tell whether a file is "loaded in cache" or not (I personally use vmtouch[1], but I know of at least nocache[2] uses it as well to only try to evict used pages) [1] https://hoytech.com/vmtouch/ [2] https://github.com/Feh/nocache I mostly use these to either fadvise(POSIX_FADV_DONTNEED) or prefetch/lock whole files so my "production" use-cases don't actually rely on the mincore part of them; but when playing with these actions it's actually fairly useful to be able to visualize which part of a file ended in cache or monitor how a file's content evolve in cache... There are various non-obvious behaviours where being able to poke around is enlightening (e.g. fadvise dontneed is actually a hint, so even if nothing uses the file linux sometimes keep the data around if it thinks that would be useful and nocache has a mode to call fadvise multiple times and things like that...) Anyway, I agree the use of mincore for this is rather ugly, and frankly some "cache management API" might be better in the long run if only for performance reason (don't try these tools on a hundred TB sparse file...), but until that pipe dream comes true I think mincore as it was is useful for system admins. Linus Torvalds wrote on Sun, Jan 06, 2019: > I decided to just apply that patch. It is *not* marked for stable, > very intentionally, because I expect that we will need to wait and see > if there are issues with it, and whether we might have to do something > entirely different (more like the traditional behavior with some extra > "only for owner" logic). FWIW I personally don't care much about "only for owner" or depending on mmap options; I don't understand much of the security implications honestly so I'm not sure how these limitations actually help. On the other hand, a simple CAP_SYS_ADMIN check making the call take either behaviour should be safe and would cover what I described above. (by the way, while we are discussing permissions, a regular user can use fadvise dontneed on files it doesn't own as well as long as it can open them for reading; I'm not sure if that would need restricting as well in the context of the security issue. Frankly even with mincore someone could likely tell the difference through timing, if they just do it a few times. Do magic, probe, flush out, repeat until satisfied.) Thanks, -- Dominique Martinet | Asmadeus
Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged
Vlastimil Babka wrote on Mon, Jan 07, 2019: > On 1/7/19 5:32 AM, Dominique Martinet wrote: > > Linus Torvalds wrote on Sat, Jan 05, 2019: > >> But I think my patch to just rip out all that page lookup, and just > >> base it on the page table state has the fundamental advantage that it > >> gets rid of code. Maybe I should jst commit it, and see if anything > >> breaks? We do have options in case things break, and then we'd at > >> least know who cares (and perhaps a lot more information of _why_ they > >> care). > > > > There actually are many tools like fincore which depend on mincore to > > try to tell whether a file is "loaded in cache" or not (I personally use > > vmtouch[1], but I know of at least nocache[2] uses it as well to only > > try to evict used pages) > > nocache could probably do fine without mincore. IIUC the point is to not > evict anything that was already resident prior to running some command > wrapped in nocache. Without the mincore checks, > posix_fadvise(POSIX_FADV_DONTNEED) will still not drop anything that > others have mapped. That means without mincore() it will drop data > that's in cache but not currently in use by anybody, which shouldn't > cause large performance regressions? Yes, I sent them a patch to allow this behaviour (which got merged ages ago); but the default nocache usage will try to preserve pages that were mapped before even if the user mapped it themselves so it's a little bit more robust that just trusting linux to do the right thing. I did various tests with fadvise dontneed and honestly the way it works is far from obvious when operating as a single user at least; I didn't test multi-users as that didn't really concern me. With the current mincore change, it will think everything was "in core" and not flush anything unless my option to just fadvise dontneed everything is passed though ; so even if we can make it work it is a change of behaviour that is breaking an existing application, and it has no way of telling it didn't work. Honestly though, as I said, mincore() is much more useful for debugging for me ; the application can be changed if required. I just pointed it out as it'll need changing, and it has no obvious way of testing at runtime if the syscall works (except dumb kernel version check, but that won't work with stable backports); so it's not that obvious. > > FWIW I personally don't care much about "only for owner" or depending on > > mmap options; I don't understand much of the security implications > > honestly so I'm not sure how these limitations actually help. > > On the other hand, a simple CAP_SYS_ADMIN check making the call take > > either behaviour should be safe and would cover what I described above. > > So without CAP_SYS_ADMIN, mincore() would return mapping status, and > with CAP_SYS_ADMIN, it would return cache residency status? Very clumsy > :( Maybe if we introduced mincore2() with flags similar to BSD mentioned > earlier in the thread, and the cache residency flag would require > CAP_SYS_ADMIN or something similar. I agree, that's rather clumsy... Or rather might lead to some unexpected behaviours. I'm open to other ideas. I'm not sure how the BSD flags help though? > > (by the way, while we are discussing permissions, a regular user can use > > fadvise dontneed on files it doesn't own as well as long as it can open > > them for reading; I'm not sure if that would need restricting as well in > > the context of the security issue. > > Probably not, as I've mentioned it won't evict what's mapped by somebody > else. And eviction is also possible via controlling LRU, which is what > the paper [1] does anyway (and also mentions that DONTNEED doesn't > work). Being able to evict somebody's page is AFAIU not sufficient for > attack, the side channel is about knowing that somebody brought that > page back to RAM by touching it. Thanks for the link to the paper, I hadn't taken the time to extract it from the news article but it's much more interesting indeed. Some of what's been said here makes more sense to me now (checking about ownership and similar might indeed be good enough). For evicting page I agree most targets would be different users so I guess that makes sense as well; there seem to be other ways of efficiently evicting a page from cache anyway. (BTW, this was brought up earlier, the paper also mentions /proc/self/pagemap as not being useful.) > > Frankly even with mincore someone > > could likely tell the difference through timing, if they just do it a > > few times. Do magic, probe, flush out, repeat until satisfied.) (I
Re: [PATCH v15 23/26] sched: early boot clock
Pavel Tatashin wrote on Mon, Jan 07, 2019: > I could not reproduce the problem. Did you suspend to memory between > wake ups? Does this time jump happen every time, even if your laptop > sleeps for a minute? I'm not sure I understand "suspend to memory between the wake ups". The full sequence is: - start a VM (just in case, I let it boot till the end) - suspend to memory (aka systemctl suspend) the host - after resuming the host, soft reboot the VM (login through serial/ssh/whatever and reboot or in the qemu console 'system_reset') I've just slept exactly one minute and reproduced again with the fedora stock kernel now (4.19.13-300.fc29.x86_64) in the VM. Interestingly I'm not getting the same offset between multiple reboots now despite not suspending again; but if I don't suspend I cannot seem to get it to give an offset at all (only tried for a few minutes; this might not be true) ; OTOH I pushed my luck further and even with a five seconds sleep I'm getting a noticeable offset on first VM reboot after resume: [0.00] Hypervisor detected: KVM [0.00] kvm-clock: Using msrs 4b564d01 and 4b564d00 [ 179.362163] kvm-clock: cpu 0, msr 13c01001, primary cpu clock [ 179.362163] clocksource: kvm-clock: mask: 0x max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns Honestly not sure what more information I could give, I'll try on some other hardware than my laptop (if I can get a server to resume after suspend through ipmi or wake on lan); but I don't have anything I could install ubuntu on to try their qemu's version... although I really don't want to believe that's the difference... Thanks, -- Dominique
Re: [PATCH v15 23/26] sched: early boot clock
Pavel Tatashin wrote on Mon, Jan 07, 2019: > I did exactly the same sequence on Kaby Lake CPU and could not > reproduce it. What is your host CPU? skylake consumer laptop CPU: Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz I don't have any kaby lake around; I have access to older servers though... -- Dominique
[GIT PULL] 9p updates for 4.21
Small pull request this time around with only two commits; some missing prototype warning fix and a syzkaller fix when a 9p server advertises a too small msize. The commit date is close-ish because I reworded a commit message to add a Cc to stable for the msize fix recently, but the patchs themselves have been in linux-next since Nov 20~ish Thanks, The following changes since commit ccda4af0f4b92f7b4c308d3acc262f4a7e3affad: Linux 4.20-rc2 (2018-11-11 17:12:31 -0600) are available in the Git repository at: git://github.com/martinetd/linux tags/9p-for-4.21 for you to fetch changes up to 574d356b7a02c7e1b01a1d9cba8a26b3c2888f45: 9p/net: put a lower bound on msize (2018-12-25 17:07:49 +0900) Pull request for inclusion in 4.21 Missing prototype warning fix and a syzkaller fix when a 9p server advertises a too small msize Adeodato Simó (1): net/9p: include trans_common.h to fix missing prototype warning. Dominique Martinet (1): 9p/net: put a lower bound on msize net/9p/client.c | 21 + net/9p/trans_common.c | 1 + -- Dominique Martinet
Re: [PATCH v15 23/26] sched: early boot clock
Pavel Tatashin wrote on Thu, Jan 03, 2019: > Could you please send the config file and qemu arguments that were > used to reproduce this problem. Running qemu by hand, nothing fancy e.g. this works: # qemu-system-x86_64 -m 1G -smp 4 -drive file=/root/kvm-wrapper/disks/f2.img,if=virtio -serial mon:stdio --enable-kvm -cpu Haswell -device virtio-rng-pci -nographic (used a specific cpu just in case but normally runnning with cpu host on a skylake machine; can probably go older) qemu is fedora 29 blend as is: $ qemu-system-x86_64 --version QEMU emulator version 3.0.0 (qemu-3.0.0-3.fc29) Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers compressed .config attached to the mail, this can likely be trimmed down some as well but that takes more time for me.. I didn't rebuild the kernel so not 100% sure (comes from /proc/config.gz) but it should work on a 4.20-rc2 kernel as written in the first few lines; 857baa87b64 I referred to in another mail was merged in 4.19-rc1 so anything past that is probably OK to reproduce... Re-checked today with these exact options (fresh VM start; then suspend laptop for a bit, then reboot VM): [0.00] Hypervisor detected: KVM [0.00] kvm-clock: Using msrs 4b564d01 and 4b564d00 [ 2477.907447] kvm-clock: cpu 0, msr 153a4001, primary cpu clock [ 2477.907448] clocksource: kvm-clock: mask: 0x max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns [ 2477.907450] tsc: Detected 2592.000 MHz processor As offered previously, happy to help in any way. Thanks, -- Dominique config.xz Description: Binary data
Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation
Theodore Y. Ts'o wrote on Fri, Dec 28, 2018: > > The problem is that there is no 32-bit API in some cases > > (unless I have misunderstood the kernel code) -- not all > > host architectures implement compat syscalls or allow them > > to be called from 64-bit processes or implement all the older > > syscall variants that had smaller offets. If there was a guaranteed > > "this syscall always exists and always gives me 32-bit offsets" > > we could use it. > > Are there going to be cases where a process or a thread will sometimes > want the 64-bit interface, and sometimes want the 32-bit interface? > Or is it always going to be one or the other? I wonder if we could > simply add a new flag to the process personality(2) flags. That would likely work for qemu user, but the qemu system+9p case is going to be more painful.. More precisely, the 9p protocol does not plan for anything other than 64bit offset so if the vfs needs to hand out a 32bit offset we'll need to make a correspondance table between the 32bit offsets we hand off and the 64bit ones to use; unless some flag can be passed at lopen to tell the server to always hand out 32bit offsets for this directory... And if we do that then 9p servers will need a way to use both APIs in parallel for both types of directories. (Although I'd rather not have to do either in the first place, keeping track of all used offsets just in case seems like a waste even if we only do it for processes in 32bit mode, and a new flag would be a protocol change with 9p not being designed to catter for subtle protocol changes so would be rather painful to roll out) No bright idea here, sorry. -- Dominique Martinet | Asmadeus
Re: [V9fs-developer] [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation
Theodore Y. Ts'o wrote on Fri, Dec 28, 2018: > On Sat, Dec 29, 2018 at 03:37:21AM +0100, Dominique Martinet wrote: > > > Are there going to be cases where a process or a thread will sometimes > > > want the 64-bit interface, and sometimes want the 32-bit interface? > > > Or is it always going to be one or the other? I wonder if we could > > > simply add a new flag to the process personality(2) flags. > > > > That would likely work for qemu user, but the qemu system+9p case is > > going to be more painful.. > > More precisely, the 9p protocol does not plan for anything other than > > 64bit offset so if the vfs needs to hand out a 32bit offset we'll need > > to make a correspondance table between the 32bit offsets we hand off and > > the 64bit ones to use; unless some flag can be passed at lopen to tell > > the server to always hand out 32bit offsets for this directory... And if > > we do that then 9p servers will need a way to use both APIs in parallel > > for both types of directories. > > How about if we add a fcntl(2) mediated flag, which is tied to a > struct file? Would that be more or less painful for 9p and qemu > system+9p? Hmm. 9P2000.L doesn't have anything akin to fcntl either, the only two obvious places where we could pass a flag is lopen (which already handles a bunch of linux-specific flags, e.g. passing O_LARGEFILE O_NOATIME etc will just forward these through for qemu/diod at least), or adding a new parameter to the 9p readdir. The former would let us get away without modifying the protocol as servers will just ignore flags they don't handle on implementations I checked, so it'd definitely be the least effort choice from what I can tell. On the other hand a fcntl would solve the server-side problem, it'd allow the server to request appropriately-sized offsets per fd, so it's a good start; we "just" need to figure how to translate that on the wire. -- Dominique Martinet | Asmadeus
Re: BUG: corrupted list in p9_read_work
Dmitry Vyukov wrote on Wed, Oct 10, 2018: > How can they be faked? > If we could create a private rdma/virtio stub instance per test > process, then we could I think easily use that instance for 9p. But is > it possible? "RDMA" itself can be faked pretty easily nowadays, there's a "rxe" driver that is soft RDMA over ethernet and can run over anything. The problem is that you can't just give the client a file like trans fd; you'd need to open an ""rdma socket"" (simplifying wording a bit), and afaik there is no standard tool for it ; or rather, the problem is that RDMA is packet based so even if there were you can't just write stuff in a fd and hope it'll work, so you need a server. If you're interested, 9p is trivial enough that I could provide you with a trivial server that works like your file (just need to reimplement something that parses header to packetize it properly; so you could write to its stdin for example) ; that'd require some setup in the VM (configure rxe and install that tool), but it would definitely be possible. What do you think ? For virtio, I'm not as familiar with the environment so I do not know if there are ways to fake it as easily unfortunately. > Testing on real hardware is mostly outside of our priorities at the > moment. I mean syzkaller itself can be run on anything, and one could > extend descriptions to use a known rdma interface and run on a real > hardware. But we can't afford this at the moment. Sure, I understand that. > As far as I understand RDMA maintainers run syzkaller on real > hardware, but I don't know if they are up to including 9p into > testing. +Leon I'd be interested in knowing what kind of tests runs there :) -- Dominique Martinet
Re: BUG: corrupted list in p9_read_work
Dmitry Vyukov wrote on Wed, Oct 10, 2018: > > Back to the current patch, since as I said I am not confident this is a > > good enough fix for the current bug, will I get notified if the bug > > happens again once the patch hits linux-next with the Reported-by tag ? > > (I don't have the setup necessary to run a syz repro as there is no C > > repro, and won't have much time to do that setup sorry) > > Yes, the bug will be reported again if it still happens after the > patch is merged (not just into linux-next, but into all tested trees, > but it does not matter much). So marking bugs as fixed tentatively is > fine if that's our best guess. Ok, thanks for confirming... > But note that syzbot can test fixes itself on request. It boils down > to just giving it the patch and the base tree: > https://github.com/google/syzkaller/blob/master/docs/syzbot.md#testing-patches .. and for clarifying that bit, let's try that! :) #syz test: git://github.com/martinetd/linux e4ca13f7d075e551dc158df6af18fb412a1dba0a -- Dominique
Re: BUG: corrupted list in p9_read_work
Dmitry Vyukov wrote on Thu, Oct 11, 2018: > > That's still the tricky part, I'm afraid... Making a separate server > > would have been easy because I could have reused some of my junk for the > > actual connection handling (some rdma helper library I wrote ages > > ago[1]), but if you're going to just embed C code you'll probably want > > something lower level? I've never seen syzkaller use any library call > > but I'm not even sure I would know how to create a qp without > > libibverbs, would standard stuff be OK ? > > Raw syscalls preferably. > What does 'rxe_cfg start ens3' do on syscall level? Some netlink? modprobe rdma_rxe (and a bunch of other rdma modules before that) then writes the interface name in /sys/module/rdma_rxe/parameters/add apparently; then checks it worked. this part could be done in C directly without too much trouble, but as long as the proper kernel configuration/modules are available > Any libraries and utilities are hell pain in linux world. Will it work > in Android userspace? gVisor? Who will explain all syzkaller users > where they get this for their who-knows-what distro, which is 10 years > old because of corp policies, and debug how their version of the > library has a slightly incompatible version? > For example, after figuring out that rxe_cfg actually comes from > rdma-core (which is a separate delight on linux), my debian > destribution failed to install it because of some conflicts around > /etc/modprobe.d/mlx4.conf, and my ubuntu distro does not know about > such package. And we've just started :) The rdma ecosystem is a pain, I'll easily agree with that... > Syscalls tend to be simpler and more reliable. If it gives ENOSUPP, > ok, that's it. If it works, great, we can use it. I'll have to look into it a bit more; libibverbs abstracts a lot of stuff into per-nic userspace drivers (the files I cited in a previous mail) and basically with the mellanox cards I'm familiar with the whole user session looks like this: * common libibverbs/rdmacm code opens /dev/infiniband/rdma_cm and /dev/infiniband/uverbs0 (plus a bunch of files to figure out abi version, what user driver to load etc) * it and the userspace driver issue "commands" over these two files' fd to setup the connection ; some commands are standard but some are specific to the interface and defined in the driver. There are many facets to a connection in RDMA: a protection domain used to register memory with the nic, a queue pair that is the actual tx/rx connection, optionally a completion channel that will be another fd to listen on for events that tell you something happened and finally some memory regions to directly communicate with the nic from userspace depending on the specific driver. * then there's the actual usage, more commands through the uverbs0 char device to register the memory you'll use, and once that's done it's entierly up to the driver - for example the mellanox lib can do everything in userspace playing with the memory regions it registered, but I'd wager the rxe driver does more calls through the uverbs0 fd... Honestly I'm not keen on reimplementing all of this; the interface itself pretty much depends on your version of the kernel (there is a common ABI defined, but as far as specific nics are concerned if your kernel module doesn't match the user library version you can get some nasty surprises), and it's far from the black or white of a good ol' ENOSUPP error. I'll look if I can figure out if there is a common subset of verbs commands that are standard and sufficient to setup a listening connection and exchange data that should be supported for all devices and would let us reimplement just that, but while I hear your point about android and ten years in the future I think it's more likely than ten years in the future the verb abi will have changed but libibverbs will just have the new version implemented and hide the change :P -- Dominique
9p/RDMA for syzkaller (Was: BUG: corrupted list in p9_read_work)
Dmitry Vyukov wrote on Thu, Oct 11, 2018: > But again we don't need to support all of the available hardware. I agree with that, I just have no idea what the "librxe-rdmav16.so" lib could be doing and described something I am slightly more familiar with (e.g. libmlx5) I talked about a common subset of the verb abi because I didn't want to look into what it's doing, but if it's not enough there's always that possibility. > For example, we are testing net stack from external side using tun. > tun is a very simple, virtual abstraction of a network card. It allows > us to test all of generic net stack starting from L2 without messing > with any real drivers and their differences entirely. I had impression > that we are talking about something similar here too. Or not? That sounds about right, rxe is a software implementation that should work on most network interfaces ; at least from what I tried it worked on a VM's virtio net down to my laptop's wifi interface so it's a good start... I'm not saying all because I just tried a dummy interface and that returned EINVAL. The only point I disagree is the 'very simple', even getting that to work will be a far cry from a socket() call... :) > Also I am a bit missing context about rdma<->9p interface. Do we need > to setup all these ring buffers to satisfy the parts that 9p needs? Is > it that 9p actually reads data directly from these ring buffers? Or > there is some higher-level rdma interface that 9p uses? It needs an "RDMA_PS_TCP" connection established, that requires everything I described unfortunately... Once that's established we need to register some memory to the driver and post some recv buffers (even if we won't read it, the client would get errors if we aren't ready to receive anything - at least it does with real hardware), and also use some registered memory to send data. Thinking back though I think that my server implementation isn't very far from the raw level in what I'm doing, I recall libibverbs fallback implementation (e.g. if the driver lib doesn't implement it otherwise) of the functions I looked at like ibv_post_send to mostly be just serializing the arguments, slapping the command from an enum in front of it and sending it to the kernel, so it might be enough to just reimplement that shim in or figure a way to generate the binary commands once and then use these values; now I'm comparing two runs of strace of my test server I definitely see a pattern. I'll give it a try but don't expect something fast, and it's probably not going to be very pretty either... To give a concrete example, here are all the read/write/fcntl calls looking just at /dev/infiniband in a hello world program that just establishes connection (server side), receive and send two messages and quits: This part apparently sets up the listening connection of the server: 1430 1539262699.126025 openat(AT_FDCWD, "/dev/infiniband/rdma_cm", O_RDWR|O_CLOEXEC) = 3 1430 1539262699.126155 write(3, "\0\0\0\0\30\0\4\0@m'\1\0\0\0\0\344\327\375\271\374\177\0\0?\1\2\0\0\0\0\0", 32) = 32 1430 1539262699.126192 write(3, "\24\0\0\0\210\0\0\0\0\0\0\\0\0\0\33\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\6\0\0\377\377\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 144) = 144 1430 1539262699.126223 write(3, "\23\0\0\0\20\0\20\1 \326\375\271\374\177\0\0\0\0\0\0\0\0\0\0", 24) = 24 1430 1539262699.126250 write(3, "\23\0\0\0\20\0\20\1 \326\375\271\374\177\0\0\0\0\0\0\2\0\0\0", 24) = 24 1430 1539262699.126274 write(3, "\1\0\0\0\20\0\4\0\324\327\375\271\374\177\0\0\0\0\0\0\0\0\0\0", 24) = 24 1430 1539262699.126303 close(3)= 0 1430 1539262699.126360 openat(AT_FDCWD, "/dev/infiniband/rdma_cm", O_RDWR|O_CLOEXEC) = 3 1430 1539262699.126429 write(3, "\0\0\0\0\30\0\4\0\240\217'\1\0\0\0\0t\330\375\271\374\177\0\0\6\1\2\0\0\0\0\0", 32) = 32 1430 1539262699.126472 write(3, "\24\0\0\0\210\0\0\0\0\0\0\0\34\0\0\0\n\0\4\323\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 144) = 144 1430 1539262699.126501 write(3, "\23\0\0\0\20\0\20\1p\326\375\271\374\177\0\0\0\0\0\0\0\0\0\0", 24) = 24 1430 1539262699.126534 write(3, "\23\0\0\0\20\0\20\1p\326\375\271\374\177\0\0\0\0\0\0\2\0\0\0", 24) = 24 1430 1539262699.127119 write(3, "\7\0\0\0\10\0\0\0\0\0\0\0@\0\0\0", 16) = 16 1430 1539262699.127149 write(3, "\23\0\0\0\20\0\20\1`\327\375\271\374\177\0\0\0\0\0\0\0\0\0\0", 24) = 24 1430 1539262699.127319 fcntl(3, F_GETFL) = 0x8002 (flags O_RDWR|O_LARGEFILE) 1430 1539262699.127348 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK|O_LARGEFILE Then the client connects (had some epoll on read on fd 3, but no read?!) 1446 1539262706.
9p/RDMA for syzkaller (Was: BUG: corrupted list in p9_read_work)
Dmitry Vyukov wrote on Thu, Oct 11, 2018: > > Now we are talking! > > We generally assume that all modules are simply compiled into kernel. > > At least that's we have on syzbot. If somebody can't compile them in, > > we can suggest to add modprobe into init. > > So this boils down to just writing to /sys/module/rdma_rxe/parameters/add. > > This fails for me: > > root@syzkaller:~# echo -n syz1 > /sys/module/rdma_rxe/parameters/add > [20992.905406] rdma_rxe: interface syz1 not found > bash: echo: write error: Invalid argument Works here, I just did: [root@f2 ~]# modprobe rdma_rxe [root@f2 ~]# echo -n ens3 > /sys/module/rdma_rxe/parameters/add dmesg says: [ 35.595534] rdma_rxe: set rxe0 active [ 35.595541] rdma_rxe: added rxe0 to ens3 Actually for a dummy interface if I try the echo directly the echo works, and a verb device is created, and I just confirmed I can use it... so not sure why rxe_cfg said EINVAL earlier... [root@f2 ~]# ip link add dummy0 type dummy [root@f2 ~]# ip link set dummy0 up [root@f2 ~]# ip addr add 10.1.1.1/24 dev dummy0 [root@f2 ~]# modprobe rdma_rxe [root@f2 ~]# echo -n dummy0 > /sys/module/rdma_rxe/parameters/add (then using my test client: [root@f2 src]# ./rcat -s INFO: trans_rdma.c (879), msk_cma_event_handler: CONNECT_REQUEST INFO: trans_rdma.c (862), msk_cma_event_handler: ESTABLISHED INFO: trans_rdma.c (917), msk_cma_event_handler: DISCONNECT EVENT... [root@f2 src]# ./rcat -c 10.1.1.1 INFO: trans_rdma.c (862), msk_cma_event_handler: ESTABLISHED ^C ) I assume your syz1 interface is a tap device as you were saying earlier? Got anything in dmesg? -- Dominique