The concern is a client is blocked while processing a request. The nbdkit server design requires a thread per request being processed regardless of the number of connections or clients. We want to run 1000's of requests in parallel without needing a thread at nbdkit layer per request in flight.
Our plugin layer is built around boost asio and a few threads in a worker pool running an io service can be processing 1000s of requests in parallel. (Our plugin is a gateway of sorts and requests are sent back out over the network. While our plugin waits for the read or write data we don't block in a thread we handle other requests that are ready). The current nbdkit server design requires a thread per request in progress because it is built around a synchronous callback to the plugin layer and the main recv_request_send_reply loop holds the only copy of the request handle that is needed to make the reply. A more flexible design would be the recv_request_send_reply loop is instead split into a recv_request loop and a send_reply func. The recv_request loop forwards the request handle to the handle_request call. The existing nbdkit_plugin struct would function identically and send_reply is called after the plugin.pread + fua is finished. An alternative plugin struct "nbdkit_plugin_lowlevel" could define a different interface where an opaque ptr to the handle + connection + flags are passed in and the plugin is required to call nbdkit_reply (opaque *ptr, ...) to send the reply to the nbd client rather than nbdkit auto sending the reply after returning from the plugin function. Some pseudo code example changes for what I had in mind. struct operation { struct connection *conn; uint64_t handle; uint32_t cmd; uint32_t flags; char *buf; uint32_t count; }; // unchanged struct nbdkit_plugin { ... int (*pread) (void *handle, void *buf, uint32_t count, uint64_t offset); int (*pwrite) (void *handle, const void *buf, uint32_t count, uint64_t offset); int (*flush) (void *handle); int (*trim) (void *handle, uint32_t count, uint64_t offset); int (*zero) (void *handle, uint32_t count, uint64_t offset, int may_trim); } // new lowlevel api struct nbdkit_plugin_lowlevel { ... int (*pread) (void *op, void *handle, void *buf, uint32_t count, uint64_t offset); int (*pwrite) (void *op, void *handle, void *buf, uint32_t count, uint64_t offset); int (*flush) (void *op, void *handle); int (*trim) (void *op, void *handle, uint32_t count, uint64_t offset); int (*zero) (void *op, void *handle, uint32_t count, uint64_t offset, int may_trim); }; // Called by the lowlevel api to send a reply to the client void nbdkit_reply(void *vop) { int r; bool flush_after_command; struct operation *op = (struct operation *) vop; flush_after_command = (op->flags & NBD_CMD_FLAG_FUA) != 0; if (!op->conn->can_flush || op->conn->readonly) flush_after_command = false; if (flush_after_command) { op->flags = 0; // clear flags r = plugin_flush_lowlevel (op, op->conn->handle); if (r == -1) // do error stuff; } else if (op->cmd == NBD_CMD_READ) send_reply(op, op->buf, op->count, 0); else send_reply(op, NULL, 0, 0); } int my_plugin_pwrite_lowlevel (void *op, void *handle, void *buf, uint32_t count, uint64_t offset) { if (is_readonly(handle)) { nbdkit_reply_error (op, EROFS); return 0; } if (some critically bad issue) return -1; // this returns right away before write has completed and when it // does complete calls the handler lambda my_storage.async_write(buf, count, offset, [op, count](const boost::asio::error_code & ec) { if (ec) nbdkit_reply_error(op, ec.value()); else nbdkit_reply(op); }); return 0; } // connections.c static int _send_reply (struct operation *op, uint32_t count, void *buf, uint32_t error) { int r; struct reply reply; reply.magic = htobe32 (NBD_REPLY_MAGIC); reply.handle = op->handle; reply.error = htobe32 (nbd_errno (error)); if (error != 0) { /* Since we're about to send only the limited NBD_E* errno to the * client, don't lose the information about what really happened * on the server side. Make sure there is a way for the operator * to retrieve the real error. */ debug ("sending error reply: %s", strerror (error)); } r = xwrite (op->conn->sockout, &reply, sizeof reply); if (r == -1) { nbdkit_error ("write reply: %m"); return -1; } if (op->cmd == NBD_CMD_READ) { /* Send the read data buffer. */ r = xwrite (op->conn->sockout, buf, count); if (r == -1) { nbdkit_error ("write data: %m"); return -1; } } return 0; } // new mutex on writes due to parallel nature of responding to the socket int send_reply (struct operation *op, uint32_t count, void *buf, uint32_t error) { int r; plugin_lock_reply (op->conn); r = _send_reply (op, count, buf, error); plugin_unlock_reply (op->conn); free (op); return r; } On Mon, Feb 20, 2017 at 4:03 AM, Richard W.M. Jones <rjo...@redhat.com> wrote: > > ----- Forwarded message ----- > > > > Date: Sat, 18 Feb 2017 22:21:19 -0500 > > Subject: nbdkit async > > > > Hello, > > > > Hope this is the right person to contact regarding nbdkit design. > > > > I have a high latency massively parallel device that I am currently > > implementing as an nbdkit plugin in c++ and have run into some design > > limitations due to the synchronous callback interface nbdkit requires. > > Is the concern that each client requires a single thread, consuming > memory (eg for stack space), but because of the high latency plugin > these threads will be hanging around not doing very much? And/or is > it that the client is blocked while servicing each request? > > > Nbdkit is currently designed to call the plugin > > pread/pwrite/trim/flush/zero ops as synchronous calls and expects when > the > > plugin functions return that it can then send the nbd reply to the > socket. > > > > It's parallel thread model is also not implemented as of yet > > I think this bit can be fixed fairly easily. One change which is > especially easy to make is to send back the NBD_FLAG_CAN_MULTI_CONN > flag (under control of the plugin). > > Anyway this doesn't solve your problem ... > > > but the > > current design still mandates a worker thread per parallel op in progress > > due to the synchronous design of the plugin calls. > > And the synchronous / 1 thread per client design of the server. > > > I would like to modify this to allow for an alternative operating mode > > where nbdkit calls the plugin functions and expects the plugin to > callback > > to nbdkit when a request has completed rather than responding right after > > the plugin call returns to nbdkit. > > > > If you are familiar with fuse low level api design, something very > similar > > to that. > > > > An example flow for a read request would be as follows: > > > > 1) nbdkit reads and validates the request from the socket > > 2) nbdkit calls handle_request but now also passing in the nbd request > > handle value > > 3) nbdkit bundles the nbd request handle value, bool flush_on_update, and > > read size into an opaque ptr to struct > > 4) nbdkit calls my_plugin.pread passing in the usual args + the opaque > ptr > > We can't change the existing API, so this would have to be exposed > through new plugin entry point(s). > > > 5) my_plugin.pread makes an asynchronous read call with a handler set on > > completion to call nbdkit_reply_read(conn, opaque ptr, buf) or on error > > nbdkit_reply_error(conn, opaque_ptr, error) > > 6) my_plugin.pread returns back to nbdkit without error after it has > > started the async op but before it has completed > > 7) nbdkit doesn't send a response to the conn->sockout beause when the > > async op has completed my_plugin will callback to nbdkit for it to send > the > > response > > 8) nbdkit loop continues right away on the next request and it reads and > > validates the next request from conn->sockin without waiting for the > > previous request to complete > > *) Now requires an additional mutex on the conn->sockout for writing > > responses > > > > The benefit of this approach is that 2 threads (1 thread for reading > > requests from the socket and kicking off requests to the plugin and 1 > > thread (or more) in a worker pool executing the async handler callbacks) > > can service 100s of slow nbd requests in parallel overcoming high > latency. > > > > The current design with synchronous callbacks we can provide async in our > > plugin layer for pwrites and implement our flush to enforce it but we > can't > > get around a single slow high latency read op blocking the entire pipe. > > > > I'm willing to work on this in a branch and push this up as opensource > but > > first wanted to check if this functionality extension is in fact > something > > redhat would want for nbdkit and if so if there were suggestions to the > > implementation. > > It depends on how much it complicates the internals of nbdkit (which > are currently quite simple). Would need to see the patches ... > > You can help by splitting into simple changes which are generally > applicable (eg. supporting NBD_FLAG_CAN_MULTI_CONN), and other changes > which are more difficult to integrate. > > > Initial implementation approach was going to be similar to the > > fuse_low_level approach and create an entirely separate header file for > the > > asynchronous plugin api because the plugin calls now need an additional > > parameter (opaque ptr to handle for nbdkit_reply_). This header file > > nbdkit_plugin_async.h defines the plugin struct with slightly different > > function ptr prototypes that accepts the opaque ptr to nbd request > handle > > and some additional callback functions nbdkit_reply_error, nbdkit_reply, > > and nbdkit_reply_read. The user of this plugin interface is required to > > call either nbdkit_reply_error or nbdkit_reply[_read] in each of the > > pread/pwrite/flush/trim/zero ops. > > > > If you got this far thank you for the long read and please let me know if > > there is any interest. > > Rich. > > -- > Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~ > rjones > Read my programming and virtualization blog: http://rwmj.wordpress.com > virt-top is 'top' for virtual machines. Tiny program with many > powerful monitoring features, net stats, disk stats, logging, etc. > http://people.redhat.com/~rjones/virt-top >
_______________________________________________ Libguestfs mailing list Libguestfs@redhat.com https://www.redhat.com/mailman/listinfo/libguestfs