On Mon, Jul 11, 2016 at 02:56:52PM +0000, Dexuan Cui wrote:
> Hyper-V Sockets (hv_sock) supplies a byte-stream based communication
> mechanism between the host and the guest. It's somewhat like TCP over
> VMBus, but the transportation layer (VMBus) is much simpler than IP.
> 
> With Hyper-V Sockets, applications between the host and the guest can talk
> to each other directly by the traditional BSD-style socket APIs.
> 
> Hyper-V Sockets is only available on new Windows hosts, like Windows Server
> 2016. More info is in this article "Make your own integration services":
> https://msdn.microsoft.com/en-us/virtualization/hyperv_on_windows/develop/make_mgmt_service
> 
> The patch implements the necessary support in the guest side by introducing
> a new socket address family AF_HYPERV.
> 
> Signed-off-by: Dexuan Cui <de...@microsoft.com>
> Cc: "K. Y. Srinivasan" <k...@microsoft.com>
> Cc: Haiyang Zhang <haiya...@microsoft.com>
> Cc: Vitaly Kuznetsov <vkuzn...@redhat.com>
> Cc: Cathy Avery <cav...@redhat.com>
> Cc: Olaf Hering <o...@aepfle.de>
> ---
> 
> You can also get the patch by (commit 5dde7975):
> https://github.com/dcui/linux/tree/decui/hv_sock/net-next/20160711_v16
> 
> 
> For the change log before v12, please see https://lkml.org/lkml/2016/5/15/31
> 
> In v12, the changes are mainly the following:
> 
> 1) remove the module params as David suggested.
> 
> 2) use 5 exact pages for VMBus send/recv rings, respectively.
> The host side's design of the feature requires 5 exact pages for recv/send
> rings respectively -- this is suboptimal considering memory consumption,
> however unluckily we have to live with it, before the host comes up with
> a new design in the future. :-(
> 
> 3) remove the per-connection static send/recv buffers
> Instead, we allocate and free the buffers dynamically only when we recv/send
> data. This means: when a connection is idle, no memory is consumed as
> recv/send buffers at all.
> 
> In v13:
> I return ENOMEM on buffer alllocation failure
> 
>    Actually "man read/write" says "Other errors may occur, depending on the
> object connected to fd". "man send/recv" indeed lists ENOMEM.
>    Considering AF_HYPERV is a new socket type, ENOMEM seems OK here.
>    In the long run, I think we should add a new API in the VMBus driver,
> allowing data copy from VMBus ringbuffer into user mode buffer directly.
> This way, we can even eliminate this temporary buffer.
> 
> In v14:
> fix some coding style issues pointed out by David.
> 
> In v15:
> Just some stylistic changes addressing comments from Joe Perches and
> Olaf Hering -- thank you!
> - add a GPL blurb.
> - define a new macro PAGE_SIZE_4K and use it to replace PAGE_SIZE
> - change sk_to_hvsock/hvsock_to_sk() from macros to inline functions
> - remove a not-very-useful pr_err()
> - fix some typos in comment and coding style issues.
> 
> In v16:
> Made stylistic changes addressing comments from Vitaly Kuznetsov.
> Thank you very much for the detailed comments, Vitaly!
> 
> Looking forward to your comments!
> 
>  MAINTAINERS                 |    2 +
>  include/linux/hyperv.h      |   13 +
>  include/linux/socket.h      |    4 +-
>  include/net/af_hvsock.h     |   78 +++
>  include/uapi/linux/hyperv.h |   23 +
>  net/Kconfig                 |    1 +
>  net/Makefile                |    1 +
>  net/hv_sock/Kconfig         |   10 +
>  net/hv_sock/Makefile        |    3 +
>  net/hv_sock/af_hvsock.c     | 1509 
> +++++++++++++++++++++++++++++++++++++++++++
>  10 files changed, 1643 insertions(+), 1 deletion(-)

...

> +static LIST_HEAD(hvsock_bound_list);
> +static LIST_HEAD(hvsock_connected_list);
> +static DEFINE_MUTEX(hvsock_mutex);
> +
> +static struct sock *hvsock_find_bound_socket(const struct sockaddr_hv *addr)
> +{
> +     struct hvsock_sock *hvsk;
> +
> +     list_for_each_entry(hvsk, &hvsock_bound_list, bound_list) {
> +             if (!uuid_le_cmp(addr->shv_service_guid,
> +                              hvsk->local_addr.shv_service_guid))
> +                     return hvsock_to_sk(hvsk);
> +     }
> +     return NULL;
> +}
> +
> +static struct sock *hvsock_find_connected_socket_by_channel(
> +     const struct vmbus_channel *channel)
> +{
> +     struct hvsock_sock *hvsk;
> +
> +     list_for_each_entry(hvsk, &hvsock_connected_list, connected_list) {
> +             if (hvsk->channel == channel)
> +                     return hvsock_to_sk(hvsk);
> +     }
> +     return NULL;
> +}

How does this work from performance point of view if there are many
connected sockets and/or high frequency of new connections? AFAICS most
other families use a hash table for socket lookup.

> +static void get_ringbuffer_rw_status(struct vmbus_channel *channel,
> +                                  bool *can_read, bool *can_write)
> +{
> +     u32 avl_read_bytes, avl_write_bytes, dummy;
> +
> +     if (can_read) {
> +             hv_get_ringbuffer_availbytes(&channel->inbound,
> +                                          &avl_read_bytes,
> +                                          &dummy);
> +             /* 0-size payload means FIN */
> +             *can_read = avl_read_bytes >= HVSOCK_PKT_LEN(0);
> +     }
> +
> +     if (can_write) {
> +             hv_get_ringbuffer_availbytes(&channel->outbound,
> +                                          &dummy,
> +                                          &avl_write_bytes);
> +
> +             /* We only write if there is enough space */
> +             *can_write = avl_write_bytes > HVSOCK_PKT_LEN(PAGE_SIZE);

I'm not sure where does this come from but is this really supposed to be
PAGE_SIZE (not the fixed 4KB PAGE_SIZE_4K)?

> +static int hvsock_open_connection(struct vmbus_channel *channel)
> +{
> +     struct hvsock_sock *hvsk, *new_hvsk;
> +     uuid_le *instance, *service_id;
> +     unsigned char conn_from_host;
> +     struct sockaddr_hv hv_addr;
> +     struct sock *sk, *new_sk;
> +     int ret;
> +
> +     instance = &channel->offermsg.offer.if_instance;
> +     service_id = &channel->offermsg.offer.if_type;
> +
> +     /* The first byte != 0 means the host initiated the connection. */
> +     conn_from_host = channel->offermsg.offer.u.pipe.user_def[0];
> +
> +     mutex_lock(&hvsock_mutex);
> +
> +     hvsock_addr_init(&hv_addr, conn_from_host ? *service_id : *instance);
> +     sk = hvsock_find_bound_socket(&hv_addr);
> +
> +     if (!sk || (conn_from_host && sk->sk_state != SS_LISTEN) ||
> +         (!conn_from_host && sk->sk_state != SS_CONNECTING)) {
> +             ret = -ENXIO;
> +             goto out;
> +     }
> +
> +     if (conn_from_host) {
> +             if (sk->sk_ack_backlog >= sk->sk_max_ack_backlog) {
> +                     ret = -ECONNREFUSED;
> +                     goto out;
> +             }
> +
> +             new_sk = hvsock_create(sock_net(sk), NULL, GFP_KERNEL,
> +                                    sk->sk_type);
> +             if (!new_sk) {
> +                     ret = -ENOMEM;
> +                     goto out;
> +             }
> +
> +             new_sk->sk_state = SS_CONNECTING;
> +             new_hvsk = sk_to_hvsock(new_sk);
> +             new_hvsk->channel = channel;
> +             hvsock_addr_init(&new_hvsk->local_addr, *service_id);
> +             hvsock_addr_init(&new_hvsk->remote_addr, *instance);
> +     } else {
> +             hvsk = sk_to_hvsock(sk);
> +             hvsk->channel = channel;
> +     }
> +
> +     set_channel_read_state(channel, false);
> +     ret = vmbus_open(channel, RINGBUFFER_HVSOCK_SND_SIZE,
> +                      RINGBUFFER_HVSOCK_RCV_SIZE, NULL, 0,
> +                      hvsock_on_channel_cb, conn_from_host ? new_sk : sk);
> +     if (ret != 0) {
> +             if (conn_from_host) {
> +                     new_hvsk->channel = NULL;
> +                     sock_put(new_sk);
> +             } else {
> +                     hvsk->channel = NULL;
> +             }
> +             goto out;
> +     }
> +
> +     vmbus_set_chn_rescind_callback(channel, hvsock_close_connection);
> +
> +     /* see get_ringbuffer_rw_status() */
> +     set_channel_pending_send_size(channel, HVSOCK_PKT_LEN(PAGE_SIZE) + 1);

Same question.

> +
> +     if (conn_from_host) {
> +             new_sk->sk_state = SS_CONNECTED;
> +
> +             sock_hold(&new_hvsk->sk);
> +             list_add(&new_hvsk->connected_list, &hvsock_connected_list);
> +
> +             hvsock_enqueue_accept(sk, new_sk);
> +     } else {
> +             sk->sk_state = SS_CONNECTED;
> +             sk->sk_socket->state = SS_CONNECTED;
> +
> +             sock_hold(&hvsk->sk);
> +             list_add(&hvsk->connected_list, &hvsock_connected_list);
> +     }
> +
> +     sk->sk_state_change(sk);
> +out:
> +     mutex_unlock(&hvsock_mutex);
> +     return ret;
> +}

...

> +static int hvsock_create_sock(struct net *net, struct socket *sock,
> +                           int protocol, int kern)
> +{
> +     struct sock *sk;
> +
> +     if (!capable(CAP_SYS_ADMIN) && !capable(CAP_NET_ADMIN))
> +             return -EPERM;

Looks like any application wanting to use hyper-v sockets will need
rather high privileges. It would make sense if these sockets were
reserved for privileged tasks like VM management. But according to the
commit message, hv_sock is supposed to be used for regular application
to application communication. Requiring CAP_{SYS,NET}_ADMIN looks like
an overkill to me.

> +
> +     if (protocol != 0 && protocol != SHV_PROTO_RAW)
> +             return -EPROTONOSUPPORT;
> +
> +     switch (sock->type) {
> +     case SOCK_STREAM:
> +             sock->ops = &hvsock_ops;
> +             break;
> +     default:
> +             return -ESOCKTNOSUPPORT;
> +     }
> +
> +     sock->state = SS_UNCONNECTED;
> +
> +     sk = hvsock_create(net, sock, GFP_KERNEL, 0);
> +     return sk ? 0 : -ENOMEM;
> +}

Michal Kubecek

Reply via email to