On 3/23/21 6:57 PM, Adrian Moreno wrote:
> 
> 
> On 3/19/21 6:21 PM, Stefan Hajnoczi wrote:
>> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
>>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
>>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
>>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
>>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
>>>>>>> And some housekeeping usually required for applications in case the
>>>>>>> socket server terminated abnormally and socket files left on a file
>>>>>>> system:
>>>>>>>  "failed to bind to vhu: Address already in use; remove it and try 
>>>>>>> again"
>>>>>>
>>>>>> QEMU avoids this by unlinking before binding. The drawback is that users
>>>>>> might accidentally hijack an existing listen socket, but that can be
>>>>>> solved with a pidfile.
>>>>>
>>>>> How exactly this could be solved with a pidfile?
>>>>
>>>> A pidfile prevents two instances of the same service from running at the
>>>> same time.
>>>>
>>>> The same effect can be achieved by the container orchestrator, systemd,
>>>> etc too because it refuses to run the same service twice.
>>>
>>> Sure. I understand that.  My point was that these could be 2 different
>>> applications and they might not know which process to look for.
>>>
>>>>
>>>>> And what if this is
>>>>> a different application that tries to create a socket on a same path?
>>>>> e.g. QEMU creates a socket (started in a server mode) and user
>>>>> accidentally created dpdkvhostuser port in Open vSwitch instead of
>>>>> dpdkvhostuserclient.  This way rte_vhost library will try to bind
>>>>> to an existing socket file and will fail.  Subsequently port creation
>>>>> in OVS will fail.   We can't allow OVS to unlink files because this
>>>>> way OVS users will have ability to unlink random sockets that OVS has
>>>>> access to and we also has no idea if it's a QEMU that created a file
>>>>> or it was a virtio-user application or someone else.
>>>>
>>>> If rte_vhost unlinks the socket then the user will find that networking
>>>> doesn't work. They can either hot unplug the QEMU vhost-user-net device
>>>> or restart QEMU, depending on whether they need to keep the guest
>>>> running or not. This is a misconfiguration that is recoverable.
>>>
>>> True, it's recoverable, but with a high cost.  Restart of a VM is rarely
>>> desirable.  And the application inside the guest might not feel itself
>>> well after hot re-plug of a device that it actively used.  I'd expect
>>> a DPDK application that runs inside a guest on some virtio-net device
>>> to crash after this kind of manipulations.  Especially, if it uses some
>>> older versions of DPDK.
>>
>> This unlink issue is probably something we think differently about.
>> There are many ways for users to misconfigure things when working with
>> system tools. If it's possible to catch misconfigurations that is
>> preferrable. In this case it's just the way pathname AF_UNIX domain
>> sockets work and IMO it's better not to have problems starting the
>> service due to stale files than to insist on preventing
>> misconfigurations. QEMU and DPDK do this differently and both seem to be
>> successful, so ¯\_(ツ)_/¯.
>>
>>>>
>>>> Regarding letting OVS unlink files, I agree that it shouldn't if this
>>>> create a security issue. I don't know the security model of OVS.
>>>
>>> In general privileges of a ovs-vswitchd daemon might be completely
>>> different from privileges required to invoke control utilities or
>>> to access the configuration database.  SO, yes, we should not allow
>>> that.
>>
>> That can be locked down by restricting the socket path to a file beneath
>> /var/run/ovs/vhost-user/.
>>
>>>>
>>>>> There are, probably, ways to detect if there is any alive process that
>>>>> has this socket open, but that sounds like too much for this purpose,
>>>>> also I'm not sure if it's possible if actual user is in a different
>>>>> container.
>>>>> So I don't see a good reliable way to detect these conditions.  This
>>>>> falls on shoulders of a higher level management software or a user to
>>>>> clean these socket files up before adding ports.
>>>>
>>>> Does OVS always run in the same net namespace (pod) as the DPDK
>>>> application? If yes, then abstract AF_UNIX sockets can be used. Abstract
>>>> AF_UNIX sockets don't have a filesystem path and the socket address
>>>> disappears when there is no process listening anymore.
>>>
>>> OVS is usually started right on the host in a main network namespace.
>>> In case it's started in a pod, it will run in a separate container but
>>> configured with a host network.  Applications almost exclusively runs
>>> in separate pods.
>>
>> Okay.
>>
>>>>>>> This patch-set aims to eliminate most of the inconveniences by
>>>>>>> leveraging an infrastructure service provided by a SocketPair Broker.
>>>>>>
>>>>>> I don't understand yet why this is useful for vhost-user, where the
>>>>>> creation of the vhost-user device backend and its use by a VMM are
>>>>>> closely managed by one piece of software:
>>>>>>
>>>>>> 1. Unlink the socket path.
>>>>>> 2. Create, bind, and listen on the socket path.
>>>>>> 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK
>>>>>>    RPC, spawn a process, etc) and pass in the listen fd.
>>>>>> 4. In the meantime the VMM can open the socket path and call connect(2).
>>>>>>    As soon as the vhost-user device backend calls accept(2) the
>>>>>>    connection will proceed (there is no need for sleeping).
>>>>>>
>>>>>> This approach works across containers without a broker.
>>>>>
>>>>> Not sure if I fully understood a question here, but anyway.
>>>>>
>>>>> This approach works fine if you know what application to run.
>>>>> In case of a k8s cluster, it might be a random DPDK application
>>>>> with virtio-user ports running inside a container and want to
>>>>> have a network connection.  Also, this application needs to run
>>>>> virtio-user in server mode, otherwise restart of the OVS will
>>>>> require restart of the application.  So, you basically need to
>>>>> rely on a third-party application to create a socket with a right
>>>>> name and in a correct location that is shared with a host, so
>>>>> OVS can find it and connect.
>>>>>
>>>>> In a VM world everything is much more simple, since you have
>>>>> a libvirt and QEMU that will take care of all of these stuff
>>>>> and which are also under full control of management software
>>>>> and a system administrator.
>>>>> In case of a container with a "random" DPDK application inside
>>>>> there is no such entity that can help.  Of course, some solution
>>>>> might be implemented in docker/podman daemon to create and manage
>>>>> outside-looking sockets for an application inside the container,
>>>>> but that is not available today AFAIK and I'm not sure if it
>>>>> ever will.
>>>>
>>>> Wait, when you say there is no entity like management software or a
>>>> system administrator, then how does OVS know to instantiate the new
>>>> port? I guess something still needs to invoke ovs-ctl add-port?
>>>
>>> I didn't mean that there is no any application that configures
>>> everything.  Of course, there is.  I mean that there is no such
>>> entity that abstracts all that socket machinery from the user's
>>> application that runs inside the container.  QEMU hides all the
>>> details of the connection to vhost backend and presents the device
>>> as a PCI device with a network interface wrapping from the guest
>>> kernel.  So, the application inside VM shouldn't care what actually
>>> there is a socket connected to OVS that implements backend and
>>> forward traffic somewhere.  For the application it's just a usual
>>> network interface.
>>> But in case of a container world, application should handle all
>>> that by creating a virtio-user device that will connect to some
>>> socket, that has an OVS on the other side.
>>>
>>>>
>>>> Can you describe the steps used today (without the broker) for
>>>> instantiating a new DPDK app container and connecting it to OVS?
>>>> Although my interest is in the vhost-user protocol I think it's
>>>> necessary to understand the OVS requirements here and I know little
>>>> about them.
>>>>> I might describe some things wrong since I worked with k8s and CNI
>>> plugins last time ~1.5 years ago, but the basic schema will look
>>> something like this:
>>>
>>> 1. user decides to start a new pod and requests k8s to do that
>>>    via cmdline tools or some API calls.
>>>
>>> 2. k8s scheduler looks for available resources asking resource
>>>    manager plugins, finds an appropriate physical host and asks
>>>    local to that node kubelet daemon to launch a new pod there.
>>>
> 
> When the CNI is called, the pod has already been created, i.e: a PodID exists
> and so does an associated network namespace. Therefore, everything that has to
> do with the runtime spec such as mountpoints or devices cannot be modified by
> this time.
> 
> That's why the Device Plugin API is used to modify the Pod's spec before the 
> CNI
> chain is called.
> 
>>> 3. kubelet asks local CNI plugin to allocate network resources
>>>    and annotate the pod with required mount points, devices that
>>>    needs to be passed in and environment variables.
>>>    (this is, IIRC, a gRPC connection.   It might be a multus-cni
>>>    or kuryr-kubernetes or any other CNI plugin.  CNI plugin is
>>>    usually deployed as a system DaemonSet, so it runs in a
>>>    separate pod.
>>>
>>> 4. Assuming that vhost-user connection requested in server mode.
>>>    CNI plugin will:
>>>    4.1 create a directory for a vhost-user socket.
>>>    4.2 add this directory to pod annotations as a mount point.
> 
> I believe this is not possible, it would have to inspect the pod's spec or
> otherwise determine an existing mount point where the socket should be 
> created.

Uff.  Yes, you're right.  Thanks for your clarification.
I mixed up CNI and Device Plugin here.

CNI itself is not able to annotate new resources to the pod, i.e.
create new mounts or something like this.   And I don't recall any
vhost-user device plugins.  Is there any?  There is an SR-IOV device
plugin, but its purpose is to allocate and pass PCI devices, not create
mounts for vhost-user.

So, IIUC, right now user must create the directory and specify
a mount point in a pod spec file or pass the whole /var/run/openvswitch
or something like this, right?

Looking at userspace-cni-network-plugin, it actually just parses
annotations to find the shared directory and fails if there is
no any:
 
https://github.com/intel/userspace-cni-network-plugin/blob/master/userspace/userspace.go#L122

And examples suggests to specify a directory to mount:
 
https://github.com/intel/userspace-cni-network-plugin/blob/master/examples/ovs-vhost/userspace-ovs-pod-1.yaml#L41

Looks like this is done by user's hands.

> 
> +Billy might give more insights on this
> 
>>>    4.3 create a port in OVS by invoking 'ovs-vsctl port-add' or
>>>        by connecting to ovsdb-server by JSONRPC directly.
>>>        It will set port type as dpdkvhostuserclient and specify
>>>        socket-path as a path inside the directory it created.
>>>        (OVS will create a port and rte_vhost will enter the
>>>         re-connection loop since socket does not exist yet.)
>>>    4.4 Set up socket file location as environment variable in
>>>        pod annotations.
>>>    4.5 report success to kubelet.
>>>
> 
> Since the CNI cannot modify the pod's mounts it has to rely on a Device Plugin
> or other external entity that can inject the mount point before the pod is 
> created.
> 
> However, there is another usecase that might be relevant: dynamic attachment 
> of
> network interfaces. In this case the CNI cannot work in collaboration with a
> Device Plugin or "mount-point injector" and an existing mount point has to be 
> used.
> Also, some form of notification mechanism has to exist to tell the workload a
> new socket is ready.
> 
>>> 5. kubelet will finish all other preparations and resource
>>>    allocations and will ask docker/podman to start a container
>>>    with all mount points, devices and environment variables from
>>>    the pod annotation.
>>>
>>> 6. docker/podman starts a container.
>>>    Need to mention here that in many cases initial process of
>>>    a container is not the actual application that will use a
>>>    vhost-user connection, but likely a shell that will invoke
>>>    the actual application.
>>>
>>> 7. Application starts inside the container, checks the environment
>>>    variables (actually, checking of environment variables usually
>>>    happens in a shell script that invokes the application with
>>>    correct arguments) and creates a net_virtio_user port in server
>>>    mode.  At this point socket file will be created.
>>>    (since we're running third-party application inside the container
>>>     we can only assume that it will do what is written here, it's
>>>     a responsibility of an application developer to do the right
>>>     thing.)
>>>
>>> 8. OVS successfully re-connects to the newly created socket in a
>>>    shared directory and vhost-user protocol establishes the network
>>>    connectivity.
>>>
>>> As you can wee, there are way too many entities and communication
>>> methods involved.  So, passing a pre-opened file descriptor from
>>> CNI all the way down to application is not that easy as it is in
>>> case of QEMU+LibVirt.
>>
>> File descriptor passing isn't necessary if OVS owns the listen socket
>> and the application container is the one who connects. That's why I
>> asked why dpdkvhostuser was deprecated in another email. The benefit of
>> doing this would be that the application container can instantly connect
>> to OVS without a sleep loop.
>>
>> I still don't get the attraction of the broker idea. The pros:
>> + Overcomes the issue with stale UNIX domain socket files
>> + Eliminates the re-connect sleep loop
>>
>> Neutral:
>> * vhost-user UNIX domain socket directory container volume is replaced
>>   by broker UNIX domain socket bind mount
>> * UNIX domain socket naming conflicts become broker key naming conflicts
>>
>> The cons:
>> - Requires running a new service on the host with potential security
>>   issues
>> - Requires support in third-party applications, QEMU, and DPDK/OVS
>> - The old code must be kept for compatibility with non-broker
>>   configurations, especially since third-party applications may not
>>   support the broker. Developers and users will have to learn about both
>>   options and decide which one to use.
>>
>> This seems like a modest improvement for the complexity and effort
>> involved. The same pros can be achieved by:
>> * Adding unlink(2) to rte_vhost (or applications can add rm -f
>>   $PATH_TO_SOCKET to their docker-entrypoint.sh). The disadvantage is
>>   it doesn't catch a misconfiguration where the user launches two
>>   processes with the same socket path.
>> * Reversing the direction of the client/server relationship to
>>   eliminate the re-connect sleep loop at startup. I'm unsure whether
>>   this is possible.
>>
>> That said, the broker idea doesn't affect the vhost-user protocol itself
>> and is more of an OVS/DPDK topic. I may just not be familiar enough with
>> OVS/DPDK to understand the benefits of the approach.
>>
>> Stefan
>>
> 

Reply via email to