On 2015/09/07 14:54, Xie, Huawei wrote: > On 8/26/2015 5:23 PM, Tetsuya Mukawa wrote: >> On 2015/08/25 18:56, Xie, Huawei wrote: >>> On 8/25/2015 10:59 AM, Tetsuya Mukawa wrote: >>>> Hi Xie and Yanping, >>>> >>>> >>>> May I ask you some questions? >>>> It seems we are also developing an almost same one. >>> Good to know that we are tackling the same problem and have the similar >>> idea. >>> What is your status now? We had the POC running, and compliant with >>> dpdkvhost. >>> Interrupt like notification isn't supported. >> We implemented vhost PMD first, so we just start implementing it. >> >>>> On 2015/08/20 19:14, Xie, Huawei wrote: >>>>> Added dev at dpdk.org >>>>> >>>>> On 8/20/2015 6:04 PM, Xie, Huawei wrote: >>>>>> Yanping: >>>>>> I read your mail, seems what we did are quite similar. Here i wrote a >>>>>> quick mail to describe our design. Let me know if it is the same thing. >>>>>> >>>>>> Problem Statement: >>>>>> We don't have a high performance networking interface in container for >>>>>> NFV. Current veth pair based interface couldn't be easily accelerated. >>>>>> >>>>>> The key components involved: >>>>>> 1. DPDK based virtio PMD driver in container. >>>>>> 2. device simulation framework in container. >>>>>> 3. dpdk(or kernel) vhost running in host. >>>>>> >>>>>> How virtio is created? >>>>>> A: There is no "real" virtio-pci device in container environment. >>>>>> 1). Host maintains pools of memories, and shares memory to container. >>>>>> This could be accomplished through host share a huge page file to >>>>>> container. >>>>>> 2). Containers creates virtio rings based on the shared memory. >>>>>> 3). Container creates mbuf memory pools on the shared memory. >>>>>> 4) Container send the memory and vring information to vhost through >>>>>> vhost message. This could be done either through ioctl call or vhost >>>>>> user message. >>>>>> >>>>>> How vhost message is sent? >>>>>> A: There are two alternative ways to do this. >>>>>> 1) The customized virtio PMD is responsible for all the vring creation, >>>>>> and vhost message sending. >>>> Above is our approach so far. >>>> It seems Yanping also takes this kind of approach. >>>> We are using vhost-user functionality instead of using the vhost-net >>>> kernel module. >>>> Probably this is the difference between Yanping and us. >>> In my current implementation, the device simulation layer talks to "user >>> space" vhost through cuse interface. It could also be done through vhost >>> user socket. This isn't the key point. >>> Here vhost-user is kind of confusing, maybe user space vhost is more >>> accurate, either cuse or unix domain socket. :). >>> >>> As for yanping, they are now connecting to vhost-net kernel module, but >>> they are also trying to connect to "user space" vhost. Correct me if wrong. >>> Yes, there is some difference between these two. Vhost-net kernel module >>> could directly access other process's memory, while using >>> vhost-user(cuse/user), we need do the memory mapping. >>>> BTW, we are going to submit a vhost PMD for DPDK-2.2. >>>> This PMD is implemented on librte_vhost. >>>> It allows DPDK application to handle a vhost-user(cuse) backend as a >>>> normal NIC port. >>>> This PMD should work with both Xie and Yanping approach. >>>> (In the case of Yanping approach, we may need vhost-cuse) >>>> >>>>>> 2) We could do this through a lightweight device simulation framework. >>>>>> The device simulation creates simple PCI bus. On the PCI bus, >>>>>> virtio-net PCI devices are created. The device simulations provides >>>>>> IOAPI for MMIO/IO access. >>>> Does it mean you implemented a kernel module? >>>> If so, do you still need vhost-cuse functionality to handle vhost >>>> messages n userspace? >>> The device simulation is a library running in user space in container. >>> It is linked with DPDK app. It creates pseudo buses and virtio-net PCI >>> devices. >>> The virtio-container-PMD configures the virtio-net pseudo devices >>> through IOAPI provided by the device simulation rather than IO >>> instructions as in KVM. >>> Why we use device simulation? >>> We could create other virtio devices in container, and provide an common >>> way to talk to vhost-xx module. >> Thanks for explanation. >> At first reading, I thought the difference between approach1 and >> approach2 is whether we need to implement a new kernel module, or not. >> But I understand how you implemented. >> >> Please let me explain our design more. >> We might use a kind of similar approach to handle a pseudo virtio-net >> device in DPDK. >> (Anyway, we haven't finished implementing yet, this overview might have >> some technical problems) >> >> Step1. Separate virtio-net and vhost-user socket related code from QEMU, >> then implement it as a separated program. >> The program also has below features. >> - Create a directory that contains almost same files like >> /sys/bus/pci/device/<pci address>/* >> (To scan these file located on outside sysfs, we need to fix EAL) >> - This dummy device is driven by dummy-virtio-net-driver. This name is >> specified by '<pci addr>/driver' file. >> - Create a shared file that represents pci configuration space, then >> mmap it, also specify the path in '<pci addr>/resource_path' >> >> The program will be GPL, but it will be like a bridge on the shared >> memory between virtio-net PMD and DPDK vhost backend. >> Actually, It will work under virtio-net PMD, but we don't need to link it. >> So I guess we don't have GPL license issue. >> >> Step2. Fix pci scan code of EAL to scan dummy devices. >> - To scan above files, extend pci_scan() of EAL. >> >> Step3. Add a new kdrv type to EAL. >> - To handle the 'dummy-virtio-net-driver', add a new kdrv type to EAL. >> >> Step4. Implement pci_dummy_virtio_net_map/unmap(). >> - It will have almost same functionality like pci_uio_map(), but for >> dummy virtio-net device. >> - The dummy device will be mmaped using a path specified in '<pci >> addr>/resource_path'. >> >> Step5. Add a new compile option for virtio-net device to replace IO >> functions. >> - The IO functions of virtio-net PMD will be replaced by read() and >> write() to access to the shared memory. >> - Add notification mechanism to IO functions. This will be used when >> write() to the shared memory is done. >> (Not sure exactly, but probably we need it) >> >> Does it make sense? >> I guess Step1&2 is different from your approach, but the rest might be >> similar. >> >> Actually, we just need sysfs entries for a virtio-net dummy device, but >> so far, I don't have a fine way to register them from user space without >> loading a kernel module. > Tetsuya: > I don't quite get the details. Who will create those sysfs entries? A > kernel module right?
Hi Xie, I don't create sysfs entries. Just create a directory that contains files looks like sysfs entries. And initialize EAL with not only sysfs but also the above directory. In quoted last sentence, I wanted to say we just needed files looks like sysfs entries. But I don't know a good way to create files under sysfs without loading kernel module. This is because I try to create the additional directory. > The virtio-net is configured through read/write to sharing > memory(between host and guest), right? Yes, I agree. > Where is shared vring created and shared memory created, on shared huge > page between host and guest? The vritqueues(vrings) are on guest hugepage. Let me explain. Guest container should have read/write access to a part of hugepage directory on host. (For example, /mnt/huge/conainer1/ is shared between host and guest.) Also host and guest needs to communicate through a unix domain socket. (For example, host and guest can communicate with using "/tmp/container1/sock") If we can do like above, a virtio-net PMD on guest can creates virtqueues(vrings) on it's hugepage, and writes these information to a pseudo virtio-net device that is a process created in guest container. Then the pseudo virtio-net device sends it to vhost-user backend(host DPDK application) through a unix domain socket. So with my plan, there are 3 processes. DPDK applications on host and guest, also a process that works like virtio-net device. > Who will talk to dpdkvhost? If we need to talk to a cuse device or the vhost-net kernel module, an above pseudo virtio-net device could talk to. (But, so far, my target is only vhost-user.) >> This is because I need to change pci_scan() also. >> >> It seems you have implemented a virtio-net pseudo device as BSD license. >> If so, this kind of PMD would be nice to use it. > Currently it is based on native linux kvm tool. Great, I hadn't noticed this option. >> In the case that it takes much time to implement some lost >> functionalities like interrupt mode, using QEMU code might be an one of >> options. > For interrupt mode, i plan to use eventfd for sleep/wake, have not tried > yet. >> Anyway, we just need a fine virtual NIC between containers and host. >> So we don't hold to our approach and implementation. > Do you have comments to my implementation? > We could publish the version without the device framework first for > reference. No I don't have. Could you please share it? I am looking forward to seeing it. Tetsuya