Re: How to implement message forwarding from one CID to another in vhost driver
Hi Dorjoy, On Sat, May 18, 2024 at 04:17:38PM GMT, Dorjoy Chowdhury wrote: Hi, Hope you are doing well. I am working on adding AWS Nitro Enclave[1] emulation support in QEMU. Alexander Graf is mentoring me on this work. A v1 patch series has already been posted to the qemu-devel mailing list[2]. AWS nitro enclaves is an Amazon EC2[3] feature that allows creating isolated execution environments, called enclaves, from Amazon EC2 instances, which are used for processing highly sensitive data. Enclaves have no persistent storage and no external networking. The enclave VMs are based on Firecracker microvm and have a vhost-vsock device for communication with the parent EC2 instance that spawned it and a Nitro Secure Module (NSM) device for cryptographic attestation. The parent instance VM always has CID 3 while the enclave VM gets a dynamic CID. The enclave VMs can communicate with the parent instance over various ports to CID 3, for example, the init process inside an enclave sends a heartbeat to port 9000 upon boot, expecting a heartbeat reply, letting the parent instance know that the enclave VM has successfully booted. The plan is to eventually make the nitro enclave emulation in QEMU standalone i.e., without needing to run another VM with CID 3 with proper vsock If you don't have to launch another VM, maybe we can avoid vhost-vsock and emulate virtio-vsock in user-space, having complete control over the behavior. So we could use this opportunity to implement virtio-vsock in QEMU [4] or use vhost-user-vsock [5] and customize it somehow. (Note: vhost-user-vsock already supports sibling communication, so maybe with a few modifications it fits your case perfectly) [4] https://gitlab.com/qemu-project/qemu/-/issues/2095 [5] https://github.com/rust-vmm/vhost-device/tree/main/vhost-device-vsock communication support. For this to work, one approach could be to teach the vhost driver in kernel to forward CID 3 messages to another CID N So in this case both CID 3 and N would be assigned to the same QEMU process? Do you have to allocate 2 separate virtio-vsock devices, one for the parent and one for the enclave? (set to CID 2 for host) i.e., it patches CID from 3 to N on incoming messages and from N to 3 on responses. This will enable users of the Will these messages have the VMADDR_FLAG_TO_HOST flag set? We don't support this in vhost-vsock yet, if supporting it helps, we might, but we need to better understand how to avoid security issues, so maybe each device needs to explicitly enable the feature and specify from which CIDs it accepts packets. nitro-enclave machine type in QEMU to run the necessary vsock server/clients in the host machine (some defaults can be implemented in QEMU as well, for example, sending a reply to the heartbeat) which will rid them of the cumbersome way of running another whole VM with CID 3. This way, users of nitro-enclave machine in QEMU, could potentially also run multiple enclaves with their messages for CID 3 forwarded to different CIDs which, in QEMU side, could then be specified using a new machine type option (parent-cid) if implemented. I guess in the QEMU side, this will be an ioctl call (or some other way) to indicate to the host kernel that the CID 3 messages need to be forwarded. Does this approach of What if there is already a VM with CID = 3 in the system? forwarding CID 3 messages to another CID sound good? It seems too specific a case, if we can generalize it maybe we could make this change, but we would like to avoid complicating vhost-vsock and keep it as simple as possible to avoid then having to implement firewalls, etc. So first I would see if vhost-user-vsock or the QEMU built-in device is right for this use-case. Thanks, Stefano If this approach sounds good, I need some guidance on where the code should be written in order to achieve this. I would greatly appreciate any suggestions. Thanks. Regards, Dorjoy [1] https://docs.aws.amazon.com/enclaves/latest/user/nitro-enclave.html [2] https://mail.gnu.org/archive/html/qemu-devel/2024-05/msg03524.html [3] https://aws.amazon.com/ec2/
Re: How to implement message forwarding from one CID to another in vhost driver
Hey Stefano, Thanks for the reply. On Mon, May 20, 2024, 2:55 PM Stefano Garzarella wrote: > > Hi Dorjoy, > > On Sat, May 18, 2024 at 04:17:38PM GMT, Dorjoy Chowdhury wrote: > >Hi, > > > >Hope you are doing well. I am working on adding AWS Nitro Enclave[1] > >emulation support in QEMU. Alexander Graf is mentoring me on this work. A v1 > >patch series has already been posted to the qemu-devel mailing list[2]. > > > >AWS nitro enclaves is an Amazon EC2[3] feature that allows creating isolated > >execution environments, called enclaves, from Amazon EC2 instances, which are > >used for processing highly sensitive data. Enclaves have no persistent > >storage > >and no external networking. The enclave VMs are based on Firecracker microvm > >and have a vhost-vsock device for communication with the parent EC2 instance > >that spawned it and a Nitro Secure Module (NSM) device for cryptographic > >attestation. The parent instance VM always has CID 3 while the enclave VM > >gets > >a dynamic CID. The enclave VMs can communicate with the parent instance over > >various ports to CID 3, for example, the init process inside an enclave > >sends a > >heartbeat to port 9000 upon boot, expecting a heartbeat reply, letting the > >parent instance know that the enclave VM has successfully booted. > > > >The plan is to eventually make the nitro enclave emulation in QEMU standalone > >i.e., without needing to run another VM with CID 3 with proper vsock > > If you don't have to launch another VM, maybe we can avoid vhost-vsock > and emulate virtio-vsock in user-space, having complete control over the > behavior. > > So we could use this opportunity to implement virtio-vsock in QEMU [4] > or use vhost-user-vsock [5] and customize it somehow. > (Note: vhost-user-vsock already supports sibling communication, so maybe > with a few modifications it fits your case perfectly) > > [4] https://gitlab.com/qemu-project/qemu/-/issues/2095 > [5] https://github.com/rust-vmm/vhost-device/tree/main/vhost-device-vsock Thanks for letting me know. Right now I don't have a complete picture but I will look into them. Thank you. > > > > >communication support. For this to work, one approach could be to teach the > >vhost driver in kernel to forward CID 3 messages to another CID N > > So in this case both CID 3 and N would be assigned to the same QEMU > process? CID N is assigned to the enclave VM. CID 3 was supposed to be the parent VM that spawns the enclave VM (this is how it is in AWS, where an EC2 instance VM spawns the enclave VM from inside it and that parent EC2 instance always has CID 3). But in the QEMU case as we don't want a parent VM (we want to run enclave VMs standalone) we would need to forward the CID 3 messages to host CID. I don't know if it means CID 3 and CID N is assigned to the same QEMU process. Sorry. > > Do you have to allocate 2 separate virtio-vsock devices, one for the > parent and one for the enclave? If there is a parent VM, then I guess both parent and enclave VMs need virtio-vsock devices. > > >(set to CID 2 for host) i.e., it patches CID from 3 to N on incoming messages > >and from N to 3 on responses. This will enable users of the > > Will these messages have the VMADDR_FLAG_TO_HOST flag set? > > We don't support this in vhost-vsock yet, if supporting it helps, we > might, but we need to better understand how to avoid security issues, so > maybe each device needs to explicitly enable the feature and specify > from which CIDs it accepts packets. I don't know about the flag. So I don't know if it will be set. Sorry. > > >nitro-enclave machine > >type in QEMU to run the necessary vsock server/clients in the host machine > >(some defaults can be implemented in QEMU as well, for example, sending a > >reply > >to the heartbeat) which will rid them of the cumbersome way of running > >another > >whole VM with CID 3. This way, users of nitro-enclave machine in QEMU, could > >potentially also run multiple enclaves with their messages for CID 3 > >forwarded > >to different CIDs which, in QEMU side, could then be specified using a new > >machine type option (parent-cid) if implemented. I guess in the QEMU side, > >this > >will be an ioctl call (or some other way) to indicate to the host kernel that > >the CID 3 messages need to be forwarded. Does this approach of > > What if there is already a VM with CID = 3 in the system? Good question! I don't know what should happen in this case. > > >forwarding CID 3 messages to another CID sound good? > > It seems too specific a case, if we can generalize it maybe we could > make this change, but we would like to avoid complicating vhost-vsock > and keep it as simple as possible to avoid then having to implement > firewalls, etc. > > So first I would see if vhost-user-vsock or the QEMU built-in device is > right for this use-case. Thanks you! I will check everything out and reach out if I need further guidance about what needs to be done. And sorry as I was
Re: [patch net-next] virtio_net: add support for Byte Queue Limits
Fri, May 10, 2024 at 09:11:16AM CEST, hen...@linux.alibaba.com wrote: >On Thu, 9 May 2024 13:46:15 +0200, Jiri Pirko wrote: >> From: Jiri Pirko >> >> Add support for Byte Queue Limits (BQL). > >Historically both Jason and Michael have attempted to support BQL >for virtio-net, for example: > >https://lore.kernel.org/netdev/21384cb5-99a6-7431-1039-b356521e1...@redhat.com/ > >These discussions focus primarily on: > >1. BQL is based on napi tx. Therefore, the transfer of statistical information >needs to rely on the judgment of use_napi. When the napi mode is switched to >orphan, some statistical information will be lost, resulting in temporary >inaccuracy in BQL. > >2. If tx dim is supported, orphan mode may be removed and tx irq will be more >reasonable. This provides good support for BQL. But when the device does not support dim, the orphan mode is still needed, isn't it? > >Thanks. > >> >> Signed-off-by: Jiri Pirko >> --- >> drivers/net/virtio_net.c | 33 - >> 1 file changed, 20 insertions(+), 13 deletions(-) >> >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c >> index 218a446c4c27..c53d6dc6d332 100644 >> --- a/drivers/net/virtio_net.c >> +++ b/drivers/net/virtio_net.c >> @@ -84,7 +84,9 @@ struct virtnet_stat_desc { >> >> struct virtnet_sq_free_stats { >> u64 packets; >> +u64 xdp_packets; >> u64 bytes; >> +u64 xdp_bytes; >> }; >> >> struct virtnet_sq_stats { >> @@ -512,19 +514,19 @@ static void __free_old_xmit(struct send_queue *sq, >> bool in_napi, >> void *ptr; >> >> while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) { >> -++stats->packets; >> - >> if (!is_xdp_frame(ptr)) { >> struct sk_buff *skb = ptr; >> >> pr_debug("Sent skb %p\n", skb); >> >> +stats->packets++; >> stats->bytes += skb->len; >> napi_consume_skb(skb, in_napi); >> } else { >> struct xdp_frame *frame = ptr_to_xdp(ptr); >> >> -stats->bytes += xdp_get_frame_len(frame); >> +stats->xdp_packets++; >> +stats->xdp_bytes += xdp_get_frame_len(frame); >> xdp_return_frame(frame); >> } >> } >> @@ -965,7 +967,8 @@ static void virtnet_rq_unmap_free_buf(struct virtqueue >> *vq, void *buf) >> virtnet_rq_free_buf(vi, rq, buf); >> } >> >> -static void free_old_xmit(struct send_queue *sq, bool in_napi) >> +static void free_old_xmit(struct send_queue *sq, struct netdev_queue *txq, >> + bool in_napi) >> { >> struct virtnet_sq_free_stats stats = {0}; >> >> @@ -974,9 +977,11 @@ static void free_old_xmit(struct send_queue *sq, bool >> in_napi) >> /* Avoid overhead when no packets have been processed >> * happens when called speculatively from start_xmit. >> */ >> -if (!stats.packets) >> +if (!stats.packets && !stats.xdp_packets) >> return; >> >> +netdev_tx_completed_queue(txq, stats.packets, stats.bytes); >> + >> u64_stats_update_begin(&sq->stats.syncp); >> u64_stats_add(&sq->stats.bytes, stats.bytes); >> u64_stats_add(&sq->stats.packets, stats.packets); >> @@ -1013,13 +1018,15 @@ static void check_sq_full_and_disable(struct >> virtnet_info *vi, >> * early means 16 slots are typically wasted. >> */ >> if (sq->vq->num_free < 2+MAX_SKB_FRAGS) { >> -netif_stop_subqueue(dev, qnum); >> +struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum); >> + >> +netif_tx_stop_queue(txq); >> if (use_napi) { >> if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) >> virtqueue_napi_schedule(&sq->napi, sq->vq); >> } else if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) { >> /* More just got used, free them then recheck. */ >> -free_old_xmit(sq, false); >> +free_old_xmit(sq, txq, false); >> if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) { >> netif_start_subqueue(dev, qnum); >> virtqueue_disable_cb(sq->vq); >> @@ -2319,7 +2326,7 @@ static void virtnet_poll_cleantx(struct receive_queue >> *rq) >> >> do { >> virtqueue_disable_cb(sq->vq); >> -free_old_xmit(sq, true); >> +free_old_xmit(sq, txq, true); >> } while (unlikely(!virtqueue_enable_cb_delayed(sq->vq))); >> >> if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS) >> @@ -2471,7 +2478,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, >> int budget) >> txq = netdev_get_tx_queue(vi->dev, index); >> __netif_tx_lock(txq, raw_smp_processor_id()); >> virtqueue_disable_cb(sq->vq); >> -free_old_xm
Re: How to implement message forwarding from one CID to another in vhost driver
Howdy, On 20.05.24 14:44, Dorjoy Chowdhury wrote: Hey Stefano, Thanks for the reply. On Mon, May 20, 2024, 2:55 PM Stefano Garzarella wrote: Hi Dorjoy, On Sat, May 18, 2024 at 04:17:38PM GMT, Dorjoy Chowdhury wrote: Hi, Hope you are doing well. I am working on adding AWS Nitro Enclave[1] emulation support in QEMU. Alexander Graf is mentoring me on this work. A v1 patch series has already been posted to the qemu-devel mailing list[2]. AWS nitro enclaves is an Amazon EC2[3] feature that allows creating isolated execution environments, called enclaves, from Amazon EC2 instances, which are used for processing highly sensitive data. Enclaves have no persistent storage and no external networking. The enclave VMs are based on Firecracker microvm and have a vhost-vsock device for communication with the parent EC2 instance that spawned it and a Nitro Secure Module (NSM) device for cryptographic attestation. The parent instance VM always has CID 3 while the enclave VM gets a dynamic CID. The enclave VMs can communicate with the parent instance over various ports to CID 3, for example, the init process inside an enclave sends a heartbeat to port 9000 upon boot, expecting a heartbeat reply, letting the parent instance know that the enclave VM has successfully booted. The plan is to eventually make the nitro enclave emulation in QEMU standalone i.e., without needing to run another VM with CID 3 with proper vsock If you don't have to launch another VM, maybe we can avoid vhost-vsock and emulate virtio-vsock in user-space, having complete control over the behavior. So we could use this opportunity to implement virtio-vsock in QEMU [4] or use vhost-user-vsock [5] and customize it somehow. (Note: vhost-user-vsock already supports sibling communication, so maybe with a few modifications it fits your case perfectly) [4] https://gitlab.com/qemu-project/qemu/-/issues/2095 [5] https://github.com/rust-vmm/vhost-device/tree/main/vhost-device-vsock Thanks for letting me know. Right now I don't have a complete picture but I will look into them. Thank you. communication support. For this to work, one approach could be to teach the vhost driver in kernel to forward CID 3 messages to another CID N So in this case both CID 3 and N would be assigned to the same QEMU process? CID N is assigned to the enclave VM. CID 3 was supposed to be the parent VM that spawns the enclave VM (this is how it is in AWS, where an EC2 instance VM spawns the enclave VM from inside it and that parent EC2 instance always has CID 3). But in the QEMU case as we don't want a parent VM (we want to run enclave VMs standalone) we would need to forward the CID 3 messages to host CID. I don't know if it means CID 3 and CID N is assigned to the same QEMU process. Sorry. There are 2 use cases here: 1) Enclave wants to treat host as parent (default). In this scenario, the "parent instance" that shows up as CID 3 in the Enclave doesn't really exist. Instead, when the Enclave attempts to talk to CID 3, it should really land on CID 0 (hypervisor). When the hypervisor tries to connect to the Enclave on port X, it should look as if it originates from CID 3, not CID 0. 2) Multiple parent VMs. Think of an actual cloud hosting scenario. Here, we have multiple "parent instances". Each of them thinks it's CID 3. Each can spawn an Enclave that talks to CID 3 and reach the parent. For this case, I think implementing all of virtio-vsock in user space is the best path forward. But in theory, you could also swizzle CIDs to make random "real" CIDs appear as CID 3. Do you have to allocate 2 separate virtio-vsock devices, one for the parent and one for the enclave? If there is a parent VM, then I guess both parent and enclave VMs need virtio-vsock devices. (set to CID 2 for host) i.e., it patches CID from 3 to N on incoming messages and from N to 3 on responses. This will enable users of the Will these messages have the VMADDR_FLAG_TO_HOST flag set? We don't support this in vhost-vsock yet, if supporting it helps, we might, but we need to better understand how to avoid security issues, so maybe each device needs to explicitly enable the feature and specify from which CIDs it accepts packets. I don't know about the flag. So I don't know if it will be set. Sorry. From the guest's point of view, the parent (CID 3) is just another VM. Since Linux as of https://patchwork.ozlabs.org/project/netdev/patch/20201204170235.84387-4-andra...@amazon.com/#2594117 always sets VMADDR_FLAG_TO_HOST when local_CID > 0 && remote_CID > 0, I would say the message has the flag set. How would you envision the host to implement the flag? Would the host allow user space to listen on any CID and hence receive the respective target connections? And wouldn't listening on CID 0 then mean you're effectively listening to "any" other CID? Thinking about that a bit more, that may be just what we need, yes :) nitro-enclave machine type in QEM
Re: [PATCH net-next] virtio-net: synchronize operstate with admin state on up/down
On Mon, 20 May 2024 09:03:02 +0800, Jason Wang wrote: > This patch synchronize operstate with admin state per RFC2863. > > This is done by trying to toggle the carrier upon open/close and > synchronize with the config change work. This allows propagate status > correctly to stacked devices like: > > ip link add link enp0s3 macvlan0 type macvlan > ip link set link enp0s3 down > ip link show > > Before this patch: > > 3: enp0s3: mtu 1500 qdisc pfifo_fast state DOWN mode > DEFAULT group default qlen 1000 > link/ether 00:00:05:00:00:09 brd ff:ff:ff:ff:ff:ff > .. > 5: macvlan0@enp0s3: mtu 1500 qdisc > noqueue state UP mode DEFAULT group default qlen 1000 > link/ether b2:a9:c5:04:da:53 brd ff:ff:ff:ff:ff:ff > > After this patch: > > 3: enp0s3: mtu 1500 qdisc pfifo_fast state DOWN mode > DEFAULT group default qlen 1000 > link/ether 00:00:05:00:00:09 brd ff:ff:ff:ff:ff:ff > ... > 5: macvlan0@enp0s3: mtu 1500 qdisc > noqueue state LOWERLAYERDOWN mode DEFAULT group default qlen 1000 > link/ether b2:a9:c5:04:da:53 brd ff:ff:ff:ff:ff:ff > > Cc: Venkat Venkatsubra > Cc: Gia-Khanh Nguyen > Signed-off-by: Jason Wang Reviewed-by: Xuan Zhuo Thanks. > --- > drivers/net/virtio_net.c | 94 +++- > 1 file changed, 63 insertions(+), 31 deletions(-) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index 4e1a0fc0d555..24d880a5023d 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -433,6 +433,12 @@ struct virtnet_info { > /* The lock to synchronize the access to refill_enabled */ > spinlock_t refill_lock; > > + /* Is config change enabled? */ > + bool config_change_enabled; > + > + /* The lock to synchronize the access to config_change_enabled */ > + spinlock_t config_change_lock; > + > /* Work struct for config space updates */ > struct work_struct config_work; > > @@ -623,6 +629,20 @@ static void disable_delayed_refill(struct virtnet_info > *vi) > spin_unlock_bh(&vi->refill_lock); > } > > +static void enable_config_change(struct virtnet_info *vi) > +{ > + spin_lock_irq(&vi->config_change_lock); > + vi->config_change_enabled = true; > + spin_unlock_irq(&vi->config_change_lock); > +} > + > +static void disable_config_change(struct virtnet_info *vi) > +{ > + spin_lock_irq(&vi->config_change_lock); > + vi->config_change_enabled = false; > + spin_unlock_irq(&vi->config_change_lock); > +} > + > static void enable_rx_mode_work(struct virtnet_info *vi) > { > rtnl_lock(); > @@ -2421,6 +2441,25 @@ static int virtnet_enable_queue_pair(struct > virtnet_info *vi, int qp_index) > return err; > } > > +static void virtnet_update_settings(struct virtnet_info *vi) > +{ > + u32 speed; > + u8 duplex; > + > + if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_SPEED_DUPLEX)) > + return; > + > + virtio_cread_le(vi->vdev, struct virtio_net_config, speed, &speed); > + > + if (ethtool_validate_speed(speed)) > + vi->speed = speed; > + > + virtio_cread_le(vi->vdev, struct virtio_net_config, duplex, &duplex); > + > + if (ethtool_validate_duplex(duplex)) > + vi->duplex = duplex; > +} > + > static int virtnet_open(struct net_device *dev) > { > struct virtnet_info *vi = netdev_priv(dev); > @@ -2439,6 +2478,18 @@ static int virtnet_open(struct net_device *dev) > goto err_enable_qp; > } > > + /* Assume link up if device can't report link status, > +otherwise get link status from config. */ > + netif_carrier_off(dev); > + if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_STATUS)) { > + enable_config_change(vi); > + schedule_work(&vi->config_work); > + } else { > + vi->status = VIRTIO_NET_S_LINK_UP; > + virtnet_update_settings(vi); > + netif_carrier_on(dev); > + } > + > return 0; > > err_enable_qp: > @@ -2875,12 +2926,19 @@ static int virtnet_close(struct net_device *dev) > disable_delayed_refill(vi); > /* Make sure refill_work doesn't re-enable napi! */ > cancel_delayed_work_sync(&vi->refill); > + /* Make sure config notification doesn't schedule config work */ > + disable_config_change(vi); > + /* Make sure status updating is cancelled */ > + cancel_work_sync(&vi->config_work); > > for (i = 0; i < vi->max_queue_pairs; i++) { > virtnet_disable_queue_pair(vi, i); > cancel_work_sync(&vi->rq[i].dim.work); > } > > + vi->status &= ~VIRTIO_NET_S_LINK_UP; > + netif_carrier_off(dev); > + > return 0; > } > > @@ -4583,25 +4641,6 @@ static void virtnet_init_settings(struct net_device > *dev) > vi->duplex = DUPLEX_UNKNOWN; > } > > -static void virtnet_update_settings(struct virtnet_info *vi) > -{ > - u32 speed; > - u8 duplex; > - > - if (!virtio_has_feature(vi->vdev, VIRTIO_NET_