Re: vmd(8): fix vmctl wait state corruption

Mischa Mon, 26 Apr 2021 00:59:24 -0700


> On 24 Apr 2021, at 20:56, Dave Voutila <[email protected]> wrote:
> 
> 
> Dave Voutila writes:
> 
>> Dave Voutila writes:
>> 
>>> vmd(8) users of tech@,
>>> 
>>> NOTE: I have no intention to try to commit this prior to 6.9's release
>>> due to its complexity, but I didn't want to "wait" to solicit testers or
>>> potential feedback.
>> 
>> Freeze is over, so bumping this thread with an updated diff below.
>> 
> 
> Now that there's been some testing and snaps are building once again,
> anyone willing to review & OK?


Wanted to confirm here is as well. The patch works well.
Ran this patch against -current with ~30 VMs owned by a user account.

Issued vmctl stop -aw, ctrl-c-ed every 3-4 VM, and every time the last VM 
stopped shutdown properly.
Even in rapid succession of vmctl stop -aw + ctrl-c, resulting in multiple VMs 
in "stopping” stage, all worked well.
All VMs also started properly without any fsck needed.

Mischa

>>> I noticed recently that I could not have two vmctl(8) clients "wait" for
>>> the same vm to shutdown as one would cancel the other. Worse yet, if you
>>> cancel a wait (^C) you can effectively corrupt the state being used for
>>> tracking the waiting client preventing future clients from waiting on
>>> the vm.
>>> 
>>> It turns out the socket fd of the vmctl(8) client is being sent by the
>>> control process as the peerid in the imsg. This fd is being stored on
>>> the vmd_vm structure in the vm_peerid member, but this fd only has
>>> meaning in the scope of the control process. Consequently:
>>> 
>>> - only 1 value can be stored at a time, meaning only 1 waiting client
>>>  can exist at a time
>>> - since vm_peerid is used for storing if another vmd(8) process is
>>>  waiting on the vm, vm_peerid can be corrupted by vmctl(8)
>>> - the control process cannot update waiting state on vmctl disconnects
>>>  and since fd's are reused it's possible the message could be sent to a
>>>  vmctl(8) client performing an operation other than a "wait"
>>> 
>>> The below diff:
>>> 
>>> 1. enables support for multiple vmctl(8) clients to wait on the same vm
>>>   to terminate
>>> 2. keeps the wait state in the control process and out of the parent's
>>>   global vm state, tracking waiting parties in a TAILQ
>>> 3. removes the waiting client state on client disconnect/cancellation
>>> 4. simplifies vmd(8) by removing IMSG_VMDOP_WAIT_VM_REQUEST handling
>>>   from the vmm process, which isn't needed (and was partially
>>>   responsible for the corruption)
>>> 
>> 
>> Above design still stands, but I've fixed some messaging issues related
>> to the fact the parent process was forwarding
>> IMSG_VMDOP_TERMINATE_VM_RESPONSE messages directly to the control
>> process resulting in duplicate messages. This broke doing a `vmctl stop`
>> for all vms (-a) and waiting (-w). It now only forwards errors.
>> 
>>> There are some subsequent tweaks that may follow this diff, specifically
>>> one related to the fact I've switched the logic to send
>>> IMSG_VMDOP_TERMINATE_VM_EVENT messages to the control process (which
>>> makes sense to me) but control relays a IMSG_VMDOP_TERMINATE_VM_RESPONSE
>>> message to the waiting vmctl(8) client. I'd need to update vmctl(8) to
>>> look for the other event and don't want to complicate the diff further.
>>> 
>>> If any testers out there can try to break this for me it would be much
>>> appreciated. :-)
>>> 
>> 
>> Testers? I'd like to give people a few days to kick the tires before
>> asking for OK to commit.
>> 
>> -dv
>> 
>> 
>> Index: control.c
>> ===================================================================
>> RCS file: /cvs/src/usr.sbin/vmd/control.c,v
>> retrieving revision 1.34
>> diff -u -p -r1.34 control.c
>> --- control.c        20 Apr 2021 21:11:56 -0000      1.34
>> +++ control.c        21 Apr 2021 17:17:04 -0000
>> @@ -41,6 +41,13 @@
>> 
>> struct ctl_connlist ctl_conns = TAILQ_HEAD_INITIALIZER(ctl_conns);
>> 
>> +struct ctl_notify {
>> +    int                     ctl_fd;
>> +    uint32_t                ctl_vmid;
>> +    TAILQ_ENTRY(ctl_notify) entry;
>> +};
>> +TAILQ_HEAD(ctl_notify_q, ctl_notify) ctl_notify_q =
>> +    TAILQ_HEAD_INITIALIZER(ctl_notify_q);
>> void
>>       control_accept(int, short, void *);
>> struct ctl_conn
>> @@ -78,7 +85,10 @@ int
>> control_dispatch_vmd(int fd, struct privsep_proc *p, struct imsg *imsg)
>> {
>>      struct ctl_conn         *c;
>> +    struct ctl_notify       *notify = NULL, *notify_next;
>>      struct privsep          *ps = p->p_ps;
>> +    struct vmop_result       vmr;
>> +    int                      waiting = 0;
>> 
>>      switch (imsg->hdr.type) {
>>      case IMSG_VMDOP_START_VM_RESPONSE:
>> @@ -86,11 +96,11 @@ control_dispatch_vmd(int fd, struct priv
>>      case IMSG_VMDOP_SEND_VM_RESPONSE:
>>      case IMSG_VMDOP_RECEIVE_VM_RESPONSE:
>>      case IMSG_VMDOP_UNPAUSE_VM_RESPONSE:
>> -    case IMSG_VMDOP_TERMINATE_VM_RESPONSE:
>>      case IMSG_VMDOP_GET_INFO_VM_DATA:
>>      case IMSG_VMDOP_GET_INFO_VM_END_DATA:
>>      case IMSG_CTL_FAIL:
>>      case IMSG_CTL_OK:
>> +            /* Provide basic response back to a specific control client */
>>              if ((c = control_connbyfd(imsg->hdr.peerid)) == NULL) {
>>                      log_warnx("%s: lost control connection: fd %d",
>>                          __func__, imsg->hdr.peerid);
>> @@ -99,6 +109,61 @@ control_dispatch_vmd(int fd, struct priv
>>              imsg_compose_event(&c->iev, imsg->hdr.type,
>>                  0, 0, imsg->fd, imsg->data, IMSG_DATA_SIZE(imsg));
>>              break;
>> +    case IMSG_VMDOP_TERMINATE_VM_RESPONSE:
>> +            IMSG_SIZE_CHECK(imsg, &vmr);
>> +            memcpy(&vmr, imsg->data, sizeof(vmr));
>> +
>> +            if ((c = control_connbyfd(imsg->hdr.peerid)) == NULL) {
>> +                    log_warnx("%s: lost control connection: fd %d",
>> +                        __func__, imsg->hdr.peerid);
>> +                    return (0);
>> +            }
>> +
>> +            TAILQ_FOREACH(notify, &ctl_notify_q, entry) {
>> +                    if (notify->ctl_fd == (int) imsg->hdr.peerid) {
>> +                            /*
>> +                             * Update if waiting by vm name. This is only
>> +                             * supported when stopping a single vm. If
>> +                             * stopping all vms, vmctl(8) sends the request
>> +                             * using the vmid.
>> +                             */
>> +                            if (notify->ctl_vmid < 1)
>> +                                    notify->ctl_vmid = vmr.vmr_id;
>> +                            waiting = 1;
>> +                            break;
>> +                    }
>> +            }
>> +
>> +            /* An error needs to be relayed to the client immediately */
>> +            if (!waiting || vmr.vmr_result) {
>> +                    imsg_compose_event(&c->iev, imsg->hdr.type,
>> +                        0, 0, imsg->fd, imsg->data, IMSG_DATA_SIZE(imsg));
>> +
>> +                    if (notify) {
>> +                            TAILQ_REMOVE(&ctl_notify_q, notify, entry);
>> +                            free(notify);
>> +                    }
>> +            }
>> +            break;
>> +    case IMSG_VMDOP_TERMINATE_VM_EVENT:
>> +            /* Notify any waiting clients that a VM terminated */
>> +            IMSG_SIZE_CHECK(imsg, &vmr);
>> +            memcpy(&vmr, imsg->data, sizeof(vmr));
>> +
>> +            TAILQ_FOREACH_SAFE(notify, &ctl_notify_q, entry, notify_next) {
>> +                    if (notify->ctl_vmid != vmr.vmr_id)
>> +                            continue;
>> +                    if ((c = control_connbyfd(notify->ctl_fd)) != NULL) {
>> +                            /* XXX vmctl expects *_RESPONSE, not *_EVENT */
>> +                            imsg_compose_event(&c->iev,
>> +                                IMSG_VMDOP_TERMINATE_VM_RESPONSE,
>> +                                0, 0, imsg->fd, imsg->data,
>> +                                IMSG_DATA_SIZE(imsg));
>> +                            TAILQ_REMOVE(&ctl_notify_q, notify, entry);
>> +                            free(notify);
>> +                    }
>> +            }
>> +            break;
>>      case IMSG_VMDOP_CONFIG:
>>              config_getconfig(ps->ps_env, imsg);
>>              proc_compose(ps, PROC_PARENT, IMSG_VMDOP_DONE, NULL, 0);
>> @@ -276,7 +341,8 @@ control_connbyfd(int fd)
>> void
>> control_close(int fd, struct control_sock *cs)
>> {
>> -    struct ctl_conn *c;
>> +    struct ctl_conn         *c;
>> +    struct ctl_notify       *notify, *notify_next;
>> 
>>      if ((c = control_connbyfd(fd)) == NULL) {
>>              log_warn("%s: fd %d: not found", __func__, fd);
>> @@ -286,6 +352,14 @@ control_close(int fd, struct control_soc
>>      msgbuf_clear(&c->iev.ibuf.w);
>>      TAILQ_REMOVE(&ctl_conns, c, entry);
>> 
>> +    TAILQ_FOREACH_SAFE(notify, &ctl_notify_q, entry, notify_next) {
>> +            if (notify->ctl_fd == fd) {
>> +                    TAILQ_REMOVE(&ctl_notify_q, notify, entry);
>> +                    free(notify);
>> +                    break;
>> +            }
>> +    }
>> +
>>      event_del(&c->iev.ev);
>>      close(c->iev.ibuf.fd);
>> 
>> @@ -308,7 +382,8 @@ control_dispatch_imsg(int fd, short even
>>      struct imsg                      imsg;
>>      struct vmop_create_params        vmc;
>>      struct vmop_id                   vid;
>> -    int                              n, v, ret = 0;
>> +    struct ctl_notify               *notify;
>> +    int                              n, v, wait = 0, ret = 0;
>> 
>>      if ((c = control_connbyfd(fd)) == NULL) {
>>              log_warn("%s: fd %d: not found", __func__, fd);
>> @@ -388,11 +463,25 @@ control_dispatch_imsg(int fd, short even
>>                      }
>>                      break;
>>              case IMSG_VMDOP_WAIT_VM_REQUEST:
>> +                    wait = 1;
>> +                    /* FALLTHROUGH */
>>              case IMSG_VMDOP_TERMINATE_VM_REQUEST:
>>                      if (IMSG_DATA_SIZE(&imsg) < sizeof(vid))
>>                              goto fail;
>>                      memcpy(&vid, imsg.data, sizeof(vid));
>>                      vid.vid_uid = c->peercred.uid;
>> +
>> +                    if (wait || vid.vid_flags & VMOP_WAIT) {
>> +                            vid.vid_flags |= VMOP_WAIT;
>> +                            notify = calloc(1, sizeof(struct ctl_notify));
>> +                            if (notify == NULL)
>> +                                    fatal("%s: calloc", __func__);
>> +                            notify->ctl_vmid = vid.vid_id;
>> +                            notify->ctl_fd = fd;
>> +                            TAILQ_INSERT_TAIL(&ctl_notify_q, notify, entry);
>> +                            log_debug("%s: registered wait for peer %d",
>> +                                __func__, fd);
>> +                    }
>> 
>>                      if (proc_compose_imsg(ps, PROC_PARENT, -1,
>>                          imsg.hdr.type, fd, -1, &vid, sizeof(vid)) == -1) {
>> Index: vmd.c
>> ===================================================================
>> RCS file: /cvs/src/usr.sbin/vmd/vmd.c,v
>> retrieving revision 1.122
>> diff -u -p -r1.122 vmd.c
>> --- vmd.c    5 Apr 2021 11:35:26 -0000       1.122
>> +++ vmd.c    21 Apr 2021 17:17:04 -0000
>> @@ -128,42 +128,41 @@ vmd_dispatch_control(int fd, struct priv
>>              IMSG_SIZE_CHECK(imsg, &vid);
>>              memcpy(&vid, imsg->data, sizeof(vid));
>>              flags = vid.vid_flags;
>> +            cmd = IMSG_VMDOP_TERMINATE_VM_RESPONSE;
>> 
>>              if ((id = vid.vid_id) == 0) {
>>                      /* Lookup vm (id) by name */
>>                      if ((vm = vm_getbyname(vid.vid_name)) == NULL) {
>>                              res = ENOENT;
>> -                            cmd = IMSG_VMDOP_TERMINATE_VM_RESPONSE;
>>                              break;
>>                      } else if ((vm->vm_state & VM_STATE_SHUTDOWN) &&
>>                          (flags & VMOP_FORCE) == 0) {
>>                              res = EALREADY;
>> -                            cmd = IMSG_VMDOP_TERMINATE_VM_RESPONSE;
>>                              break;
>>                      } else if (!(vm->vm_state & VM_STATE_RUNNING)) {
>>                              res = EINVAL;
>> -                            cmd = IMSG_VMDOP_TERMINATE_VM_RESPONSE;
>>                              break;
>>                      }
>>                      id = vm->vm_vmid;
>>              } else if ((vm = vm_getbyvmid(id)) == NULL) {
>>                      res = ENOENT;
>> -                    cmd = IMSG_VMDOP_TERMINATE_VM_RESPONSE;
>>                      break;
>>              }
>> -            if (vm_checkperm(vm, &vm->vm_params.vmc_owner,
>> -                vid.vid_uid) != 0) {
>> +            if (vm_checkperm(vm, &vm->vm_params.vmc_owner, vid.vid_uid)) {
>>                      res = EPERM;
>> -                    cmd = IMSG_VMDOP_TERMINATE_VM_RESPONSE;
>>                      break;
>>              }
>> 
>> -            memset(&vid, 0, sizeof(vid));
>> -            vid.vid_id = id;
>> -            vid.vid_flags = flags;
>> -            if (proc_compose_imsg(ps, PROC_VMM, -1, imsg->hdr.type,
>> -                imsg->hdr.peerid, -1, &vid, sizeof(vid)) == -1)
>> -                    return (-1);
>> +            /* Only relay TERMINATION requests, not WAIT requests */
>> +            if (imsg->hdr.type == IMSG_VMDOP_TERMINATE_VM_REQUEST) {
>> +                    memset(&vid, 0, sizeof(vid));
>> +                    vid.vid_id = id;
>> +                    vid.vid_flags = flags;
>> +
>> +                    if (proc_compose_imsg(ps, PROC_VMM, -1, imsg->hdr.type,
>> +                            imsg->hdr.peerid, -1, &vid, sizeof(vid)) == -1)
>> +                            return (-1);
>> +            }
>>              break;
>>      case IMSG_VMDOP_GET_INFO_VM_REQUEST:
>>              proc_forward_imsg(ps, imsg, PROC_VMM, -1);
>> @@ -420,12 +419,14 @@ vmd_dispatch_vmm(int fd, struct privsep_
>>      case IMSG_VMDOP_TERMINATE_VM_RESPONSE:
>>              IMSG_SIZE_CHECK(imsg, &vmr);
>>              memcpy(&vmr, imsg->data, sizeof(vmr));
>> -            DPRINTF("%s: forwarding TERMINATE VM for vm id %d",
>> -                __func__, vmr.vmr_id);
>> -            proc_forward_imsg(ps, imsg, PROC_CONTROL, -1);
>> -            if ((vm = vm_getbyvmid(vmr.vmr_id)) == NULL)
>> -                    break;
>> -            if (vmr.vmr_result == 0) {
>> +
>> +            if (vmr.vmr_result) {
>> +                    DPRINTF("%s: forwarding TERMINATE VM for vm id %d",
>> +                        __func__, vmr.vmr_id);
>> +                    proc_forward_imsg(ps, imsg, PROC_CONTROL, -1);
>> +            } else {
>> +                    if ((vm = vm_getbyvmid(vmr.vmr_id)) == NULL)
>> +                            break;
>>                      /* Mark VM as shutting down */
>>                      vm->vm_state |= VM_STATE_SHUTDOWN;
>>              }
>> @@ -478,16 +479,13 @@ vmd_dispatch_vmm(int fd, struct privsep_
>>                      config_setvm(ps, vm, (uint32_t)-1, vm->vm_uid);
>>              }
>> 
>> -            /* Send a response if a control client is waiting for it */
>> -            if (imsg->hdr.peerid != (uint32_t)-1) {
>> -                    /* the error is meaningless for deferred responses */
>> -                    vmr.vmr_result = 0;
>> +            /* The error is meaningless for deferred responses */
>> +            vmr.vmr_result = 0;
>> 
>> -                    if (proc_compose_imsg(ps, PROC_CONTROL, -1,
>> -                        IMSG_VMDOP_TERMINATE_VM_RESPONSE,
>> -                        imsg->hdr.peerid, -1, &vmr, sizeof(vmr)) == -1)
>> -                            return (-1);
>> -            }
>> +            if (proc_compose_imsg(ps, PROC_CONTROL, -1,
>> +                    IMSG_VMDOP_TERMINATE_VM_EVENT,
>> +                    imsg->hdr.peerid, -1, &vmr, sizeof(vmr)) == -1)
>> +                    return (-1);
>>              break;
>>      case IMSG_VMDOP_GET_INFO_VM_DATA:
>>              IMSG_SIZE_CHECK(imsg, &vir);
>> Index: vmm.c
>> ===================================================================
>> RCS file: /cvs/src/usr.sbin/vmd/vmm.c,v
>> retrieving revision 1.100
>> diff -u -p -r1.100 vmm.c
>> --- vmm.c    11 Apr 2021 21:02:40 -0000      1.100
>> +++ vmm.c    21 Apr 2021 17:17:04 -0000
>> @@ -150,30 +150,6 @@ vmm_dispatch_parent(int fd, struct privs
>>                      res = ENOENT;
>>              cmd = IMSG_VMDOP_START_VM_RESPONSE;
>>              break;
>> -    case IMSG_VMDOP_WAIT_VM_REQUEST:
>> -            IMSG_SIZE_CHECK(imsg, &vid);
>> -            memcpy(&vid, imsg->data, sizeof(vid));
>> -            id = vid.vid_id;
>> -
>> -            DPRINTF("%s: recv'ed WAIT_VM for %d", __func__, id);
>> -
>> -            cmd = IMSG_VMDOP_TERMINATE_VM_RESPONSE;
>> -            if (id == 0) {
>> -                    res = ENOENT;
>> -            } else if ((vm = vm_getbyvmid(id)) != NULL) {
>> -                    if (vm->vm_peerid != (uint32_t)-1) {
>> -                            peerid = vm->vm_peerid;
>> -                            res = EINTR;
>> -                    } else
>> -                            cmd = 0;
>> -                    vm->vm_peerid = imsg->hdr.peerid;
>> -            } else {
>> -                    /* vm doesn't exist, cannot stop vm */
>> -                    log_debug("%s: cannot stop vm that is not running",
>> -                        __func__);
>> -                    res = VMD_VM_STOP_INVALID;
>> -            }
>> -            break;
>>      case IMSG_VMDOP_TERMINATE_VM_REQUEST:
>>              IMSG_SIZE_CHECK(imsg, &vid);
>>              memcpy(&vid, imsg->data, sizeof(vid));
>> @@ -221,15 +197,6 @@ vmm_dispatch_parent(int fd, struct privs
>>                                          __func__);
>>                                      res = VMD_VM_STOP_INVALID;
>>                              }
>> -                    }
>> -                    if ((flags & VMOP_WAIT) &&
>> -                        res == 0 && (vm->vm_state & VM_STATE_SHUTDOWN)) {
>> -                            if (vm->vm_peerid != (uint32_t)-1) {
>> -                                    peerid = vm->vm_peerid;
>> -                                    res = EINTR;
>> -                            } else
>> -                                    cmd = 0;
>> -                            vm->vm_peerid = imsg->hdr.peerid;
>>                      }
>>              } else {
>>                      /* VM doesn't exist, cannot stop vm */

Re: vmd(8): fix vmctl wait state corruption

Reply via email to