On Fri, 24 Aug 2018 at 16:19, Burakov, Anatoly <anatoly.bura...@intel.com> wrote: > > On 24-Aug-18 11:41 AM, Burakov, Anatoly wrote: > > On 24-Aug-18 10:35 AM, Tiwei Bie wrote: > >> On Fri, Aug 24, 2018 at 09:59:42AM +0100, Burakov, Anatoly wrote: > >>> On 24-Aug-18 5:49 AM, Tiwei Bie wrote: > >>>> On Thu, Aug 23, 2018 at 03:01:30PM +0100, Burakov, Anatoly wrote: > >>>>> On 23-Aug-18 12:19 PM, Sean Harte wrote: > >>>>>> On Thu, 23 Aug 2018 at 10:05, Burakov, Anatoly > >>>>>> <anatoly.bura...@intel.com> wrote: > >>>>>>> > >>>>>>> On 23-Aug-18 3:57 AM, Tiwei Bie wrote: > >>>>>>>> Deadlock can occur when allocating memory if a vhost-kernel > >>>>>>>> based virtio-user device is in use. Besides, it's possible > >>>>>>>> to have much more than 64 non-contiguous hugepage backed > >>>>>>>> memory regions due to the memory hotplug, which may cause > >>>>>>>> problems when handling VHOST_SET_MEM_TABLE request. A better > >>>>>>>> solution is to have the virtio-user pass all the VA ranges > >>>>>>>> reserved by DPDK to vhost-kernel. > >>>>>>>> > >>>>>>>> Bugzilla ID: 81 > >>>>>>>> Fixes: 12ecb2f63b12 ("net/virtio-user: support memory hotplug") > >>>>>>>> Cc: sta...@dpdk.org > >>>>>>>> > >>>>>>>> Reported-by: Seán Harte <sea...@gmail.com> > >>>>>>>> Signed-off-by: Tiwei Bie <tiwei....@intel.com> > >>>>>>>> --- > >>>>>>>> drivers/net/virtio/virtio_user/vhost_kernel.c | 64 > >>>>>>>> ++++++++----------- > >>>>>>>> 1 file changed, 27 insertions(+), 37 deletions(-) > >>>>>>>> > >>>>>>>> diff --git a/drivers/net/virtio/virtio_user/vhost_kernel.c > >>>>>>>> b/drivers/net/virtio/virtio_user/vhost_kernel.c > >>>>>>>> index b2444096c..49bd1b821 100644 > >>>>>>>> --- a/drivers/net/virtio/virtio_user/vhost_kernel.c > >>>>>>>> +++ b/drivers/net/virtio/virtio_user/vhost_kernel.c > >>>>>>>> @@ -70,41 +70,12 @@ static uint64_t vhost_req_user_to_kernel[] = { > >>>>>>>> [VHOST_USER_SET_MEM_TABLE] = VHOST_SET_MEM_TABLE, > >>>>>>>> }; > >>>>>>>> > >>>>>>>> -struct walk_arg { > >>>>>>>> - struct vhost_memory_kernel *vm; > >>>>>>>> - uint32_t region_nr; > >>>>>>>> -}; > >>>>>>>> -static int > >>>>>>>> -add_memory_region(const struct rte_memseg_list *msl __rte_unused, > >>>>>>>> - const struct rte_memseg *ms, size_t len, void *arg) > >>>>>>>> -{ > >>>>>>>> - struct walk_arg *wa = arg; > >>>>>>>> - struct vhost_memory_region *mr; > >>>>>>>> - void *start_addr; > >>>>>>>> - > >>>>>>>> - if (wa->region_nr >= max_regions) > >>>>>>>> - return -1; > >>>>>>>> - > >>>>>>>> - mr = &wa->vm->regions[wa->region_nr++]; > >>>>>>>> - start_addr = ms->addr; > >>>>>>>> - > >>>>>>>> - mr->guest_phys_addr = (uint64_t)(uintptr_t)start_addr; > >>>>>>>> - mr->userspace_addr = (uint64_t)(uintptr_t)start_addr; > >>>>>>>> - mr->memory_size = len; > >>>>>>>> - mr->mmap_offset = 0; > >>>>>>>> - > >>>>>>>> - return 0; > >>>>>>>> -} > >>>>>>>> - > >>>>>>>> -/* By default, vhost kernel module allows 64 regions, but DPDK > >>>>>>>> allows > >>>>>>>> - * 256 segments. As a relief, below function merges those > >>>>>>>> virtually > >>>>>>>> - * adjacent memsegs into one region. > >>>>>>>> - */ > >>>>>>>> static struct vhost_memory_kernel * > >>>>>>>> prepare_vhost_memory_kernel(void) > >>>>>>>> { > >>>>>>>> + struct rte_mem_config *mcfg = > >>>>>>>> rte_eal_get_configuration()->mem_config; > >>>>>>>> struct vhost_memory_kernel *vm; > >>>>>>>> - struct walk_arg wa; > >>>>>>>> + uint32_t region_nr = 0, i; > >>>>>>>> > >>>>>>>> vm = malloc(sizeof(struct vhost_memory_kernel) + > >>>>>>>> max_regions * > >>>>>>>> @@ -112,15 +83,34 @@ prepare_vhost_memory_kernel(void) > >>>>>>>> if (!vm) > >>>>>>>> return NULL; > >>>>>>>> > >>>>>>>> - wa.region_nr = 0; > >>>>>>>> - wa.vm = vm; > >>>>>>>> + for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) { > >>>>>>>> + struct rte_memseg_list *msl = &mcfg->memsegs[i]; > >>>>>>>> + struct vhost_memory_region *mr; > >>>>>>>> + void *start_addr; > >>>>>>>> + uint64_t len; > >>>>>>> > >>>>>>> There is a rte_memseg_list_walk() - please do not iterate over > >>>>>>> memseg > >>>>>>> lists manually. > >>>>>>> > >>>>>> > >>>>>> rte_memseg_list_walk() can't be used here because > >>>>>> prepare_vhost_memory_kernel() is sometimes called from a memory > >>>>>> callback. It will then hang trying to get a read lock on > >>>>>> memory_hotplug_lock. > >>>>> > >>>>> OK, so use rte_memseg_list_walk_thread_unsafe(). > >>>>> > >>>>>> I don't think the rte_memseg_list_walk_thread_unsafe() function is > >>>>>> appropriate because prepare_vhost_memory_kernel() may not always be > >>>>>> called from a memory callback. > >>>>> > >>>>> And how is this different? What you're doing here is identical to > >>>>> calling > >>>>> rte_memseg_list_walk_thread_unsafe() (that's precisely what it does > >>>>> internally - check the code!), except that you're doing it manually > >>>>> and not > >>>>> using DPDK API, which makes your code dependent on internals of DPDK's > >>>>> memory implementation. > >>>>> > >>>>> So, this function may or may not be called from a callback, but > >>>>> you're using > >>>>> thread-unsafe walk anyway. I think you should call either > >>>>> thread-safe or > >>>>> thread-unsafe version depending on whether you're being called from a > >>>>> callback or not. > >>>> > >>>> Hmm, the real case is a bit more tricky. Even if this > >>>> function isn't called from memory event callbacks, the > >>>> "thread-safe" version list_walk() still can't be used. > >>>> Because deadlock may happen. > >>>> > >>>> In virtio-user device start, it needs to do SET_MEM_TABLE > >>>> for the vhost-backend. And to make sure that preparing and > >>>> setting the memory table is atomic (and also to protect the > >>>> device state), it needs a lock. So if it calls "thread-safe" > >>>> version list_walk(), there will be two locks taken in > >>>> below order: > >>>> > >>>> - the virtio-user device lock (taken by virtio_user_start_device()); > >>>> - the memory hotplug lock (taken by rte_memseg_list_walk()); > >>>> > >>>> And above locks will be released in below order: > >>>> > >>>> - the memory hotplug lock (released by rte_memseg_list_walk()); > >>>> - the virtio-user device lock (released by virtio_user_start_device()); > >>>> > >>>> And in virtio-user's memory event callback, it also needs > >>>> to take the virtio-user device lock to make sure preparing > >>>> and setting the memory table is atomic (and also to protect > >>>> the device state), so the same device lock is needed here. > >>>> And before virtio-user's memory event callback is called, > >>>> the memory hotplug lock has already been taken by memory > >>>> subsystem. So the locks are taken in below order: > >>>> > >>>> - the memory hotplug lock (It has been taken by memory subsystem > >>>> before virtio-user's memory event callback being called); > >>>> - the virtio-user device lock (taken by virtio_user_mem_event_cb()); > >>>> > >>>> So, if the virtio-user device start and memory callback > >>>> events happen at the same time, deadlock may happen. > >>>> > >>>> And we can't always use rte_memseg_list_walk_thread_unsafe(), > >>>> because by its definition, it's only expected to be called > >>>> in memory callbacks: > >>>> > >>>> /** > >>>> * ...... > >>>> * @note This function does not perform any locking, and is only > >>>> safe to call > >>>> * from within memory-related callback functions. > >>>> * ...... > >>>> */ > >>>> int __rte_experimental > >>>> rte_memseg_list_walk_thread_unsafe(rte_memseg_list_walk_t func, void > >>>> *arg); > >>>> > >>>> So both of rte_memseg_list_walk_thread_unsafe() and > >>>> rte_memseg_list_walk() are not really suitable for this > >>>> case. If we really want to use these helpers, we need > >>>> to allow rte_memseg_*_walk_thread_unsafe() to be called > >>>> when the callers have taken mcfg->memory_hotplug_lock, > >>>> or add some extra memory APIs to allow callers to take > >>>> the lock for rte_memseg_*_walk_thread_unsafe(). And we > >>>> can take this lock in virtio_user_start_device() before > >>>> taking the virtio-user device lock (so locks can be taken > >>>> in the correct order). Thoughts? > >>> > >>> You can know if this function is called from memory callback. You can > >>> probably also know if you're in the process of starting the device. The > >>> solution seems clear to me - check both? :) > >> > >> Hmm.. You didn't get my point. :( > >> > >> I mean a lock within virtio driver is needed by virtio-user > >> to avoid the race condition between the virtio-user device > >> start and the virtio-user memory event callback (e.g. about > >> the device state change). And to iterate the memseg lists, > >> a lock within memory subsystem will be taken. So in both of > >> the virtio device start and memory event handling, there are > >> two locks will be taken -- And we need to take these locks > >> in correct order to avoid deadlock, and it requires us to > >> have a way to take the lock for rte_memseg_*_thread_unsafe() > >> in callers. > > > > I'm afraid i'm still not getting your point :( > > > > You know that you can either get called from memory callback, or not > > from memory callback. Both of these times, the virtio device is locked. > > So, where does the race come from? You take out your device lock, and by > > the time you need to iterate through memsegs, you know that you either > > were or weren't called from memory callback, which means you can pick > > either thread-safe or thread-unsafe version. > > > > Can you please draw up a step-by-step example where race can happen that > > cannot be solved the way i suggested above? > > > > I.e. > > > > thread 1 thread 2 > > do this > > do that > > For the benefit of public discussion, the following is result of our > internal discussion on this topic: the deadlock may happen because we > take virtio lock and hotplug lock in different order. > > So, thread 1 may do device start, which will take out virtio device > lock, then attempt to iterate memory regions thereby taking memory > hotplug lock. At the same time, thread 2 might trigger an allocation, > and virtio will receive a callback, which will lock hotplug, and then > attempt to lock virtio device. > > In other words, > > Thread 1 Thread 2 > lock virtio device > lock hotplug > (waits for hotplug unlock) > (waits for virtio device unlock) > > The solution to this is not trivial, and we haven't come up with > anything to fix this that doesn't involve pretty serious changes to the > memory subsystem. Ideas welcome :) >
As part of device start, can the virtio-user driver take the hotplug lock before the virtio lock (in Thread 1 in your example). It can access the lock through rte_eal_get_configuration()->mem_config, as the Mellanox driver currently does. If that would work, an API to lock and unlock do it should probably be provided by the memory subsystem. Although, it seems a bit error-prone and a future change could easily break things. > -- > Thanks, > Anatoly