On 24-Aug-18 4:51 PM, Sean Harte wrote:
On Fri, 24 Aug 2018 at 16:19, Burakov, Anatoly
<anatoly.bura...@intel.com> wrote:

On 24-Aug-18 11:41 AM, Burakov, Anatoly wrote:
On 24-Aug-18 10:35 AM, Tiwei Bie wrote:
On Fri, Aug 24, 2018 at 09:59:42AM +0100, Burakov, Anatoly wrote:
On 24-Aug-18 5:49 AM, Tiwei Bie wrote:
On Thu, Aug 23, 2018 at 03:01:30PM +0100, Burakov, Anatoly wrote:
On 23-Aug-18 12:19 PM, Sean Harte wrote:
On Thu, 23 Aug 2018 at 10:05, Burakov, Anatoly
<anatoly.bura...@intel.com> wrote:

On 23-Aug-18 3:57 AM, Tiwei Bie wrote:
Deadlock can occur when allocating memory if a vhost-kernel
based virtio-user device is in use. Besides, it's possible
to have much more than 64 non-contiguous hugepage backed
memory regions due to the memory hotplug, which may cause
problems when handling VHOST_SET_MEM_TABLE request. A better
solution is to have the virtio-user pass all the VA ranges
reserved by DPDK to vhost-kernel.

Bugzilla ID: 81
Fixes: 12ecb2f63b12 ("net/virtio-user: support memory hotplug")
Cc: sta...@dpdk.org

Reported-by: Seán Harte <sea...@gmail.com>
Signed-off-by: Tiwei Bie <tiwei....@intel.com>
---
      drivers/net/virtio/virtio_user/vhost_kernel.c | 64
++++++++-----------
      1 file changed, 27 insertions(+), 37 deletions(-)

diff --git a/drivers/net/virtio/virtio_user/vhost_kernel.c
b/drivers/net/virtio/virtio_user/vhost_kernel.c
index b2444096c..49bd1b821 100644
--- a/drivers/net/virtio/virtio_user/vhost_kernel.c
+++ b/drivers/net/virtio/virtio_user/vhost_kernel.c
@@ -70,41 +70,12 @@ static uint64_t vhost_req_user_to_kernel[] = {
          [VHOST_USER_SET_MEM_TABLE] = VHOST_SET_MEM_TABLE,
      };

-struct walk_arg {
-     struct vhost_memory_kernel *vm;
-     uint32_t region_nr;
-};
-static int
-add_memory_region(const struct rte_memseg_list *msl __rte_unused,
-             const struct rte_memseg *ms, size_t len, void *arg)
-{
-     struct walk_arg *wa = arg;
-     struct vhost_memory_region *mr;
-     void *start_addr;
-
-     if (wa->region_nr >= max_regions)
-             return -1;
-
-     mr = &wa->vm->regions[wa->region_nr++];
-     start_addr = ms->addr;
-
-     mr->guest_phys_addr = (uint64_t)(uintptr_t)start_addr;
-     mr->userspace_addr = (uint64_t)(uintptr_t)start_addr;
-     mr->memory_size = len;
-     mr->mmap_offset = 0;
-
-     return 0;
-}
-
-/* By default, vhost kernel module allows 64 regions, but DPDK
allows
- * 256 segments. As a relief, below function merges those
virtually
- * adjacent memsegs into one region.
- */
      static struct vhost_memory_kernel *
      prepare_vhost_memory_kernel(void)
      {
+     struct rte_mem_config *mcfg =
rte_eal_get_configuration()->mem_config;
          struct vhost_memory_kernel *vm;
-     struct walk_arg wa;
+     uint32_t region_nr = 0, i;

          vm = malloc(sizeof(struct vhost_memory_kernel) +
                          max_regions *
@@ -112,15 +83,34 @@ prepare_vhost_memory_kernel(void)
          if (!vm)
                  return NULL;

-     wa.region_nr = 0;
-     wa.vm = vm;
+     for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) {
+             struct rte_memseg_list *msl = &mcfg->memsegs[i];
+             struct vhost_memory_region *mr;
+             void *start_addr;
+             uint64_t len;

There is a rte_memseg_list_walk() - please do not iterate over
memseg
lists manually.


rte_memseg_list_walk() can't be used here because
prepare_vhost_memory_kernel() is sometimes called from a memory
callback. It will then hang trying to get a read lock on
memory_hotplug_lock.

OK, so use rte_memseg_list_walk_thread_unsafe().

I don't think the rte_memseg_list_walk_thread_unsafe() function is
appropriate because prepare_vhost_memory_kernel() may not always be
called from a memory callback.

And how is this different? What you're doing here is identical to
calling
rte_memseg_list_walk_thread_unsafe() (that's precisely what it does
internally - check the code!), except that you're doing it manually
and not
using DPDK API, which makes your code dependent on internals of DPDK's
memory implementation.

So, this function may or may not be called from a callback, but
you're using
thread-unsafe walk anyway. I think you should call either
thread-safe or
thread-unsafe version depending on whether you're being called from a
callback or not.

Hmm, the real case is a bit more tricky. Even if this
function isn't called from memory event callbacks, the
"thread-safe" version list_walk() still can't be used.
Because deadlock may happen.

In virtio-user device start, it needs to do SET_MEM_TABLE
for the vhost-backend. And to make sure that preparing and
setting the memory table is atomic (and also to protect the
device state), it needs a lock. So if it calls "thread-safe"
version list_walk(), there will be two locks taken in
below order:

- the virtio-user device lock (taken by virtio_user_start_device());
- the memory hotplug lock (taken by rte_memseg_list_walk());

And above locks will be released in below order:

- the memory hotplug lock (released by rte_memseg_list_walk());
- the virtio-user device lock (released by virtio_user_start_device());

And in virtio-user's memory event callback, it also needs
to take the virtio-user device lock to make sure preparing
and setting the memory table is atomic (and also to protect
the device state), so the same device lock is needed here.
And before virtio-user's memory event callback is called,
the memory hotplug lock has already been taken by memory
subsystem. So the locks are taken in below order:

- the memory hotplug lock (It has been taken by memory subsystem
      before virtio-user's memory event callback being called);
- the virtio-user device lock (taken by virtio_user_mem_event_cb());

So, if the virtio-user device start and memory callback
events happen at the same time, deadlock may happen.

And we can't always use rte_memseg_list_walk_thread_unsafe(),
because by its definition, it's only expected to be called
in memory callbacks:

/**
    * ......
    * @note This function does not perform any locking, and is only
safe to call
    *       from within memory-related callback functions.
    * ......
    */
int __rte_experimental
rte_memseg_list_walk_thread_unsafe(rte_memseg_list_walk_t func, void
*arg);

So both of rte_memseg_list_walk_thread_unsafe() and
rte_memseg_list_walk() are not really suitable for this
case. If we really want to use these helpers, we need
to allow rte_memseg_*_walk_thread_unsafe() to be called
when the callers have taken mcfg->memory_hotplug_lock,
or add some extra memory APIs to allow callers to take
the lock for rte_memseg_*_walk_thread_unsafe(). And we
can take this lock in virtio_user_start_device() before
taking the virtio-user device lock (so locks can be taken
in the correct order). Thoughts?

You can know if this function is called from memory callback. You can
probably also know if you're in the process of starting the device. The
solution seems clear to me - check both? :)

Hmm.. You didn't get my point. :(

I mean a lock within virtio driver is needed by virtio-user
to avoid the race condition between the virtio-user device
start and the virtio-user memory event callback (e.g. about
the device state change). And to iterate the memseg lists,
a lock within memory subsystem will be taken. So in both of
the virtio device start and memory event handling, there are
two locks will be taken -- And we need to take these locks
in correct order to avoid deadlock, and it requires us to
have a way to take the lock for rte_memseg_*_thread_unsafe()
in callers.

I'm afraid i'm still not getting your point :(

You know that you can either get called from memory callback, or not
from memory callback. Both of these times, the virtio device is locked.
So, where does the race come from? You take out your device lock, and by
the time you need to iterate through memsegs, you know that you either
were or weren't called from memory callback, which means you can pick
either thread-safe or thread-unsafe version.

Can you please draw up a step-by-step example where race can happen that
cannot be solved the way i suggested above?

I.e.

thread 1    thread 2
do this
          do that

For the benefit of public discussion, the following is result of our
internal discussion on this topic: the deadlock may happen because we
take virtio lock and hotplug lock in different order.

So, thread 1 may do device start, which will take out virtio device
lock, then attempt to iterate memory regions thereby taking memory
hotplug lock. At the same time, thread 2 might trigger an allocation,
and virtio will receive a callback, which will lock hotplug, and then
attempt to lock virtio device.

In other words,

Thread 1                Thread 2
lock virtio device
                         lock hotplug
(waits for hotplug unlock)
                         (waits for virtio device unlock)

The solution to this is not trivial, and we haven't come up with
anything to fix this that doesn't involve pretty serious changes to the
memory subsystem. Ideas welcome :)


As part of device start, can the virtio-user driver take the hotplug
lock before the virtio lock (in Thread 1 in your example). It can
access the lock through rte_eal_get_configuration()->mem_config, as
the Mellanox driver currently does. If that would work, an API to lock
and unlock do it should probably be provided by the memory subsystem.
Although, it seems a bit error-prone and a future change could easily
break things.

Hi Sean,

I wasn't aware of the MLX driver accessing the lock directly, and i would strongly discourage everyone from doing so :)

The main problem with either accessing the lock directly or providing the API to lock/unlock this lock is, if something in current thtread has accidentally triggered an allocation while holding that lock (which may happen silently in the background), it may lead to a deadlock as allocation may try to take out a write lock.

One possible solution could be changing the rwlock to a recursive lock, but even then it presents several problems because 1) our recursive spinlocks are not rwlocks and thus they can't allow parallel read access, and 2) there is currently no way to upgrade/downgrade read lock to a write lock (not to mention the allocator doesn't know if it's supposed to release a lock, or downgrade it to a read lock), which would be needed for such a thing to work without stopping the world, so to speak.

Another possible solution to this could be to stop page allocations from happening in the first place, while an external lock is held. This might just solve our problem (and possibly MLX's as well), but i haven't yet thought through the implications of such a change.

Thanks,
Anatoly


--
Thanks,
Anatoly



--
Thanks,
Anatoly

Reply via email to