[PATCH] drm/ttm: remove fence_lock

Maarten Lankhorst Tue, 25 Mar 2014 15:23:11 +0100

Hey,

op 21-03-14 14:04, Thomas Hellstrom schreef:
> On 03/21/2014 01:12 PM, Maarten Lankhorst wrote:
>> Hey,
>>
>> op 21-03-14 09:27, Thomas Hellstrom schreef:
>>> On 03/21/2014 12:55 AM, Dave Airlie wrote:
>>>> On Fri, Oct 19, 2012 at 3:04 AM, Jerome Glisse <j.glisse at gmail.com>
>>>> wrote:
>>>>> On Thu, Oct 18, 2012 at 06:43:40PM +0200, Thomas Hellstrom wrote:
>>>>>> On 10/18/2012 04:45 PM, Maarten Lankhorst wrote:
>>>>>>> Op 18-10-12 13:55, Thomas Hellstrom schreef:
>>>>>>>> On 10/18/2012 01:38 PM, Maarten Lankhorst wrote:
>>>>>>>>> Op 18-10-12 13:02, Thomas Hellstrom schreef:
>>>>>>>>>> On 10/18/2012 10:37 AM, Maarten Lankhorst wrote:
>>>>>>>>>>> Hey,
>>>>>>>>>>>
>>>>>>>>>>> Op 18-10-12 09:59, Thomas Hellstrom schreef:
>>>>>>>>>>>> On 10/18/2012 09:28 AM, Thomas Hellstrom wrote:
>>>>>>>>>>>>> Hi, Maarten,
>>>>>>>>>>>>>
>>>>>>>>>>>>> As you know I have been having my doubts about this change.
>>>>>>>>>>>>> To me it seems insane to be forced to read the fence
>>>>>>>>>>>>> pointer under a
>>>>>>>>>>>>> reserved lock, simply because when you take the reserve
>>>>>>>>>>>>> lock, another
>>>>>>>>>>>>> process may have it and there is a substantial chance that
>>>>>>>>>>>>> that process
>>>>>>>>>>>>> will also be waiting for idle while holding the reserve lock.
>>>>>>>>>>> I think it makes perfect sense, the only times you want to
>>>>>>>>>>> read the fence
>>>>>>>>>>> is when you want to change the members protected by the
>>>>>>>>>>> reservation.
>>>>>>>>>> No, that's not true. A typical case (or the only case)
>>>>>>>>>> is where you want to do a map with no_wait semantics. You will
>>>>>>>>>> want
>>>>>>>>>> to be able to access a buffer's results even if the eviction code
>>>>>>>>>> is about to schedule an unbind from the GPU, and have the buffer
>>>>>>>>>> reserved?
>>>>>>>>> Well either block on reserve or return -EBUSY if reserved,
>>>>>>>>> presumably the
>>>>>>>>> former..
>>>>>>>>>
>>>>>>>>> ttm_bo_vm_fault does the latter already, anyway
>>>>>>>> ttm_bo_vm_fault only trylocks to avoid a deadlock with mmap_sem,
>>>>>>>> it's really
>>>>>>>> a waiting reserve but for different reasons. Typically a
>>>>>>>> user-space app will keep
>>>>>>>> asynchronous maps to TTM during a buffer lifetime, and the
>>>>>>>> buffer lifetime may
>>>>>>>> be long as user-space apps keep caches.
>>>>>>>> That's not the same as accessing a buffer after the GPU is done
>>>>>>>> with it.
>>>>>>> Ah indeed.
>>>>>>>>> You don't need to hold the reservation while performing the
>>>>>>>>> wait itself though,
>>>>>>>>> you could check if ttm_bo_wait(no_wait_gpu = true) returns
>>>>>>>>> -EBUSY, and if so
>>>>>>>>> take a ref to the sync_obj member and then wait after
>>>>>>>>> unreserving. You won't
>>>>>>>>> reset sync_obj member to NULL in that case, but that should be
>>>>>>>>> harmless.
>>>>>>>>> This will allow you to keep the reservations fast and short.
>>>>>>>>> Maybe a helper
>>>>>>>>> would be appropriate for this since radeon and nouveau both
>>>>>>>>> seem to do this.
>>>>>>>>>
>>>>>>>> The problem is that as long as other users are waiting for idle
>>>>>>>> with reservation
>>>>>>>> held, for example to switch GPU engine or to delete a GPU bind,
>>>>>>>> waiting
>>>>>>>> for reserve will in many case mean wait for GPU.
>>>>>>> This example sounds inefficient, I know nouveau can do this, but
>>>>>>> this essentially
>>>>>>> just stalls the gpu entirely. I think guess you mean functions
>>>>>>> like nouveau_gem_object_close?
>>>>>>> It wouldn't surprise me if performance in nouveau could be
>>>>>>> improved simply by
>>>>>>> fixing those cases up instead, since it stalls the application
>>>>>>> completely too for other uses.
>>>>>>>
>>>>>> Please see the Nouveau cpu_prep patch as well.
>>>>>>
>>>>>> While there are a number of cases that can be fixed up, also in
>>>>>> Radeon, there's no way around that reservation
>>>>>> is a heavyweight lock that, particularly on simpler hardware without
>>>>>> support for fence ordering
>>>>>> with barriers and / or "semaphores" and accelerated eviction will be
>>>>>> held while waiting for idle.
>>>>>>
>>>>>> As such, it is unsuitable to protect read access to the fence
>>>>>> pointer. If the intention is to keep a single fence pointer
>>>>>> I think the best solution is to allow reading the fence pointer
>>>>>> outside reservation, but make sure this can be done
>>>>>> atomically. If the intention is to protect an array or list of fence
>>>>>> pointers, I think a spinlock is the better solution.
>>>>>>
>>>>>> /Thomas
>>>>> Just wanted to point out that like Thomas i am concern about having to
>>>>> have object reserved when waiting on its associated fence. I fear it
>>>>> will hurt us somehow.
>>>>>
>>>>> I will try to spend couple days stress testing your branch on radeon
>>>>> trying to see if it hurts performance anyhow with current use case.
>>>>>
>>>> I've been trying to figure out what to do with Maarten's patches going
>>>> forward since they are essential for all kinds of SoC people,
>>>>
>>>> However I'm still not 100% sure I believe either the fact that the
>>>> problem is anything more than a microoptimisation, and possibly a
>>>> premature one at that, this needs someone to start suggesting
>>>> benchmarks we can run and a driver set to run them on, otherwise I'm
>>>> starting to tend towards we are taking about an optimisation we can
>>>> fix later,
>>>>
>>>> The other option is to somehow merge this stuff under an option that
>>>> allows us to test it using a command line option, but I don't think
>>>> that is sane either,
>>>>
>>>> So Maarten, Jerome, Thomas, please start discussing actual real world
>>>> tests you think merging this code will affect and then I can make a
>>>> better consideration.
>>>>
>>>> Dave.
>>> Dave,
>>>
>>> This is IMHO a fundamental design discussion, not a micro-optimization.
>>>
>>> I'm pretty sure all sorts of horrendous things could be done to the DRM
>>> design without affecting real world application performance.
>>>
>>> In TTM data protection is primarily done with spinlocks: This serves two
>>> purposes.
>>>
>>> a) You can't unnecessarily hold a data protection lock while waiting for
>>> GPU, which is typically a very stupid thing to do (perhaps not so in
>>> this particular case) but when the sleep while atomic or locking
>>> inversion kernel warning turns up, that should at least
>>> ring a bell. Trying to implement dma-buf fencing using the TTM fencing
>>> callbacks will AFAICT cause a locking inversion.
>>>
>>> b) Spinlocks essentially go away on UP systems. The whole reservation
>>> path was essentially lock-free on UP systems pre dma-buf integration,
>>> and with very few atomic locking operations even on SMP systems. It was
>>> all prompted by a test some years ago (by Eric Anholt IIRC) showing that
>>> locking operations turned up quite high on Intel driver profiling.
>>>
>>> If we start protecting data with reservations, when we also export the
>>> reservation locks, so that people find them a convenient "serialize work
>>> on this buffer" lock, all kind of interesting things might happen, and
>>> we might end up in a situation
>>> similar to protecting everything with struct_mutex.
>>>
>>> This is why I dislike this change. In particular when there is a very
>>> simple remedy.
>>>
>>> But if I can get an ACK to convert the reservation object fence pointers
>>> and associated operations on them to be rcu-safe when I have some time
>>> left, I'd be prepared to accept this patch series in it's current state.
>> RCU could be a useful way to deal with this. But in my case I've shown
>> there are very few places where it's needed, core ttm does not need it
>> at all.
>> This is why I wanted to leave it to individual drivers to implement it.
>>
>> I think kfree_rcu for free in the driver itself should be enough,
>> and obtaining in the driver would require something like this, similar
>> to whats in rcuref.txt:
>>
>> rcu_read_lock();
>> f = rcu_dereference(bo->fence);
>> if (f && !kref_get_unless-zero(&f->kref)) f = NULL;
>> rcu_read_unlock();
>>
>> if (f) {
>> // do stuff here
>> ...
>>
>> // free fence
>> kref_put(&f->kref, fence_put_with_kfree_rcu);
>> }
>>
>> Am I wrong or is something like this enough to make sure fence is
>> still alive?
> No, you're correct.
>
> And a bo check for idle would amount to (since we don't care if the
> fence refcount is zero).
>
> rcu_read_lock()
> f = rcu_dereference(bo->fence);
> signaled = !f || f->signaled;
> rcu_read_unlock().
>
> /Thomas
>
This is very easy to implement when there is only 1 fence slot, bo->fence being 
equal to reservation_object->fence_excl.
It appears to break down when slightly when I implement it on top of shared 
fences.


My diff is against git://git.linaro.org/people/sumit.semwal/linux-3.x.git 
for-next-fences
shared_max_fence is held in reservation_object to prevent the reallocation in 
reservation_object_reserve_shared_fence
every time the same reservation_object gets > 4 shared fences; if it happens 
once, it's likely going to happen again on the same object.

Anyhow does the below look sane to you? This has only been tested by my 
compiler, I haven't checked if this boots.
---
commit c73b87d33fd08f7c1b0a381b08b3626128f8f3b8
Author: Maarten Lankhorst <maarten.lankhorst at canonical.com>
Date:   Tue Mar 25 15:10:21 2014 +0100

     add rcu to fence HACK WIP DO NOT USE KILLS YOUR POT PLANTS PETS AND FAMILY

diff --git a/drivers/base/dma-buf.c b/drivers/base/dma-buf.c
index 96338bf7f457..0a88c10b3db9 100644
--- a/drivers/base/dma-buf.c
+++ b/drivers/base/dma-buf.c
@@ -134,7 +134,10 @@ static unsigned int dma_buf_poll(struct file *file, 
poll_table *poll)
  {
        struct dma_buf *dmabuf;
        struct reservation_object *resv;
+       struct reservation_object_shared *shared;
+       struct fence *fence_excl;
        unsigned long events;
+       unsigned shared_count;

        dmabuf = file->private_data;
        if (!dmabuf || !dmabuf->resv)
@@ -148,14 +151,18 @@ static unsigned int dma_buf_poll(struct file *file, 
poll_table *poll)
        if (!events)
                return 0;

-       ww_mutex_lock(&resv->lock, NULL);
+       rcu_read_lock();

-       if (resv->fence_excl && (!(events & POLLOUT) ||
-                                resv->fence_shared_count == 0)) {
+       shared = rcu_dereference(resv->shared);
+       fence_excl = rcu_dereference(resv->fence_excl);
+       shared_count = ACCESS_ONCE(shared->count);
+
+       if (fence_excl && (!(events & POLLOUT) ||
+                                (!shared || shared_count == 0))) {
                struct dma_buf_poll_cb_t *dcb = &dmabuf->cb_excl;
                unsigned long pevents = POLLIN;

-               if (resv->fence_shared_count == 0)
+               if (!shared || shared_count == 0)
                        pevents |= POLLOUT;

                spin_lock_irq(&dmabuf->poll.lock);
@@ -167,19 +174,26 @@ static unsigned int dma_buf_poll(struct file *file, 
poll_table *poll)
                spin_unlock_irq(&dmabuf->poll.lock);

                if (events & pevents) {
-                       if (!fence_add_callback(resv->fence_excl,
-                                               &dcb->cb, dma_buf_poll_cb))
+                       if (!kref_get_unless_zero(&fence_excl->refcount)) {
+                               /* force a recheck */
+                               events &= ~pevents;
+                               dma_buf_poll_cb(NULL, &dcb->cb);
+                       } else if (!fence_add_callback(fence_excl, &dcb->cb,
+                                                      dma_buf_poll_cb)) {
                                events &= ~pevents;
-                       else
+                               fence_put(fence_excl);
+                       } else {
                                /*
                                 * No callback queued, wake up any additional
                                 * waiters.
                                 */
+                               fence_put(fence_excl);
                                dma_buf_poll_cb(NULL, &dcb->cb);
+                       }
                }
        }

-       if ((events & POLLOUT) && resv->fence_shared_count > 0) {
+       if ((events & POLLOUT) && shared && shared_count > 0) {
                struct dma_buf_poll_cb_t *dcb = &dmabuf->cb_shared;
                int i;

@@ -194,20 +208,34 @@ static unsigned int dma_buf_poll(struct file *file, 
poll_table *poll)
                if (!(events & POLLOUT))
                        goto out;

-               for (i = 0; i < resv->fence_shared_count; ++i)
-                       if (!fence_add_callback(resv->fence_shared[i],
-                                               &dcb->cb, dma_buf_poll_cb)) {
+               for (i = 0; i < shared_count; ++i) {
+                       struct fence *fence = ACCESS_ONCE(shared->fence[i]);
+                       if (!kref_get_unless_zero(&fence->refcount)) {
+                               /*
+                                * fence refcount dropped to zero, this means
+                                * that the shared object has been freed from 
under us.
+                                * call dma_buf_poll_cb and force a recheck!
+                                */
                                events &= ~POLLOUT;
+                               dma_buf_poll_cb(NULL, &dcb->cb);
                                break;
                        }
+                       if (!fence_add_callback(fence, &dcb->cb,
+                                               dma_buf_poll_cb)) {
+                               fence_put(fence);
+                               events &= ~POLLOUT;
+                               break;
+                       }
+                       fence_put(fence);
+               }

                /* No callback queued, wake up any additional waiters. */
-               if (i == resv->fence_shared_count)
+               if (i == shared_count)
                        dma_buf_poll_cb(NULL, &dcb->cb);
        }

  out:
-       ww_mutex_unlock(&resv->lock);
+       rcu_read_unlock();
        return events;
  }

diff --git a/drivers/base/fence.c b/drivers/base/fence.c
index 8fff13fb86cf..be03a9e8ad8b 100644
--- a/drivers/base/fence.c
+++ b/drivers/base/fence.c
@@ -170,7 +170,7 @@ void release_fence(struct kref *kref)
        if (fence->ops->release)
                fence->ops->release(fence);
        else
-               kfree(fence);
+               kfree_rcu(fence, rcu);
  }
  EXPORT_SYMBOL(release_fence);

diff --git a/include/linux/fence.h b/include/linux/fence.h
index 65f2a01ee7e4..d19620405c34 100644
--- a/include/linux/fence.h
+++ b/include/linux/fence.h
@@ -40,6 +40,7 @@ struct fence_cb;
   * struct fence - software synchronization primitive
   * @refcount: refcount for this fence
   * @ops: fence_ops associated with this fence
+ * @rcu: used for releasing fence with kfree_rcu
   * @cb_list: list of all callbacks to call
   * @lock: spin_lock_irqsave used for locking
   * @context: execution context this fence belongs to, returned by
@@ -73,6 +74,7 @@ struct fence_cb;
  struct fence {
        struct kref refcount;
        const struct fence_ops *ops;
+       struct rcu_head rcu;
        struct list_head cb_list;
        spinlock_t *lock;
        unsigned context, seqno;
diff --git a/include/linux/reservation.h b/include/linux/reservation.h
index f3f57460a205..91576fabafdb 100644
--- a/include/linux/reservation.h
+++ b/include/linux/reservation.h
@@ -42,6 +42,7 @@
  #include <linux/ww_mutex.h>
  #include <linux/fence.h>
  #include <linux/slab.h>
+#include <linux/rcupdate.h>

  extern struct ww_class reservation_ww_class;

@@ -49,8 +50,12 @@ struct reservation_object {
        struct ww_mutex lock;

        struct fence *fence_excl;
-       struct fence **fence_shared;
-       u32 fence_shared_count, fence_shared_max;
+       u32 shared_max_fence;
+       struct reservation_object_shared {
+               struct rcu_head rcu;
+               u32 count;
+               struct fence *fence[];
+       } *shared;
  };

  static inline void
@@ -58,8 +63,8 @@ reservation_object_init(struct reservation_object *obj)
  {
        ww_mutex_init(&obj->lock, &reservation_ww_class);

-       obj->fence_shared_count = obj->fence_shared_max = 0;
-       obj->fence_shared = NULL;
+       obj->shared = NULL;
+       obj->shared_max_fence = 4;
        obj->fence_excl = NULL;
  }

@@ -70,11 +75,97 @@ reservation_object_fini(struct reservation_object *obj)

        if (obj->fence_excl)
                fence_put(obj->fence_excl);
-       for (i = 0; i < obj->fence_shared_count; ++i)
-               fence_put(obj->fence_shared[i]);
-       kfree(obj->fence_shared);
+       if (obj->shared) {
+               for (i = 0; i < obj->shared->count; ++i)
+                       fence_put(obj->shared->fence[i]);
+
+               /*
+                * This object should be dead and all references must have
+                * been released to it, so no need to free with rcu.
+                */
+               kfree(obj->shared);
+       }

        ww_mutex_destroy(&obj->lock);
  }

+/*
+ * Reserve space to add a shared fence to a reservation_object,
+ * must be called with obj->lock held.
+ */
+static inline int
+reservation_object_reserve_shared_fence(struct reservation_object *obj)
+{
+       struct reservation_object_shared *shared, *old;
+       u32 max = obj->shared_max_fence;
+
+       if (obj->shared) {
+               if (obj->shared->count < max)
+                       return 0;
+               max *= 2;
+       }
+
+       shared = kmalloc(offsetof(typeof(*shared), fence[max]), GFP_KERNEL);
+       if (!shared)
+               return -ENOMEM;
+       old = obj->shared;
+
+       if (old) {
+               shared->count = old->count;
+               memcpy(shared->fence, old->fence, old->count * 
sizeof(*old->fence));
+       } else {
+               shared->count = 0;
+       }
+       rcu_assign_pointer(obj->shared, shared);
+       obj->shared_max_fence = max;
+       kfree_rcu(old, rcu);
+       return 0;
+}
+
+/*
+ * Add a fence to a shared slot, obj->lock must be held, and
+ * reservation_object_reserve_shared_fence has been called.
+ */
+
+static inline void
+reservation_object_add_shared_fence(struct reservation_object *obj,
+                                   struct fence *fence)
+{
+       unsigned i;
+
+       BUG_ON(obj->shared->count == obj->shared_max_fence);
+       fence_get(fence);
+
+       for (i = 0; i < obj->shared->count; ++i)
+               if (obj->shared->fence[i]->context == fence->context) {
+                       struct fence *old = obj->shared->fence[i];
+                       rcu_assign_pointer(obj->shared->fence[i], fence);
+                       fence_put(old);
+                       return;
+               }
+
+       obj->shared->fence[obj->shared->count] = fence;
+       smp_wmb();
+       obj->shared->count++;
+}
+
+/*
+ * May be called after adding an exclusive to wipe all shared fences.
+ */
+
+static inline void
+reservation_object_clear_shared(struct reservation_object *obj)
+{
+       struct reservation_object_shared *old = obj->shared;
+       unsigned i;
+
+       if (!old)
+               return;
+
+       rcu_assign_pointer(obj->shared, NULL);
+       for (i = 0; i < old->count; ++i)
+               fence_put(old->fence[i]);
+       kfree_rcu(old, rcu);
+}
+
  #endif /* _LINUX_RESERVATION_H */

[PATCH] drm/ttm: remove fence_lock

Reply via email to