On 26/10/16 02:53, Daniel Vetter wrote:
On Wed, Oct 19, 2016 at 03:34:12PM -0700, Ian Romanick wrote:
On 10/19/2016 12:58 PM, Matt Turner wrote:
On Wed, Oct 19, 2016 at 12:06 PM, Jan Ziak <0xe2.0x9a.0...@gmail.com> wrote:
This patch removes locks around reference counts in favor of atomic inc/dec
operations on the refcounts. This makes the refcount adjustments much faster.
I had the same idea last year and sent a patch series ([PATCH 00/13]
mesa: Locking improvements and optimizations [1]).

The part that corresponds to your patch are patches 5-9. 1-4, 10, and
11 were committed.

I'm having a difficult time remembering why I ultimately dropped the
series, but Ian's comment on 05/13 indicates that the locking is
broken for bufferobj.c, so it's not appropriate to use atomics there.
I remember that texture locking is totally broken, but I cannot
remember the details.
So... I was trying to trick you into fixing the locking problems.  I see
that I have failed. :)
Hm, locking, figured I'll jump in with some patterns from the kernel world
;-)

I looked at this quite a bit more closely today.  Here's one timeline
that worries me:

     Thread 1                     Thread 2
     Call glDeleteBuffers
                                  Call glNamedBufferData
                                  Call _mesa_lookup_bufferobj_err
     Remove buffer ID from hash
     Decrement refcount
     Free BO backing store
     Delete buffer object
                                  Reference freed memory

The entire body of glDeleteBuffers holds a different lock (the hashtable
lock), so all of that happens atomically.  _mesa_lookup_bufferobj_err
does not update the refcount, so bad things can happen.

I think the right way to fix this is to have every _mesa_lookup_*
function increment the refcount of the object before returning it.  At
that point, every modification to the refcount should happen while the
hashtable mutex is held.  I think that means that neither atomics nor an
additional mutex are needed for reference counting.
I've been doing some work on this but I think there is still a need for atomics when we remove a reference right? We can only drop both the atomic and mutex when increment the reference.

By re-using the hash table lock we always know that we won't delete the buffer out from underneath a thread using it, but we still have a race condition between the two threads trying to delete it and could end up with a double free?
Does that seem right?


Yup, that's the pattern we use in the kernel. One additional complications
is if you have lookup tables which only have a pointer, but do not hold a
full reference on the object (i.e. weak references for caches). In that
case you must check (under the lookup table lock) whether the refcount
really is greater than 0, and only increment in that case. Atomically of
course. That's implemented by the kref_get_unless_zero helper in the
kernel.

Upshot of this: You can fully untangle locking for lookup tables and
caches from refcounting, and you can add a given object to as many places
with weak references as you want. The other usual pattern is to only
destroy an object while holding the lookup lock, but that's unecessariyl
serializing, and as soon as you have 2+ lookup tables it gets seriously
complicated (since you need to hold all the locks).

The other timelime the concerns me is:

     Thread 1                     Thread 2
     Call glNamedBufferData
     Call _mesa_lookup_bufferobj_err
     Call ctx->Driver.BufferData
         (buffer_data_fallback as an example)
     Free old buffer
                                  Call glNamedBufferData
                                  Call _mesa_lookup_bufferobj_err
                                  Call ctx->Driver.BufferData
                                  Free old buffer
                                      (possibly double-free)
                                  Allocate new buffer
     Allocate new buffer
         (possibly leak the other new buffer)

To fix this, I think we do want a per-object mutex that is only used in
ctx->Driver.BufferData.  If there is only one context in the share group
(a very common case!), we could avoid even that locking.
One trick we're using in the kernel for this stuff is rcu, i.e. never
change the data structure, but if you update it create a new one, put it
into the lookup struct and then unreference the old one. Of course this is
still racy (but in most apis winner-takes-it-all is perfectly fine
behaviour) for concurrent updates, but it does fix the double-free or
anything else nasty happening pretty cheaply.

Textures and renderbuffers have similar issues.  Textures have
additional problems with derived state.  Calling glTexParameter can
cause a lot of state inside the texture to get updated.  Multiple calls
to glTexParameter from multiple threads can result in inconsistent
state.  It also means that rendering from one thread while another
thread is updating state can lead to problems.

Other drivers solve this problem by having N+1 copies of each object's
state.  Each context gets its own copy, and the share group gets a copy.
  Writes to state go to both the local copy and the shared copy.  Reads
come only from the local copy.  glBindTexture updates the local copy
from the shared copy.  The local copies are basically a write-through
cache.  I'm not sure how this works with bindless.  DSA also causes
problems because every DSA command implicitly has a bind.
This sounds similar to rcu, but coded more explicitly. I definitely don't
know enough about gl to say what might work (or work better), and how
consistent you need to be about concurrent writers.

Cheers, Daniel

The other thing (in 12/13 [2]) is that vertex array objects,
framebuffer objects, and pipeline objects are "container objects" and
are not shared between contexts, so they require no locking. It would
be better to solve that and avoid locking/atomics all together. But
then FBOs (and renderbuffers?) from EXT_framebuffer_object and VAOs
from APPLE_vertex_array_object are actually shared between contexts,
so those are out as well.
If my theories above are correct, we don't need atomics or mutexes for
any of it.

That seems to leave sampler objects, shader objects, and program
objects. I suspect those are relatively small compared with textures
or buffers.

I'm not really sure what I'd suggest, to be honest. It seems like a
lot of aggravation for a minimal reward if done right.

[1] https://lists.freedesktop.org/archives/mesa-dev/2015-August/090972.html
[2] https://lists.freedesktop.org/archives/mesa-dev/2015-August/090984.html
_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Reply via email to