On 15.04.2016 17:12, Francisco Jerez wrote:
For a test doing almost the same thing but not relying on unspecified
invocation ordering, see
"tests/spec/arb_shader_image_load_store/shader-mem-barrier.c" -- It
would be interesting to see whether you can get it to reproduce the GCN
coherency bug using different framebuffer size and modulus parameters.
I tried that, but couldn't reproduce. Whether I just wasn't thorough
enough/"unlucky" or whether the in-order nature of the hardware and L1
cache behavior just makes it impossible to fail the shader-mem-barrier
test, I'm not sure.
Now I'm curious about the exact nature of the bug ;), some sort of
missing L1 cache-flushing which could potentially affect dependent
invocations?
I'm not sure I remember everything, to be honest.
One issue that I do remember is that load/store by default go through
L1, but atomics _never_ go through L1, no matter how you compile them.
This means that if you're working on two different images, one with
atomics and the other without, then the atomic one will always behave
coherently but the other one won't unless you explicitly tell it to.
Now that I think about this again, there should probably be a
shader-mem-barrier-style way to test for that particular issue in a way
that doesn't depend on the specifics of the parallelization. Something
like, in a loop:
Thread 1: increasing imageStore into image 1 at location 1, imageLoad
from image 1 location 2
Thread 2: same, but exchange locations 1 and 2
Both threads: imageAtomicAdd on the same location in image 2
Then each thread can check that _if_ the imageAtomicAdd detects the
buddy thread operating in parallel, _then_ they must also observe
incrementing values in the location that the buddy thread stores to.
Does that sound reasonable?
Yeah, that sounds reasonable, but keep in mind that even if both image
variables are marked coherent you cannot make assumptions about the
ordering of the image stores performed on image 1 relative to the
atomics performed on image 2 unless there is an explicit barrier in
between, which means that some level of L1 caching is legitimate even in
that scenario (and might have some performance benefit over skipping L1
caching of coherent images altogether) -- That's in fact the way that
the i965 driver implements coherent image stores: We just write to L1
and flush later on to the globally coherent L3 on the next
memoryBarrier().
Okay, adding the barrier makes sense.
What about a test along the lines of the current coherency test? Any
idea what's the reason you couldn't get it to reproduce the issue? Is
it because threads with dependent inputs are guaranteed to be spawned in
the same L1 cache domain as the threads that generated their inputs or
something like that?
From what I understand (though admittedly the documentation I have on
this is not the clearest...), the hardware flushes the L1 cache
automatically at the end of each shader invocation, so that dependent
invocations are guaranteed to pick it up.
Cheers,
Nicolai
_______________________________________________
Piglit mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/piglit