On 06.12.2017 08:07, James Jones wrote:
[snip]
So lets say you have a setup where both display and GPU supported
FOO/tiled, but only GPU supported compressed (FOO/CC) and cached
(FOO/cached). But the GPU supported the following transitions:
trans_a: FOO/CC -> null
trans_b: FOO/cached -> null
Then the sets for each device (in order of preference):
GPU:
1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=32k)
2: caps(FOO/tiled, FOO/CC); constraints(alignment=32k)
3: caps(FOO/tiled); constraints(alignment=32k)
Display:
1: caps(FOO/tiled); constraints(alignment=64k)
Merged Result:
1: caps(FOO/tiled, FOO/CC, FOO/cached);
constraints(alignment=64k);
transition(GPU->display: trans_a, trans_b; display->GPU: none)
2: caps(FOO/tiled, FOO/CC); constraints(alignment=64k);
transition(GPU->display: trans_a; display->GPU: none)
3: caps(FOO/tiled); constraints(alignment=64k);
transition(GPU->display: none; display->GPU: none)
We definitely don't want to expose a way of getting uncached rendering
surfaces for radeonsi. I mean, I think we are supposed to be able to
program
our hardware so that the backend bypasses all caches, but (a) nobody
validates that and (b) it's basically suicide in terms of
performance. Let's
build fewer footguns :)
sure, this was just a hypothetical example. But to take this case as
another example, if you didn't want to expose uncached rendering (or
cached w/ cache flushes after each draw), you would exclude the entry
from the GPU set which didn't have FOO/cached (I'm adding back a
cached but not CC config just to make it interesting), and end up
with:
trans_a: FOO/CC -> null
trans_b: FOO/cached -> null
GPU:
1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=32k)
2: caps(FOO/tiled, FOO/cached); constraints(alignment=32k)
Display:
1: caps(FOO/tiled); constraints(alignment=64k)
Merged Result:
1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=64k);
transition(GPU->display: trans_a, trans_b; display->GPU: none)
2: caps(FOO/tiled, FOO/cached); constraints(alignment=64k);
transition(GPU->display: trans_b; display->GPU: none)
So there isn't anything in the result set that doesn't have GPU cache,
and the cache-flush transition is always in the set of required
transitions going from GPU -> display
Hmm, I guess this does require the concept of a required cap..
Which we already introduced to the allocator API when we realized we
would need them as we were prototyping.
Note I also posed the question of whether things like cached (and
similarly compression, since I view compression as roughly an equivalent
mechanism to a cache) in one of the open issues on my XDC 2017 slides
because of this very problem of over-pruning it causes. It's on slide
15, as "No device-local capabilities". You'll have to listen to my
coverage of it in the recorded presentation for that slide to make any
sense, but it's the same thing Nicolai has laid out here.
As I continued working through our prototype driver support, I found I
didn't actually need to include cached or compressed as capabilities:
The GPU just applies them as needed and the usage transitions make it
transparent to the non-GPU engines. That does mean the GPU driver
currently needs to be the one to realize the allocation from the
capability set to get optimal behavior. We could fix that by reworking
our driver though. At this point, not including device-local properties
like on-device caching in capabilities seems like the right solution to
me. I'm curious whether this applies universally though, or if other
hardware doesn't fit the "compression and stuff all behaves like a
cache" idiom.
Compression is a part of the memory layout for us: framebuffer
compression uses an additional "meta surface". At the most basic level,
an allocation with loss-less compression support is by necessity bigger
than an allocation without.
We can allocate this meta surface separately, but then we're forced to
decompress when passing the surface around (e.g. to a compositor.)
Consider also the example I gave elsewhere, where a cross-vendor tiling
layout is combined with vendor-specific compression:
Device 1, rendering: caps(BASE/foo-tiling, VND1/compression)
Device 2, sampling/scanout: caps(BASE/foo-tiling, VND2/compression)
Some more thoughts on caching or "device-local" properties below.
[snip]
I think I like the idea of having transitions being part of the
per-device/engine cap sets, so that such information can be used upon
merging to know which capabilities may remain or have to be dropped.
I think James's proposal for usage transitions was intended to work
with flows like:
1. App gets GPU caps for RENDER usage
2. App allocates GPU memory using a layout from (1)
3. App now decides it wants use the buffer for SCANOUT
4. App queries usage transition metadata from RENDER to SCANOUT,
given the current memory layout.
5. Do the transition and hand the buffer off to display
No, all usages the app intends to transition to must be specified up
front when initially querying caps in the model I assumed. The app then
specifies some subset (up to the full set) of the specified usages as a
src and dst when querying transition metadata.
The problem I see with this is that it isn't guaranteed that there will
be a chain of transitions for the buffer to be usable by display.
I hadn't thought hard about it, but my initial thoughts were that it
would be required that the driver support transitioning to any single
usage given the capabilities returned. However, transitioning to
multiple usages (E.g., to simultaneously rendering and scanning out)
could fail to produce a valid transition, in which case the app would
have to fall back to a copy in that case, or avoid that simultaneous
usage combination in some other way.
Adding transition metadata to the original capability sets, and using
that information when merging could give us a compatible memory layout
that would be usable by both GPU and display.
I'll look into extending the current merging logic to also take into
account transitions.
Yes, it'll be good to see whether this can be made to work. I agree
Rob's example outcomes above are ideal, but it's not clear to me how to
code up such an algorithm. This also all seems unnecessary if "device
local" capabilities aren't needed, as posited above.
although maybe the user doesn't need to know every possible transition
between devices once you have more than two devices..
We should be able to infer how buffers are going to be moved around
from the list of usages, shouldn't we?
Maybe we are missing some bits of information there, but I think the
allocator should be able to know what transitions the app will care
about and provide only those.
The allocator only knows the requested union of all usages currently.
The number of possible transitions grows combinatorially for every usage
requested I believe. I expect there will be cases where ~10 usages are
specified, so generating all possible transitions all the time may be
excessive, when the app will probably generally only care about 2 or 3
states, and in practice, there will probably only actually be 2 or 3
different underlying possible combinations of operations.
Exactly. So I wonder if we can't just "cut through the bullshit" somehow?
I'm looking for something that would also eliminate another part of the
design that makes me uncomfortable: the metadata for transitions. This
makes me uncomfortable for a number of reasons. Who computes the
metadata? How is the representation of the metadata? With cross-device
usages (which is the whole point of the exercise), this quickly becomes
infeasible.
So instead as a thought experiment, let's just use what we already have:
capabilities and constraints (or properties/attributes).
I kind of already outlined this with the long example in my email here
https://lists.freedesktop.org/archives/mesa-dev/2017-December/179055.html
Let me try to summarize the transition algorithm. Its inputs are:
- the current (source) capability set
- the desired new usages
- the capability sets associated with these usages, as queried when the
surface was allocated
Steps of the algorithm:
1. Compute the merged capability set for the new usages (the destination
capability set).
2. Compute the transition capability set, which is the merger of the
source and destination sets.
3. Determine whether a "release" transition is required on the source
device(s):
3a. For global properties, a transition is required if the source
capability set is a superset of the transition set.
3b. For device-local properties, a transition is required if there is
some destination device for which the device-local properties are a
subset of the source set.
4. Determine whether an "acquire" transition is required on the
destination device(s) in a similar way.
Finally, execute the transitions using corresponding APIs, where the
APIs simply receive the computed capability sets.
For example, release transitions would receive the source capability set
(and perhaps the source usages), the transition capability set, and the
set difference of device-local capabilities, and nothing else.
The point is that all steps of the algorithm can be implemented in a
device-agnostic way in libdevicealloc, without calling into any
device/driver callbacks.
I'm pretty sure this or something like it can be made to work. We need
to think through a lot of example cases, but at least we'll have thought
them through, which is better than relying on some opaque metadata thing
and then finding out later that there are some new cross-device cases
where things don't work out because the piece of (presumably
device-specific driver) code that computes the metadata isn't aware of them.
[snip]
One final note: When I initially wrote up the capability merging logic,
I treated "layout" as a sort of "special" capability, basically like
Nicolai originally outlined above. Miguel suggested I add the
"required" bit instead to generalize things, and it ended up working out
much cleaner. Besides the layout, there is at least one other obvious
candidate for a "required" capability that became obvious as soon as I
started coding up the prototype driver: memory location. It might seem
like memory location is a simple device-agnostic constraint rather than
a capability, but it's actually too complicated (we need more memory
locations than "device" and "host"). It has to be vendor specific, and
hence fits in better as a capability.
Could you give more concrete examples of what you'd like to see, and why
having this as constraints is insufficient?
I think if possible, we should try to keep the design generalized to as
few types of objects and special cases as possible. The more we can
generalize the solutions to our existing problem set, the better the
mechanism should hold up as we apply it to new and unknown problems as
they arise.
I'm coming around to the fact that those things should perhaps live in a
single list/array, but I still don't like the term "capability".
I admit it's a bit of bike-shedding, but I'm starting to think it would
be better to go with the generic term "property" or "attribute", and
then add flags/adjectives to that based on how merging should work.
This would include the constraints as well -- it seems arbitrary to me
that those would be singled out into their own list.
Basically, the underlying principle is that a good API would have either
one list that includes all the properties, or one list per
merging-behavior. And I think one single list is easier on the API
consumer and easier to extend.
Cheers,
Nicolai
--
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev