On 2017-03-01 12:13 PM, Andres Rodriguez wrote:
On 3/1/2017 6:42 AM, Christian König wrote:
Patches #1-#14 are Acked-by: Christian König <christian.koe...@amd.com>.
Patch #15:
Not sure if that is a good idea or not, need to take a closer look
after digging through the rest.
In general the HW IP is just for the IOCTL API and not for internal
use inside the driver.
I'll drop this patch and use ring->funcs->type instead.
Patch #16:
Really nice :) I don't have time to look into it in detail, but you
have one misconception I like to point out:
The queue manager maintains a per-file descriptor map of user ring ids
to amdgpu_ring pointers. Once a map is created it is permanent (this is
required to maintain FIFO execution guarantees for a ring).
Actually we don't have a FIFO execution guarantee per ring. We only
have that per context.
Agreed. I'm using pretty imprecise terminology here which can be
confusing. I wanted to be more precise than "context", because two
amdgpu_cs_request submissions to the same context but with a different
ring field can execute out of order.
I think s/ring/context's ring/ should be enough to clarify here if you
think so as well.
E.g. commands from different context can execute at the same time and
out of order.
Making this per file is ok for now, but you should keep in mind that
we might want to change that sooner or later.
Patch #17 & #18 need to take a closer look when I have more time, but
the comments from others sounded valid to me as well.
Patch #19: Raising and lowering the priority of a ring during command
submission doesn't sound like a good idea to me.
I'm not really sure what would be a better time than at command
submission.
If it was just SPI priorities we could have static partitioning of
rings, some high priority and some regular, etc. But that approach
reduces the number of rings
Sorry, I finished typing something else and forgot this section was
incomplete. Full reply:
I'm not really sure what would be a better time than at command submission.
If it was just SPI priorities we could have static partitioning of
rings, some high priority and some regular, etc. But that approach
reduces the number of rings available. It would also require a callback
at command submission time for CU reservation.
The way you currently have it implemented would also raise the
priority of already running jobs on the same ring. Keep in mind that
everything is pipelined here.
That is actually intentional. If there is work already on the ring
with lower priority we don't want the high priority work to have to
wait for it to finish executing at regular priority. Therefore the
work that has already been commited to the ring inherits the higher
priority level.
I agree this isn't ideal, which is why the LRU ring mapping policy is
there to make sure this doesn't happen often.
Additional to that you can't have a fence callback in the job
structure, cause the job structure is freed by the same fence as
well. So it can happen that you access freed up memory (but only for
a very short period of time).
Any strong preference for either 1) refcounting the job structure, or
2) allocating a new piece of memory to store the callback parameters?
Patches #20-#22 are Acked-by: Christian König
<christian.koe...@amd.com>.
Regards,
Christian.
Am 28.02.2017 um 23:14 schrieb Andres Rodriguez:
This patch series introduces a mechanism that allows users with
sufficient
privileges to categorize their work as "high priority". A userspace
app can
create a high priority amdgpu context, where any work submitted to
this context
will receive preferential treatment over any other work.
High priority contexts will be scheduled ahead of other contexts by
the sw gpu
scheduler. This functionality is generic for all HW blocks.
Optionally, a ring can implement a set_priority() function that allows
programming HW specific features to elevate a ring's priority.
This patch series implements set_priority() for gfx8 compute rings.
It takes
advantage of SPI scheduling and CU reservation to provide improved
frame
latencies for high priority contexts.
For compute + compute scenarios we get near perfect scheduling
latency. E.g.
one high priority ComputeParticles + one low priority ComputeParticles:
- High priority ComputeParticles: 2.0-2.6 ms/frame
- Regular ComputeParticles: 35.2-68.5 ms/frame
For compute + gfx scenarios the high priority compute application does
experience some latency variance. However, the variance has smaller
bounds and
a smalled deviation then without high priority scheduling.
Following is a graph of the frame time experienced by a high
priority compute
app in 4 different scenarios to exemplify the compute + gfx latency
variance:
- ComputeParticles: this scenario invloves running the compute
particles
sample on its own.
- +SSAO: Previous scenario with the addition of running the
ssao sample
application that clogs the GFX ring with constant work.
- +SPI Priority: Previous scenario with the addition of SPI
priority
programming for compute rings.
- +CU Reserve: Previous scenario with the addition of dynamic CU
reservation for compute rings.
Graph link:
https://plot.ly/~lostgoat/9/
As seen above, high priority contexts for compute allow us to
schedule work
with enhanced confidence of completion latency under high GPU loads.
This
property will be important for VR reprojection workloads.
Note: The first part of this series is a resend of "Change
queue/pipe split
between amdkfd and amdgpu" with the following changes:
- Fixed kfdtest on Kaveri due to shift overflow. Refer to:
"drm/amdkfdallow
split HQD on per-queue granularity v3"
- Used Felix's suggestions for a simplified HQD programming
sequence
- Added a workaround for a Tonga HW bug during HQD programming
This series is also available at:
https://github.com/lostgoat/linux/tree/wip-high-priority
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx