On 06/30, Alex Deucher wrote: > On Tue, May 27, 2025 at 4:46 PM Rodrigo Siqueira <sique...@igalia.com> wrote: > > > > Hi Alex, > > > > Follow some comments and questions. > > > > On 05/02, Alex Deucher wrote: > > > Add an initial documentation page for user mode queues. > > > > > > Signed-off-by: Alex Deucher <alexander.deuc...@amd.com> > > > --- > > > Documentation/gpu/amdgpu/index.rst | 1 + > > > Documentation/gpu/amdgpu/userq.rst | 196 +++++++++++++++++++++++++++++ > > > 2 files changed, 197 insertions(+) > > > create mode 100644 Documentation/gpu/amdgpu/userq.rst > > > > > > diff --git a/Documentation/gpu/amdgpu/index.rst > > > b/Documentation/gpu/amdgpu/index.rst > > > index bb2894b5edaf2..45523e9860fc5 100644 > > > --- a/Documentation/gpu/amdgpu/index.rst > > > +++ b/Documentation/gpu/amdgpu/index.rst > > > @@ -12,6 +12,7 @@ Next (GCN), Radeon DNA (RDNA), and Compute DNA (CDNA) > > > architectures. > > > module-parameters > > > gc/index > > > display/index > > > + userq > > > flashing > > > xgmi > > > ras > > > diff --git a/Documentation/gpu/amdgpu/userq.rst > > > b/Documentation/gpu/amdgpu/userq.rst > > > new file mode 100644 > > > index 0000000000000..53e6b053f652f > > > --- /dev/null > > > +++ b/Documentation/gpu/amdgpu/userq.rst > > > @@ -0,0 +1,196 @@ > > > +================== > > > + User Mode Queues > > > +================== > > > + > > > +Introduction > > > +============ > > > + > > > +Similar to the KFD, GPU engine queues move into userspace. The idea is > > > to let > > > +user processes manage their submissions to the GPU engines directly, > > > bypassing > > > +IOCTL calls to the driver to submit work. This reduces overhead and > > > also allows > > > +the GPU to submit work to itself. Applications can set up work graphs > > > of jobs > > > +across multiple GPU engines without needing trips through the CPU. > > > + > > > +UMDs directly interface with firmware via per application shared memory > > > areas. > > > +The main vehicle for this is queue. A queue is a ring buffer with a read > > > +pointer (rptr) and a write pointer (wptr). The UMD writes IP specific > > > packets > > > +into the queue and the firmware processes those packets, kicking off > > > work on the > > > +GPU engines. The CPU in the application (or another queue or device) > > > updates > > > +the wptr to tell the firmware how far into the ring buffer to process > > > packets > > > +and the rtpr provides feedback to the UMD on how far the firmware has > > > progressed > > > +in executing those packets. When the wptr and the rptr are equal, the > > > queue is > > > +idle. > > > + > > > +Theory of Operation > > > +=================== > > > + > > > +The various engines on modern AMD GPUs support multiple queues per > > > engine with a > > > +scheduling firmware which handles dynamically scheduling user queues on > > > the > > > +available hardware queue slots. When the number of user queues > > > outnumbers the > > > +available hardware queue slots, the scheduling firmware dynamically maps > > > and > > > +unmaps queues based on priority and time quanta. The state of each user > > > queue > > > +is managed in the kernel driver in an MQD (Memory Queue Descriptor). > > > This is a > > > +buffer in GPU accessible memory that stores the state of a user queue. > > > The > > > +scheduling firmware uses the MQD to load the queue state into an HQD > > > (Hardware > > > +Queue Descriptor) when a user queue is mapped. Each user queue requires > > > a > > > +number of additional buffers which represent the ring buffer and any > > > metadata > > > +needed by the engine for runtime operation. On most engines this > > > consists of > > > +the ring buffer itself, a rptr buffer (where the firmware will shadow > > > the rptr > > > +to userspace), a wrptr buffer (where the application will write the wptr > > > for the > > > +firmware to fetch it), and a doorbell. A doorbell is a piece of the > > > device's > > > > In this part, you started to explain about the doorbell; consider adding > > a new paragraph here. > > Added some additional info here. > > > > > Another idea could be to create a dedicated page to explain doorbells > > and move all the general doorbell information from this patch to the new > > page. I think there is no kernel-doc about amdgpu doorbells. > > > > > +MMIO BAR which can be mapped to specific user queues. Writing to the > > > doorbell > > > +wakes the firmware and causes it to fetch the wptr and start processing > > > the > > > +packets in the queue. Each 4K page of the doorbell BAR supports specific > > > offset > > > +ranges for specific engines. The doorbell of a queue most be mapped > > > into the > > > > /most/must/ > > Fixed. > > > > > > +aperture aligned to the IP used by the queue (e.g., GFX, VCN, SDMA, > > > etc.). > > > +These doorbell apertures are set up via NBIO registers. Doorbells are > > > 32 bit or > > > +64 bit (depending on the engine) chunks of the doorbell BAR. A 4K > > > doorbell page > > > +provides 512 64-bit doorbells for up to 512 user queues. A subset of > > > each page > > > +is reserved for each IP type supported on the device. The user can > > > query the > > > +doorbell ranges for each IP via the INFO IOCTL. > > > > The first time that I read this, I was confused about the IOCTL part; > > however, at the end of this patch, I noticed that you explained the > > IOCTL part. Perhaps add a mention in parenthesis so the reader can see > > more details about this info in the "IOCTL Interfaces" section. > > Updated. > > > > > > + > > > +When an application wants to create a user queue, it allocates the the > > > necessary > > > +buffers for the queue (ring buffer, wptr and rptr, context save areas, > > > etc.). > > > +These can be separate buffers or all part of one larger buffer. The > > > application > > > +would map the buffer(s) into its GPUVM and use the GPU virtual addresses > > > of for > > > +the areas of memory they want t use for the user queue. They would also > > > > /t/to/ > > Fixed. > > > > > > +allocate a doorbell page for the doorbells used by the user queues. The > > > +application would then populate the MQD in the USERQ IOCTL structure > > > with the > > > +GPU virtual addresses and doorbell index they want to use. The user can > > > also > > > +specify the attributes for the user queue (priority, whether the queue > > > is secure > > > +for protected content, etc.). The application would then call the USERQ > > > +create IOCTL to create the queue from using the specified MQD. The > > > +kernel driver then validates the MQD provided by the application and > > > translates > > > +the MQD into the engine specific MQD format for the IP. The IP specific > > > MQD > > > +would be allocated and the queue would be added to the run list > > > maintained by > > > +the scheduling firmware. Once the queue has been created, the > > > application can > > > +write packets directly into the queue, update the wptr, and write to the > > > +doorbell offset to kick off work in the user queue. > > > + > > > +When the application is done with the user queue, it would call the USERQ > > > +FREE IOCTL to destroy it. The kernel driver would preempt the queue and > > > +remove it from the scheduling firmware's run list. Then the IP specific > > > MQD > > > +would be freed and the user queue state would be cleaned up. > > > > Is it possible to add some pseudo-code that summarizes the programming > > model described here? > > I'm not sure I understand what you are asking for here.
Hi Alex, You can ignore my question. > > > > > > + > > > +Some engines may require the aggregated doorbell to if the engine does > > > not > > > > /to/too/ or /to//? > > Fixed. > > > > > Do you know which engines requires the aggreted doorbell? Can this > > information be retrieved via IOCTL? I think this information can be > > helpful for userspace implementation. > > No IPs which currently support user queues require the aggregated > doorbell. VCN likely will be the first IP > that needs it. > > > > > > +support doorbells from unmapped queues. The aggregated doorbell is a > > > special > > > +page of doorbell space which wakes the scheduler. In cases where the > > > engine may > > > +be oversubscribed, some queues may not be mapped. If the doorbell is > > > rung when > > > +the queue is not mapped, the engine firmware may miss the request. Some > > > +scheduling firmware may work around this my polling wptr shadows when the /my/by/ ? > > > +hardware is oversubscribed, other engines may support doorbell updates > > > from > > > +unmapped queues. In the event that one of these options is not > > > available, the > > > +kernel driver will map a page of aggregated doorbell space into each > > > GPUVM > > > +space. The UMD will then update the doorbell and wptr as normal and > > > then write > > > +to the aggregated doorbell as well. > > > + > > > +Special Packets > > > +--------------- > > > + > > > +In order to support legacy implicit synchronization, as well as mixed > > > user and > > > +kernel queues, we need a synchronization mechanism that is secure. > > > Because > > > +kernel queues or memory management tasks depend on kernel fences, we > > > need a way > > > +for user queues to update memory that the kernel can use for a fence, > > > that can't > > > +be messed with by a bad actor. To support this, we've added protected > > > fence > > > +packet. This packet works by writing the a monotonically increasing > > > value to > > > +a memory location that is only the privileged clients have write access > > > to. > > > +User queues only have read access. When this packet is executed, the > > > memory > > > +location is updated and other queues (kernel or user) can see the > > > results. > > > > Does the driver handle this packet? I mean, does the driver insert it > > without the userspace request? What is the packet name? How can I find > > it in the kernel? > > The actual packet format varies from IP to IP (GFX/Compute, SDMA, VCN, > etc.), but the behavior is the same. The packet submission is handled > in userspace. The kernel driver just sets up the privileged memory > used for each user queue when it sets them up when the application > creates them. Could you include this additional information in the new version? Thanks > > > > > > + > > > +Memory Management > > > +================= > > > + > > > +It is assumed that all buffers mapped into the GPUVM space for the > > > process are > > > +valid when engines on the GPU are running. The kernel driver will only > > > allow > > > +user queues to run when all buffers are mapped. If there is a memory > > > event that > > > +requires buffer migration, the kernel driver will preempt the user > > > queues, > > > +migrate buffers to where they need to be, update the GPUVM page tables > > > and > > > +invaldidate the TLB, and then resume the user queues. > > > + > > > +Interaction with Kernel Queues > > > +============================== > > > + > > > +Depending on the IP and the scheduling firmware, you can enable kernel > > > queues > > > +and user queues at the same time, However, you are limited by the HQD > > > slots. > > > > /However/however/ > > Fixed. > > > > > > +Kernel queues are always mapped so any work the goes into kernel queues > > > will > > > +take priority. This limits the available HQD slots for user queues. > > > + > > > +Not all IPs will support user queues on all GPUs. As such, UMDs will > > > need to > > > +support both user queues and kernel queues depending on the IP. For > > > example, a > > > +GPU may support user queues for GFX, compute, and SDMA, but not for VCN, > > > JPEG, > > > +and VPE. UMDs need to support both. The kernel driver provides a way to > > > +determine if user queues and kernel queues are supported on a per IP > > > basis. > > > +UMDs can query this information via the INFO IOCTL and determine whether > > > to use > > > +kernel queues or user queues for each IP. > > > + > > > +Queue Resets > > > +============ > > > + > > > +For most engines, queues can be reset individually. GFX, compute, and > > > SDMA > > > +queues can be reset individually. When a hung queue is detected, it can > > > be > > > +reset either via the scheduling firmware or MMIO. Since there are no > > > kernel > > > +fences for most user queues, they will usually only be detected when > > > some other > > > +event happens; e.g., a memory event which requires migration of buffers. > > > When > > > +the queues are preempted, if the queue is hung, the preemption will fail. > > > +Driver will them look up the queues that failed to preempt and reset > > > them and > > > +record which queues are hung. > > > + > > > + > > > +On the UMD side, we will add an USERQ QUERY_STATUS IOCTL to query the > > > queue > > > +status. UMD will provide the queue id in the IOCTL and the kernel driver > > > +will check if it has already recorded the queue as hung (e.g., due to > > > failed > > > +peemption) and report back the status. > > > + > > > +IOCTL Interfaces > > > +================ > > > + > > > +GPU virtual addresses used for queues and related data (rptrs, wptrs, > > > context > > > +save areas, etc.) should be validated by the kernel mode driver to > > > prevent the > > > +user from specifying invalid GPU virtual addresses. If the user provides > > > +invalid GPU virtual addresses or doorbell indicies, the IOCTL should > > > return an > > > +error message. These buffers should also be tracked in the kernel > > > driver so > > > +that if the user attempts to unmap the buffer(s) from the GPUVM, the > > > umap call > > > +would return an error. > > > + > > > +INFO > > > +---- > > > +There are several new INFO queries related to user queues in order to > > > query the > > > +size of user queue meta data needed for a user queue (e.g., context save > > > areas > > > +or shadow buffers), and whether kernel or user queues or both are > > > supported > > > +for each IP type. > > > + > > > +USERQ > > > +----- > > > +The USERQ IOCTL is used for creating, freeing, and querying the status > > > of user > > > +queues. It supports 3 opcodes: > > > + > > > +1. CREATE - Create a user queue. The application provides a MQD-like > > > structure > > > + that devices the type of queue and associated metadata and flags for > > > that > > > + queue type. Returns the queue id. > > > +2. FREE - Free a user queue. > > > +3. QUERY_STATRUS - Query that status of a queue. Used to check if the > > > queue is > > > > /QUERY_STATRUS/QUERY_STATUS/? > > Fixed. > > Thanks, > > Alex > > > > > Thanks > > Siqueira > > > > > + healthy or not. E.g., if the queue has been reset. (WIP) > > > + > > > +USERQ_SIGNAL > > > +------------ > > > +The USERQ_SIGNAL IOCTL is used to provide a list of sync objects to be > > > signaled. > > > + > > > +USERQ_WAIT > > > +---------- > > > +The USERQ_WAIT IOCTL is used to provide a list of sync object to be > > > waited on. > > > + > > > +Kernel and User Queues > > > +====================== > > > + > > > +In order to properly validate and test performance, we have a driver > > > option to > > > +select what type of queues are enabled (kernel queues, user queues or > > > both). > > > +The user_queue driver parameter allows you to enable kernel queues only > > > (0), > > > +user queues and kernel queues (1), and user queues only (2). Enabling > > > user > > > +queues only will free up static queue assignments that would otherwise > > > be used > > > +by kernel queues for use by the scheduling firmware. Some kernel queues > > > are > > > +required for kernel driver operation and they will always be created. > > > When the > > > +kernel queues are not enabled, they are not registered with the drm > > > scheduler > > > +and the CS IOCTL will reject any incoming command submissions which > > > target those > > > +queue types. Kernel queues only mirrors the behavior on all existing > > > GPUs. > > > +Enabling both queues allows for backwards compatibility with old > > > userspace while > > > +still supporting user queues. > > > -- > > > 2.49.0 > > > > > > > -- > > Rodrigo Siqueira -- Rodrigo Siqueira