Hello Faith and everyfrogy! I've been developing a new Vulkan driver for Mesa — Terakan, for AMD TeraScale Evergreen and Northern Islands GPUs — since May of 2023. You can find it in amd/terascale/vulkan on the Terakan branch of my fork at Triang3l/mesa. While it currently lacks many of the graphical features, the architecture of state management, meta, and descriptors, has already largely been implemented in its code. I'm overall relatively new to Mesa, in the past having contributed the fragment shader interlock implementation to RADV that included working with the state management, but never having written a Gallium driver, or a Vulkan driver in the ANV copy-pasting era, so this may be a somewhat fresh — although quite conservative — take on this.
Due to various hardware and kernel driver differences (bindings being individually loaded into fixed slots as part of the command buffer state, the lack of command buffer chaining in the kernel resulting in having to reapply all of the state when the size of the hardware command buffer exceeds the HW/KMD limits), I've been designing the architecture of my Vulkan driver largely from scratch, without using the existing Mesa drivers as a reference. Unfortunately, it seems like we ended up going in fundamentally opposite directions in our designs, so I'd say that I'm much more scared about this approach than I am excited about it. My primary concerns about this architecture can be summarized into two categories: • The obligation to manage pipeline and dynamic state in the common representation — essentially mostly the same Vulkan function call arguments, but with an additional layer for processing pNext and merging pipeline and dynamic state — restricts the abilities of drivers to optimize state management for specific hardware. Most importantly, it hampers precompiling of state in pipeline objects. In state management, this would make Mesa Vulkan implementations closer not even to Gallium, but to the dreaded OpenGL. • Certain parts of the common code are designed around assumptions about the majority of the hardware, however some devices may have large architectural differences in specific areas, and trying to adapt the way of programming such hardware subsystems results in having to write suboptimal algorithms, as well as sometimes artificially restricting the VkPhysicalDeviceLimits the device can report. An example from my driver is the meaning of a pipeline layout on fixed-slot TeraScale. Because it uses flat binding indices throughout all sets (sets don't exist in the hardware at all), it needs base offsets for each set within the stage's bindings — which are precomputed at pipeline layout creation. This is fundamentally incompatible with MR !27024's direction to remove the concept of a pipeline layout — and if the common vkCmdBindDescriptorSets makes the VK_KHR_maintenance6 layout-object-less path the only available one, it would add a lot of overhead by making it necessary to recompute the offsets at every bind. I think what we need to consider about pipeline state (in the broader sense, including both state objects and dynamic state) is that it inherently has very different properties from anything the common runtime already covers. What most of the current objects in the common runtime have in common is that they: • Are largely hardware-independent and can work everywhere the same way. • Either: • Provide a complex solution to a large-scale problem, essentially being sort of advanced "middleware". Examples are WSI, synchronization, pipeline cache, secondary command buffer emulation, render pass emulation. • Or, solve a trivial task in a way that's non-intrusive towards algorithms employed by the drivers — such as managing object handles, invoking allocators, reference-counting descriptor set and pipeline layouts, pooling VkCommandBuffer instances. • Rarely influence the design of "hot path" functions, such as changes to pipeline state and bindings. On the other hand, pipeline state: 1. Is entirely hardware-specific. 2. Is modified very frequently — making up the majority of command buffer recording time. 3. Can be precompiled in pipeline objects — and that's highly desirable due to the previous point. Because of 1, there's almost nothing in the pipeline state that the common runtime can help share between drivers. Yes, it can potentially be used to automate running some NIR passes for baking static state into shaders, but currently it looks like the runtime is going in a somewhat different direction, and that needs only some helper functions invoked at pipeline creation time. Aside from that, I can't see it being able to be useful for anything other than merging static and dynamic state into a single structure. For drivers where developers would prefer this approach for various reasons (prototyping simplicity, or staying at the near-original-Vulkan level of abstraction is sufficient for them and their target hardware in this area), this functionality for merging and marking state as dirty can be provided in the "toolbox" way with usage being optional, via composition rather than inheritance (such as using callbacks for getting the static state structure from the pipeline, and the dynamic state structure from the command buffer), and a layer of vkCmdSet* entry point fallbacks, as well as functions to call from vkCmdBindPipeline (or a default implementation). As for 2 and 3, I don't think merely the amount of code is a reason solid enough for Mesa to start making it more and more uncomfortable for drivers to take advantage of precompilation of static state in graphics pipelines. We should not forget that whole point of pipeline objects, along with cross-stage optimizations, is to make command buffer recording cheaper — which is also a large part of the idea of Vulkan itself. And if drivers can utilize parts of the API to make applications run faster… we should be encouraging that, not demoralize driver developers striving to do that. I don't believe we should be "deprecating" monolithic pipelines in our architectural decisions. While translation layers for some source APIs have constraints related to data availability that make ESO a more optimal approach, native Vulkan games and apps often use monolithic pipelines — I know World War Z uses monolithic with only viewport, depth bias and stencil reference being dynamic, early Vulkan games like Doom too, and probably many more. That has always been the recommended path. In Terakan specifically, along with performing shader linkage, I strongly want to be able to precompile the following state if it's static — and I've already been implementing all states in this precompiling way since the very beginning: • All vertex input bindings and attributes (with unused ones skipped if the pre-rasterization part of the pipeline is available) into a pointer to the "fetch shader subroutine" (with instance index divisor ALU code scheduled for the VLIW5/VLIW4 ALU architecture), a bitfield of used bindings, and strides for them. • Although static viewports are rare, but viewports into scales/offsets, implicit scissors, and registers related to depth range and clamping. • Rasterization state: • Polygon mode, cull mode, front face, depth bias toggle, provoking vertex into a 32-bit AND-NOT (keeping dynamic fields) and a 32-bit OR mask for the PA_SU_SC_MODE_CNTL register. • Clipping space parameters into ANDNOT/OR masks for 32-bit PA_CL_CLIP_CNTL. • Custom MSAA sample locations into packed 4-bit values. • All depth/stencil state into ANDNOT/OR masks for DB_DEPTH_CONTROL, DB_STENCILREFMASK and DB_STENCILREFMASK_BF 32-bit registers. • The entire blending equation for each color attachment into a 32-bit CB_BLEND#_CONTROL. This list includes most of VkGraphicsPipelineCreateInfo. Almost all of the static state in my driver goes through some preprocessing, and there's zero Vulkan enum parsing triggered by vkCmdBindPipeline in it. With this, I simply not only don't need to use the merging logic from the new vk_pipeline, but I don't even need to store essentially a copy of VkGraphicsPipelineCreateInfo inside the pipeline object. The merging of static and dynamic state is done at a different level of abstraction in my driver — in a highly close-to-registers representation. I already have custom implementations of vkCmdSet* functions converting directly to that representation skipping Mesa's vk_dynamic_graphics_state (this means that the vk_dynamic_graphics_state in vk_command_buffer objects is already in an undefined state in my driver though — and it can be safely removed to save space), and that doesn't require much effort from me, in part because I try to reuse conversion functions between vkCreateGraphicsPipelines and vkCmdSet* wherever possible. To summarize what I feel about writing state management code: • Am I okay with copying a bit of code (usually 5-6 lines per entry point) from vkCreateGraphicsPipelines and vkCmdBindPipeline to vkCmdSet*? Totally yes. • Would I be okay with doing 49 dyn->dirty BITSET_TESTs for every draw command, many of which lead to some vk_to_nv9097? I understand why other people may prefer this approach, but for me personally, that equals to asking whether I would happily disfigure my (hypothetical) child with my own hands. And even when a driver does preprocessing of static state, if the common pipeline state logic is forced upon drivers, the parts already handled by preprocessing will still be wasting execution time going through the common logic only to never actually be used by the driver — we kind of end up summing the cost of both, not subtracting one from the other. For additional context, here's what my state architecture looks like: • terakan_pipeline_graphics: • Either a monolithic pipeline, or a library, or a pipeline constructed from libraries. • Separated into `struct`s for GPL parts (vertex input, pre-rasterization, fragment shader, fragment output, plus some shared parts like multisampling) from day 1. • Examples of state elements within those structures are hardware registers (full or partial), pre-converted viewports, vertex fetch subroutine, vertex binding strides. • If a part of a hardware register is dynamic, it's excluded from the 32-bit replacement mask for that register. • Bitset of which state elements are static, bitscanned when binding. • terakan_state_draw ("software" state): • Modified by vkCmdBindPipeline or vkCmdSet*. • State elements are close to hardware registers, but application of them may include minor postprocessing, such as intersecting viewport-implicit and application-provided scissor rectangles, or reindexing (compacting) of hardware color attachments due to the hardware D3D11-OMSetRenderTargetsAndUnorderedAccessViews-like requirements for storage resource binding. • In some cases there may be dependencies between state elements — this is the "intermediate" representation with somewhat relaxed rules for that. • Bitset of "pending" state, bitscanned before application's draws to invoke apply callbacks. • Only stores the application-provided state — (custom) meta draws skip this level, but mark touched state here as "pending" for it to be restored the next time the application wants to draw something. • terakan_hw_state_draw ("hardware" state): • Modified by terakan_state_draw applying and by meta draws. • State elements are very close to hardware registers, and each of them is entirely atomic. • Due to the lack of command buffer chaining in the kernel driver, this is the part that handles switching to the new hardware command buffer: • This is the only place where hardware commands for changing the graphics state are emitted, so their result is not lost. • When starting a new hardware command buffer, all state that has ever been set is re-emitted with the same callbacks as normally. • Bitset of modified state, bitscanned before application's or meta draws to invoke emit callbacks. • Additionally, this is where the resource binding slots are managed, including deduplication of unchanged bindings, and arbitration of hardware LS and ES+VS binding slot space usage between Vulkan VS and TES stages. Additional note regarding the common meta code is that I'm unable to use the functionality that involves writing to images from compute shaders in it. It's written in AMD's AddrLib that on some of my target hardware (Cypress, the earliest Evergreen chip), using a VK_IMAGE_TILING_LINEAR image as a storage image causes a failure/hang, so for copying to images, I have to use rasterization at least in some cases. Also some meta operations can benefit from hardware-specific functionality, such as MSAA resolves that can be done by drawing a rectangle with a special output merger hardware configuration. Regarding implementing new features in the common code, I think where that's possible, the toolbox approach is enough, but where it's not, making the common runtime more intrusive won't help either way. GPL and ESO will still need a lot of driver-specific code for the (non-) preprocessing reasons above (and also in RADV, dynamic state in some cases even results in less optimal register setup, setting registers to more "generic" values compared to the same state specified as static). Minor things like allowing null descriptors in more areas would still require handling of those VK_NULL_HANDLE cases inside each driver. On the other hand, if, like I mentioned earlier, the common runtime starts, for example, normalizing vkCmdBindDescriptorSets to the VK_KHR_maintenance6 layout == VK_NULL_HANDLE version with a descriptor set layout array instead of the pipeline layout even if one was provided by the application, rather than treating that as a special case, that's essentially a big NAK (not a compiler in this context) within Terakan that will have to calculate prefix sums every vkCmdBindDescriptorSets call. Overall, MR !27024 not only did not make me want to adopt any of the new common bases it introduces, but it actually had the reverse effect on me — particularly, it's now a high-priority task for me to _get rid_ of the common vk_pipeline_layout that I'm already using in my driver in favor of a custom implementation. The issue I have with vk_pipeline_layout is that it contains a fixed-size 32-element array for the descriptor set layouts. However, in my driver, descriptor sets are purely a CPU-side abstraction as the hardware has fixed binding slots, and the cost of vkBindDescriptorSets scales with the number of individual bindings involved, not sets. So for my target hardware, binding a large descriptor set is very unoptimal if you only need to actually change one or two bindings in it. And I want to reflect that in the device properties — report maxBoundDescriptorSets = 1054 (the total number of exposed hardware slots of all types), and try to take advantage of this qualitative property in a fork of DXVK and maybe in Xenia. For that, I replaced that static array with dynamic allocation in pipeline layout creation on my branch. However, !27024 makes things more complicated for my driver there by adding more fixed-size arrays of descriptor set layouts (thankfully I don't need any of them… at least until I'm forced to need them). Of course such a large maxBoundDescriptorSets is not something that existing software will take advantage of, the only precedent of something similar was MoltenVK with 1 billion of sets, which also used fixed slots before argument buffers were added to Metal. But it feels like that's only the beginning, and in the future we're going to see the common runtime and its limit/feature/format (the latter, for example, potentially affecting something like separate depth and stencil images — something very useful for Scaleform GFx stencil masks, for instance) support holding drivers back more and more. Something like maxBoundDescriptorSets = 32 combined with maxPushDescriptors = 32 is definitely NAK-worthy for me, as that would make it simply impossible for software to take advantage of the flat binding model when running on top of my driver. I can also handle maxPushConstantsSize closer to 64 KB just fine — in the hardware, push constants are just yet another UBO. If that's the future of Mesa, I don't know at all how I'll be able to maintain my driver in the upstream without progressively slaughtering its functionality and performance to the core. (My new workaround idea for the vk_pipeline_layout fixed size issue is to derive vk_pipeline_layout, vk_descriptor_set_layout and my custom terakan_pipeline_layout from some common vk_refcounted_object_base, and in vk_cmd_enqueue_CmdBindDescriptorSets, call vk_refcounted_object_base_ref instead of vk_pipeline_layout_ref directly. This will make it possible for me to provide a custom pipeline layout implementation, while also having the ability to use Mesa's common secondary command buffer emulation, as in my driver targeting hardware without virtual memory and command buffer chaining, it's cheaper to record Vulkan commands themselves than to merge hardware commands between command buffers patching relocations and inherited objects, or to put even the smallest secondary command buffers in separate submissions with all full state resets. But again, this goes in the direction opposite to increasing the common runtime's intrusiveness.) Even knowing that my driver will never be able to run VKD3D-Proton properly or Doom Eternal is nowhere as demotivating to me as what this may entail. Fighting with the GPU's internals and the specification is fun. The prospect of eternally fighting with merge request comments suggesting adopting NVK's architectural patterns and losing performance and limit values to them, when you can trivially avoid that in the code, is a different thing — and with my driver being something niche rather than a popular RADV, NVK or ANV, I honestly don't have high hopes about the attention that will be paid to the distinct properties of my target hardware by the common runtime in the future. — Tri