I'll agree with Jose about Vulkan being a low-level abstraction, and to me the "opt-in" way seems like a much more balanced approach to achieving our goals — not only balanced between the goals themselves (code amount and time to implement aren't our only criteria to optimize), but also across the variety of hardware — as if something goes wrong with the watertight abstraction for a certain implementation, not only it'd take more time to find a solution, but issues of one driver risk wasting time of everyone as it'd often be necessary to make debatable changes to interfaces used by all drivers.
I also need to further clarify my point regardless the design of what we want to encourage drivers to use, specifically about pipeline objects and dynamic state/ESO. Vulkan, as I see from all the perspectives I'm regularly interacting with it from — as an RHI programmer at a game studio, a translation layer developer (the Xenia Xbox 360 emulator), and now creating a driver for it — has not grown much thicker than it originally was. What has increased is its surface area — but where it's actually important: letting applications more precisely convey their intentions. I'd say it's even thinner and more transparent from this point of view now. We got nice things like inline uniform blocks, host image copy, push descriptors, descriptor buffers, and of course dynamic state — and they all pretty much directly correspond to some hardware concepts, that apps can utilize to do what they want with less indirection between their actual architecture and the hardware. Essentially, the application and the driver (and the rest of the chain — the specification, I'd like to retract my statement about "fighting" it, by the way, and the hardware controlled by that driver) can work more cooperatively now, towards their common goal of delivering what the app developer wants to provide to the user with as high quality and speed as realistically possible. They now have more ways of helping each other by communicating their intentions and capabilities to each other more completely and accurately. And it's important for us not to go *backwards*. This is why I think it's just fundamentally wrong to encourage drivers to layer pipeline objects and static state on top of dynamic state. An application would typically use static state when it: • Knows the potentially needed state setups in advance (like in a game with demands of materials preprocessed, or in a non-gaming/non-DCC app). • Wants to quickly apply a complete state configuration. • Maybe doesn't care much about the state used by previously done work, like drawing wildly different kinds of objects in a scene. At the same time, it'd choose dynamic if it: • Doesn't have upfront knowledge of possible states (like in an OpenGL/ D3D9/D3D11 translation layer or a console emulator, or with a highly flexible art pipeline in the game). • Wants to quickly make small, incremental state changes. • Maybe wills to mix state variables updated at different frequencies. Their use cases, and application's intentions they convey, are as opposite as the antonymous words "static" and "dynamic" they're called. Treating one like a specialization of the other is making the driver blind in the same way as back in 2016 when applications had no other option but to reduce everything to static state. (Of course with state spanning so many pipeline stages, applications would usually not just be picking one of the two extremes, and instead may want static for some cases/stages and dynamic for the other. This is also where the route Vulkan's development over the 8 years has taken is very wise: instead of forcing Escobar's axiom of choice upon applications, let them specify their intentions on a per-variable basis, and choose the appropriate amount of state grouping among monolithic pipelines, GPL with libraries containing one or multiple parts of a pipeline, and ESO.) The primary rule of game optimization is, if you can avoid doing something every frame, or, even worse, hundreds or thousands of times per frame, do whatever reuse you can to avoid that. If we know that's what the game wants to do — by providing a pipeline object with the state it wants to be static, a pipeline layout object — we should be aiding it. Just like if the the game tells us that it can't precompile something, the graphics stack should do the best it can in this situation — it would be wrong to add the overhead of running a time machine to 2016 to its draws either. After all, the driver's draw call code and the game's draw call code are both just draw call code with one common goal. So, it's important that whichever solution we end up with, it must not be a "broken telephone" degrading the cooperation between the application and the driver. And we should not forget that the communication between them is two-way, which includes: • Interface calls done by the app. • Limits and features exposed by the driver. Having accurate information about the other party is important for both to be able to make optimal decisions considering the real strong points and the real constraints of the two. And note that the reason why I'm talking about interface calls and limits collectively is because the application's Vulkan usage approaches essentially represent the "limits" of the application as well — like whether it has sufficient information to precompile pipeline state. When the common runtime just gets in the way here, it means that it's basically acting *against* the goal of the two… green sus 🐸 If NVK developers consider that for their target hardware, the near-unprocessed representation of the state behind both static and dynamic interfaces is sufficiently optimal, it's fine. But other drivers should not be punished for essentially doing what the application and the specification are enabling and even expecting them to do. Like baking immutable samplers into shader code. Or taking advantage of static, non-update-after-bind descriptors to assign UBOs to a fast hardware path in a more straightforward way. Or preprocessing static state in a pipeline object. Maybe it wouldn't have been a very big deal from the performance perspective in reality if in my driver Terakan, calling vkCmdBindPipeline with static blending equation state would have resulted in 6 enum translations and some shifts/ORs, instead of just one 32-bit assignment. Or, with my target hardware having fixed binding slots, if every vkCmdBindDescriptorSets call ran a += loop until firstSet instead of looking up the base slots in the pipeline layout. However, the conceptual thing here is that I'm not trying to make small improvements over some "default" behavior. There's no "default" here. I'm not supposed to implement static on top of dynamic in the first place, as I said, they are not only different concepts, but even opposite ones. Of course there are always exceptions on a case-by-case, driver-by-driver basis. For instance, due to the constraints of the memory and binding architectures of my target hardware, the kernel driver and the microcode, it's more optimal for my driver to record secondary command buffers on the level of Vulkan commands using Mesa's common encoder. But this kind of cherry-picking aligns much more closely with the "opt-in" approach than the "watertight" one. On the topic of limits, I also think the best we can do is to actually be honest about the specific driver and the specific hardware, and to view them from the perspective of enabling richer communication with the application. Blatantly lying (at least without an environment variable switch), like in the scary old days when, as I heard, some drivers resorted to being LLVMpipe having met a repeating NPOT texture, is definitely not contributing to productive communication. But at the same time, if AMD see how apps can take advantage of the explicit cubemap 3D>2D transformation from VK_AMD_gcn_shader, or there are potential scenarios for something like VK_ARM_render_pass_striped… why can't I just tell the app that spreading descriptors across sets more granularly costs my driver nothing and report maxBoundDescriptorSets = UINT32_MAX or at least an integer-overflow-safer (maxSamplers + maxUB + maxSB + maxSampled + maxSI) * 6 + maxIA, giving it one more close-to-metal tool for cases where it may be useful? ---- P.S.: So far, the list of architectural concepts I'm not willing to sacrifice in Terakan, which I'd consider a loss of a major regression, includes: • Pipeline objects with pretranslated fixed-function state, as well as everything needed to enable that (like storing the current state in a close-to-hardware representation, which may require custom vkCmdSet* implementations). • Pipeline layout objects where already available, most importantly in vkCmdBindDescriptorSets and vkCmdPushDescriptorSetKHR. • maxBoundDescriptorSets, maxPushDescriptors, maxPushConstantsSize significantly higher than on hardware with root-signature-based binding. • Inside a VkCommandPool, separate pooling of entities allocated at different frequencies: hardware command buffer portions, command encoders (containing things like the current state that is pretty large due to fixed binding slots, relocation hash maps), BOs with push constants and dynamic vertex fetch subroutines. — Triang3l