Giuseppe Bilotta <giuseppe.bilo...@gmail.com> writes: > On Fri, Jun 5, 2015 at 10:35 PM, Francisco Jerez <curroje...@riseup.net> > wrote: >>> OTOH, at least in OpenCL, this cap wouldn't be used 'raw' as >>> performance hint, since the actual value returned (the >>> PREFERRED_WORK_GROUP_SIZE_MULTIPLE) is a kernel property rather than a >>> device property, so it may be tuned at kernel compilation time, >>> according to effective work-item SIMD usage. >> >> At least the way it's implemented in this series, it's a per-device >> property, and even though I see your point that it might be useful to >> have a finer-grained value in some cases, I don't think it's worth doing >> unless there is any evidence that the unlikely over-alignment of the >> work-group size will actually hurt performance for some application -- >> And there isn't at this point because ILO doesn't currently support >> OpenCL AFAIK. > > What I was trying to say is that while the cap itself is per-device, > the OpenCL property that relies on this cap isn't. > In this sense, I would expect the cap to report the actual _hardware_ > property, and the higher level stack (OpenCL or whatever, if and when > it will be supported) to massage the value as appropriate (e.g. by > multiplying by 4x —the overcommit needed to keep the device pipes > full— and then dividing by the vector width of the kernel). > The problem is that it requires a lot of hardware-specific knowledge to find the right over-commit factor (instruction latencies, issue overhead, the fact that in some cases the pipeline is twice as wide), and whether and to what extent the kernel needs to be scalarized -- and the OpenCL state tracker is hardware-independent.
> Ultimately the question is if the device property (i.e. not the OpenCL > kernel property) should expose just the actual physical SIMD width, or > a less raw value that takes into consideration other aspects of the > device too. > > In some sense the situation is similar to the one with the older > NVIDIA and AMD architectures, where the processing elements were > clustered in smaller blocks (e.g. 8s or 16s for NVIDIAs, even though > the warp size was 32), which meant you _could_ efficiently use half- > or quater-warps under specific conditions, but in most cases you > wanted to use multiple of full warps anyway. > > However, on that hardware the instruction dispatch was actually at the > warp level. This has significant implications when implementing > lockless algorithms, for example: the warp or wavefront size on NVIDIA > and AMD becomes the largest number of work-items that can exchange > data without barriers. With the “dynamic SIMD” thing Intel has, would > we have any guarantee of synchronized forward progress? Yes, you do. The fact that the FPUs are 4-wide is completely transparent for the application (and even for the driver), it's just an implementation detail: The EUs behave pretty much as if they really had the logical SIMD width, executing instructions in order (e.g. a 4-wide chunk of an instruction will never start execution before all 4-wide chunks of the previous instruction have) and atomically (e.g. an FPU instruction won't be able to see the effects of some 4-wide chunks of another instruction but not others -- not even its own effects). > (Yet, people relying on the PREFERRED_WORK_GROUP_SIZE_MULTIPLE for > lockless algorithm are abusing a value for something it wasn't > intended to be.) Yeah, true. > > Ok, I'm convinced that 16 is a good choice for this cap on Intel, at > least for the current generation. > > -- > Giuseppe "Oblomov" Bilotta
signature.asc
Description: PGP signature
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev