On Fri, Jun 5, 2015 at 10:35 PM, Francisco Jerez <curroje...@riseup.net> wrote: >> OTOH, at least in OpenCL, this cap wouldn't be used 'raw' as >> performance hint, since the actual value returned (the >> PREFERRED_WORK_GROUP_SIZE_MULTIPLE) is a kernel property rather than a >> device property, so it may be tuned at kernel compilation time, >> according to effective work-item SIMD usage. > > At least the way it's implemented in this series, it's a per-device > property, and even though I see your point that it might be useful to > have a finer-grained value in some cases, I don't think it's worth doing > unless there is any evidence that the unlikely over-alignment of the > work-group size will actually hurt performance for some application -- > And there isn't at this point because ILO doesn't currently support > OpenCL AFAIK.
What I was trying to say is that while the cap itself is per-device, the OpenCL property that relies on this cap isn't. In this sense, I would expect the cap to report the actual _hardware_ property, and the higher level stack (OpenCL or whatever, if and when it will be supported) to massage the value as appropriate (e.g. by multiplying by 4x —the overcommit needed to keep the device pipes full— and then dividing by the vector width of the kernel). Ultimately the question is if the device property (i.e. not the OpenCL kernel property) should expose just the actual physical SIMD width, or a less raw value that takes into consideration other aspects of the device too. In some sense the situation is similar to the one with the older NVIDIA and AMD architectures, where the processing elements were clustered in smaller blocks (e.g. 8s or 16s for NVIDIAs, even though the warp size was 32), which meant you _could_ efficiently use half- or quater-warps under specific conditions, but in most cases you wanted to use multiple of full warps anyway. However, on that hardware the instruction dispatch was actually at the warp level. This has significant implications when implementing lockless algorithms, for example: the warp or wavefront size on NVIDIA and AMD becomes the largest number of work-items that can exchange data without barriers. With the “dynamic SIMD” thing Intel has, would we have any guarantee of synchronized forward progress? (Yet, people relying on the PREFERRED_WORK_GROUP_SIZE_MULTIPLE for lockless algorithm are abusing a value for something it wasn't intended to be.) Ok, I'm convinced that 16 is a good choice for this cap on Intel, at least for the current generation. -- Giuseppe "Oblomov" Bilotta _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev