Am 27.04.2016 um 18:13 schrieb Jose Fonseca: > On 27/04/16 02:46, Roland Scheidegger wrote: >> Am 27.04.2016 um 03:05 schrieb Dave Airlie: >>> On 27 April 2016 at 11:00, Dave Airlie <airl...@gmail.com> wrote: >>>>>> So far I've set the execmask to 1 active channel, I'm contemplating >>>>>> changing that >>>>>> though and using less machines. >>>>> Ah yes, I think that would indeed be desirable. >>>> >>>> I'll look into it, though it's not that trivial, since you might >>>> have a 1x20x1 >>>> layout, also having to make sure each thread gets the correct system >>>> values. >> Looks doable though. I'm mostly asking because the whole point of >> compute shaders is things running in parallel, and while that wouldn't >> really run in parallel it would at least slightly look like it... >> >>>> >>>>> >>>>>> >>>>>> Any ideas how to implement this in llvm? :-) 1024 CPU threads? >>>>> I suppose 1024 is really the minimum work size you have to support? >>>>> But since things are always run 4-wide (or 8-wide) that would >>>>> "only" be >>>>> 256 (or 128) threads. That many threads sound a bit suboptimal to me >>>>> (unless you really have a boatload of cpu cores), but why not - I >>>>> suppose you can always pause some of the threads, not all need to be >>>>> active at the same time. >>>>> Though I wonder what the opencl-on-cpu guys do... >>>> >>>> pocl appears to spawn a number of threads and split the work out >>>> amongst >>>> them in the X direction. >>>> >>>> However I'm not seeing how they handle barriers, or if they handle >>>> them correctly at all. >>> >>> Okay newer versions of pocl seem to have some sort of thread scheduler, >>> that schedule workgroups across up to 8 threads, however I can't see how >>> they deal with barriers still. >> >> Yes the problem with barriers is what I had in mind too. Otherwise could >> just create worker threads, which pick up whatever work items are left. >> >> Roland > > Regarding llvmpipe, the simple solution seems indeed to be to use one os > thread for one register worth. > > The second, intermediate, solution is to use the same number of threads > (ie, == to the number of CPU), each using very large vectors (ie, > 1024/num-cpus ), let LLVM deal with breaking those vectors in smaller > units. Are you sure llvm can actually deal with such massive vectors (not just in theory but in practice too)? But even if it can, I don't think that would be all that useful. It's likely going to result in huge shaders, massive amounts of spilling, not to mention divergent control flow is going to be terrible.
> > Emitting LLVM IR such way that it's able to stop/resume execution in the > middle of a thread seems hard (thought not impossible, since we already > deal with execution masks, so it would be mostly a matter of spilling > all input/temp registers and execution maks to/from malloc memory. Theoretically doable, but only as long as there's no real control flow I think. Otherwise looks pretty impossible to me. > > > Another solution might be to integrate some thirdparty library that > implements so called green/user-space threads (e.g, via setjmp/longjmp, > or something else). I don't know any such library off-hand, and getting > to work on all OSes might be far from trivial. My gut feeling is that > this would be the most promissfull option long term: no need to have > thousands of OS threads, and no need to add increase complexity of LLVM > code generation. That looks like a reasonable solution. I'm not really sure though the overhead of kernel threads is really all that bad compared to user-space threads (so, 256 ordinary threads or so which I think is the most we'd need might be just fine). Roland _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev