On 08/01/2018 07:12 AM, Tom de Vries wrote: >>>> + gangs = grids * (blocks / warp_size); >>> >>> So, we launch with gangs == grids * workers ? Is that intentional? >> >> Yes. At least that's what I've been using in og8. Setting num_gangs = >> grids alone caused significant slow downs. >> > > Well, what you're saying here is: increasing num_gangs increases > performance. > > You don't explain why you multiply with workers specifically.
I set it that way because I think the occupancy calculator is determining the occupancy of a single multiprocessor unit, rather than the entire GPU. Looking at the og8 code again, I had num_gangs = 2 * threads_per_sm / warp_size * dev_size which corresponds to 2 * grids * blocks / warp_size Because blocks is generally smaller than threads_per_block, the driver occupancy calculator ends up launching fewer gangs. I don't have a firm position with this default behavior. Perhaps we should just set gang = grids That's probably an improvement over what's there now. Cesar