"time srun hostname" reports on the order of 0.2 seconds, so at least single node requests are handled expediently!
it might be useful to simply collect the scaling curve for that test - is it linear, superlinear, fast to a point then blows up, etc?
have you already looked at the difference between 1 task/node and all threads/node?
regards, mark hahn.