On Mon, 2010-10-25 at 12:44 -0700, Eric Anholt wrote: > So, what if the problem is that our URB allocations aren't big enough? > I would expect that to look kind of like what I'm seeing. One > experiment would be to go double the preferred size of each stage in > brw_urb.c one by one -- is one stage's URB allocation a limit? Or, am I > on the right track at all (go reduce all the preferred sizes to 1/2 and > see if that hurts)?
I've tinkered with the URB allocations a little, and couldn't notice any discernible performance impact. I think I'll have to re-try with the CPU forced into C0 and at full operating frequency in order to really tax things and make sure I can be confident in comparing numbers. I did notice that the mesa code appears to enforce a minimum of 4x URB entries for a GS thread, where the PRM suggests you could potentially get stalls due to the CLIP operation unless you have 5x URB threads. (Vol2 GM45 page 56). Obviously we're not seeing GPU hangs all the time, otherwise people would have complained!, but it might be something worth adjusting if I'm correct in my assessment. > If this is the problem, I'd think URB allocation should actually look > something like dividing up the whole space according to some weighting, > with minimums per unit to prevent deadlock. Right now, we're just using > a fixed division for preferred if it fits, and a nasty minimum set for > the fallback case. It would seem that the different operations need to take about the same amount of time, otherwise they will stall anyway. There is no point processing vertices faster than the WM can absorb them, right? I guess this should reflect in the thread scheduling in the GPU EUs though. I would half expect various FF units to spend time "idle" / waiting for data to move.. the real curiosity is if the threads being dispatched can't keep all the EUs at 100%. Something has to be the bottle-neck in a pipeline, I'm hoping it will eventually be the GPU EUs. I've also tinkered with max thread numbers for the GS and CLIP units (which looked to be running at 1 or 2 threads only). That didn't appear to have any (positive) impact either. Browsing the code, it would appear that up to 50x WM and 32x VS threads are dispatched. I've not checked the docs for a maximum number yet, but I'm assuming the numbers in MESA may reflect a hardware limit there. I think what we really need for better understanding is a per-frame profile of when different execution units are busy. It sounds like you have something like this in development for Ironlake (unfortunately I'm only on GM45 here). Thanks for your help in debugging this. I understand performance will necessarily come as a secondary consideration to chip-set support and stability, so I appreciate you taking the time to think about this a bit. One thing which might be interesting (albeit hard for me to do), is compare some benchmarks against the Win32 drivers for the chip to sanity check whether the Linux drivers + mesa are in the same ball-park or not. If they are, I guess there won't be a silver bullet fix anywhere. Best regards, -- Peter Clifton Electrical Engineering Division, Engineering Department, University of Cambridge, 9, JJ Thomson Avenue, Cambridge CB3 0FA Tel: +44 (0)7729 980173 - (No signal in the lab!) Tel: +44 (0)1223 748328 - (Shared lab phone, ask for me) _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx