Hi,
On 05/29/2015 07:04 PM, Connor Abbott wrote:
On Fri, May 29, 2015 at 6:23 AM, Eero Tamminen
<eero.t.tammi...@intel.com> wrote:
On 05/28/2015 10:19 PM, Thomas Helland wrote:
One more thing;
Is there a limit where the loop body gets so large that we
want to decide that "gah, this sucks, no point in unrolling this"?
I imagine as the loops get large there will be a case of
diminishing returns from the unrolling?
I think only backend can say something to that. You e.g. give backend
unrolled and non-unrolled versions and backend decides which one is better
(after applying additional optimizations)...
I don't really think it's going to be too good of an idea to do that,
mainly because it means you'd be duplicating a lot of work for the
normal vs. unrolled versions, and there might be some really large
loops where generating the unrolled version is going to kill your CPU
-- doing any amount of work that's proportional to the number of times
the loop runs, without any limit, seems like a recipe for disaster.
Sure it should have sanity bounds, but my point was more that it depends
on many factors and even backend doesn't necessarily know about all the
factors up front either, because some of them depend on the passes done
by the backend.
In GLSL IR, we've been fairly lax about figuring out when unrolling is
helpful and unhelpful -- we just have a simple "node count" plus a
threshold (as well as a few other heuristics). In NIR, we could
similarly have an instruction count plus a threshold and port over the
heuristics to whatever extent possible. We do have some logic for
figuring out if an array access is constant after unrolling, and it
seems like we'd want to keep that around. The next level of
sophistication, I guess, is to give the backend a callback to give an
estimation of the execution cost of certain operations. For example,
unless a negate/absolute value instruction is used by something that
can't handle the modifier, then on i965 the cost of those instructions
would be 0. I think that would get us most of the way there to
something accurate, without needing to do an undue amount of work (in
terms of CPU time and man-effort).
Some factors affecting whether to unroll or not:
- which one can make pull into push
- which one allows using higher SIMD mode
- which one can do better latency compensation / scheduling
for memory accesses (e.g. texture fetches)
- instruction count
- instruction cache size
- cycles (when they differ between instructions)
How much of this information frontend has or can request from backend
without it needing to actually compile both versions?
- Eero
(In offline compiler compilation CPU usage would be less of an issue.)
_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev