I noticed this morning that the loop are in the wrong order for a column major array. Reversing them, I get:
testing outer_func 0.294904 seconds 0.296689 seconds testing outer_func2 0.280391 seconds 0.281223 seconds Now both versions have the phi instructions, so I guess that wasn't the problem And sprinkling a little @simd on the inner loops: testing outer_func 0.159910 seconds 0.157640 seconds testing outer_func2 0.151384 seconds 0.152224 seconds I'm going to write a Fortran code to do a performance comparison, but this is looking pretty good. Do you think I should file a performance issue for the original code? Jared Crean On Saturday, October 29, 2016 at 4:13:48 AM UTC-4, Kristoffer Carlsson wrote: > > Could it be some alias checking going on? > > Anyway, this code is horribly slow on 0.6 (even with #19097) it seems. > > to_indexes(::Int64, ::Int64, ::Vararg{Int64,N}) at operators.jl:868 > (repeats 3 times) > kills performance. > > > On Saturday, October 29, 2016 at 5:56:12 AM UTC+2, Jared Crean wrote: >> >> I'm working on an high dimensional finite difference code, and I got a >> strange performance result. I have a kernel function that >> computes the stencil at a given point, and an outer function, outer_func, >> that loops over the dimensions and calls the kernel function at every grid >> point. >> I created a second function, outer_func2, with the same loops as >> outer_func, but rather than call the kernel function it has the contents of >> the kernel function copied into it. The source code is here: >> https://github.com/JaredCrean2/wave6d/blob/master/src/test_inline.jl >> >> The performance results (with bounds checking disabled and >> --math-mode=fast) are: >> >> testing outer_func >> 0.398586 seconds >> 0.398821 seconds >> testing outer_func2 >> 2.522230 seconds >> 2.522479 seconds >> >> >> >> I ran this on in Intel Ivy Bridge (i7-3820) processor, using Julia 0.4.4 >> >> I looked at the llvm code (attached), and noticed outer_func2 has a bunch >> of extra statements that look like >> >> %lsr.iv570 = phi i8* [ %scevgep571, %L21 ], [ %scevgep569, %L.preheader >> ] >> >> >> >> that are not present for outer_func. I don't know llvm code very well >> (hardly at all), so I'm not sure what these mean. Any help >> understanding either the llvm code or the performance difference would be >> appreciated. >> >> >> >> Thanks, >> Jared Crean >> >