The timing for the Fortran code (using -Ofast) is outer_func2 time = 0.160010 second.
I checked and it is using vector instructions. I'm impressed Julia is as fast as Fortran in this case. I would have thought alias checking would Julia down. The Julia code is slow on release-0.5 as well as 0.6, so I will file an issue. Jared Crean On Saturday, October 29, 2016 at 11:05:38 AM UTC-4, Jared Crean wrote: > > I noticed this morning that the loop are in the wrong order for a column > major array. Reversing them, I get: > > testing outer_func > 0.294904 seconds > 0.296689 seconds > testing outer_func2 > 0.280391 seconds > 0.281223 seconds > > Now both versions have the phi instructions, so I guess that wasn't the > problem > > > And sprinkling a little @simd on the inner loops: > > testing outer_func > 0.159910 seconds > 0.157640 seconds > testing outer_func2 > 0.151384 seconds > 0.152224 seconds > > I'm going to write a Fortran code to do a performance comparison, but this > is looking pretty good. > > Do you think I should file a performance issue for the original code? > > Jared Crean > > > > On Saturday, October 29, 2016 at 4:13:48 AM UTC-4, Kristoffer Carlsson > wrote: >> >> Could it be some alias checking going on? >> >> Anyway, this code is horribly slow on 0.6 (even with #19097) it seems. >> >> to_indexes(::Int64, ::Int64, ::Vararg{Int64,N}) at operators.jl:868 >> (repeats 3 times) >> kills performance. >> >> >> On Saturday, October 29, 2016 at 5:56:12 AM UTC+2, Jared Crean wrote: >>> >>> I'm working on an high dimensional finite difference code, and I got a >>> strange performance result. I have a kernel function that >>> computes the stencil at a given point, and an outer function, >>> outer_func, that loops over the dimensions and calls the kernel function at >>> every grid point. >>> I created a second function, outer_func2, with the same loops as >>> outer_func, but rather than call the kernel function it has the contents of >>> the kernel function copied into it. The source code is here: >>> https://github.com/JaredCrean2/wave6d/blob/master/src/test_inline.jl >>> >>> The performance results (with bounds checking disabled and >>> --math-mode=fast) are: >>> >>> testing outer_func >>> 0.398586 seconds >>> 0.398821 seconds >>> testing outer_func2 >>> 2.522230 seconds >>> 2.522479 seconds >>> >>> >>> >>> I ran this on in Intel Ivy Bridge (i7-3820) processor, using Julia 0.4.4 >>> >>> I looked at the llvm code (attached), and noticed outer_func2 has a >>> bunch of extra statements that look like >>> >>> %lsr.iv570 = phi i8* [ %scevgep571, %L21 ], [ %scevgep569, %L.preheader >>> ] >>> >>> >>> >>> that are not present for outer_func. I don't know llvm code very well >>> (hardly at all), so I'm not sure what these mean. Any help >>> understanding either the llvm code or the performance difference would >>> be appreciated. >>> >>> >>> >>> Thanks, >>> Jared Crean >>> >>