[julia-users] Re: Performance of Kernel Inlining

Jared Crean Sat, 29 Oct 2016 11:53:55 -0700

The timing for the Fortran code (using -Ofast) is

  outer_func2 time = 0.160010 second.


I checked and it is using vector instructions.  I'm impressed Julia is as 
fast as Fortran in this case.  I would have thought alias checking would 
Julia down.

The Julia code is slow on release-0.5 as well as 0.6, so I will file an 
issue.

  Jared Crean



On Saturday, October 29, 2016 at 11:05:38 AM UTC-4, Jared Crean wrote:
>
> I noticed this morning that the loop are in the wrong order for a column 
> major array.  Reversing them, I get:
>
> testing outer_func
>   0.294904 seconds
>   0.296689 seconds
> testing outer_func2
>   0.280391 seconds
>   0.281223 seconds
>
> Now both versions have the phi instructions, so I guess that wasn't the 
> problem 
>
>
> And sprinkling a little @simd on the inner loops:
>
> testing outer_func
>   0.159910 seconds
>   0.157640 seconds
> testing outer_func2
>   0.151384 seconds
>   0.152224 seconds
>
> I'm going to write a Fortran code to do a performance comparison, but this 
> is looking pretty good.
>
> Do you think I should file a performance issue for the original code?
>
>   Jared Crean
>
>
>
> On Saturday, October 29, 2016 at 4:13:48 AM UTC-4, Kristoffer Carlsson 
> wrote:
>>
>> Could it be some alias checking going on?
>>
>> Anyway, this code is horribly slow on 0.6 (even with #19097) it seems.
>>
>> to_indexes(::Int64, ::Int64, ::Vararg{Int64,N}) at operators.jl:868 
>> (repeats 3 times)
>> kills performance.
>>
>>
>> On Saturday, October 29, 2016 at 5:56:12 AM UTC+2, Jared Crean wrote:
>>>
>>> I'm working on an high dimensional finite difference code, and I got a 
>>> strange performance result. I have a kernel function that
>>> computes the stencil at a given point, and an outer function, 
>>> outer_func, that loops over the dimensions and calls the kernel function at 
>>> every grid point.
>>> I created a second function, outer_func2, with the same loops as 
>>> outer_func, but rather than call the kernel function it has the contents of
>>> the kernel function copied into it.  The source code is here: 
>>> https://github.com/JaredCrean2/wave6d/blob/master/src/test_inline.jl
>>>
>>> The performance results (with bounds checking disabled and 
>>> --math-mode=fast) are:
>>>
>>> testing outer_func
>>>   0.398586 seconds
>>>   0.398821 seconds
>>> testing outer_func2
>>>   2.522230 seconds
>>>   2.522479 seconds
>>>
>>>
>>>
>>> I ran this on in Intel Ivy Bridge (i7-3820) processor, using Julia 0.4.4
>>>
>>> I looked at the llvm code (attached), and noticed outer_func2 has a 
>>> bunch of extra statements that look like
>>>
>>>   %lsr.iv570 = phi i8* [ %scevgep571, %L21 ], [ %scevgep569, %L.preheader 
>>> ]
>>>
>>>
>>>
>>> that are not present for outer_func.  I don't know llvm code very well 
>>> (hardly at all), so I'm not sure what these mean.  Any help
>>> understanding either the llvm code or the performance difference would 
>>> be appreciated.
>>>
>>>
>>>
>>>   Thanks,
>>>      Jared Crean
>>>
>>

[julia-users] Re: Performance of Kernel Inlining

Reply via email to