Re: [julia-users] Re: Finite element code in Julia: Curious overhead in .* product

Valentin Churavy Thu, 11 Dec 2014 00:18:46 -0800

Dear Petr,

I tought the point of having the array objects and the associated 
> manipulation functions was to hide the loops while delivering decent 
> performance...



This assumption is very true in Matlab. Matlab spend an enormous 
engineering amount on optimizing code that uses vectorization and array 
objects (and it is still slow). There has been ongoing discussions over on 
Github [1] on how to improve the current situation in Julia. One of the 
major selling points of Julia for me is the fact that it is quite 
transparent on which kind of optimizations it requires. I can write in a 
very dynamic style with a lot of matrix operations like in Matlab and still 
get decent performance or I can go in and identify with tools like @profile 
[2] what the pains point in my program are. 

The point of using vectorized operations in Matlab is that is the one 
reliable way to get good performance in matlab, because all underlying 
functions are written in C. In Julia most underlying functions are written 
in Julia (note that most mathematical operations call out to C libraries. 
No need to reinvent the wheel). There is a Julia package you can use for 
porting over Matlab code that devectorizes you code [3], but if the 
operation is occurring in a hot loop it still will be a good and necessary 
optimization to unroll the vectorized code. Then you no longer get just 
decent performance, but excellent performance instead. 

Best Valentin

PS:
You can also use @which 1 .* [1] and @edit to see where the operation is 
defined and how it is implemented. The .* operation is implemented by 
allocating an output array and the running * element wise over the array. 
No magic is going on there.

[1] https://github.com/JuliaLang/julia/issues/8450 
[2] http://docs.julialang.org/en/release-0.3/stdlib/profile/
[3] https://github.com/lindahua/Devectorize.jl

On Thursday, 11 December 2014 06:23:29 UTC+1, Petr Krysl wrote:
>
> Thanks.  Now my head is really spinning!
>
> See, before I posted the original question I tried expanding the loop in 
> the actual FE code, and the code was SLOWER and was using MORE memory:
>
> With the expression 
> Fe += Ns[j] * (f * Jac * w[j]); :
> 6.223416655 seconds (1648832052 bytes
>
> With the expanded loop 
> for kx=1:length(Fe) # alternative (devectorization)
>      Fe[kx] += Ns[j][kx] * (f * Jac * w[j]); 
> end 
>  7.340272676 seconds (1776971604 bytes allocated,
>
> In addition, your argument clearly demonstrates how to avoid the temporary 
> array for doit1(), but doit2() adds to the 3 x 1 one additional 1x1 
> temporary (it seems to me), yet it is about 14 times slower. Why is that?
>
> Finally, if the only way I can get decent performance with lines like
>
>  Fe += Ns[j] * (f * Jac * w[j]);  # Fe and Ns[j] arrays
>
> is to manually write out all the loops, that would be terrible news 
> indeed.  Not only that is a lot of work when rewriting loads of Matlab code 
> (with several matrices concatenated in many many expressions), but the 
> legibility and maintainability tanks. I tought the point of having the 
> array objects and the associated manipulation functions was to hide the 
> loops while delivering decent performance...
>
> Petr
>
>
> On Wednesday, December 10, 2014 8:28:31 PM UTC-8, Tim Holy wrote:
>>
>> Multiplying two Float64s yields another Float64; most likely, this will 
>> be 
>> stored in the CPU's registers. In contrast,  [f]*Jac creates an array, on 
>> each 
>> iteration, that has to be stored on the heap. 
>>
>> A faster variant devectorizes: 
>>        function doit1a(N) 
>>            Fe=zeros(3,1);    Ns=zeros(3,1)+1.0;    f=-6.0;    Jac=1.0; 
>>            for i=1:N 
>>                tmp = f*Jac 
>>                for j = 1:length(Fe) 
>>                    Fe[j] += Ns[j]*tmp 
>>                end 
>>            end 
>>            Fe 
>>        end 
>>
>> julia> @time doit1(N); 
>> elapsed time: 0.810270399 seconds (384000320 bytes allocated, 61.23% gc 
>> time) 
>>
>> julia> @time doit1a(N); 
>> elapsed time: 0.022118726 seconds (320 bytes allocated) 
>>
>> Note the tiny allocations in the second case. 
>>
>> --Tim 
>>
>>
>> On Wednesday, December 10, 2014 07:54:00 PM Petr Krysl wrote: 
>> > The code is really short: 
>> > 
>> > N=2000000 
>> > 
>> > function doit1(N) 
>> >     Fe=zeros(3,1);    Ns=zeros(3,1)+1.0;    f=-6.0;    Jac=1.0; 
>> >     for i=1:N 
>> >         Fe += Ns *  (f * Jac); 
>> >     end 
>> >     Fe 
>> > end 
>> > 
>> > function doit2(N) 
>> >     Fe=zeros(3,1);    Ns=zeros(3,1)+1.0;    f=-6.0;    Jac=1.0; 
>> >     for i=1:N 
>> >         Fe += Ns .*  ([f] * Jac); 
>> >     end 
>> >     Fe 
>> > end 
>> > 
>> > function doit3(N) 
>> >     Fe=zeros(3,1);    Ns=zeros(3,1)+1.0;    f=-6.0;    Jac=1.0; 
>> >     fs=[1.0] 
>> >     for i=1:N 
>> >         fs= ([f] * Jac );   Fe += Ns .*  fs; 
>> >     end 
>> >     Fe 
>> > end 
>> > 
>> > function doit4(N) 
>> >     Fe=zeros(3,1);    Ns=zeros(3,1)+1.0;    f=-6.0;    Jac=1.0; 
>> >     fs=[1.0] 
>> >     for i=1:N                   # 
>> >         fs= [f]; fs *= Jac;   Fe += Ns .*  fs; 
>> >     end 
>> >     Fe 
>> > end 
>> > # 
>> > @time doit1(N) 
>> > @time doit2(N) 
>> > @time doit3(N) 
>> > @time doit4(N) 
>> > 
>> > Here are the measurements on my machine: 
>> > 
>> > elapsed time: 0.461173619 seconds (384129944 bytes allocated, 44.45% gc 
>> > time) 
>> > elapsed time: 6.905249901 seconds (1904151036 bytes allocated, 12.84% 
>> gc 
>> > time) 
>> > elapsed time: 6.862259146 seconds (1904166072 bytes allocated, 13.23% 
>> gc 
>> > time) 
>> > elapsed time: 6.842678542 seconds (1904166256 bytes allocated, 12.81% 
>> gc 
>> > time) 
>> > 
>> > I'll be grateful for any pointers as to how to structure the code so 
>> that 
>> > the array operations not incur this horrendous penalty. 
>> > 
>> > Petr 
>> > 
>> > On Wednesday, December 10, 2014 7:06:43 PM UTC-8, Tim Holy wrote: 
>> > > Can you post short but complete code examples in a gist? 
>> > > https://gist.github.com/ 
>> > > That would make it easier to follow than chasing code examples across 
>> long 
>> > > email chains. 
>> > > 
>> > > --Tim 
>> > > 
>> > > On Wednesday, December 10, 2014 02:35:38 PM Petr Krysl wrote: 
>> > > > I don't know if this is correct, but here is a guess: 
>> > > > 
>> > > > Option 3 still requires a temp array ( to calculate the result of 
>> the 
>> > > > paren fs= ([f] * Jac * w[j]); ), and option 4 eliminates that temp. 
>> The 
>> > > > cost of the temp over the 2 million loops is ~200MB and 0.6 sec CPU 
>> > > 
>> > > time. 
>> > > 
>> > > > So WHY is the difference between 1 and 2 so HUUUGE? 
>> > > > 
>> > > > I think this calls for someone who wrote the compiler. Guys? 
>> > > > 
>> > > > Thanks a bunch, 
>> > > > 
>> > > > P 
>> > > > 
>> > > > On Wednesday, December 10, 2014 12:51:26 PM UTC-8, Petr Krysl 
>> wrote: 
>> > > > > Actually: option (4) was also tested: 
>> > > > > # 16.333766821 seconds (3008899660 bytes 
>> > > > > fs[1]= f; fs *= (Jac * w[j]); 
>> > > > > 
>> > > > >             Fe += Ns[j] .*  fs; 
>> > > > > 
>> > > > > So, allocation of memory was reduced somewhat, runtime not so 
>> much. 
>> > > > > 
>> > > > > On Wednesday, December 10, 2014 12:45:20 PM UTC-8, Petr Krysl 
>> wrote: 
>> > > > >> Well,  temporary array was also on my mind.  However,  things 
>> are I 
>> > > > >> believe a little bit more complicated. 
>> > > > >> 
>> > > > >> Here is the code with three timed options.  As you can see, the 
>> first 
>> > > 
>> > > two 
>> > > 
>> > > > >> options are the fast one (multiplication with a scalar) and the 
>> slow 
>> > > 
>> > > one 
>> > > 
>> > > > >> (multiplication with a one by one matrix).   In the third option 
>> I 
>> > > 
>> > > tried 
>> > > 
>> > > > >> to 
>> > > > >> avoid the creation of an  ad hoc temporary by allocating a 
>> variable 
>> > > > >> outside 
>> > > > >> of the loop.  The effect unfortunately is nil. 
>> > > > >> 
>> > > > >>     fs=[0.0]# Used only for option (3) 
>> > > > >>     # Now loop over all fes in the block 
>> > > > >>     for i=1:size(conns,1) 
>> > > > >>     
>> > > > >>         ... 
>> > > > >>         for j=1:npts 
>> > > > >>         
>> > > > >>            ... 
>> > > > >>           
>> > > > >>           # Option (1): 7.193767019 seconds (1648850568 bytes 
>> > > > >>           # Fe += Ns[j] *  (f * Jac * w[j]); # 
>> > > > >>           
>> > > > >>            # Option (2): 17.301214583 seconds (3244458368 bytes 
>> > > > >>           
>> > > > >>           #    Fe += Ns[j] .*  ([f] * Jac * w[j]); # 
>> > > > >>           
>> > > > >>            # Option (3): 16.943314075 seconds (3232879120 
>> > > > >>           
>> > > > >>           fs= ([f] * Jac * w[j]);   Fe += Ns[j] .*  fs; 
>> > > > >>         
>> > > > >>         end 
>> > > > >>         ... 
>> > > > >>     
>> > > > >>     end 
>> > > > >> 
>> > > > >> What do you think?   Why is the code still getting hit with a 
>> big 
>> > > > >> performance/memory penalty? 
>> > > > >> 
>> > > > >> Thanks, 
>> > > > >> 
>> > > > >> Petr 
>> > > > >> 
>> > > > >> On Monday, December 8, 2014 2:03:02 PM UTC-8, Valentin Churavy 
>> wrote: 
>> > > > >>> I would think that when f is a 1x1 matrix Julia is allocating a 
>> new 
>> > > 
>> > > 1x1 
>> > > 
>> > > > >>> matrix to store the result. If it is a scalar that allocation 
>> can be 
>> > > > >>> skipped. When this part of the code is now in a hot loop it 
>> might 
>> > > 
>> > > happen 
>> > > 
>> > > > >>> that you allocate millions of very small short-lived objects 
>> and 
>> > > 
>> > > that 
>> > > 
>> > > > >>> taxes 
>> > > > >>> the GC quite a lot. 
>>
>>

Re: [julia-users] Re: Finite element code in Julia: Curious overhead in .* product

Reply via email to