This is interesting. The parallel speedup on your machine using TBB is in
the same ballpark as on my machine using OpenMP, and they're both
delivering less than a 2:1 speedup.

I informally ran some experiments two nights ago to try to characterize the
behavior. On my machine, with OpenMP #pragmas on the scalar loops, the
ratio of single-threaded to multi-threaded runtimes held stubbornly at
about 0.7 regardless of the size of the problem. I tried integer and float
data, addition and power, with ravels up to 100 million elements. (My
smallest test set was a million elements; I still need to try smaller sets
to see whether I can find a knee where the thread setup overhead dominates
and results in a runtime ratio greater than 1.)

I'm not sure what this means, yet. I'd hoped to see some further
improvement as the ravel size increased, despite the internal
inefficiencies. TBH, I didn't find and annotate the copy loop(s); that
might have a lot to do with my results. (I did find and annotate another
loop in check_value(), though. Maybe parallelizing that will improve your
results.) I'm hoping that the poor showing so far isn't a result of memory
bandwidth limitations.

I hope to spend some more time on this over the weekend.


P.S.: I will note that the nice part about using OpenMP is that there's no
hand-coding necessary. All you do is add #pragmas to your program; the
compiler takes care of the rewrites.


---------- Forwarded message ----------

> From: "Elias Mårtenson" <loke...@gmail.com>
> To: "bug-apl@gnu.org" <bug-apl@gnu.org>
> Cc:
> Date: Fri, 14 Mar 2014 22:22:15 +0800
> Subject: [Bug-apl] Performance optimisations: Results
> Hello guys,
>
> I've spent some time experimenting with various performance optimisations
> and I would like to share my latest results with you:
>
> I've run a lot of tests using Callgrind, which is part of the 
> Valgrind<http://valgrind.org/>tool (documentation
> here <http://valgrind.org/docs/manual/cl-manual.html>). In doing so, I've
> concluded that a disproportionate amount of time is spent copying values
> (this can be parallelised; more about that below).
>
> I set out to see how much faster I could make simple test program that
> applied a monadic scalar function. Here is my test program:
>
> ∇Z←testinv;tmp
> src←10000 4000⍴÷⍳100
> 'Starting'
> tmp←{--------⍵} time src
> Z←1
> ∇
>
>
> This program calls my time operator which simply shows the amount of time
> it took to execute the operation. This is of course needed for
> benchmarking. For completeness, here is the implementation of time:
>
> ∇Z←L (OP time) R;start;end
> start←⎕AI
> →(0≠⎕NC 'L')/twoargs
> Z←OP R
> →finish
> twoargs:
> Z←L OP R
> finish:
> end←⎕AI
> 'Time:',((end[3]+end[2]×1E6) - (start[3]+start[2]×1E6))÷1E6
> ∇
>
>
> The unmodified version of GNU APL runs this in *5037.00* milliseconds on
> my machine.
>
> I then set out to minimise the amount of cloning of values, taking
> advantage of the existing temp functionality. Once I had done this, the
> execution time was reduced to *2577.00* ms.
>
> I then used the Threading Building 
> Blocks<https://www.threadingbuildingblocks.org/>library to parallelise two 
> operations: The clone operation and the monadic
> SkalarFunction::eval_skalar_B(). After this, on my 4-core machine, the
> runtime was reduced to *1430.00* ms.
>
> Threading Building Blocks is available from the application repositories
> of at least Arch Linux and Ubuntu, and I'm sure it's available elsewhere
> too. To test in on OSX I had to download it separately.
>
> To summarise:
>
>    - Standard: 5037.00
>    - Reduced cloning: 2577.00
>    - Parallel: 1430.00
>
> I have attached the patch, but it's definitely not something that should
> be applied blindly. I have hacked around is several parts of the code, some
> of which I can't say I understand fully, so see it as a proof-of-concept,
> nothing else.
>
> Note that the code that implements the parallelism using TBB is pretty
> ugly, and the code ends up being duplicated in the parallel and
> non-parallel version. This can, of course, be encapsulated much nicer if
> one wants to make this generic.
>
> Another thing, TBB is incredibly efficient, especially on Intel CPU's. I'd
> be very interested to see how OpenMP performs on this same code.
>
> Regards,
> Elias
>
>
-- 
"The secret to creativity is knowing how to hide your sources."
   Albert Einstein


http://soundcloud.com/davidlamkins
http://reverbnation.com/lamkins
http://reverbnation.com/lcw
http://lamkins-guitar.com/
http://lamkins.net/
http://successful-lisp.com/

Reply via email to