Hello,I would like to share some benchmarks on my dual-core (yes, I am a poor guy) machine. The benchmark measures A+A with A←1048576⍴2. The y-axis shows the CPU cycle counter which was recorded every 4096 iterations (so we have 256 samples on the x-axis). The first value was 0 (befor the loop was entered) and the final value was after exiting the loop
(in SkalarFunction::eval_skalar_AB().
The way I read these results is that the inner loop for skalar function scales linearly on a 2-core machine.
If the total time scales worse then either the sequential part is too big (Amdahls law), or something else is wrong. For example in one of my first tests everything compiled fine but still only one core was used.
/// Jürgen On 03/14/2014 05:18 PM, David Lamkins wrote:
This is interesting. The parallel speedup on your machine using TBB is in the same ballpark as on my machine using OpenMP, and they're both delivering less than a 2:1 speedup.I informally ran some experiments two nights ago to try to characterize the behavior. On my machine, with OpenMP #pragmas on the scalar loops, the ratio of single-threaded to multi-threaded runtimes held stubbornly at about 0.7 regardless of the size of the problem. I tried integer and float data, addition and power, with ravels up to 100 million elements. (My smallest test set was a million elements; I still need to try smaller sets to see whether I can find a knee where the thread setup overhead dominates and results in a runtime ratio greater than 1.)I'm not sure what this means, yet. I'd hoped to see some further improvement as the ravel size increased, despite the internal inefficiencies. TBH, I didn't find and annotate the copy loop(s); that might have a lot to do with my results. (I did find and annotate another loop in check_value(), though. Maybe parallelizing that will improve your results.) I'm hoping that the poor showing so far isn't a result of memory bandwidth limitations.I hope to spend some more time on this over the weekend.P.S.: I will note that the nice part about using OpenMP is that there's no hand-coding necessary. All you do is add #pragmas to your program; the compiler takes care of the rewrites.---------- Forwarded message ---------- From: "Elias Mårtenson" <loke...@gmail.com <mailto:loke...@gmail.com>> To: "bug-apl@gnu.org <mailto:bug-apl@gnu.org>" <bug-apl@gnu.org <mailto:bug-apl@gnu.org>> Cc: Date: Fri, 14 Mar 2014 22:22:15 +0800 Subject: [Bug-apl] Performance optimisations: Results Hello guys, I've spent some time experimenting with various performance optimisations and I would like to share my latest results with you: I've run a lot of tests using Callgrind, which is part of the Valgrind <http://valgrind.org/> tool (documentation here <http://valgrind.org/docs/manual/cl-manual.html>). In doing so, I've concluded that a disproportionate amount of time is spent copying values (this can be parallelised; more about that below). I set out to see how much faster I could make simple test program that applied a monadic scalar function. Here is my test program: ∇Z←testinv;tmp src←10000 4000⍴÷⍳100 'Starting' tmp←{--------⍵} time src Z←1 ∇ This program calls my time operator which simply shows the amount of time it took to execute the operation. This is of course needed for benchmarking. For completeness, here is the implementation of time: ∇Z←L (OP time) R;start;end start←⎕AI →(0≠⎕NC 'L')/twoargs Z←OP R →finish twoargs: Z←L OP R finish: end←⎕AI 'Time:',((end[3]+end[2]×1E6) - (start[3]+start[2]×1E6))÷1E6 ∇ The unmodified version of GNU APL runs this in *5037.00* milliseconds on my machine. I then set out to minimise the amount of cloning of values, taking advantage of the existing temp functionality. Once I had done this, the execution time was reduced to *2577.00* ms. I then used the Threading Building Blocks <https://www.threadingbuildingblocks.org/> library to parallelise two operations: The clone operation and the monadic SkalarFunction::eval_skalar_B(). After this, on my 4-core machine, the runtime was reduced to *1430.00* ms. Threading Building Blocks is available from the application repositories of at least Arch Linux and Ubuntu, and I'm sure it's available elsewhere too. To test in on OSX I had to download it separately. To summarise: * Standard: 5037.00 * Reduced cloning: 2577.00 * Parallel: 1430.00 I have attached the patch, but it's definitely not something that should be applied blindly. I have hacked around is several parts of the code, some of which I can't say I understand fully, so see it as a proof-of-concept, nothing else. Note that the code that implements the parallelism using TBB is pretty ugly, and the code ends up being duplicated in the parallel and non-parallel version. This can, of course, be encapsulated much nicer if one wants to make this generic. Another thing, TBB is incredibly efficient, especially on Intel CPU's. I'd be very interested to see how OpenMP performs on this same code. Regards, Elias -- "The secret to creativity is knowing how to hide your sources." Albert Einstein http://soundcloud.com/davidlamkins http://reverbnation.com/lamkins http://reverbnation.com/lcw http://lamkins-guitar.com/ http://lamkins.net/ http://successful-lisp.com/
<<attachment: two-cores.png>>
0, 168 1, 344610 2, 673064 3, 994497 4, 1316056 5, 1638490 6, 1959482 7, 2281776 8, 2603300 9, 2928142 10, 3248644 11, 3569132 12, 3889956 13, 4211144 14, 4531310 15, 4851546 16, 5175296 17, 5495462 18, 5814326 19, 6134856 20, 6454000 21, 6773466 22, 7093933 23, 7433902 24, 7756784 25, 8076390 26, 8395310 27, 8714636 28, 9033192 29, 9352896 30, 9672544 31, 9994747 32, 10313660 33, 10633336 34, 10952536 35, 11272303 36, 11591251 37, 11911312 38, 12232934 39, 12551910 40, 12871285 41, 13191038 42, 13510350 43, 13829144 44, 14148456 45, 14470274 46, 14789236 47, 15108212 48, 15426971 49, 15745394 50, 16062991 51, 16382261 52, 16756110 53, 17081295 54, 17400432 55, 17719674 56, 18038454 57, 18357031 58, 18673998 59, 18991546 60, 19313728 61, 19634048 62, 19953269 63, 20272378 64, 20590878 65, 20909924 66, 21229110 67, 21550823 68, 21869946 69, 22189146 70, 22508654 71, 22827924 72, 23146956 73, 23466030 74, 23788394 75, 24106040 76, 24424883 77, 24744342 78, 25062975 79, 25381552 80, 25700374 81, 26037060 82, 26357114 83, 26675530 84, 26993652 85, 27310794 86, 27629147 87, 27948151 88, 28266924 89, 28588630 90, 28907130 91, 29226134 92, 29545180 93, 29864100 94, 30182726 95, 30501394 96, 30823590 97, 31141572 98, 31459708 99, 31778229 100, 32096435 101, 32415082 102, 32735178 103, 33053468 104, 33374929 105, 33694066 106, 34013756 107, 34332760 108, 34651519 109, 34970082 110, 35299404 111, 35621082 112, 35939904 113, 36259055 114, 36578241 115, 36896601 116, 37215598 117, 37534336 118, 37856539 119, 38174528 120, 38493798 121, 38811948 122, 39129146 123, 39448171 124, 39766440 125, 40087068 126, 40405589 127, 40723837 128, 41042512 129, 41361544 130, 41679624 131, 41997004 132, 42316526 133, 42637273 134, 42955878 135, 43273916 136, 43592948 137, 43911826 138, 44230368 139, 44594116 140, 44916886 141, 45233048 142, 45549532 143, 45866786 144, 46182598 145, 46498144 146, 46815489 147, 47137230 148, 47453630 149, 47769505 150, 48086640 151, 48401927 152, 48717515 153, 49034475 154, 49350896 155, 49669886 156, 49985684 157, 50302966 158, 50618344 159, 50933988 160, 51250220 161, 51567040 162, 51885834 163, 52201086 164, 52517444 165, 52832815 166, 53148340 167, 53464600 168, 53781112 169, 54118526 170, 54434282 171, 54751228 172, 55067194 173, 55382726 174, 55698804 175, 56014147 176, 56330120 177, 56649726 178, 56967400 179, 57283884 180, 57601404 181, 57918350 182, 58234666 183, 58550338 184, 58869594 185, 59184944 186, 59500378 187, 59816960 188, 60132821 189, 60449431 190, 60765964 191, 61083932 192, 61399800 193, 61716424 194, 62032768 195, 62349315 196, 62664980 197, 62981800 198, 63309197 199, 63628061 200, 63944965 201, 64260952 202, 64577590 203, 64894431 204, 65209830 205, 65525404 206, 65844016 207, 66160773 208, 66476788 209, 66792327 210, 67109336 211, 67424826 212, 67739868 213, 68058438 214, 68375216 215, 68691322 216, 69008170 217, 69324766 218, 69640186 219, 69956320 220, 70272090 221, 70591780 222, 70908096 223, 71223348 224, 71539720 225, 71855182 226, 72170448 227, 72529289 228, 72854768 229, 73172932 230, 73491873 231, 73812788 232, 74132177 233, 74450908 234, 74770129 235, 75093508 236, 75410881 237, 75728058 238, 76045123 239, 76365569 240, 76683824 241, 77002506 242, 77324191 243, 77642572 244, 77961282 245, 78279817 246, 78596630 247, 78915305 248, 79233868 249, 79552634 250, 79875516 251, 80194058 252, 80512754 253, 80831597 254, 81151700 255, 81470620 266, 81834662
0, 329994 1, 700805 2, 1055285 3, 1401722 4, 1750224 5, 2096297 6, 2433767 7, 2764090 8, 3091312 9, 3417225 10, 3743502 11, 4069653 12, 4397246 13, 4722543 14, 5047273 15, 5399695 16, 5724985 17, 6049575 18, 6373717 19, 6700295 20, 7025263 21, 7349384 22, 7674674 23, 8000055 24, 8324981 25, 8650236 26, 8976373 27, 9301397 28, 9628493 29, 9953895 30, 10278905 31, 10605672 32, 10931907 33, 11257043 34, 11582494 35, 11907399 36, 12232318 37, 12557321 38, 12882086 39, 13207215 40, 13533835 41, 13860651 42, 14185801 43, 14510629 44, 14888741 45, 15213534 46, 15538747 47, 15864107 48, 16189026 49, 16513959 50, 16838878 51, 17163097 52, 17488149 53, 17814510 54, 18139898 55, 18465265 56, 18790891 57, 19115691 58, 19440932 59, 19766194 60, 20091400 61, 20416277 62, 20740832 63, 21065807 64, 21390649 65, 21714623 66, 22039920 67, 22364965 68, 22690087 69, 23016049 70, 23341507 71, 23665915 72, 24054205 73, 24380426 74, 24704785 75, 25029984 76, 25354462 77, 25680172 78, 26005098 79, 26330276 80, 26654068 81, 26978490 82, 27303843 83, 27630302 84, 27955235 85, 28279839 86, 28606942 87, 29037477 88, 29370565 89, 29695519 90, 30020396 91, 30345644 92, 30671074 93, 30996931 94, 31321409 95, 31645971 96, 31971198 97, 32297503 98, 32622933 99, 32947271 100, 33290887 101, 33616562 102, 33941705 103, 34266995 104, 34592383 105, 34917015 106, 35241899 107, 35566125 108, 35890715 109, 36215193 110, 36541386 111, 36866599 112, 37192981 113, 37519013 114, 37844212 115, 38169551 116, 38495499 117, 38821944 118, 39147766 119, 39472755 120, 39797807 121, 40122061 122, 40447596 123, 40773404 124, 41099667 125, 41425055 126, 41750394 127, 42074837 128, 329973 129, 692237 130, 1050119 131, 1394449 132, 1742629 133, 2109072 134, 2439416 135, 2768787 136, 3095575 137, 3420998 138, 3746267 139, 4072194 140, 4481806 141, 4805766 142, 5132162 143, 5458481 144, 5782434 145, 6106835 146, 6430802 147, 6760726 148, 7083335 149, 7406868 150, 7731563 151, 8056468 152, 8381562 153, 8705018 154, 9031946 155, 9355192 156, 9678354 157, 10000473 158, 10324314 159, 10646748 160, 10969630 161, 11298203 162, 11620567 163, 11943323 164, 12265253 165, 12588065 166, 12913068 167, 13236972 168, 13584578 169, 13910085 170, 14231560 171, 14554190 172, 14878570 173, 15200416 174, 15522122 175, 15845025 176, 16170476 177, 16493617 178, 16815806 179, 17138037 180, 17460394 181, 17782072 182, 18104555 183, 18432176 184, 18754428 185, 19076148 186, 19397770 187, 19720260 188, 20043128 189, 20365877 190, 20693547 191, 21016156 192, 21339353 193, 21661857 194, 21984158 195, 22305402 196, 22627871 197, 22970878 198, 23291898 199, 23614227 200, 23937928 201, 24260355 202, 24582208 203, 24904404 204, 25231283 205, 25553115 206, 25875157 207, 26198284 208, 26521894 209, 26844797 210, 27166972 211, 27488230 212, 27815116 213, 28136311 214, 28457975 215, 29644202 216, 29980587 217, 30302195 218, 30624433 219, 30947336 220, 31269945 221, 31593408 222, 31915772 223, 32314625 224, 32637157 225, 32959605 226, 33282704 227, 33604459 228, 33927138 229, 34250272 230, 34577991 231, 34899067 232, 35220472 233, 35542605 234, 35865158 235, 36189307 236, 36510761 237, 36833664 238, 37163714 239, 37485427 240, 37808246 241, 38130841 242, 38453506 243, 38776654 244, 39099004 245, 39425561 246, 39747190 247, 40069449 248, 40392170 249, 40714128 250, 41035918 251, 41384420 252, 41709689 253, 42031696 254, 42353647 255, 42673624 266, 43046192