Hi Elias,
I believe the gain of coalescing functions (or slicing the values
involved) is somewhat limited and occurs
only when your APL values are small. For large values computing
one function after the other has a better cache
locality. And it has a
Thanks, that's interesting indeed.
What about the idea of coalescing multiple functions so that each thread
can stream multiple operations in a row without synchronising? To me, it
would seem to be hugely beneficial if the expression -1+2+X could stream
the three operations (two additions, one neg
Hi Elias,
I am working on it.
As a preparation I have created a new command ]PSTAT that
shows how many CPU cycles
the different scalar function take. You can run the new workspace
ScalarBenchmark_1.apl to
see the results (SVN 444).
Have the results of this been integrated in the interpreter?
On 1 August 2014 21:57, Juergen Sauermann
wrote:
> Hi Elias,
>
> yes - actually a lot. I haven't looked through all files, but
> at 80, 60, and small core counts.
>
> The good news is that all results look plausible now. There are so
Hi Elias,
yes - actually a lot. I haven't looked through all files, but
at 80, 60, and small core counts.
The good news is that all results look plausible now. There are some
variations
in the data, of course, but the trend is clear:
The total time for OMP (the rightmost value in the plot, i.
Were you able to deduce anything from the test results?
On 11 May 2014 23:02, "Juergen Sauermann"
wrote:
> Hi Elias,
>
> thanks, already interesting. If you could loop around the core count:
>
> *for ((i=1; $i<=80; ++i)); do*
> * ./Parallel $i*
> * ./Parallel_OMP $i*
> *done*
>
> then I could un
Hi,
I guess I know what went wrong. The workload per thread was so small
(reading the CPU cycle counter and that was it) that the first threads will
have finished while the tasks were still being distributed.
Due to the lack of core binding, some cores would therefore be used several
times and c
Hi Elias,
thanks, already interesting. If you could loop around the core count:
*for ((i=1; $i<=80; ++i)); do**
** ./Parallel $i**
** ./Parallel_OMP $i**
**done*
then I could understand the data better. Also not sure if something
is wrong with the benchmark program. On my new 4-core with OMP I