Re: [Bug-apl] 80 core performance results

2014-08-23 Thread Juergen Sauermann

  
  
Hi Elias,
  
  I believe the gain of coalescing functions (or slicing the values
  involved) is somewhat limited and occurs
  only when your APL values are small. For large values computing
  one function after the other has a better cache
  locality. And it has a price: the runtime parser gets slower (you
  cannot detect always these sequences at ⎕FX time,
  error reporting becomes dubious), And the scheme only works for
  scalar-like functions so that the number of
  functions in theses sequences is small.
  
  What can be saved by slicing is essentially memory allocation for
  the intermediate results and fork/sync times for
  the intermediate functions. memory allocation should be reasonably
  fast already so there is little to gain. And the
  fork/sync times, which are currently the biggest problem, need to
  go down significantly (otherwise we can forget
  about parallel APL). The gain on fork/sync times is
  O(log(core-count) × (coalescing-length - 1)) which is not too
  much.
  
  The penalty of non-localization can be huge. For example:
  
    ]PSTAT 13
╔╗
  ║  Performance Statistics (CPU
  cycles)   ║
╠══╦══╦══╣
  ║  ║  first pass  ║   subsequent
  passes  ║
  ║ Function
  ╟──┬──┬╫──┬──┬╢
  ║  ║ N    │ μ    │  σ÷μ % ║ N    │
  μ    │  σ÷μ % ║
╠══╬══╪══╪╬══╪══╪╣
  ║ A + B    ║    3 │  511 │   49 % ║ 6047 │  
  39 │   69 % ║
╚══╩══╧══╧╩══╧══╧╝
  
  The above shows 3 runs of A+B with integer, real, and complex data
  The left column shows the (average)
  number of cycles for the first ravel element of each vector while
  the right column shows the subsequent
  ravel elements. That is in 1 2 3 4 5 + 1 2 3 4 5, the left column
  shows the average time for 1+1 while the
  right columns shows the time for 2+2, 3+3, 4+4, and 5+5. This
  pattern is typical for all scalar functions and
  my best explanation for it is (instruction-) caching.
  
  Now the risk with coalescing is that if a function has a large
  instruction footprint then it could kick other
  functions out of the cache so that we get the first pass cycles
  (511 above) on all passes and not only on
  the first instead of the 39 above.
  
  I am planning to add more performance counters over time so that
  we have a more solid basis for this kind of
  discussions.
  
  /// Jürgen
  
  

On 08/22/2014 06:24 PM, Elias Mårtenson
  wrote:


  Thanks, that's interesting indeed.


What about the idea of coalescing multiple functions so
  that each thread can stream multiple operations in a row
  without synchronising? To me, it would seem to be hugely
  beneficial if the _expression_ -1+2+X could stream the three
  operations (two additions, one negation) when generating the
  output. Would such a feature require much re-architecting of
  the application?


Regards,
  Elias
  
  

On 22 August 2014 21:46, Juergen
  Sauermann 
  wrote:
  
 Hi Elias,

I am working on it.

As a preparation I have created a new command ]PSTAT
that shows how many CPU cycles
the different scalar function take. You can run the new
workspace ScalarBenchmark_1.apl to
see the results (SVN 444).

These numbers are very imprtant to determine when to
switch from sequential to parallel execution.
The idea is to feed back these numbers into the
interpreter so that machines can tune themselves.

The other thing is the lesson learned from your
benchmark results. As far as I can see, semaphores are
far to
slow for syncing the threads. The picture that is
currently evolving in my head is this: Instead of 2
states
(blocked on semaphore/running) of the threads there
should be 3 states:

1. blocked on semaphore,
2. busy waiting on some flag in userspace, and
3. running (performing parallel 

[Bug-apl] Optimizations revived

2014-08-23 Thread Juergen Sauermann

  
  
Hi,
  
  I have revived Elias' in-place optimization for A⍴B and ,B
  now usinng a different
  way of figuring if B is in use. SVN 445.
  
  /// Jürgen
  

  




Re: [Bug-apl] Optimizations revived

2014-08-23 Thread Elias Mårtenson
Cool :-)

Speaking of this, I did spend quite a bit of time some time ago to try to
figure out an easy way to get generic copy-on-write semantics, but I never
came to a satisfactory conclusion. I might revisit it later.

Have you thought about it? It is, after all, related to this specific
optimisation.

Regards,
Elias


On 23 August 2014 23:29, Juergen Sauermann 
wrote:

>  Hi,
>
> I have revived Elias' in-place optimization for *A⍴B* and *,B* now usinng
> a different
> way of figuring if *B* is in use. SVN 445.
>
> /// Jürgen
>
>


Re: [Bug-apl] Optimizations revived

2014-08-23 Thread Juergen Sauermann

  
  
Hi Elias,

normally APL values are not written to. The exception is indexed
assignment.

I believe the clone() call in Symbol::resolve()
can be skipped completely.
This is probably the most frequently used clone() case. I
suppose copy-on-write semantics is
achieved when all clone() calls are gone. Many of the
remaining clone() calls are specific
to certain functions so their performance impact should be small.

I haven't done the above before the 1.4 release because I didn't
want to release
a not-so-well tested optimization.

/// Jürgen


On 08/23/2014 05:32 PM, Elias Mårtenson
  wrote:


  Cool :-)


Speaking of this, I did spend quite a bit of time some time
  ago to try to figure out an easy way to get generic
  copy-on-write semantics, but I never came to a satisfactory
  conclusion. I might revisit it later.


Have you thought about it? It is, after all, related to
  this specific optimisation.


Regards,
Elias
  
  


  On 23 August 2014 23:29, Juergen Sauermann 
  wrote:
  
 Hi,

I have revived Elias' in-place optimization for A⍴B
and ,B now usinng a different
way of figuring if B is in use. SVN 445.

/// Jürgen

   
  


  


  




[Bug-apl] Uh oh... SVN 445

2014-08-23 Thread David Lamkins
I ran into this after updating to 445.

This is in a pendent function.

  ''≡0⍴value
0
  8⎕cr value
┌→───┐
│unix│
└┘
  ''≡0⍴'unix'
1
  8⎕cr 'unix'
┌→───┐
│unix│
└┘


-- 
"The secret to creativity is knowing how to hide your sources."
   Albert Einstein


http://soundcloud.com/davidlamkins
http://reverbnation.com/lamkins
http://reverbnation.com/lcw
http://lamkins-guitar.com/
http://lamkins.net/
http://successful-lisp.com/


[Bug-apl] Seeking clues regarding quote-quad prompt

2014-08-23 Thread David B. Lamkins
Back on the subject of aplwrap integration:

I'm seeing a GNU APL behavior that I don't understand, and would
appreciate some hints on where to look. I don't necessarily consider the
following behavior to be buggy, I just want to be able to figure out how
and why it's happening so I can dig into the code.

Quick background: aplwrap spawns GNU APL with pipes for stdin, stdout
and stderr. Pretty much everything works as expected, except for some
puzzling behavior w.r.t. a ⍞ prompt followed by a ⍞ input.

What I'm seeing (by dumping stdout and stderr) is that the prompt is
showing up on both stdout and stderr.

>From what I've been able to read so far, I think this is how it happens:

With --rawCIN, get_user_line() calls no_readline() with the prompt text.
no_readline() then ungetc()s the entire prompt so it'll be available for
⍞ input.

I think that aplwrap must see the pushed-back prompt and echo it to
stdout. That's fine. I can deal with that.

But then almost the same prompt appears on stderr. I can't figure out
how that happens. I say "almost the same" because the prompt text on
stderr may have pad characters in place of blanks, assuming that the
prompt used a nested vector. For that reason, I'm convinced that aplwrap
isn't somehow involved; the stderr prompt must from GNU APL.

What I can't understand is how the ⍞ prompt *ever* shows up on stderr.
>From what I've read, it looks like the prompt always goes to COUT.

Clues will be much appreciated...