array-copy! is slow & array-map.c (was: Extremly slow for format & string-join)

Daniel Llorens Mon, 01 Apr 2013 10:26:59 -0700

> Message: 5
> Date: Mon, 1 Apr 2013 15:40:48 +0800
> From: Daniel Hartwig <mand...@gmail.com>
> To: guile-devel@gnu.org
> Subject: Re: Extremly slow for format & string-join

> On 1 April 2013 14:59, Daniel Llorens <daniel.llor...@bluewin.ch> wrote:

>> How can it be slower to allocate the result at once?
> 
> Shrug.  I do not know much of array internals.  You probably have much
> more experience there than I.

Not much with the implementation, no :-/

> Except for the curious profile output, I suspect the overhead is due
> to such factors as repeated application of MAPFUNC and consequent
> arithmetic to access the shared arrays contents

mapfunc is used only to compute the strides and bounds, it isn't kept beyond 
make-shared-array.

But I hadn't thought that the profile was wrong. Indeed, the slow part is not 
make-typed-array but array-copy!.

scheme@(guile-user)> (define s "1234567890")
scheme@(guile-user)> (define t (make-shared-array s (lambda (i j) (list j)) 
1000000 10))
scheme@(guile-user)> (define a (make-typed-array 'a *unspecified* 1000000 10))
scheme@(guile-user)> (define b (make-typed-array 'a *unspecified* 1000000 10))
scheme@(guile-user)> ,time (array-copy! t a)
;; 1.718000s real time, 1.710000s run time.  0.000000s spent in GC.
scheme@(guile-user)> ,time (array-copy! a b)
;; 1.673000s real time, 1.670000s run time.  0.000000s spent in GC.
scheme@(guile-user)> 

Since I have no intuition for these numbers, I thought maybe it's really this 
slow, or a cache problem, who knows:

scheme@(guile-user)> (import (rnrs bytevectors))
scheme@(guile-user)> (define x (make-bytevector 40000000))
scheme@(guile-user)> ,time (define y (bytevector-copy x))
;; 0.018000s real time, 0.020000s run time.  0.000000s spent in GC.

In NumPy (using doubles):

In [11]: %time a = np.zeros([1000000, 10])
CPU times: user 0.04 s, sys: 0.05 s, total: 0.09 s
Wall time: 0.09 s

In [12]: %time b = a+1
CPU times: user 0.04 s, sys: 0.05 s, total: 0.09 s
Wall time: 0.09 s

So it's really bad. I had a look at libguile/array-map.c. There are three parts 
in there:

[1] scm_ramapc(). This is a general array traversal function. It does 
linearization, so the (array-copy! a b) call above should resolve to a single 
call to racp().

[2] array-copy!, array-fill!, array-map!, array-for-each, etc. These all use 
scm_ramapc().

[3] a bunch of specializations scm_ra_sum, scm_ra_difference, and so on. 

First, I think that all of [3] should be gone, it's dead code. This is the 
first patch.

Second, array-map!, array-for-each cons on each application of proc. The quick 
& dirty solution is to add 1-arg, 2-args, etc. cases to ramap(), rafe(). 
array-index-map! does its own traversal and can't be linearized, so that can't 
be fixed as easily. There are weirdo cases. For example array-equal? calls 
array_compare that recurses on itself down to the last rank. This means that 
there's a function call on each and every array element.

I don't know whether fixing these problems is worthwhile, or the whole thing 
should be rewritten, maybe with a different approach. Either go to Scheme where 
we have macros and can inline the inner loops, or use a code generator to 
generate fixed rank cases, etc.

Third, none of the above are causing the slowness of array-copy!. I noticed 
that there's a double indirection in racp(). The second patch removes it. 
Actually this double indirection goes on all over array-map.c and I don't 
understand why it's needed...

It's only a bit faster than before, though:

scheme@(guile-user)> (define s "1234567890")
scheme@(guile-user)> (define t (make-shared-array s (lambda (i j) (list j)) 
1000000 10))
scheme@(guile-user)> (define a (make-typed-array 'a *unspecified* 1000000 10))
scheme@(guile-user)> (define b (make-typed-array 'a *unspecified* 1000000 10))
scheme@(guile-user)> ,time (array-copy! t a)
;; 1.187000s real time, 1.190000s run time.  0.000000s spent in GC.
scheme@(guile-user)> ,time (array-copy! a b)
;; 1.107000s real time, 1.110000s run time.  0.000000s spent in GC.
scheme@(guile-user)> 

There's the overhead of impl->, etc. I'm thinking that one can do a direct 
memory copy when the array types are the same, or even call memcpy() when the 
strides allow. I think these should be relatively common cases.

Regards,

        Daniel

0001-Remove-dead-code-in-array-map.c.patch
Description: Binary data

0002-Remove-double-indirection-in-element-access-in-array.patch
Description: Binary data

array-copy! is slow & array-map.c (was: Extremly slow for format & string-join)

Reply via email to