> Message: 5 > Date: Mon, 1 Apr 2013 15:40:48 +0800 > From: Daniel Hartwig <mand...@gmail.com> > To: guile-devel@gnu.org > Subject: Re: Extremly slow for format & string-join
> On 1 April 2013 14:59, Daniel Llorens <daniel.llor...@bluewin.ch> wrote: >> How can it be slower to allocate the result at once? > > Shrug. I do not know much of array internals. You probably have much > more experience there than I. Not much with the implementation, no :-/ > Except for the curious profile output, I suspect the overhead is due > to such factors as repeated application of MAPFUNC and consequent > arithmetic to access the shared arrays contents mapfunc is used only to compute the strides and bounds, it isn't kept beyond make-shared-array. But I hadn't thought that the profile was wrong. Indeed, the slow part is not make-typed-array but array-copy!. scheme@(guile-user)> (define s "1234567890") scheme@(guile-user)> (define t (make-shared-array s (lambda (i j) (list j)) 1000000 10)) scheme@(guile-user)> (define a (make-typed-array 'a *unspecified* 1000000 10)) scheme@(guile-user)> (define b (make-typed-array 'a *unspecified* 1000000 10)) scheme@(guile-user)> ,time (array-copy! t a) ;; 1.718000s real time, 1.710000s run time. 0.000000s spent in GC. scheme@(guile-user)> ,time (array-copy! a b) ;; 1.673000s real time, 1.670000s run time. 0.000000s spent in GC. scheme@(guile-user)> Since I have no intuition for these numbers, I thought maybe it's really this slow, or a cache problem, who knows: scheme@(guile-user)> (import (rnrs bytevectors)) scheme@(guile-user)> (define x (make-bytevector 40000000)) scheme@(guile-user)> ,time (define y (bytevector-copy x)) ;; 0.018000s real time, 0.020000s run time. 0.000000s spent in GC. In NumPy (using doubles): In [11]: %time a = np.zeros([1000000, 10]) CPU times: user 0.04 s, sys: 0.05 s, total: 0.09 s Wall time: 0.09 s In [12]: %time b = a+1 CPU times: user 0.04 s, sys: 0.05 s, total: 0.09 s Wall time: 0.09 s So it's really bad. I had a look at libguile/array-map.c. There are three parts in there: [1] scm_ramapc(). This is a general array traversal function. It does linearization, so the (array-copy! a b) call above should resolve to a single call to racp(). [2] array-copy!, array-fill!, array-map!, array-for-each, etc. These all use scm_ramapc(). [3] a bunch of specializations scm_ra_sum, scm_ra_difference, and so on. First, I think that all of [3] should be gone, it's dead code. This is the first patch. Second, array-map!, array-for-each cons on each application of proc. The quick & dirty solution is to add 1-arg, 2-args, etc. cases to ramap(), rafe(). array-index-map! does its own traversal and can't be linearized, so that can't be fixed as easily. There are weirdo cases. For example array-equal? calls array_compare that recurses on itself down to the last rank. This means that there's a function call on each and every array element. I don't know whether fixing these problems is worthwhile, or the whole thing should be rewritten, maybe with a different approach. Either go to Scheme where we have macros and can inline the inner loops, or use a code generator to generate fixed rank cases, etc. Third, none of the above are causing the slowness of array-copy!. I noticed that there's a double indirection in racp(). The second patch removes it. Actually this double indirection goes on all over array-map.c and I don't understand why it's needed... It's only a bit faster than before, though: scheme@(guile-user)> (define s "1234567890") scheme@(guile-user)> (define t (make-shared-array s (lambda (i j) (list j)) 1000000 10)) scheme@(guile-user)> (define a (make-typed-array 'a *unspecified* 1000000 10)) scheme@(guile-user)> (define b (make-typed-array 'a *unspecified* 1000000 10)) scheme@(guile-user)> ,time (array-copy! t a) ;; 1.187000s real time, 1.190000s run time. 0.000000s spent in GC. scheme@(guile-user)> ,time (array-copy! a b) ;; 1.107000s real time, 1.110000s run time. 0.000000s spent in GC. scheme@(guile-user)> There's the overhead of impl->, etc. I'm thinking that one can do a direct memory copy when the array types are the same, or even call memcpy() when the strides allow. I think these should be relatively common cases. Regards, Daniel
0001-Remove-dead-code-in-array-map.c.patch
Description: Binary data
0002-Remove-double-indirection-in-element-access-in-array.patch
Description: Binary data