> On 16 Feb 2024, at 2:48 am, Marten van Kerkwijk <[email protected]>
> wrote:
>
>> In [45]: %timeit np.add.reduce(a, axis=None)
>> 42.8 µs ± 2.44 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>
>> In [43]: %timeit dotsum(a)
>> 26.1 µs ± 718 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>
>> But theoretically, sum, should be faster than dot product by a fair bit.
>>
>> Isn’t parallelisation implemented for it?
>
> I cannot reproduce that:
>
> In [3]: %timeit np.add.reduce(a, axis=None)
> 19.7 µs ± 184 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
>
> In [4]: %timeit dotsum(a)
> 47.2 µs ± 360 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>
> But almost certainly it is indeed due to optimizations, since .dot uses
> BLAS which is highly optimized (at least on some platforms, clearly
> better on yours than on mine!).
>
> I thought .sum() was optimized too, but perhaps less so?
I can confirm at least it does not seem to use multithreading – with the
conda-installed numpy+BLAS
I almost exactly reproduce your numbers, whereas linked against my own OpenBLAS
build
In [3]: %timeit np.add.reduce(a, axis=None)
19 µs ± 111 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
# OMP_NUM_THREADS=1
In [4]: %timeit dots(a)
20.5 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
# OMP_NUM_THREADS=8
In [4]: %timeit dots(a)
9.84 µs ± 1.1 µs per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
add.reduce shows no difference between the two and always remains at <= 100 %
CPU usage.
dotsum is scaling still better with larger matrices, e.g. ~4 x for 1000x1000.
Cheers,
Derek
_______________________________________________
NumPy-Discussion mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: [email protected]