Hey! The benchmark you posted, Cayetano, is:
julia -e 'using Pkg; Pkg.add("BenchmarkTools"); using BenchmarkTools; N = 1000; A = rand(N, N); B = rand(N, N); @btime $A*$B' This is a matrix multiplication that gets delegated to the underlying BLAS right. Running it under ‘perf record’ confirms it: --8<---------------cut here---------------start------------->8--- Samples: 139K of event 'cycles:u', Event count (approx.): 99624880590 Overhead Command Shared Object Symbol 35.27% .julia-real libblas.so.3.9.0 [.] dgemm_ 3.99% .julia-real libjulia-internal.so.1.8 [.] gc_mark_loop 2.60% .julia-real libjulia-internal.so.1.8 [.] apply_cl 1.06% .julia-real libjulia-internal.so.1.8 [.] jl_get_binding_ --8<---------------cut here---------------end--------------->8--- We’re using libblas.so (presumably from the ‘lapack’ package) and not OpenBLAS, so no wonder it’s slow. Could it be that: "LIBBLAS=-lopenblas" "LIBBLASNAME=libopenblas" is ineffective? I think we have a lead! Ludo’.