On Thu, Mar 02, 2017 at 11:45:59AM +0100, Thomas Koenig wrote: > Here's the updated version, which just uses FMA for AVX2. > > OK for trunk? > > Regards > > Thomas > > 2017-03-01 Thomas Koenig <tkoe...@gcc.gnu.org> > > PR fortran/78379 > * m4/matmul.m4: (matmul_'rtype_code`_avx2): Also generate for > reals. Add fma to target options. > (matmul_'rtype_code`): Call AVX2 only if FMA is available. > * generated/matmul_c10.c: Regenerated. > * generated/matmul_c16.c: Regenerated. > * generated/matmul_c4.c: Regenerated. > * generated/matmul_c8.c: Regenerated. > * generated/matmul_i1.c: Regenerated. > * generated/matmul_i16.c: Regenerated. > * generated/matmul_i2.c: Regenerated. > * generated/matmul_i4.c: Regenerated. > * generated/matmul_i8.c: Regenerated. > * generated/matmul_r10.c: Regenerated. > * generated/matmul_r16.c: Regenerated. > * generated/matmul_r4.c: Regenerated. > * generated/matmul_r8.c: Regenerated.
Actually, I see a problem, but not related to this patch. I bet e.g. tsan would complain heavily on the wrappers, because the code is racy: static void (*matmul_p) ('rtype` * const restrict retarray, 'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas, int blas_limit, blas_call gemm) = NULL; if (matmul_p == NULL) { matmul_p = matmul_'rtype_code`_vanilla; if (__cpu_model.__cpu_vendor == VENDOR_INTEL) { /* Run down the available processors in order of preference. */ #ifdef HAVE_AVX512F if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) { matmul_p = matmul_'rtype_code`_avx512f; goto tailcall; } #endif /* HAVE_AVX512F */ ... } tailcall: (*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm); So, even when assuming all matmul_p = stores are atomic, e.g. if you call matmul from 2 or more threads about the same time for the first time, it could be that the first one sets matmul_p to vanilla and then another thread runs it (uselessly slow), etc. As you don't care about the if (matmul_p == NULL) part being done in multiple threads concurrently, I guess you could e.g. do: static void (*matmul_p) ('rtype` * const restrict retarray, 'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas, int blas_limit, blas_call gemm); // <--- No need for NULL initializer for static var void (*matmul_fn) ('rtype` * const restrict retarray, 'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas, int blas_limit, blas_call gemm); matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED); if (matmul_fn == NULL) { matmul_fn = matmul_'rtype_code`_vanilla; if (__cpu_model.__cpu_vendor == VENDOR_INTEL) { /* Run down the available processors in order of preference. */ #ifdef HAVE_AVX512F if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) { matmul_fn = matmul_'rtype_code`_avx512f; goto finish; } #endif /* HAVE_AVX512F */ ... finish: __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED); } (*matmul_fn) (retarray, a, b, try_blas, blas_limit, gemm); (i.e. make sure you read matmul_p in each call exactly once and store at most once per thread). Jakub