I use clang9 to build code which has many arm64 intrinsics. I
use vmlaq_f32 to perform multiply accumulate operations on float32x4_t data
type. I have expected fmla instruction will be generated but instead clang
generate a fmul and a fadd instruction for me. For simple function this is
not an issue but for function which use a lot of neon registers clang9 will
generate inefficient code which will store/load neon register to/from stack
frequently. But if clang generate  fmla instruction 32 neon register is
more than enough.
    BTW: I have tested a function which use  vmlaq_f32 heavily, If I build
it for armv7-a it will generate very  efficient code(it will generate vmla
instruction in this case), but if I build it for armv8-a the generated code
looks very inefficient with many store/load to/from stack.
    Is there a way to force clang9 generate  fmla  instruction for
vmlaq_f32? Thanks.
_______________________________________________
cfe-users mailing list
cfe-users@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-users

Reply via email to