On 6/6/19 12:45 PM, Peter Maydell wrote: > Convert the VFP VMLA instruction to decodetree. > > This is the first of the VFP 3-operand data processing instructions, > so we include in this patch the code which loops over the elements > for an old-style VFP vector operation. The existing code to do this > looping uses the deprecated cpu_F0s/F0d/F1s/F1d TCG globals; since > we are going to be converting instructions one at a time anyway > we can take the opportunity to make the new loop use TCG temporaries, > which means we can do that conversion one operation at a time > rather than needing to do it all in one go. > > We include an UNDEF check which was missing in the old code: > short-vector operations (with stride or length non-zero) were > deprecated in v7A and must UNDEF in v8A, so if the MVFR0 FPShVec > field does not indicate that support for short vectors is present > we UNDEF the operations that would use them. (This is a change > of behaviour for Cortex-A7, Cortex-A15 and the v8 CPUs, which > previously were all incorrectly allowing short-vector operations.) > > Note that the conversion fixes a bug in the old code for the > case of VFP short-vector "mixed scalar/vector operations". These > happen where the destination register is in a vector bank but > but the second operand is in a scalar bank. For example > vmla.f64 d10, d1, d16 with length 2 stride 2 > is equivalent to the pair of scalar operations > vmla.f64 d10, d1, d16 > vmla.f64 d8, d3, d16 > where the destination and first input register cycle through > their vector but the second input is scalar (d16). In the > old decoder the gen_vfp_F1_mul() operation uses cpu_F1{s,d} > as a temporary output for the multiply, which trashes the > second input operand. For the fully-scalar case (where we > never do a second iteration) and the fully-vector case > (where the loop loads the new second input operand) this > doesn't matter, but for the mixed scalar/vector case we > will end up using the wrong value for later loop iterations. > In the new code we use TCG temporaries and so avoid the bug. > This bug is present for all the multiply-accumulate insns > that operate on short vectors: VMLA, VMLS, VNMLA, VNMLS. > > Note 2: the expression used to calculate the next register > number in the vector bank is not in fact correct; we leave > this behaviour unchanged from the old decoder and will > fix this bug later in the series. > > Signed-off-by: Peter Maydell <peter.mayd...@linaro.org> > --- > target/arm/cpu.h | 5 + > target/arm/translate-vfp.inc.c | 205 +++++++++++++++++++++++++++++++++ > target/arm/translate.c | 14 ++- > target/arm/vfp.decode | 6 + > 4 files changed, 224 insertions(+), 6 deletions(-)
Reviewed-by: Richard Henderson <richard.hender...@linaro.org> r~