On Wed, 9 Jun 2021 at 00:17, Richard Henderson
<richard.hender...@linaro.org> wrote:
>
> On 6/7/21 9:57 AM, Peter Maydell wrote:
> > +#define DO_VDUP(OP, ESIZE, TYPE, H)                                     \
> > +    void HELPER(mve_##OP)(CPUARMState *env, void *vd, uint32_t val)     \
> > +    {                                                                   \
> > +        TYPE *d = vd;                                                   \
> > +        uint16_t mask = mve_element_mask(env);                          \
> > +        unsigned e;                                                     \
> > +        for (e = 0; e < 16 / ESIZE; e++, mask >>= ESIZE) {              \
> > +            uint64_t bytemask = mask_to_bytemask##ESIZE(mask);          \
> > +            d[H(e)] &= ~bytemask;                                       \
> > +            d[H(e)] |= (val & bytemask);                                \
> > +        }                                                               \
> > +        mve_advance_vpt(env);                                           \
> > +    }
> > +
> > +DO_VDUP(vdupb, 1, uint8_t, H1)
> > +DO_VDUP(vduph, 2, uint16_t, H2)
> > +DO_VDUP(vdupw, 4, uint32_t, H4)
>
> Hmm.  I think the masking should be done at either uint32_t or uint64_t.  
> Doing
> it byte-by-byte is wasteful.

Mmm. I think some of this structure is holdover from an initial
misinterpretation
of the spec that all these ops looked at the predicate bit for the LS byte
of the element to see if the entire element was acted upon, in which case
you do need to work element-by-element with the right size. (This is actually
true for some operations, but mostly the predicate bits do bytewise masking
and can give you a partial chunk of a result element, as here.)

-- PMM

Reply via email to