On Wed, 9 Jun 2021 at 00:17, Richard Henderson <richard.hender...@linaro.org> wrote: > > On 6/7/21 9:57 AM, Peter Maydell wrote: > > +#define DO_VDUP(OP, ESIZE, TYPE, H) \ > > + void HELPER(mve_##OP)(CPUARMState *env, void *vd, uint32_t val) \ > > + { \ > > + TYPE *d = vd; \ > > + uint16_t mask = mve_element_mask(env); \ > > + unsigned e; \ > > + for (e = 0; e < 16 / ESIZE; e++, mask >>= ESIZE) { \ > > + uint64_t bytemask = mask_to_bytemask##ESIZE(mask); \ > > + d[H(e)] &= ~bytemask; \ > > + d[H(e)] |= (val & bytemask); \ > > + } \ > > + mve_advance_vpt(env); \ > > + } > > + > > +DO_VDUP(vdupb, 1, uint8_t, H1) > > +DO_VDUP(vduph, 2, uint16_t, H2) > > +DO_VDUP(vdupw, 4, uint32_t, H4) > > Hmm. I think the masking should be done at either uint32_t or uint64_t. > Doing > it byte-by-byte is wasteful.
Mmm. I think some of this structure is holdover from an initial misinterpretation of the spec that all these ops looked at the predicate bit for the LS byte of the element to see if the entire element was acted upon, in which case you do need to work element-by-element with the right size. (This is actually true for some operations, but mostly the predicate bits do bytewise masking and can give you a partial chunk of a result element, as here.) -- PMM