On 17.05.19 18:16, Richard Henderson wrote: > On 5/15/19 1:31 PM, David Hildenbrand wrote: >> +#define DEF_VFAE(BITS) >> \ >> +static int vfae##BITS(void *v1, const void *v2, const void *v3, uint8_t m5) >> > > > First, because this *is* complicated stuff, can we find a way to use inline > functions instead of an undebuggable macro for this? Perhaps a different set > of wrappers than s390_vec_read_element##BITS, which always return uint32_t, so > that they have a constant signature?
For vfene I have for now +static inline uint64_t s390_vec_read_element(const S390Vector *v, uint8_t enr, + uint8_t es) +{ + switch (es) { + case MO_8: + return s390_vec_read_element8(v, enr); + case MO_16: + return s390_vec_read_element16(v, enr); + case MO_32: + return s390_vec_read_element32(v, enr); + case MO_64: + return s390_vec_read_element64(v, enr); + default: + g_assert_not_reached(); + } +} + Which we could reuse here. I'll try to look into using a inline function instead, passing in the element size and other flags, so the compiler can specialize. Thanks! > >> + if (zs && !data) { >> + if (cc == 3) { >> + first_byte = i * (BITS / 8); >> + cc = 0; /* match for zero */ >> + } else if (cc != 0) { >> + cc = 2; /* matching elements before match for zero */ >> + } >> + if (!rt) { >> + break; >> + } >> + } > > So here we are computing the second intermediate result. > >> + /* try to match with any other element from the other vector */ >> + for (j = 0; j < (128 / BITS); j++) { >> + if (data == s390_vec_read_element##BITS(v3, j)) { >> + any_equal = true; >> + break; >> + } >> + } > > And here the first intermediate result, > >> + /* invert the result if requested */ >> + any_equal = in ^ any_equal; > > ... inverted, if requested, > >> + if (cc == 3 && any_equal) { >> + first_byte = i * (BITS / 8); >> + cc = 1; /* matching elements, no match for zero */ >> + if (!zs && !rt) { >> + break; >> + } >> + } > >> + /* indicate bit vector if requested */ >> + if (rt && any_equal) { >> + s390_vec_write_element##BITS(&tmp, i, (uint##BITS##_t)-1ull); >> + } > > ... writing out (some of) the results of the first intermediate result. > >> + } >> + if (!rt) { >> + s390_vec_write_element8(&tmp, 7, first_byte); >> + } > > ... writing out the rest of the first intermediate result. > > I wonder if it wouldn't be clearer, within the loop, to do > > if (any_equal) { > if (cc == 3) { > first_byte = ...; > cc = 1; > } > if (rt) { > write element -1; > } else if (!zs) { > break; > } > } > > I also think that, if we create a bunch more of these wrappers: > >> +DEF_VFAE_HELPER(8) >> +DEF_VFAE_HELPER(16) >> +DEF_VFAE_HELPER(32) > > then RT and ZS can be passed in as constant parameters to the above, and then > the compiler will fold away all of the stuff that's not needed for each > different case. Which, I think, is significant. These are practically > different instructions with the different modifiers. > > > r~ > -- Thanks, David / dhildenb