This has been an active recent discussion on irc. I'll try to summarise my position there here:
Ramana Radhakrishnan <raman...@nvidia.com> writes: >> On 6 Aug 2024, at 4:14 PM, Richard Sandiford <richard.sandif...@arm.com> >> wro>> Kyrylo Tkachov <ktkac...@nvidia.com> writes: >>>> On 5 Aug 2024, at 18:00, Richard Sandiford <richard.sandif...@arm.com> >>>> wro>>>> Kyrylo Tkachov <ktkac...@nvidia.com> writes: >>>>>> On 5 Aug 2024, at 12:01, Richard Sandiford <richard.sandif...@arm.com> >>>>>> wrote: >>>>>> >>>>>> External email: Use caution opening links or attachments >>>>>> >>>>>> >>>>>> Jennifer Schmitz <jschm...@nvidia.com> writes: >>>>>>> This patch folds the SVE intrinsic svdiv into a vector of 1's in case >>>>>>> 1) the predicate is svptrue and >>>>>>> 2) dividend and divisor are equal. >>>>>>> This is implemented in the gimple_folder for signed and unsigned >>>>>>> integers. Corresponding test cases were added to the existing test >>>>>>> suites. >>>>>>> >>>>>>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no >>>>>>> regression. >>>>>>> OK for mainline? >>>>>>> >>>>>>> Please also advise whether it makes sense to implement the same >>>>>>> optimization >>>>>>> for float types and if so, under which conditions? >>>>>> >>>>>> I think we should instead use const_binop to try to fold the division >>>>>> whenever the predicate is all-true, or if the function uses _x >>>>>> predication. >>>>>> (As a follow-on, we could handle _z and _m too, using VEC_COND_EXPR.) >>>>>> >>>>> >>>>> From what I can see const_binop only works on constant arguments. >>>> >>>> Yeah, it only produces a result for constant arguments. I see now >>>> that that isn't the case that the patch is interested in, sorry. >>>> >>>>> Is fold_binary a better interface to use ? I think it’d hook into the >>>>> match.pd machinery for divisions at some point. >>>> >>>> We shouldn't use that from gimple folders AIUI, but perhaps I misremember. >>>> (I realise we'd be using it only to test whether the result is constant, >>>> but even so.) >>>> >>>> Have you (plural) come across a case where svdiv is used with equal >>>> non-constant arguments? If it's just being done on first principles >>>> then how about starting with const_binop instead? If possible, it'd be >>>> good to structure it so that we can reuse the code for svadd, svmul, >>>> svsub, etc. >>> >>> We’ve had a bit of internal discussion on this to get our ducks in a row. >>> We are interested in having more powerful folding of SVE intrinsics >>> generally and we’d like some advice on how best to approach this. >>> Prathamesh suggested adding code to fold intrinsics to standard GIMPLE >>> codes where possible when they are _x-predicated or have a ptrue predicate. >>> Hopefully that would allow us to get all the match.pd and fold-const.cc >>> <http://fold-const.cc/> optimizations “for free”. >>> Would that be a reasonable direction rather than adding custom folding code >>> to individual intrinsics such as svdiv? >>> We’d need to ensure that the midend knows how to expand such GIMPLE codes >>> with VLA types and that the required folding rules exist in match.pd >>> (though maybe they work already for VLA types?) >> >> Expansion shouldn't be a problem, since we already rely on that for >> autovectorisation. >> >> But I think this comes back to what we discussed earlier, in the context >> of whether we should replace divisions by constants with multi-instruction >> alternatives. My comment there was: > > >> >> >> If people want to write out a calculation in natural arithmetic, it >> would be better to write the algorithm in scalar code and let the >> vectoriser handle it. That gives the opportunity for many more >> optimisations than just this one. >> > > > > It’s been a while and apologies if I’m coming in a bit late in this and > possibly that thinking has moved on. I’ve always viewed ACLE as an extension > to the language and thus fair game for compilers to optimise . For folks who > really really need that instruction there’s also inline asm :) But the language already provides division via /. GCC doesn't support that yet for VLA SVE vectors, but clang does, and Tejas is looking at adding the corresponding support to GCC. If people just want to add vectors, divide vectors, etc., without any preference about implementation, IMO it's better to let them express that directly with generic features, rather than force them to use target-specific instruction-derived intrinsics like svdiv_x. That would also make the code more portable across targets. So, on the "there's also inline asm" point: I'd argue that (with Tejas's work), there's also generic C/C++ for people who just want to express dataflow and let the compiler do the instruction selection. One of the advantages of intrinsics (at least as currently implemented for SVE, but I think in practice more generally) is that they let programmers do vector instruction selection while leaving the compiler to do things like register allocation, loop control, ivopts, addressing mode selection, etc. It can act as a form of high-level assembly. So I think there are plausibly two constituencies: people who want intrinsics to generate the corresponding architecture instructions (with tweaks if the compiler can be sure that they are improvements) and people who are using them to express an algorithm without any real opinion on implementation. The second group is probably best served by cross-target SIMD frameworks, and so the users of the intrinsics would be the frameworks themselves, rather than the end users (the users of the frameworks). If we say that svdiv_x is just an SVE-specific way of writing /, then (after Tejas's work) we would have two ways of serving the second constituency and no way of serving the first constituency (given the limitations of inline asm wrt intrinsics). In case it sounds otherwise, I'm not saying that we shouldn't optimise intrinsics. I'm instead saying that we should only interfere with the user's vector instruction selection if the change would be an improvement in all realistic scenarios. I don't think lowering to generic gimple division gives that level of confidence. > The approach for implementing the ACLE intrinsics for both AArch32 and > AArch64 used to be: > > 1. express the intrinsics with GNU C / C++ (see implementations in > arm_neon.h) if feasible and semantics match up. > 2. fall back to gimple folding / representation if semantics matched up. > 3. RTL unspecs (if no representation feasible , fall back to it ) > > In the case of SVE VLA intrinsics there is no GNU C feasible, but if there > was gimple representation possible shouldn’t we go to that ? This should be feasible for VLA too. ACLE already defines an optional feature for it, which is what Tejas is working on implementing. > With Advanced SIMD the behaviour the user sees the behaviour as per 1 above > (see the implementation of the basic arithmetic operations for neon in GNUC). > Is there any reason that SVE needs to be different in its treatment in the > backend ? Integer division seems especially dangerous though, given the multiple ways of trying to code-generate it. Also, gimple sometimes carries assumptions about undefined behaviour, such as undefined overflow, whereas the intrinsics are defined to behave in the same way as the underlying instructions. For example, INT_MIN / -1 is well-defined when performed by svdiv_x. ------ To capture here a more general point from the irc discussion: I think that if something can be expressed in gimple, we should also find a way of expressing it in generic C/C++, if that seems useful to programmers. SIMD frameworks can then use those generic C/C++ features rather than having to specialise for each target. Yes, this would require patches to frameworks, but that doesn't seem unreasonable. The point of the SIMD frameworks is to act as a bridge between the user and the compiler. And users will need to recompile to get optimised code whatever approach we take, since they'll be relying on new compiler behaviour to get the optimisation. Let's say that a SIMD framework supports N different vector targets. At the moment, the process seems to be that each target defines its own target-specific intrinsic for (say) vector addition. SIMD frameworks then have a 1->N map of addition to target intrinsics. Then, if the assumption is that intrinsics just express dataflow, the expectation is that the compiler will map those N target intrinsics back to +, giving an N->1 mapping. This seems like a hopelessly indirect way of implementing a 1->1 mapping (programmer plus to gimple plus). And, if a framework tries to emulate one target's intrinsics using another, the framework can end up with an N->N mapping. One example given on irc (originally to make a different point) was: https://github.com/simd-everywhere/simde/blob/master/simde/arm/sve/add.h This contains an emulation of svadd_s8_x. In C and C++, that should just be: return op1 + op2; But the implementation is instead a 37-line function (and one that does actually use vector + for two alternatives). Providing generic C/C++ features would help to avoid that, and would mean that the code doesn't need to be expressed using references to target-specific instructions when the intent is not to use those instructions in particular. Thanks, Richard