On Fri, Sep 11, 2020 at 8:43 AM Richard Biener <richard.guent...@gmail.com> wrote:
> On Fri, Sep 11, 2020 at 8:27 AM Anton Youdkevitch > <anton.youdkevi...@bell-sw.com> wrote: > > > > Richard, > > > > On Thu, Sep 10, 2020 at 12:03 PM Richard Biener < > richard.guent...@gmail.com> wrote: > >> > >> On Wed, Sep 9, 2020 at 5:51 PM Anton Youdkevitch > >> <anton.youdkevi...@bell-sw.com> wrote: > >> > > >> > ThunderxT2 chip has an odd property that nested scalar FP min and max > are > >> > slower than logically the same sequence of compares and branches. > >> > >> Always for any input data? > > > > If you mean the data that makes it choose all the combinations of > > taken/not taken branches then yes — the results for synthetics are always > > the same (+60%). I didn't check Inf/NaNs, though, as in such > > cases performance is not a concern. > > I specifically was suggesting to measure the effect of branch mispredicts. > You'll have the case of the first branch being mispredicted, the second > branch being mispredicted and both branches being mispredicted. > So how's the worst case behaving in comparison to the FP min/max > back-to-back case? > Yes, I measured all the four cases. However, since the data was static this might be just training the branch predictor. The thing is that even the best case has 3 FP insns vs 2 FP mins/maxes and is still almost two times faster. The worst case has 5 FP insns. > > Btw, did you try to use conditional moves / conditional compares (IIRC > arm has some weird ccmp that might or might not come in handy)? > I did. FP conditional moves are notoriously slow on TX2. The implementation that uses FP cmoves is several times worse than min/max or branche ones. > > >> > Here is the patch where I'm trying to implement that transformation. > >> > Please advise if the "combine" pass (actually after the pass itself) > is the > >> > appropriate place to do this. > >> > > >> > I was considering the possibility to implement this in aarch64.md > >> > (which would be much cleaner) but didn't manage to figure out how > >> > to make fmin/fmax survive until later passes and replace them only > >> > then. > >> > >> + || !SCALAR_FLOAT_MODE_P (GET_MODE (SET_SRC (PATTERN > (insn))))) > >> + continue; > >> ... > >> + if (code1 != SMIN && code1 != UMIN && > >> + code1 != SMAX && code1 != UMAX) > >> + continue; > >> > >> you shouldn't see U{MIN,MAX} for float data. > > > > OK, thanks. Will fix that. > > > >> > >> > >> May I suggest to instead to this in a peephole2 or in another late > >> machine-specific pass? > > > > Yes, sure, I'm basically asking for any suggestion. My idea is to move > > it as late as possible since messing with control flow is generally a bad > > idea. The current implementation is just a proof of concept. Do you > > think it's worth to postpone it until, let's say, shorten or peephole2 > > would be enough? > > I think doing it as late as possible, possibly after sched2, is best > since presumably the slowness really depends on back-to-back > min(max(..)) (what about min (min (..))?), so if there's enough other > instructions inbetween they behave reasonable again. > OK, understood. Thanks! > > Did you try if scheduling some insns inbetween the min/max operation > would improve things? Thus, might it be reasonable to adjust the > machine desctiption to artitifically constrain min/max latency? > Good point, thanks. The main difference for branched vs non-branched versions is CPU utilization, so, proper scheduling might (should?) change this. > > >> > >> > >> Are nested vector FP min/max fast? > > > > The vector min/max are as fast as the scalar ones (ironically) it is > that utilizing the vector > > compare and branch will much be slower: it's not just the fact the ASIMD > compare does > > not affect CC register and additional processing is required but also > the number of branches > > to deal with all the individual elements of the vector in the mixed > case. It seemed pretty much > > a deadend so I didn't bother to touch it. > > OK, I wasn't thinking of applyin the same transform to vector code but > using > vector instructions in place of the scalar ones instead of branchy code. > But if > that doesn't make a difference ... > No, in this case it does not. -- Thanks, Anton > > Richard. > > > -- > > Thanks, > > Anton > > > > > > > >> > >> Richard. > >> > >> > >> > > >> > -- > >> > Thanks, > >> > Anton >