On Fri, Sep 11, 2020 at 8:43 AM Richard Biener <richard.guent...@gmail.com>
wrote:

> On Fri, Sep 11, 2020 at 8:27 AM Anton Youdkevitch
> <anton.youdkevi...@bell-sw.com> wrote:
> >
> > Richard,
> >
> > On Thu, Sep 10, 2020 at 12:03 PM Richard Biener <
> richard.guent...@gmail.com> wrote:
> >>
> >> On Wed, Sep 9, 2020 at 5:51 PM Anton Youdkevitch
> >> <anton.youdkevi...@bell-sw.com> wrote:
> >> >
> >> > ThunderxT2 chip has an odd property that nested scalar FP min and max
> are
> >> > slower than logically the same sequence of compares and branches.
> >>
> >> Always for any input data?
> >
> > If you mean the data that makes it choose all the combinations of
> > taken/not taken branches then yes — the results for synthetics are always
> > the same (+60%). I didn't check Inf/NaNs, though, as in such
> > cases performance is not a concern.
>
> I specifically was suggesting to measure the effect of branch mispredicts.
> You'll have the case of the first branch being mispredicted, the second
> branch being mispredicted and both branches being mispredicted.
> So how's the worst case behaving in comparison to the FP min/max
> back-to-back case?
>
Yes, I measured all the four cases. However, since the data was static this
might be just training the branch predictor. The thing is that even the
best case
has 3 FP insns vs 2 FP mins/maxes and is still almost two times faster. The
worst case has 5 FP insns.


>
> Btw, did you try to use conditional moves / conditional compares (IIRC
> arm has some weird ccmp that might or might not come in handy)?
>
I did. FP conditional moves are notoriously slow on TX2. The implementation
that uses FP cmoves is several times worse than min/max or branche ones.


>
> >> > Here is the patch where I'm trying to implement that transformation.
> >> > Please advise if the "combine" pass (actually after the pass itself)
> is the
> >> > appropriate place to do this.
> >> >
> >> > I was considering the possibility to implement this in aarch64.md
> >> > (which would be much cleaner) but didn't manage to figure out how
> >> > to make fmin/fmax survive until later passes and replace them only
> >> > then.
> >>
> >> +             || !SCALAR_FLOAT_MODE_P (GET_MODE (SET_SRC (PATTERN
> (insn)))))
> >> +           continue;
> >> ...
> >> +         if (code1 != SMIN && code1 != UMIN &&
> >> +             code1 != SMAX && code1 != UMAX)
> >> +           continue;
> >>
> >> you shouldn't see U{MIN,MAX} for float data.
> >
> > OK, thanks. Will fix that.
> >
> >>
> >>
> >> May I suggest to instead to this in a peephole2 or in another late
> >> machine-specific pass?
> >
> > Yes, sure, I'm basically asking for any suggestion. My idea is to move
> > it as late as possible since messing with control flow is generally a bad
> > idea. The current implementation is just a proof of concept. Do you
> > think it's worth to postpone it until, let's say, shorten or peephole2
> > would be enough?
>
> I think doing it as late as possible, possibly after sched2, is best
> since presumably the slowness really depends on back-to-back
> min(max(..)) (what about min (min (..))?), so if there's enough other
> instructions inbetween they behave reasonable again.
>
OK, understood. Thanks!


>
> Did you try if scheduling some insns inbetween the min/max operation
> would improve things?  Thus, might it be reasonable to adjust the
> machine desctiption to artitifically constrain min/max latency?
>
Good point, thanks. The main difference for branched vs non-branched
versions is CPU utilization, so, proper scheduling might (should?)
change this.


>
> >>
> >>
> >> Are nested vector FP min/max fast?
> >
> > The vector min/max are as fast as the scalar ones (ironically) it is
> that utilizing the vector
> > compare and branch will much be slower: it's not just the fact the ASIMD
> compare does
> > not affect CC register and additional processing is required but also
> the number of branches
> > to deal with all the individual elements of the vector in the mixed
> case. It seemed pretty much
> > a deadend so I didn't bother to touch it.
>
> OK, I wasn't thinking of applyin the same transform to vector code but
> using
> vector instructions in place of the scalar ones instead of branchy code.
> But if
> that doesn't make a difference ...
>
No, in this case it does not.

-- 
  Thanks,
  Anton


>
> Richard.
>
> > --
> >   Thanks,
> >   Anton
> >
> >
> >
> >>
> >> Richard.
> >>
> >>
> >> >
> >> > --
> >> >   Thanks,
> >> >   Anton
>

Reply via email to