On Wed, Sep 9, 2020 at 5:51 PM Anton Youdkevitch <anton.youdkevi...@bell-sw.com> wrote: > > ThunderxT2 chip has an odd property that nested scalar FP min and max are > slower than logically the same sequence of compares and branches.
Always for any input data? > Here is the patch where I'm trying to implement that transformation. > Please advise if the "combine" pass (actually after the pass itself) is the > appropriate place to do this. > > I was considering the possibility to implement this in aarch64.md > (which would be much cleaner) but didn't manage to figure out how > to make fmin/fmax survive until later passes and replace them only > then. + || !SCALAR_FLOAT_MODE_P (GET_MODE (SET_SRC (PATTERN (insn))))) + continue; ... + if (code1 != SMIN && code1 != UMIN && + code1 != SMAX && code1 != UMAX) + continue; you shouldn't see U{MIN,MAX} for float data. May I suggest to instead to this in a peephole2 or in another late machine-specific pass? Are nested vector FP min/max fast? Richard. > > -- > Thanks, > Anton