On Thu, 2015-09-17 at 09:18 -0500, Bill Schmidt wrote: > On Thu, 2015-09-17 at 09:39 +0200, Richard Biener wrote: > > On Wed, 16 Sep 2015, Alan Lawrence wrote: > > > > > On 16/09/15 17:10, Bill Schmidt wrote: > > > > > > > > On Wed, 2015-09-16 at 16:29 +0100, Alan Lawrence wrote: > > > > > On 16/09/15 15:28, Bill Schmidt wrote: > > > > > > 2015-09-16 Bill Schmidt <wschm...@linux.vnet.ibm.com> > > > > > > > > > > > > * config/rs6000/altivec.md (UNSPEC_REDUC_SMAX, > > > > > > UNSPEC_REDUC_SMIN, > > > > > > UNSPEC_REDUC_UMAX, UNSPEC_REDUC_UMIN, > > > > > > UNSPEC_REDUC_SMAX_SCAL, > > > > > > UNSPEC_REDUC_SMIN_SCAL, UNSPEC_REDUC_UMAX_SCAL, > > > > > > UNSPEC_REDUC_UMIN_SCAL): New enumerated constants. > > > > > > (reduc_smax_v2di): New define_expand. > > > > > > (reduc_smax_scal_v2di): Likewise. > > > > > > (reduc_smin_v2di): Likewise. > > > > > > (reduc_smin_scal_v2di): Likewise. > > > > > > (reduc_umax_v2di): Likewise. > > > > > > (reduc_umax_scal_v2di): Likewise. > > > > > > (reduc_umin_v2di): Likewise. > > > > > > (reduc_umin_scal_v2di): Likewise. > > > > > > (reduc_smax_v4si): Likewise. > > > > > > (reduc_smin_v4si): Likewise. > > > > > > (reduc_umax_v4si): Likewise. > > > > > > (reduc_umin_v4si): Likewise. > > > > > > (reduc_smax_v8hi): Likewise. > > > > > > (reduc_smin_v8hi): Likewise. > > > > > > (reduc_umax_v8hi): Likewise. > > > > > > (reduc_umin_v8hi): Likewise. > > > > > > (reduc_smax_v16qi): Likewise. > > > > > > (reduc_smin_v16qi): Likewise. > > > > > > (reduc_umax_v16qi): Likewise. > > > > > > (reduc_umin_v16qi): Likewise. > > > > > > (reduc_smax_scal_<mode>): Likewise. > > > > > > (reduc_smin_scal_<mode>): Likewise. > > > > > > (reduc_umax_scal_<mode>): Likewise. > > > > > > (reduc_umin_scal_<mode>): Likewise. > > > > > > > > > > You shouldn't need the non-_scal reductions. Indeed, they shouldn't be > > > > > used if > > > > > the _scal are present. The non-_scal's were previously defined as > > > > > producing a > > > > > vector with one element holding the result and the other elements all > > > > > zero, and > > > > > this was only ever used with a vec_extract immediately after; the > > > > > _scal > > > > > pattern > > > > > now includes the vec_extract as well. Hence the non-_scal patterns are > > > > > deprecated / considered legacy, as per md.texi. > > > > > > > > Thanks -- I had misread the description of the non-scalar versions, > > > > missing the part where the other elements are zero. What I really > > > > want/need is an optab defined as computing the maximum value in all > > > > elements of the vector. > > > > > > Yes, indeed. It seems reasonable to me that this would coexist with an > > > optab > > > which computes only a single value (i.e. a scalar). > > > > So just to clarify - you need to reduce the vector with max to a scalar > > but want the (same) result in all vector elements? > > Yes. Alan Hayward's cond-reduction patch is set up to perform a > reduction to scalar, followed by a scalar broadcast to get the value > into all positions. It happens that our most efficient expansion to > reduce to scalar will naturally produce the value in all positions. > Using the reduc_smax_scal_<mode> expander, we are required to extract > this into a scalar and then perform the broadcast. Those two operations > are unnecessary. We can clean up after the fact, but the cost model > will still reflect the unnecessary activity. > > > > > > At that point it might be appropriate to change the cond-reduction code to > > > generate the reduce-to-vector in all cases, and optabs.c expand it to > > > reduce-to-scalar + broadcast if reduce-to-vector was not available. Along > > > with > > > the (parallel) changes to cost model already proposed, does that cover > > > all the > > > cases? It does add a new tree code, yes, but I'm feeling that could be > > > justified if we go down this route. > > > > I'd rather have combine recognize an insn that does both (if it > > exists). As I understand powerpc doesn't have reduction to scalar > > (I think no ISA actually has this, but all ISAs have reduce to > > one vector element plus a cheap way of extraction (effectively a subreg)) > > but it's reduction already has all result vector elements set to the > > same value (as opposed to having some undefined or zero or whatever). > > The extraction is not necessarily so cheap. For floating-point values > in a register, we can indeed use a subreg. But in this case we are > talking about integers, and an extract must move them from the vector > registers to the GPRs. With POWER8, we can do this with a 5-cycle move > instruction. Prior to that, we can only do this with a very expensive > path through memory, as previous processors had no direct path between > the vector and integer registers. > > > > > > However, another point that springs to mind: if you reduce a loop > > > containing > > > OR or MUL expressions, the vect_create_epilog_for_reduction reduces these > > > using shifts, and I think will also use shifts for platforms not > > > possessing a > > > reduc_plus/min/max. If shifts could be changed for rotates, the code there > > > would do your reduce-to-a-vector-of-identical-elements in the > > > midend...can we > > > (sensibly!) bring all of these together? > > If you think it's reasonable, I will take a look at whether this can > work. For targets like PowerPC and AArch32, perhaps we are better off > not defining an expander at all, but letting the vector epilogue code > expose the sequence of shifts/max-mins at the GIMPLE level. For this > case the vectorizer would then not generate the unnecessary extract and > broadcast.
Well, I don't see an easy way to do this either. We really need either a whole-vector rotate or a double-vector shift to do the expansion early, neither of which has a defined standard pattern name. So either way we would need to introduce a new GIMPLE code and a new optab entry. I'm still leaning towards interest in exploring the reduce-to-vector approach. Thanks, Bill > > However, I'd want to let Alan Hayward finish revising and committing his > patch first. > > Regards, > Bill > > > > > > > > Perhaps the practical thing is to have the vectorizer also do an > > > > add_stmt_cost with some new token that indicates the cost model should > > > > make an adjustment if the back end doesn't need the extract/broadcast. > > > > Targets like PowerPC and AArch32 could then subtract the unnecessary > > > > cost, and remove the unnecessary code in simplify-rtx. > > > > > > I think it'd be good if we could do it before simplify-rtx, really, > > > although > > > I'm not sure I have a strong argument as to why, as long as we can cost it > > > appropriately. > > > > > > > In any case, I will remove implementing the deprecated optabs, and I'll > > > > also try to look at Alan L's patch shortly. > > > > > > That'd be great, thanks :) > > > > As for the cost modeling the simple solution is of course to have > > these as "high-level" costs (a new enum entry), but the sustainable > > solution is to have the target see the vectorized code and decide > > for itself. Iff we'd go down the simple route then the target > > may as well create the vectorized code for that special high-level > > operation itself. > > > > Richard. > > >