On Tue, 15 Oct 2024, Tamar Christina wrote: > > -----Original Message----- > > From: Richard Biener <rguent...@suse.de> > > Sent: Tuesday, October 15, 2024 1:20 PM > > To: Tamar Christina <tamar.christ...@arm.com> > > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com> > > Subject: RE: [PATCH 1/4]middle-end: support multi-step zero-extends using > > VEC_PERM_EXPR > > > > On Tue, 15 Oct 2024, Tamar Christina wrote: > > > > > > -----Original Message----- > > > > From: Richard Biener <rguent...@suse.de> > > > > Sent: Tuesday, October 15, 2024 12:13 PM > > > > To: Tamar Christina <tamar.christ...@arm.com> > > > > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com> > > > > Subject: Re: [PATCH 1/4]middle-end: support multi-step zero-extends > > > > using > > > > VEC_PERM_EXPR > > > > > > > > On Tue, 15 Oct 2024, Tamar Christina wrote: > > > > > > > > > Hi, > > > > > > > > > > Thanks for the look, > > > > > > > > > > The 10/15/2024 09:54, Richard Biener wrote: > > > > > > On Mon, 14 Oct 2024, Tamar Christina wrote: > > > > > > > > > > > > > Hi All, > > > > > > > > > > > > > > This patch series adds support for a target to do a direct > > > > > > > convertion for > > zero > > > > > > > extends using permutes. > > > > > > > > > > > > > > To do this it uses a target hook use_permute_for_promotio which > > > > > > > must be > > > > > > > implemented by targets. This hook is used to indicate: > > > > > > > > > > > > > > 1. can a target do this for the given modes. > > > > > > > > > > > > can_vec_perm_const_p? > > > > > > > > > > > > > 3. can the target convert between various vector modes with a > > > > VIEW_CONVERT. > > > > > > > > > > > > We have modes_tieable_p for this I think. > > > > > > > > > > > > > > > > Yes, though the reason I didn't use either of them was because they > > > > > are > > reporting > > > > > a capability of the backend. In which case the hook, which is > > > > > already backend > > > > > specific already should answer these two. > > > > > > > > > > I initially had these checks there, but they didn't seem to add > > > > > value, for > > > > > promotions the masks are only dependent on the input and output modes. > > So > > > > they really > > > > > don't change. > > > > > > > > > > When you have say a loop that does lots of conversions from say char > > > > > to int, > > it > > > > seemed > > > > > like a waste to retest the same permute constants over and over again. > > > > > > > > > > I can add them back in if you prefer... > > > > > > > > > > > > 2. is it profitable for the target to do it. > > > > > > > > > > > > So you say the target can do both ways but both zip and tbl are > > > > > > permute instructions so I really fail to see the point and why > > > > > > the target itself doesn't choose to use tbl for unpack. > > > > > > > > > > > > Is the intent in the end to have VEC_PERM in the IL rather than > > > > > > VEC_UNPACK_* so it combines with other VEC_PERMs? > > > > > > > > > > > > > > > > Yes, and this happens quite often, e.g. load permutes or lane > > > > > shuffles etc. > > > > > The reason for exposing them as VEC_PERM was to trigger further > > optimizations. > > > > > > > > > > If you remember the ticket about LOAD_LANES, with this optimization > > > > > and an > > > > open > > > > > encoding of LOAD_LANES we stop using it in cases where theres a zero > > > > > extend > > > > after > > > > > the LOAD_LANES, because then you're doing effectively two permutes and > > the > > > > LOAD_LANES > > > > > is no longer beneficial. There are other examples, load and replicate > > > > > etc. > > > > > > > > > > > That said, I'm not against supporting VEC_PERM code gen from > > > > > > unsigned promotion but I don't see why we should do this when > > > > > > the target advertises VEC_UNPACK_* support or direct conversion > > > > > > support? > > > > > > > > > > > > Esp. with adding a "local" cost related hook which cannot take > > > > > > into accout context. > > > > > > > > > > > > > > > > To summarize a long story: > > > > > > > > > > yes I open encode zero extends as permutes to allow further > > > > > optimizations. > > > > One could convert > > > > > vec_unpacks to convert optabs and use that, but that is an opague > > > > > value > > that > > > > can't be further > > > > > optimized. > > > > > > > > > > The hook isn't really a costing thing in the general sense. It's > > > > > literally just "do > > you > > > > want > > > > > permutes yes or no". The reason it gets the modes is simply that I > > > > > don't > > think a > > > > single level > > > > > extend is worth it, but I can just change it to never try to do > > > > > this on more > > than > > > > one level. > > > > > > > > When you mention LOAD_LANES we do not expose "permutes" in them on > > > > GIMPLE > > > > either, so why should we for VEC_UNPACK_*. > > > > > > I think not exposing LOAD_LANES in GIMPLE *is* an actual mistake that I > > > hope to > > correct in GCC-16. > > > Or at least the time we pick LOAD_LANES is too early. So I don't think > > > pointing to > > this is a convincing > > > argument. It's only VLA that I think needs the IL because you have to > > > mask the > > group of operations and > > > may be hard to reconcile that later on. > > > > > > > At what level are the simplifications you see happening then? > > > > > > Well, they are currently happening outside of the vectorizer passes > > > itself, > > > more specifically in this case because VN runs match simplifications. > > > > But match doesn't simplify permutes against .LOAD_LANES? So it's about > > "other" permutes (from loads) that get simplified? > > > > Yes, or other permute after the zero extend. I shouldn't have mentioned > LOAD_LANES > I think that moved the discussion to a wrong place. > > > > If the concern is that that's late I can lift it to a pattern I suppose. > > > I didn't use a pattern because similar changes in this area always just > > > happened > > > at codegen. > > > > I was wondering how this plays with my idea of having us "lower" > > or rather "code generate" to an intermediate SLP representation where > > we split SLP groups on vector boundaries and are then free to > > perform permute optimizations that need to know the vector type. > > > > That said - match could as well combine VEC_UNPACK_* with a VEC_PERMUTE > > with the catch that this duplicates patterns for the > > VEC_UNPACK_*/VEC_PERMUTE duality we have. > > > > > > > > > > I do realize we have two ways of expressing zero-extending widenings > > > > (also truncations btw) and that's always bad - so we could decide to > > > > _always_ use VEC_PERMs as the canonical representation because those > > > > combine more easily. And either match VEC_PERMs back to vec_unpack > > > > at RTL expansion time or require targets to expose those as constant > > > > vec_perms as well. There are targets like GCN where you can't do > > > > unpacking with permutes of course, so we can't do away with them > > > > (we could possibly force those targets to expose widening/truncation > > > > solely with [us]ext and trunc patterns of course). > > > > > > Ok, so your objection is that you don't want to have a different way of > > > doing > > > a single step zero extend vs a multi-step zero extend. > > > > My objection is mainly that we do this based on a target decision and > > without immediate effect on the vector loop and its costing - it's not > > that we are then able to see we can combine the permutes with others, > > say in SLP permute optimization. > > I can fix that by lifting the code up as a pattern so it does affect costing > directly and also gets seen by the vectorizer's permute simplification. > > I agree that that would be a better place for it. Does that address the > issue? Then at least the target decision directly affects vectorization > like other patterns.
So - how can you teach the SLP permute optimization to treat converts as permutes? I think since you can't really do this as pattern either it doesn't fit a VEC_PERM SLP node either? Or maybe you can have VEC_PERM <{a}, {0}, { [0:0], [1:0] }> followed by a node with a VIEW_CONVERT_EXPR to a wider element type? So it might be fully implementable in SLP permute optimization? > > > > > At the moment my patch doesn't care, if you return an unconditional true > > > then for that target you get VEC_PERM or everything and the vectorizer > > > won't ever spit out VEC_UNPACKU. > > > > > > You're arguing that this should be the default, even if the target does > > > not > > > support it and then we have to somehow undo it during vec_lowering? > > > > I argued that we possibly should do this by default and all targets > > that can vec_unpack but not vec_perm_const with such a permute can > > either implement the missing vec_perm_const or they are of the kind > > that cannot use a permute for this (!modes_tieable_p). > > Ok, and I assume this would catch targets like GCN? I don't know much about > What can be converted or not there. I'll go check their modes_tieable_p. GCN can't pun a V8HI to a V4SI vector, yes. > > > Otherwise if the target doesn't support the permute it'll be scalarized.. > > > > > > I guess sure.. But then... > > > > > > > There are targets like GCN where you can't do > > > > unpacking with permutes of course, so we can't do away with them > > > > (we could possibly force those targets to expose widening/truncation > > > > solely with [us]ext and trunc patterns of course). > > > > > > I guess if can_vec_perm_const_p fails we can undo it.. But it feels like > > > we lose an element of preference here. A target *could* do the permute, > > > but not do it efficiently. > > > > It can do it the same way it would do the vec_unpack? Or what am I > > missing here? Does your permute not exactly replicate vec_unpack_lo/hi? > > It replicates series of them yeah. What I meant with the above is what should > happen for targets that haven't implemented vec_perm_const, but I suppose > the previous paragraph addresses this. > > > > > > > > > > I think think there's a lot of merrit in open-encoding zero extends, > > > > > but one > > > > reason this is > > > > > beneficial on AArch64 for instance is that we can consume the zero > > > > > register > > and > > > > rewrite the > > > > > indices to a single register TBL. Two registers TBLs are slower on > > > > > some > > > > implementations. > > > > > > > > But this latter fact can be done by optimizing the RTL? > > > > > > Sure, and we do so today. That's why the example output in the cover > > > letter > > > has only one input register. The point of this blurb was to point out > > > more that > > > the optimization being beneficial may depend on a specific uarch and as > > > such > > > I believe that a certain element of target buy in is needed. > > > > If it's dependent on uarch then even more so - why not simply > > expand vec_unpack as tbl then? > > We expand them as ZIPs, because these don't require a lookup table index. > However again these are only single level unpacks. It doesn't work for this > case of multi-level unpacks. For something like byte -> long, or worse byte > -> double > the number of instructions to match in combine would exceed it's combine > limit. > > Additionally they require a lot of patterns. So simply, we cannot recombine > multi-level > unpacks in RTL. > > The backend however will do something sensible given a VEC_PERM_EXPR. > > But I think this is just a detail we're getting into. > > It sounds like you're ok with doing it unconditionally for any target that > supports > the permutes, and lift it pre analysis (like in a pattern) so it's costed? I _think_ that I'd be OK to do this as canonicalization, but as said it requires buy-in and work in all targets. We should be able to get rid of VEC_UNPACK_HI/LO as tree code then, GCN doesn't (cannot) implement any of those but uses [sz]ext/trunc exclusively IIRC. Richard. > Did I understand that right? > > Thanks for the discussion so far. > > Tamar > > > > If you want me to do it unconditionally sure, I can do that... > > > > > > If so can I get a review on the other patches anyway? They are > > > independent mostly. Only have some dependencies on the output of the > > > tests. > > > > Sure, I'm behind stuff - sorry. > > > > Richard. > > > > > Thanks, > > > Tamar > > > > > > > > > > > Richard. > > > > > > > > > Thanks, > > > > > Tamar > > > > > > > > > > > > Using permutations have a big benefit of multi-step zero > > > > > > > extensions > > because > > > > they > > > > > > > both reduce the number of needed instructions, but also increase > > throughput > > > > as > > > > > > > the dependency chain is removed. > > > > > > > > > > > > > > Concretely on AArch64 this changes: > > > > > > > > > > > > > > void test4(unsigned char *x, long long *y, int n) { > > > > > > > for(int i = 0; i < n; i++) { > > > > > > > y[i] = x[i]; > > > > > > > } > > > > > > > } > > > > > > > > > > > > > > from generating: > > > > > > > > > > > > > > .L4: > > > > > > > ldr q30, [x4], 16 > > > > > > > add x3, x3, 128 > > > > > > > zip1 v1.16b, v30.16b, v31.16b > > > > > > > zip2 v30.16b, v30.16b, v31.16b > > > > > > > zip1 v2.8h, v1.8h, v31.8h > > > > > > > zip1 v0.8h, v30.8h, v31.8h > > > > > > > zip2 v1.8h, v1.8h, v31.8h > > > > > > > zip2 v30.8h, v30.8h, v31.8h > > > > > > > zip1 v26.4s, v2.4s, v31.4s > > > > > > > zip1 v29.4s, v0.4s, v31.4s > > > > > > > zip1 v28.4s, v1.4s, v31.4s > > > > > > > zip1 v27.4s, v30.4s, v31.4s > > > > > > > zip2 v2.4s, v2.4s, v31.4s > > > > > > > zip2 v0.4s, v0.4s, v31.4s > > > > > > > zip2 v1.4s, v1.4s, v31.4s > > > > > > > zip2 v30.4s, v30.4s, v31.4s > > > > > > > stp q26, q2, [x3, -128] > > > > > > > stp q28, q1, [x3, -96] > > > > > > > stp q29, q0, [x3, -64] > > > > > > > stp q27, q30, [x3, -32] > > > > > > > cmp x4, x5 > > > > > > > bne .L4 > > > > > > > > > > > > > > and instead we get: > > > > > > > > > > > > > > .L4: > > > > > > > add x3, x3, 128 > > > > > > > ldr q23, [x4], 16 > > > > > > > tbl v5.16b, {v23.16b}, v31.16b > > > > > > > tbl v4.16b, {v23.16b}, v30.16b > > > > > > > tbl v3.16b, {v23.16b}, v29.16b > > > > > > > tbl v2.16b, {v23.16b}, v28.16b > > > > > > > tbl v1.16b, {v23.16b}, v27.16b > > > > > > > tbl v0.16b, {v23.16b}, v26.16b > > > > > > > tbl v22.16b, {v23.16b}, v25.16b > > > > > > > tbl v23.16b, {v23.16b}, v24.16b > > > > > > > stp q5, q4, [x3, -128] > > > > > > > stp q3, q2, [x3, -96] > > > > > > > stp q1, q0, [x3, -64] > > > > > > > stp q22, q23, [x3, -32] > > > > > > > cmp x4, x5 > > > > > > > bne .L4 > > > > > > > > > > > > > > Tests are added in the AArch64 patch introducing the hook. The > > > > > > > testsuite > > also > > > > > > > already had about 800 runtime tests that get affected by this. > > > > > > > > > > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu, arm-none-linux- > > > > gnueabihf, > > > > > > > x86_64-pc-linux-gnu -m32, -m64 and no issues. > > > > > > > > > > > > > > Ok for master? > > > > > > > > > > > > > > Thanks, > > > > > > > Tamar > > > > > > > > > > > > > > gcc/ChangeLog: > > > > > > > > > > > > > > * target.def (use_permute_for_promotion): New. > > > > > > > * doc/tm.texi.in: Document it. > > > > > > > * doc/tm.texi: Regenerate. > > > > > > > * targhooks.cc (default_use_permute_for_promotion): New. > > > > > > > * targhooks.h (default_use_permute_for_promotion): New. > > > > > > > (vectorizable_conversion): Support direct convertion with > > permute. > > > > > > > * tree-vect-stmts.cc (vect_create_vectorized_promotion_stmts): > > Likewise. > > > > > > > (supportable_widening_operation): Likewise. > > > > > > > (vect_gen_perm_mask_any): Allow vector permutes where input > > registers > > > > > > > are half the width of the result per the GCC 14 relaxation of > > > > > > > VEC_PERM_EXPR. > > > > > > > > > > > > > > --- > > > > > > > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi > > > > > > > index > > > > > > 4deb3d2c283a2964972b94f434370a6f57ea816a..e8192590ac14005bf7cb5f73 > > > > 1c16ee7eacb78143 100644 > > > > > > > --- a/gcc/doc/tm.texi > > > > > > > +++ b/gcc/doc/tm.texi > > > > > > > @@ -6480,6 +6480,15 @@ type @code{internal_fn}) should be > > considered > > > > expensive when the mask is > > > > > > > all zeros. GCC can then try to branch around the instruction > > > > > > > instead. > > > > > > > @end deftypefn > > > > > > > > > > > > > > +@deftypefn {Target Hook} bool > > > > TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION (const_tree > > > > @var{in_type}, const_tree @var{out_type}) > > > > > > > +This hook returns true if the operation promoting @var{in_type} > > > > > > > to > > > > > > > +@var{out_type} should be done as a vector permute. If > > > > > > > @var{out_type} > > is > > > > > > > +a signed type the operation will be done as the related unsigned > > > > > > > type and > > > > > > > +converted to @var{out_type}. If the target supports the needed > > permute, > > > > > > > +is able to convert unsigned(@var{out_type}) to @var{out_type} > > > > > > > and it is > > > > > > > +beneficial to the hook should return true, else false should be > > > > > > > returned. > > > > > > > +@end deftypefn > > > > > > > + > > > > > > > @deftypefn {Target Hook} {class vector_costs *} > > > > TARGET_VECTORIZE_CREATE_COSTS (vec_info *@var{vinfo}, bool > > > > @var{costing_for_scalar}) > > > > > > > This hook should initialize target-specific data structures in > > > > > > > preparation > > > > > > > for modeling the costs of vectorizing a loop or basic block. > > > > > > > The default > > > > > > > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in > > > > > > > index > > > > > > 9f147ccb95cc6d4e79cdf5b265666ad502492145..c007bc707372dd374e8effc52 > > > > d29b76f5bc283a1 100644 > > > > > > > --- a/gcc/doc/tm.texi.in > > > > > > > +++ b/gcc/doc/tm.texi.in > > > > > > > @@ -4303,6 +4303,8 @@ address; but often a machine-dependent > > strategy > > > > can generate better code. > > > > > > > > > > > > > > @hook TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE > > > > > > > > > > > > > > +@hook TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION > > > > > > > + > > > > > > > @hook TARGET_VECTORIZE_CREATE_COSTS > > > > > > > > > > > > > > @hook TARGET_VECTORIZE_BUILTIN_GATHER > > > > > > > diff --git a/gcc/target.def b/gcc/target.def > > > > > > > index > > > > > > b31550108883c5c3f5ffc7e46a1e8a7b839ebe83..58545d5ef4248da5850edec8f > > > > 4db9f2636973598 100644 > > > > > > > --- a/gcc/target.def > > > > > > > +++ b/gcc/target.def > > > > > > > @@ -2056,6 +2056,20 @@ all zeros. GCC can then try to branch > > > > > > > around > > the > > > > instruction instead.", > > > > > > > (unsigned ifn), > > > > > > > default_empty_mask_is_expensive) > > > > > > > > > > > > > > +/* Function to say whether a target supports and prefers to use > > permutes > > > > for > > > > > > > + zero extensions or truncates. */ > > > > > > > +DEFHOOK > > > > > > > +(use_permute_for_promotion, > > > > > > > + "This hook returns true if the operation promoting > > > > > > > @var{in_type} to\n\ > > > > > > > +@var{out_type} should be done as a vector permute. If > > > > > > > @var{out_type} > > > > is\n\ > > > > > > > +a signed type the operation will be done as the related unsigned > > > > > > > type > > and\n\ > > > > > > > +converted to @var{out_type}. If the target supports the needed > > > > permute,\n\ > > > > > > > +is able to convert unsigned(@var{out_type}) to @var{out_type} > > > > > > > and it > > is\n\ > > > > > > > +beneficial to the hook should return true, else false should be > > > > > > > returned.", > > > > > > > + bool, > > > > > > > + (const_tree in_type, const_tree out_type), > > > > > > > + default_use_permute_for_promotion) > > > > > > > + > > > > > > > /* Target builtin that implements vector gather operation. */ > > > > > > > DEFHOOK > > > > > > > (builtin_gather, > > > > > > > diff --git a/gcc/targhooks.h b/gcc/targhooks.h > > > > > > > index > > > > > > 2704d6008f14d2aa65671f002af886d3b802effa..723f8f4fda7808b6899f10f8b > > > > 3fafad74d3c536f 100644 > > > > > > > --- a/gcc/targhooks.h > > > > > > > +++ b/gcc/targhooks.h > > > > > > > @@ -124,6 +124,7 @@ extern opt_machine_mode > > > > default_vectorize_related_mode (machine_mode, > > > > > > > extern opt_machine_mode default_get_mask_mode (machine_mode); > > > > > > > extern bool default_empty_mask_is_expensive (unsigned); > > > > > > > extern bool default_conditional_operation_is_expensive > > > > > > > (unsigned); > > > > > > > +extern bool default_use_permute_for_promotion (const_tree, > > const_tree); > > > > > > > extern vector_costs *default_vectorize_create_costs (vec_info *, > > > > > > > bool); > > > > > > > > > > > > > > /* OpenACC hooks. */ > > > > > > > diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc > > > > > > > index > > > > > > dc040df9fcd1182b62d83088ee7fb3a248c99f51..a487eab794fe9f1089ecb58fdf > > > > c881fdb19d28f3 100644 > > > > > > > --- a/gcc/targhooks.cc > > > > > > > +++ b/gcc/targhooks.cc > > > > > > > @@ -1615,6 +1615,14 @@ default_conditional_operation_is_expensive > > > > (unsigned ifn) > > > > > > > return ifn == IFN_MASK_STORE; > > > > > > > } > > > > > > > > > > > > > > +/* By default no targets prefer permutes over multi step > > > > > > > extension. */ > > > > > > > + > > > > > > > +bool > > > > > > > +default_use_permute_for_promotion (const_tree, const_tree) > > > > > > > +{ > > > > > > > + return false; > > > > > > > +} > > > > > > > + > > > > > > > /* By default consider masked stores to be expensive. */ > > > > > > > > > > > > > > bool > > > > > > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > > > > > > > index > > > > > > 4f6905f15417f90c6f36e1711a7a25071f0f507c..f2939655e4ec34111baa8894e > > > > af769d29b1c5b82 100644 > > > > > > > --- a/gcc/tree-vect-stmts.cc > > > > > > > +++ b/gcc/tree-vect-stmts.cc > > > > > > > @@ -5129,6 +5129,111 @@ vect_create_vectorized_promotion_stmts > > > > (vec_info *vinfo, > > > > > > > gimple *new_stmt1, *new_stmt2; > > > > > > > vec<tree> vec_tmp = vNULL; > > > > > > > > > > > > > > + /* If we're using a VEC_PERM_EXPR then we're widening to the > > > > > > > final > > type in > > > > > > > + one go. */ > > > > > > > + if (ch1 == VEC_PERM_EXPR > > > > > > > + && op_type == unary_op) > > > > > > > + { > > > > > > > + vec_tmp.create (vec_oprnds0->length () * 2); > > > > > > > + bool failed_p = false; > > > > > > > + > > > > > > > + /* Extending with a vec-perm requires 2 instructions per > > > > > > > step. */ > > > > > > > + FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0) > > > > > > > + { > > > > > > > + tree vectype_in = TREE_TYPE (vop0); > > > > > > > + tree vectype_out = TREE_TYPE (vec_dest); > > > > > > > + machine_mode mode_in = TYPE_MODE (vectype_in); > > > > > > > + machine_mode mode_out = TYPE_MODE (vectype_out); > > > > > > > + unsigned bitsize_in = element_precision (vectype_in); > > > > > > > + unsigned tot_in, tot_out; > > > > > > > + unsigned HOST_WIDE_INT count; > > > > > > > + > > > > > > > + /* We can't really support VLA here as the indexes depend on > > > > > > > the > > VL. > > > > > > > + VLA should really use widening instructions like widening > > > > > > > + loads. */ > > > > > > > + if (!GET_MODE_BITSIZE (mode_in).is_constant (&tot_in) > > > > > > > + || !GET_MODE_BITSIZE (mode_out).is_constant (&tot_out) > > > > > > > + || !TYPE_VECTOR_SUBPARTS (vectype_in).is_constant > > (&count) > > > > > > > + || !TYPE_UNSIGNED (vectype_in) > > > > > > > + || !targetm.vectorize.use_permute_for_promotion > > (vectype_in, > > > > > > > + > > > > > > > vectype_out)) > > > > > > > + { > > > > > > > + failed_p = true; > > > > > > > + break; > > > > > > > + } > > > > > > > + > > > > > > > + unsigned steps = tot_out / bitsize_in; > > > > > > > + tree zero = build_zero_cst (vectype_in); > > > > > > > + > > > > > > > + unsigned chunk_size > > > > > > > + = exact_div (TYPE_VECTOR_SUBPARTS (vectype_in), > > > > > > > + TYPE_VECTOR_SUBPARTS > > (vectype_out)).to_constant (); > > > > > > > + unsigned step_size = chunk_size * (tot_out / tot_in); > > > > > > > + unsigned nunits = tot_out / bitsize_in; > > > > > > > + > > > > > > > + vec_perm_builder sel (steps, 1, 1); > > > > > > > + sel.quick_grow (steps); > > > > > > > + > > > > > > > + /* Flood fill with the out of range value first. */ > > > > > > > + for (unsigned long i = 0; i < steps; ++i) > > > > > > > + sel[i] = count; > > > > > > > + > > > > > > > + tree var; > > > > > > > + tree elem_in = TREE_TYPE (vectype_in); > > > > > > > + machine_mode elem_mode_in = TYPE_MODE (elem_in); > > > > > > > + unsigned long idx = 0; > > > > > > > + tree vc_in = get_related_vectype_for_scalar_type > > > > > > > (elem_mode_in, > > > > > > > + elem_in, > > nunits); > > > > > > > + > > > > > > > + for (unsigned long j = 0; j < chunk_size; j++) > > > > > > > + { > > > > > > > + if (WORDS_BIG_ENDIAN) > > > > > > > + for (int i = steps - 1; i >= 0; i -= step_size, idx++) > > > > > > > + sel[i] = idx; > > > > > > > + else > > > > > > > + for (int i = 0; i < (int)steps; i += step_size, idx++) > > > > > > > + sel[i] = idx; > > > > > > > + > > > > > > > + vec_perm_indices indices (sel, 2, steps); > > > > > > > + > > > > > > > + tree perm_mask = vect_gen_perm_mask_checked (vc_in, > > indices); > > > > > > > + auto vec_oprnd = make_ssa_name (vc_in); > > > > > > > + auto new_stmt = gimple_build_assign (vec_oprnd, > > VEC_PERM_EXPR, > > > > > > > + vop0, zero, > > > > > > > perm_mask); > > > > > > > + vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, > > > > > > > gsi); > > > > > > > + > > > > > > > + tree intvect_out = unsigned_type_for (vectype_out); > > > > > > > + var = make_ssa_name (intvect_out); > > > > > > > + new_stmt = gimple_build_assign (var, build1 > > (VIEW_CONVERT_EXPR, > > > > > > > + intvect_out, > > > > > > > + vec_oprnd)); > > > > > > > + vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, > > > > > > > gsi); > > > > > > > + > > > > > > > + gcc_assert (ch2.is_tree_code ()); > > > > > > > + > > > > > > > + var = make_ssa_name (vectype_out); > > > > > > > + if (ch2 == VIEW_CONVERT_EXPR) > > > > > > > + new_stmt = gimple_build_assign (var, > > > > > > > + build1 > > (VIEW_CONVERT_EXPR, > > > > > > > + vectype_out, > > > > > > > + vec_oprnd)); > > > > > > > + else > > > > > > > + new_stmt = gimple_build_assign (var, (tree_code)ch2, > > > > > > > + vec_oprnd); > > > > > > > + > > > > > > > + vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, > > > > > > > gsi); > > > > > > > + vec_tmp.safe_push (var); > > > > > > > + } > > > > > > > + } > > > > > > > + > > > > > > > + if (!failed_p) > > > > > > > + { > > > > > > > + vec_oprnds0->release (); > > > > > > > + *vec_oprnds0 = vec_tmp; > > > > > > > + return; > > > > > > > + } > > > > > > > + } > > > > > > > + > > > > > > > vec_tmp.create (vec_oprnds0->length () * 2); > > > > > > > FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0) > > > > > > > { > > > > > > > @@ -5495,6 +5600,20 @@ vectorizable_conversion (vec_info *vinfo, > > > > > > > || GET_MODE_SIZE (lhs_mode) <= GET_MODE_SIZE (rhs_mode)) > > > > > > > goto unsupported; > > > > > > > > > > > > > > + /* Check to see if the target can use a permute to perform > > > > > > > the zero > > > > > > > + extension. */ > > > > > > > + intermediate_type = unsigned_type_for (vectype_out); > > > > > > > + if (TYPE_UNSIGNED (vectype_in) > > > > > > > + && VECTOR_TYPE_P (intermediate_type) > > > > > > > + && TYPE_VECTOR_SUBPARTS (intermediate_type).is_constant () > > > > > > > + && targetm.vectorize.use_permute_for_promotion (vectype_in, > > > > > > > + > > intermediate_type)) > > > > > > > + { > > > > > > > + code1 = VEC_PERM_EXPR; > > > > > > > + code2 = FLOAT_EXPR; > > > > > > > + break; > > > > > > > + } > > > > > > > + > > > > > > > fltsz = GET_MODE_SIZE (lhs_mode); > > > > > > > FOR_EACH_2XWIDER_MODE (rhs_mode_iter, rhs_mode) > > > > > > > { > > > > > > > @@ -9804,7 +9923,8 @@ vect_gen_perm_mask_any (tree vectype, > > const > > > > vec_perm_indices &sel) > > > > > > > tree mask_type; > > > > > > > > > > > > > > poly_uint64 nunits = sel.length (); > > > > > > > - gcc_assert (known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype))); > > > > > > > + gcc_assert (known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype)) > > > > > > > + || known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype) * > > 2)); > > > > > > > > > > > > > > mask_type = build_vector_type (ssizetype, nunits); > > > > > > > return vec_perm_indices_to_tree (mask_type, sel); > > > > > > > @@ -14397,8 +14517,20 @@ supportable_widening_operation > > (vec_info > > > > *vinfo, > > > > > > > break; > > > > > > > > > > > > > > CASE_CONVERT: > > > > > > > - c1 = VEC_UNPACK_LO_EXPR; > > > > > > > - c2 = VEC_UNPACK_HI_EXPR; > > > > > > > + { > > > > > > > + tree cvt_type = unsigned_type_for (vectype_out); > > > > > > > + if (TYPE_UNSIGNED (vectype_in) > > > > > > > + && VECTOR_TYPE_P (cvt_type) > > > > > > > + && TYPE_VECTOR_SUBPARTS (cvt_type).is_constant () > > > > > > > + && targetm.vectorize.use_permute_for_promotion (vectype_in, > > > > cvt_type)) > > > > > > > + { > > > > > > > + *code1 = VEC_PERM_EXPR; > > > > > > > + *code2 = VIEW_CONVERT_EXPR; > > > > > > > + return true; > > > > > > > + } > > > > > > > + c1 = VEC_UNPACK_LO_EXPR; > > > > > > > + c2 = VEC_UNPACK_HI_EXPR; > > > > > > > + } > > > > > > > break; > > > > > > > > > > > > > > case FLOAT_EXPR: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Richard Biener <rguent...@suse.de> > > > > > > SUSE Software Solutions Germany GmbH, > > > > > > Frankenstrasse 146, 90461 Nuernberg, Germany; > > > > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG > > > > Nuernberg) > > > > > > > > > > > > > > > > > > -- > > > > Richard Biener <rguent...@suse.de> > > > > SUSE Software Solutions Germany GmbH, > > > > Frankenstrasse 146, 90461 Nuernberg, Germany; > > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG > > Nuernberg) > > > > > > > -- > > Richard Biener <rguent...@suse.de> > > SUSE Software Solutions Germany GmbH, > > Frankenstrasse 146, 90461 Nuernberg, Germany; > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg) > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)