On Tue, Jan 12, 2021 at 11:42:44AM +0100, Uros Bizjak wrote:
> > The following patch adds patterns (in the end I went with define_insn rather
> > than combiner define_split + define_insn_and_split I initially hoped or
> > define_insn_and_split) to represent (so far 128-bit only) permutations
> > like { 0 16 1 17 2 18 3 19 4 20 5 21 6 22 7 23 } where the second
> > operand is CONST0_RTX CONST_VECTOR as pmovzx.
> > define_split didn't work (due to the combiner not trying combine_split_insn
> > when i1 is NULL) but in the end was quite large, and the reason for not
> > trying to split this afterwards is the different vector mode of the output,
> > and lowpart_subreg on the result is undesirable,
> > so we'd need to split it into two instructions and hope some later pass
> > optimizes the move into just rewriting the uses using lowpart_subreg.
> 
> You can use post-reload define_insn_and_split here. This way,
> gen_lowpart on all arguments, including output, can be used. So,
> instead of generating an insn template, the patterns you introduced
> should split to "real" sse4_1 zero-extend insns. This approach is
> preferred to avoid having several pseudo-insns in .md files that do
> the same thing with slightly different patterns. There are many
> examples of post-reload splitters that use gen_lowpart in i386.md.

Ok, will change it that way.

> OTOH, perhaps some of the new testcases can be handled in x86
> target_fold_builtin? In the long term, maybe target_fold_shuffle can
> be introduced to map __builtin_shufle to various target builtins, so
> the builtin can be processed further in target_fold_builtin. As
> pointed out below, vector insn patterns can be quite complex, and push
> RTL combiners to their limits, so perhaps they can be more efficiently
> handled by tree passes.

My primary motivation was to generate good code from __builtin_shuffle here
and trying to find the best permutation and map it back from insns to
builtins would be a nightmare.
I'll see how many targets I need to modify to try the no middle-end
force_reg for CONST0_RTX case...

        Jakub

Reply via email to