On Tue, Jan 12, 2021 at 11:42:44AM +0100, Uros Bizjak wrote: > > The following patch adds patterns (in the end I went with define_insn rather > > than combiner define_split + define_insn_and_split I initially hoped or > > define_insn_and_split) to represent (so far 128-bit only) permutations > > like { 0 16 1 17 2 18 3 19 4 20 5 21 6 22 7 23 } where the second > > operand is CONST0_RTX CONST_VECTOR as pmovzx. > > define_split didn't work (due to the combiner not trying combine_split_insn > > when i1 is NULL) but in the end was quite large, and the reason for not > > trying to split this afterwards is the different vector mode of the output, > > and lowpart_subreg on the result is undesirable, > > so we'd need to split it into two instructions and hope some later pass > > optimizes the move into just rewriting the uses using lowpart_subreg. > > You can use post-reload define_insn_and_split here. This way, > gen_lowpart on all arguments, including output, can be used. So, > instead of generating an insn template, the patterns you introduced > should split to "real" sse4_1 zero-extend insns. This approach is > preferred to avoid having several pseudo-insns in .md files that do > the same thing with slightly different patterns. There are many > examples of post-reload splitters that use gen_lowpart in i386.md.
Ok, will change it that way. > OTOH, perhaps some of the new testcases can be handled in x86 > target_fold_builtin? In the long term, maybe target_fold_shuffle can > be introduced to map __builtin_shufle to various target builtins, so > the builtin can be processed further in target_fold_builtin. As > pointed out below, vector insn patterns can be quite complex, and push > RTL combiners to their limits, so perhaps they can be more efficiently > handled by tree passes. My primary motivation was to generate good code from __builtin_shuffle here and trying to find the best permutation and map it back from insns to builtins would be a nightmare. I'll see how many targets I need to modify to try the no middle-end force_reg for CONST0_RTX case... Jakub