"yangyang (ET)" <yangyang...@huawei.com> writes: > Hi, > > I have revised the patch based on your suggestions, and the following are > some points that I think is needed to be mentioned in the mail. > >> > @@ -502,17 +504,18 @@ simd_clone_adjust_return_type (struct >> cgraph_node *node) >> > veclen = node->simdclone->vecsize_int; >> > else >> > veclen = node->simdclone->vecsize_float; >> > - veclen /= GET_MODE_BITSIZE (SCALAR_TYPE_MODE (t)); >> > - if (veclen > node->simdclone->simdlen) >> > + veclen = exact_div (veclen, GET_MODE_BITSIZE (SCALAR_TYPE_MODE >> > + (t))); if (known_gt (veclen, node->simdclone->simdlen)) >> > veclen = node->simdclone->simdlen; >> >> Although again it probably doesn't make a difference in practice, the >> known/maybe situation is similar here. When comparing: >> >> - an SVE vector of 2 + 2 * (VQ - 1) doubles and >> - an Advanced SIMD vector of 2 doubles >> >> the Advanced SIMD version is conceptually ordered <= the SVE one, in the >> sense that the SVE vector always contains a whole number of Advanced SIMD >> vectors whereas the Advanced SIMD vector might not contain a whole number >> of SVE vectors. >> >> In other words, the number of lanes in the Advanced SIMD vector is known_le >> the number of lanes in the SVE vector, and the number of lanes in the SVE >> vector is known_ge and maybe_gt (but not known_gt) the number of lanes in >> the Advanced SIMD vector. So for these kinds of comparison, known_gt can >> look a bit unexpected, even if (as here) it's probably fine in practice. >> >> There's currently a hard-coded assumption in this code and in the vectoriser >> that both constant-length software vectors and constant-length hardware >> vectors are a power of 2 in size. This means that the > above is effectively >> testing whether veclen contains a whole number of >> node->simdclone->simdlens, not just whether the veclen is bigger. >> >> So when adding the initial autovec support for SVE, we generally rewrote >> these >> kinds of test to use multiple_p instead of comparisons. I think that makes >> the >> intent more obvious (but others might disagree :-)). >> >> That was a long-winded way of saying that I suggest we use: >> >> if (multiple_p (veclen, node->simdclone->simdlen)) >> veclen = node->simdclone->simdlen; >> >> instead. > > As far as I understand, multiple_p (a, b) is equal to known_ge (a, b) due to > the hard-coded assumption, and !multiple_p (b, a) is equal to known_gt (a, b).
Yes to the first part, but !multiple_p (b, a) is equal to known_ge (a, b) (== known_le (b, a)) rather than known_gt (a, b). E.g.: multiple_p (2, 2 + 2X) is false (for X>0) but: known_gt (2 + 2X, X) is also false (for X==0). In both cases, the conditions have an implicit “for all X”. > So it seems better to use !multiple_p (node->simdclone->simdlen, veclen) > here,while using multiple_p (veclen, node->simdclone->simdlen) is OK since > doing the assignment for the eq situation won't change anything. Anyway, I > used multiple_p (veclen, node->simdclone->simdlen) in the revised patch. !multiple_p is fine with me too FWIW, if you prefer that. >> > } >> > TREE_TYPE (TREE_TYPE (fndecl)) = t; >> > if (!node->definition) >> > @@ -663,8 +669,9 @@ simd_clone_adjust_argument_types (struct >> cgraph_node *node) >> > veclen = sc->vecsize_int; >> > else >> > veclen = sc->vecsize_float; >> > - veclen /= GET_MODE_BITSIZE (SCALAR_TYPE_MODE (base_type)); >> > - if (veclen > sc->simdlen) >> > + veclen = exact_div (veclen, >> > + GET_MODE_BITSIZE (SCALAR_TYPE_MODE (base_type))); >> > + if (known_gt (veclen, sc->simdlen)) >> > veclen = sc->simdlen; >> >> Similarly here. > > Similarly here, use multiple_p (veclen, sc->simdlen) in the revised patch in > addition. Here too !multiple_p sounds OK if you prefer that. >> > @@ -981,8 +992,11 @@ ipa_simd_modify_function_body (struct >> cgraph_node *node, >> > iter, NULL_TREE, NULL_TREE); >> > adjustments->register_replacement >> > (&(*adjustments->m_adj_params)[j], r); >> > >> > - if (simd_clone_subparts (vectype) < node->simdclone->simdlen) >> > - j += node->simdclone->simdlen / simd_clone_subparts (vectype) - 1; >> > + if (known_lt (simd_clone_subparts (vectype), >> node->simdclone->simdlen)) >> > + { >> > + j += vector_unroll_factor (node->simdclone->simdlen, >> > + simd_clone_subparts (vectype)) - 1; >> > + } >> >> And similarly here: >> >> if (multiple_p (node->simdclone->simdlen, simd_clone_subparts >> (vectype))) >> > > Similarly here, use multiple_p (node->simdclone->simdlen, simd_clone_subparts > (vectype)) in the revised patch in addition. To be clear, I think multiple_p is still natural here though. >> > /* Wrapper for poly_int arguments to target macros, so that if a target >> > doesn't need polynomial-sized modes, its header file can continue to >> > treat the argument as a normal constant. This should go away once >> > @@ -3902,12 +3902,12 @@ vectorizable_simd_clone_call (vec_info *vinfo, >> stmt_vec_info stmt_info, >> > n = n->simdclone->next_clone) >> > { >> > unsigned int this_badness = 0; >> > - if (n->simdclone->simdlen > vf >> > + if (known_gt (n->simdclone->simdlen, vf) >> >> I think !constant_multiple_p rather than known_gt here. Until we add >> support for vectorising simd clones with partial vectors, a whole number of >> simdlens must fit within VF. So maybe: >> >> unsigned int num_calls; >> if (!constant_multiple_p (n->simdclone->simdlen, vf, &num_calls) >> >> and then… >> >> > || n->simdclone->nargs != nargs) >> > continue; >> > - if (n->simdclone->simdlen < vf) >> > - this_badness += (exact_log2 (vf) >> > - - exact_log2 (n->simdclone->simdlen)) * 1024; >> > + if (known_lt (n->simdclone->simdlen, vf)) >> > + this_badness += exact_log2 >> > + (vector_unroll_factor (vf, n->simdclone->simdlen)) * 1024; >> >> …use num_calls instead of vector_unroll_factor here. > > The revised patch wrote these code as: > > unsigned int num_calls; > if (!constant_multiple_p (vf, n->simdclone->simdlen, &num_calls) > || n->simdclone->nargs != nargs) > continue; > if (!multiple_p (n->simdclone->simdlen, vf)) > this_badness += exact_log2 (num_calls) * 1024; I think this should be: if (num_calls != 1) this_badness += exact_log2 (num_calls) * 1024; (Or we could do the addition unconditionally, both work.) >> > if (n->simdclone->inbranch) >> > this_badness += 2048; >> > int target_badness = targetm.simd_clone.usable (n); @@ -3988,19 >> > +3988,19 @@ vectorizable_simd_clone_call (vec_info *vinfo, stmt_vec_info >> stmt_info, >> > arginfo[i].vectype = get_vectype_for_scalar_type (vinfo, arg_type, >> > slp_node); >> > if (arginfo[i].vectype == NULL >> > - || (simd_clone_subparts (arginfo[i].vectype) >> > - > bestn->simdclone->simdlen)) >> > + || (known_gt (simd_clone_subparts (arginfo[i].vectype), >> > + bestn->simdclone->simdlen))) >> >> Here too I think we want constant_multiple_p: >> >> || !constant_multiple_p (bestn->simdclone->simdlen, >> simd_clone_subparts (arginfo[i].vectype)) >> > > Use multiple_p here since the multiple is not needed. True, but in the case of vectorisation, we need to generate a constant number of copies at compile time. If we don't enforce a constant multiple, we might end up trying to use an Advanced SIMD routine when vectorising for SVE. The reason we don't have a two-argument version of constant_multiple_p is that so far nothing has needed it (at least AFAIK). There's no conceptual problem with adding it though. I'm happy to do that if it would help. >> > if (ratype) >> > new_temp = create_tmp_var (ratype); >> > else if (useless_type_conversion_p (vectype, rtype)) @@ -4264,12 >> > +4266,13 @@ vectorizable_simd_clone_call (vec_info *vinfo, >> > stmt_vec_info stmt_info, >> > >> > if (vec_dest) >> > { >> > - if (simd_clone_subparts (vectype) < nunits) >> > + if (known_lt (simd_clone_subparts (vectype), nunits)) >> >> multiple_p (nunits, …) here too. It would also be possible to do: >> >> if (constant_multiple_p (nunits, simd_clone_subparts (vectype), &k) >> >> and avoid the separate vector_unroll_factor. > > Use !multiple_p (simd_clone_subparts (vectype), nunits) here, since it seems > that the following code should not be executed in the eq situation. Ah, yes, sorry for missing that. >> >> > { >> > unsigned int k, l; >> > poly_uint64 prec = GET_MODE_BITSIZE (TYPE_MODE (vectype)); >> > poly_uint64 bytes = GET_MODE_SIZE (TYPE_MODE (vectype)); >> > - k = nunits / simd_clone_subparts (vectype); >> > + k = vector_unroll_factor (nunits, >> > + simd_clone_subparts (vectype)); >> > gcc_assert ((k & (k - 1)) == 0); >> > for (l = 0; l < k; l++) >> > { >> > @@ -4295,16 +4298,18 @@ vectorizable_simd_clone_call (vec_info *vinfo, >> stmt_vec_info stmt_info, >> > vect_clobber_variable (vinfo, stmt_info, gsi, new_temp); >> > continue; >> > } >> > - else if (simd_clone_subparts (vectype) > nunits) >> > + else if (known_gt (simd_clone_subparts (vectype), nunits)) >> >> multiple_p here too. >> > > Similarly, use !multiple_p (nunits, simd_clone_subparts (vectype)) here. OK. > @@ -614,8 +618,9 @@ simd_clone_adjust_argument_types (struct cgraph_node > *node) > veclen = sc->vecsize_int; > else > veclen = sc->vecsize_float; > - veclen /= GET_MODE_BITSIZE (SCALAR_TYPE_MODE (parm_type)); > - if (veclen > sc->simdlen) > + veclen = exact_div (veclen, > + GET_MODE_BITSIZE (SCALAR_TYPE_MODE (parm_type))); > + if (known_gt (veclen, sc->simdlen)) > veclen = sc->simdlen; > adj.op = IPA_PARAM_OP_NEW; > adj.param_prefix_index = IPA_PARAM_PREFIX_SIMD; I think this should use multiple_p (or !multiple_p) too. > @@ -981,8 +992,11 @@ ipa_simd_modify_function_body (struct cgraph_node *node, > iter, NULL_TREE, NULL_TREE); > adjustments->register_replacement (&(*adjustments->m_adj_params)[j], > r); > - if (simd_clone_subparts (vectype) < node->simdclone->simdlen) > - j += node->simdclone->simdlen / simd_clone_subparts (vectype) - 1; > + if (multiple_p (node->simdclone->simdlen, simd_clone_subparts > (vectype))) > + { > + j += vector_unroll_factor (node->simdclone->simdlen, > + simd_clone_subparts (vectype)) - 1; > + } > } > > tree name; Minor nit, but: the original formatting without braces is preferred for single statements, even if the statement spans multiple lines. Otherwise the patch looks good, thanks. Richard