On 14/01/2022 07:08, Richard Biener wrote:
On Thu, 13 Jan 2022, Andre Vieira (lists) wrote:
On 13/01/2022 14:25, Richard Biener wrote:
On Thu, 13 Jan 2022, Andre Vieira (lists) wrote:
On 13/01/2022 12:36, Richard Biener wrote:
On Thu, 13 Jan 2022, Andre Vieira (lists) wrote:
This time to the list too (sorry for double email)
Hi,
The original patch '[vect] Re-analyze all modes for epilogues', skipped
modes
that should not be skipped since it used the vector mode provided by
autovectorize_vector_modes to derive the minimum VF required for it.
However,
those modes should only really be used to dictate vector size, so instead
this
patch looks for the mode in 'used_vector_modes' with the largest element
size,
and constructs a vector mode with the smae size as the current
vector_modes[mode_i]. Since we are using the largest element size the
NUNITs
for this mode is the smallest possible VF required for an epilogue with
this
mode and should thus skip only the modes we are certain can not be used.
Passes bootstrap and regression on x86_64 and aarch64.
Clearly
+ /* To make sure we are conservative as to what modes we skip, we
+ should use check the smallest possible NUNITS which would be
+ derived from the mode in USED_VECTOR_MODES with the largest
+ element size. */
+ scalar_mode max_elsize_mode = GET_MODE_INNER
(vector_modes[mode_i]);
+ for (vec_info::mode_set::iterator i =
+ first_loop_vinfo->used_vector_modes.begin ();
+ i != first_loop_vinfo->used_vector_modes.end (); ++i)
+ {
+ if (VECTOR_MODE_P (*i)
+ && GET_MODE_SIZE (GET_MODE_INNER (*i))
+ > GET_MODE_SIZE (max_elsize_mode))
+ max_elsize_mode = GET_MODE_INNER (*i);
+ }
can be done once before iterating over the modes for the epilogue.
True, I'll start with QImode instead of the inner of vector_modes[mode_i]
too
since we can't guarantee the mode is a VECTOR_MODE_P and it is actually
better
too since we can't possible guarantee the element size of the
USED_VECTOR_MODES is smaller than that of the first vector mode...
Richard maybe knows whether we should take care to look at the
size of the vector mode as well since related_vector_mode when
passed 0 as nunits produces a vector mode with the same size
as vector_modes[mode_i] but not all used_vector_modes may be
of the same size
I suspect that should be fine though, since if we use the largest element
size
of all used_vector_modes then that should gives us the least possible
number
of NUNITS and thus only conservatively skip. That said, that does assume
that
no vector mode used may be larger than the size of the loop's vector_mode.
Can
I assume that?
No idea, but I would lean towards a no ;) I think the loops vector_mode
doesn't have to match vector_modes[mode_i] either, does it? At least
autodetected_vector_mode will be not QImode based.
The mode doesn't but both vector modes have to be the same vector size surely,
I'm not referring to the element size here.
What I was trying to ask was whether all vector modes in used_vector_modes had
the same vector size as the loops vector mode (and the vector_modes[mode_i] it
originated from).
Definitely not I think.
Hmmm I'm still struggling to understand what we use that initial
vector_mode for then. I thought it was a combination of limiting vector
size (By that I mean NUNITS * element size) and ISA choice.
If we can use vector modes within a loop with a size different from the
initial one, is it at least a guarantee that the input vector mode's
size is an upper bound to the sizes of the modes in used_vector_modes?
(and you probably also want to exclude
VECTOR_BOOLEAN_TYPE_P from the search?)
Yeah I think so too, thanks!
I keep going back to thinking (as I brought up in the bugzilla ticket),
maybe
we ought to only skip if the NUNITS of the vector mode with the same vector
size as vector_modes[mode_i] is larger than first_info_vf, or just don't
skip
at all...
The question is how much work we do before realizing the chosen mode
cannot be used because there's not enough iterations? Maybe we can
improve there easily?
IIUC the VF can change depending on whether we decide to use SLP, so really we
can only check if after we have determined whether or not to use SLP, so
either:
* When SLP fully succeeds, so somewhere between the last 'goto again;' and
return success, but there is very little left to do there
* When SLP fails: here we could save on some work.
Hmm, yeah. Guess it's quite expensive then in the end so worth to
avoid doing useless stuff. I do wonder whether we could cache
analysis fails (and VFs in case of success but worse cost) of the
main loop analysis.
Hmmm, a quick look doesn't show any cases where the main loop may fail
where epilogue may succeed (other than costing). So this could save on
some analysis. Though it is potentially a band-aid if we end up having
to skip for VF, since that wouldn't fail for the main loop, unless ofc
it's a known iteration count. It sounds like a lot of work for maybe a
bit of saving?
Also for targets that for the main loop do not perform cost
comparison (like x86) but have lots of vector modes the previous
mode of operation really made sense (start at next_mode_i or
mode_i when unrolling).
Are you hinting at maybe creating different paths here based on some target
configurable thing? Could be something we ask vector_costs?
That would be an option, yes. We could re-use the VECT_COMPARE_COSTS
bit from autovectorize_vector_modes, if we are not supposed to compare
costs then the old scheme makes sense. We could of course also ask
the target for the first (auto-detect) mode to try for the epilogue,
telling it the first loops mode and VF (again if not comparing costs)
with a new target hook.
But at this point lets try to fix the skipping heuristic and if that
fails just go back to the old iteration scheme, at least for the first
mode to try? Thus, maybe set vector_modes[0] to that previously chosen
next mode and iterate from that. Shouldn't matter for aarcht64 since
we'd compare costs of the other modes anyway.
I like the idea of pushing the old 'next mode' to vector_modes[mode_i]
idea, aarch64 will still analyze more than it needs to though. So so
yeah the skipping might be preferable if we can get it to work. The
original patch seems to work fine on current testsuite for aarch64 and
x86_64. Though, I am not confident of the 'max element size of
USED_VECTOR_MODES' if we can't guarantee that the vector size of the
input vector_mode for the main loop is an upper bound for the sizes in
used_vector_modes...