On Thu, Jul 11, 2024 at 10:58 AM Richard Sandiford <richard.sandif...@arm.com> wrote: > > Andrew Pinski <pins...@gmail.com> writes: > > I need some help with the vector cost model for aarch64. > > I am adding V2HI and V4QI mode support by emulating it using the > > native V4HI/V8QI instructions (similarly to mmx as SSE is done). The > > problem is I am running into a cost model issue with > > gcc.target/aarch64/pr98772.c (wminus is similar to > > gcc.dg/vect/slp-gap-1.c, just slightly different offsets for the > > address). > > It seems like the cost mode is overestimating the number of loads for > > V8QI case . > > With the new cost model usage (-march=armv9-a+nosve), I get: > > ``` > > t.c:7:21: note: ***** Analysis succeeded with vector mode V4QI > > t.c:7:21: note: Comparing two main loops (V4QI at VF 1 vs V8QI at VF 2) > > t.c:7:21: note: Issue info for V4QI loop: > > t.c:7:21: note: load operations = 2 > > t.c:7:21: note: store operations = 1 > > t.c:7:21: note: general operations = 4 > > t.c:7:21: note: reduction latency = 0 > > t.c:7:21: note: estimated min cycles per iteration = 2.000000 > > t.c:7:21: note: Issue info for V8QI loop: > > t.c:7:21: note: load operations = 12 > > t.c:7:21: note: store operations = 1 > > t.c:7:21: note: general operations = 6 > > t.c:7:21: note: reduction latency = 0 > > t.c:7:21: note: estimated min cycles per iteration = 4.333333 > > t.c:7:21: note: Weighted cycles per iteration of V4QI loop ~= 4.000000 > > t.c:7:21: note: Weighted cycles per iteration of V8QI loop ~= 4.333333 > > t.c:7:21: note: Preferring loop with lower cycles per iteration > > t.c:7:21: note: ***** Preferring vector mode V4QI to vector mode V8QI > > ``` > > > > That is totally wrong and instead of vectorizing using V8QI we > > vectorize using V4QI and the resulting code is worse. > > > > Attached is my current patch for adding V4QI/V2HI to the aarch64 > > backend (Note I have not finished up the changelog nor the testcases; > > I have secondary patches that add the testcases already). > > Is there something I am missing here or are we just over estimating > > V8QI cost and is something easy to fix? > > Trying it locally, I get: > > foo.c:15:23: note: ***** Analysis succeeded with vector mode V4QI > foo.c:15:23: note: Comparing two main loops (V4QI at VF 1 vs V8QI at VF 2) > foo.c:15:23: note: Issue info for V4QI loop: > foo.c:15:23: note: load operations = 2 > foo.c:15:23: note: store operations = 1 > foo.c:15:23: note: general operations = 4 > foo.c:15:23: note: reduction latency = 0 > foo.c:15:23: note: estimated min cycles per iteration = 2.000000 > foo.c:15:23: note: Issue info for V8QI loop: > foo.c:15:23: note: load operations = 8 > foo.c:15:23: note: store operations = 1 > foo.c:15:23: note: general operations = 6 > foo.c:15:23: note: reduction latency = 0 > foo.c:15:23: note: estimated min cycles per iteration = 3.000000 > foo.c:15:23: note: Weighted cycles per iteration of V4QI loop ~= 4.000000 > foo.c:15:23: note: Weighted cycles per iteration of V8QI loop ~= 3.000000 > foo.c:15:23: note: Preferring loop with lower cycles per iteration > > The function is: > > extern void > wplus (uint16_t *d, uint8_t *restrict pix1, uint8_t *restrict pix2 ) > { > for (int y = 0; y < 4; y++ ) > { > for (int x = 0; x < 4; x++ ) > d[x + y*4] = pix1[x] + pix2[x]; > pix1 += 16; > pix2 += 16; > } > } > > For V8QI we need a VF of 2, so that there are 8 elements to store to d. > Conceptually, we handle those two iterations by loading 4 V8QIs from > pix1 and pix2 (32 bytes each), with mitigations against overrun, > and then permute the result to single V8QIs. > > vectorize_load doesn't seem to be smart enough to realise that only 2 > of those 4 loads are actually used in the permuation, and so only 2 > loads should be costed for each of pix1 and pix2.
Though it has code to do that. Richard. > Thanks, > Richard