Andrew Pinski <pins...@gmail.com> writes: > I need some help with the vector cost model for aarch64. > I am adding V2HI and V4QI mode support by emulating it using the > native V4HI/V8QI instructions (similarly to mmx as SSE is done). The > problem is I am running into a cost model issue with > gcc.target/aarch64/pr98772.c (wminus is similar to > gcc.dg/vect/slp-gap-1.c, just slightly different offsets for the > address). > It seems like the cost mode is overestimating the number of loads for > V8QI case . > With the new cost model usage (-march=armv9-a+nosve), I get: > ``` > t.c:7:21: note: ***** Analysis succeeded with vector mode V4QI > t.c:7:21: note: Comparing two main loops (V4QI at VF 1 vs V8QI at VF 2) > t.c:7:21: note: Issue info for V4QI loop: > t.c:7:21: note: load operations = 2 > t.c:7:21: note: store operations = 1 > t.c:7:21: note: general operations = 4 > t.c:7:21: note: reduction latency = 0 > t.c:7:21: note: estimated min cycles per iteration = 2.000000 > t.c:7:21: note: Issue info for V8QI loop: > t.c:7:21: note: load operations = 12 > t.c:7:21: note: store operations = 1 > t.c:7:21: note: general operations = 6 > t.c:7:21: note: reduction latency = 0 > t.c:7:21: note: estimated min cycles per iteration = 4.333333 > t.c:7:21: note: Weighted cycles per iteration of V4QI loop ~= 4.000000 > t.c:7:21: note: Weighted cycles per iteration of V8QI loop ~= 4.333333 > t.c:7:21: note: Preferring loop with lower cycles per iteration > t.c:7:21: note: ***** Preferring vector mode V4QI to vector mode V8QI > ``` > > That is totally wrong and instead of vectorizing using V8QI we > vectorize using V4QI and the resulting code is worse. > > Attached is my current patch for adding V4QI/V2HI to the aarch64 > backend (Note I have not finished up the changelog nor the testcases; > I have secondary patches that add the testcases already). > Is there something I am missing here or are we just over estimating > V8QI cost and is something easy to fix?
Trying it locally, I get: foo.c:15:23: note: ***** Analysis succeeded with vector mode V4QI foo.c:15:23: note: Comparing two main loops (V4QI at VF 1 vs V8QI at VF 2) foo.c:15:23: note: Issue info for V4QI loop: foo.c:15:23: note: load operations = 2 foo.c:15:23: note: store operations = 1 foo.c:15:23: note: general operations = 4 foo.c:15:23: note: reduction latency = 0 foo.c:15:23: note: estimated min cycles per iteration = 2.000000 foo.c:15:23: note: Issue info for V8QI loop: foo.c:15:23: note: load operations = 8 foo.c:15:23: note: store operations = 1 foo.c:15:23: note: general operations = 6 foo.c:15:23: note: reduction latency = 0 foo.c:15:23: note: estimated min cycles per iteration = 3.000000 foo.c:15:23: note: Weighted cycles per iteration of V4QI loop ~= 4.000000 foo.c:15:23: note: Weighted cycles per iteration of V8QI loop ~= 3.000000 foo.c:15:23: note: Preferring loop with lower cycles per iteration The function is: extern void wplus (uint16_t *d, uint8_t *restrict pix1, uint8_t *restrict pix2 ) { for (int y = 0; y < 4; y++ ) { for (int x = 0; x < 4; x++ ) d[x + y*4] = pix1[x] + pix2[x]; pix1 += 16; pix2 += 16; } } For V8QI we need a VF of 2, so that there are 8 elements to store to d. Conceptually, we handle those two iterations by loading 4 V8QIs from pix1 and pix2 (32 bytes each), with mitigations against overrun, and then permute the result to single V8QIs. vectorize_load doesn't seem to be smart enough to realise that only 2 of those 4 loads are actually used in the permuation, and so only 2 loads should be costed for each of pix1 and pix2. Thanks, Richard