https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124580

            Bug ID: 124580
           Summary: RISCV: Redundant memory loads in x264_pixel_sad
                    function
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: guohuawen7 at gmail dot com
                CC: chenzhongyao.hit at gmail dot com
  Target Milestone: ---

In SPEC2017's 525.x264_r benchmark, function x264_pixel_sad_x3_8x8 redundantly
loads the same fenc data three times:

void x264_pixel_sad_x3_8x8( uint8_t *fenc, uint8_t *pix0, uint8_t *pix1,
uint8_t *pix2, int i_stride, int scores[3] )
{
    scores[0] = x264_pixel_sad_8x8( fenc, FENC_STRIDE, pix0, i_stride );
    scores[1] = x264_pixel_sad_8x8( fenc, FENC_STRIDE, pix1, i_stride );
    scores[2] = x264_pixel_sad_8x8( fenc, FENC_STRIDE, pix2, i_stride );
}

Current implementation causes 24 loads of fenc (8 rows × 3 calls), as can be
seen in the following link: https://godbolt.org/z/oWEvh8des. Since there are no
dependencies between the three SAD calculations, this could be optimized to
load each fenc row only once (8 total loads).

The optimization logic can be represented as follows:

orig:
Loop 1 (8 iterations): load fenc[y], load pix0[y], SAD -> sum0
Loop 2 (8 iterations): load fenc[y], load pix1[y], SAD -> sum1
Loop 3 (8 iterations): load fenc[y], load pix2[y], SAD -> sum2
Total fenc loads: 8 * 3 = 24

my goal:
Loop 1 (8 iterations):
  load fenc[y] // loaded once
  load pix0[y], SAD -> sum0
  load pix1[y], SAD -> sum1
  load pix2[y], SAD -> sum2
Total fenc loads: 8 * 1 = 8

I would like to implement this optimization. Could you advise where in the GCC
compiler this optimization would be most appropriate to implement?

Reply via email to