Hi,

I was looking at why, in the vectorized DCT kernel of FFmpeg, the insn
selection of GCC fails to produce XOP fused-multiply-add vector insns:
DOM is detecting a redundant expression that is optimized, and that
makes it impossible to detect the higher level insns in combine.

The DCT kernel looks like this:

static void
dct_unquantize_h263_inter_c (DCTELEM * block, int qscale, int nCoeffs)
{
  int i, level, qmul, qadd;

  qadd = (qscale - 1) | 1;
  qmul = qscale << 1;

  for (i = 0; i <= nCoeffs; i++)
    {
      level = block[i];

      if (level < 0)
        level = level * qmul + qadd;
      else
        level = level * qmul - qadd;

      block[i] = level;
    }
}

The expression "level * qmul" is redundant and is optimized out
of the condition:

      level = level * qmul;
      if (level < 0)
        level += qadd;
      else
        level -= qadd;

On this code GCC fails to combine the + and the - with *, as they both
depend on the same computation.  However, if I am modifying the DCT
kernel to artificially remove the redundancy:

      if (level < 0)
        level = level * qmul + qadd;
      else
        level = level * qadd - qmul;

the kernel is vectorized with the expected insns:

        vpmacsdd        %xmm1, %xmm6, %xmm0, %xmm3
        vpmacsdd        %xmm5, %xmm1, %xmm0, %xmm2
        vpcomltd        %xmm4, %xmm0, %xmm0
        vpcmov  %xmm0, %xmm2, %xmm3, %xmm0

Here is the slower and larger code generated for the original DCT,
with one * and two +:

        vpmulld %xmm6, %xmm0, %xmm1
        vpcomltd        %xmm3, %xmm0, %xmm0
        vpaddd  %xmm5, %xmm1, %xmm2
        vpaddd  %xmm4, %xmm1, %xmm1
        vpcmov  %xmm0, %xmm1, %xmm2, %xmm0

Is there a simple way to teach combine how to introduce redundancy to
generate higher level insns?

Thanks,
Sebastian Pop
--
AMD / Open Source Compiler Engineering / GNU Tools

Reply via email to