Hi,
I was looking at why, in the vectorized DCT kernel of FFmpeg, the insn
selection of GCC fails to produce XOP fused-multiply-add vector insns:
DOM is detecting a redundant expression that is optimized, and that
makes it impossible to detect the higher level insns in combine.
The DCT kernel looks like this:
static void
dct_unquantize_h263_inter_c (DCTELEM * block, int qscale, int nCoeffs)
{
int i, level, qmul, qadd;
qadd = (qscale - 1) | 1;
qmul = qscale << 1;
for (i = 0; i <= nCoeffs; i++)
{
level = block[i];
if (level < 0)
level = level * qmul + qadd;
else
level = level * qmul - qadd;
block[i] = level;
}
}
The expression "level * qmul" is redundant and is optimized out
of the condition:
level = level * qmul;
if (level < 0)
level += qadd;
else
level -= qadd;
On this code GCC fails to combine the + and the - with *, as they both
depend on the same computation. However, if I am modifying the DCT
kernel to artificially remove the redundancy:
if (level < 0)
level = level * qmul + qadd;
else
level = level * qadd - qmul;
the kernel is vectorized with the expected insns:
vpmacsdd %xmm1, %xmm6, %xmm0, %xmm3
vpmacsdd %xmm5, %xmm1, %xmm0, %xmm2
vpcomltd %xmm4, %xmm0, %xmm0
vpcmov %xmm0, %xmm2, %xmm3, %xmm0
Here is the slower and larger code generated for the original DCT,
with one * and two +:
vpmulld %xmm6, %xmm0, %xmm1
vpcomltd %xmm3, %xmm0, %xmm0
vpaddd %xmm5, %xmm1, %xmm2
vpaddd %xmm4, %xmm1, %xmm1
vpcmov %xmm0, %xmm1, %xmm2, %xmm0
Is there a simple way to teach combine how to introduce redundancy to
generate higher level insns?
Thanks,
Sebastian Pop
--
AMD / Open Source Compiler Engineering / GNU Tools