On Wed, 2024-03-27 at 10:38 +0800, chenglulu wrote:
>
> 在 2024/3/26 下午5:48, Xi Ruoyao 写道:
> > The latency of LA464 and LA664 division instructions depends on the
> > input. When I updated the costs in r14-6642, I unintentionally set the
> > division costs to the best-case latency (when the first operand is 0).
> > Per a recent discussion [1] we should use "something sensible" instead
> > of it.
> >
> > Use the average of the minimum and maximum latency observed instead.
> > This enables multiplication to reciprocal sequence reduction and speeds
> > up the following test case for about 30%:
> >
> > int
> > main (void)
> > {
> > unsigned long stat = 0xdeadbeef;
> > for (int i = 0; i < 100000000; i++)
> > stat = (stat * stat + stat * 114514 + 1919810) % 1000000007;
> > asm(""::"r"(stat));
> > }
> >
> > [1]: https://gcc.gnu.org/pipermail/gcc-patches/2024-March/648348.html
>
> The test case div-const-reduction.c is modified to assemble the instruction
> sequence as follows:
> lu12i.w $r12,999997440>>12 # 0x3b9ac000
> ori $r12,$r12,2567
> mod.w $r13,$r13,$r12
>
> This sequence of instructions takes 5 clock cycles.
Hmm indeed, it seems a waste to do this reduction for int / 1000000007.
I'll try to make a better heuristic as Richard suggests...
--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University