On Tue, 26 Sept 2023 at 14:18, Jeff Law <jeffreya...@gmail.com> wrote:

>  But the Coremark code is what it is.  This isn't a whole lot
> different than the work in the 90s which rewrote loops and compromised
> some of the spec benchmarks, or the eqntott hack to simplify that one
> key loop in eqntott.

I think the stated purpose of the benchmark matters.  If dhrystone had been
pushed as an abstraction-penalty benchmark, it would have been fine to
present results with WPA, inlining and dead code elimination as ordinary
dhrystone results.  But since it's supposed to exercise specific hardware
features, and not have the tests for these optimized away, that's not
appropriate.

So, first, we make the compiled program perform the work that the benchmark
was supposed to include in the measurement, just more efficiently.
Second, we not only optimize the benchmark, but also make the target-optimized
code generation available for other programs.  For new programs targeted at
GNU C, that is minimally archived by providing a built-in function,
and in general
for new code, by being able to replicate the idiom from coremark that
is recognized
by GCC.  The mere existence of a C idiom in a popular benchmark also makes this
idiom a common idiom, if it hasn't already been that before.
As we find new patterns that are used to implement CRC which would
be better replaced with a target-specific implementation, we can add these.

This is similar to rotate operations, which are supported directly by
some processors,
and even for other targets, there are generally preferred ways to
expand the code,
but there are a couple of different variants depending on the
available instruction set,
registers, and the microarchitecture (pipeline, latency etc).  We
started out with
one patterns that was recognized, and as new patterns were identified
in C code, we
improved GCC to recognize these other patterns.

> What ultimately pushed us to keep moving forward on this effort was
> discovering numerous CRC loop implementations out in the wild, including
> 4 implementations (IIRC) in the kernel itself.

I have always assumed that such must exist (CRCs are useful for a number
of problems, and besides, they wouldn't have been included in coremark as
a part of the workload if they were not relevant), but it is good to have
confirmation, and even better to have code that can detect and analyse a
entire class of idioms that is in such widespread use.

This still leaves room for further improvements, like detecting fully-unrolled
code, table lookup, or slice-by-N, and replacing them with better
target-optimized code where this is indicated by the optimization flags to
save execution time or code/rodata size.  Not something we have to tackle
now, but just because we don't do it now, doesn't mean we couldn't address
these in the future if that appears worthwhile.

> I can easily see creating a clmul RTL opcode for targets which support
> it and hoisting the clmul vs lookup table selection into generic code.
> I'm still pondering if we're likely to ever see cases where we want a
> vector clmul intrinsic or support in the autovectorizer for clmul.
> We've certainly got targets with vector clmul in the ISA, the question
> is using it.

If we aim for optimal code, I think it more likely that we want to detect a
block CRC computation, and have a target expander decide to do that
with inline code or a library call that uses vectorized clmul.  At the time
we add such block-CRC expansion code, it also makes sense to add a
builtin for block CRC so that new GNU C programs can directly request
that functionality without having to go through the cargo cult of matching
a supported idiom.

Now, the library might be written in GNU C, and for that it might be useful
to have a vector clmul intrinsic so that we can express this algorithm more
easily.

> Probably the biggest task in that space right now is to see if we can
> avoid the symbolic execution engine by re-using ranger.

I'll be interested to see what you'll come up with, but if reverting to the
symbolic execution engine, the compile time cost isn't much if you only
use it for a proper match.  So whatever heuristics are used before deciding
to use the engine matter.  Can all the cases detected by the engine be
recognized as a loop with a reduction?  We might use different heuristics
for different optimization levels, i.e. allow more false negatives at -O1,
and more false positives at -O2 / -fexpensive-optimizations.

> To reiterate the real goal here is to take code as-is and make it
> significantly faster.  While the original target was Coremark, we've
> found similar bitwise implementations of CRCs all over the place.
> There's no good reason that code should have to change.
>
> The idea of exposing a CRC builtin was an intermediate step that would
> allow those willing to change their code or writing new code to write
> their CRC in a trivial way and let the compiler figure out a sensible
> implementation while we clean up the CRC loop detection/validation code.

While I concur that we want existing code to be able to take advantage of
this optimization by gcc recognizing whatever method the code uses to
compute CRC (within reason, we are still bound by the laws of
computability and resource constraints for the compilation), I would
like to stress that I don't think the builtin will use its value over time.
It can be used in tests to make sure we exercise the code generation for the
internal function.  It can be used in new GNU C code to make it easier to
do a CRC computation without worrying about the implementation.  If
an accord with other major compiler suppliers (e.g. the LLVM community)
is reached, it might even see more widespread use.

> Which is basically what I was suggesting by moving to the various expand
> interfaces.  That code does not need to be RISC-V specific and in fact
> it's much better if it is not and can be shared across architectures.

For the RISC-V implementation, I started out with writing some variants in
C and assembler optimized for latency, register use, and code size, picking
the one that's best for speed for constant polynoms (crcu16_1 is still viable
for other cases, but I didn't have time to implement *and test*
-Os and variable-polynom paths), and then I crafted an expander so that
I'd get the desired assembler at the end of compilation.  I had this idea that
every port wants its own implementation (with the previous port going a
completely different route than RISC-V).  But I suppose you are right that
a lot of GCCs ports that are in mainline are just like RISC-V in that they
will work well with a lookup table, and to get an initial version that works,
it makes sense to start with a single or a handful of generic
implementation(s), and do target-specific refinements later.
I suppose for such an initial implementation,  RISC and x86* 32 / 64 bit
targets will be fine with the table lookup,while small-address-space targets
like avr would be better off with a library function along the lines
of crcu16_1,
tailored to their ALU and register sizes.  The 'For base instruction
set' variant
should actually work with int or short just as well as long, long just happens
to be faster for RISC-V as it doesn't need extra extensions, while avr
will likely
be much better off with short.

FWIW, for the table lookup code,  there are probably some places where
force_operand is convenient to make the code portable across targets.

Reply via email to