On Tue, 26 Sept 2023 at 14:18, Jeff Law <jeffreya...@gmail.com> wrote:
> But the Coremark code is what it is. This isn't a whole lot > different than the work in the 90s which rewrote loops and compromised > some of the spec benchmarks, or the eqntott hack to simplify that one > key loop in eqntott. I think the stated purpose of the benchmark matters. If dhrystone had been pushed as an abstraction-penalty benchmark, it would have been fine to present results with WPA, inlining and dead code elimination as ordinary dhrystone results. But since it's supposed to exercise specific hardware features, and not have the tests for these optimized away, that's not appropriate. So, first, we make the compiled program perform the work that the benchmark was supposed to include in the measurement, just more efficiently. Second, we not only optimize the benchmark, but also make the target-optimized code generation available for other programs. For new programs targeted at GNU C, that is minimally archived by providing a built-in function, and in general for new code, by being able to replicate the idiom from coremark that is recognized by GCC. The mere existence of a C idiom in a popular benchmark also makes this idiom a common idiom, if it hasn't already been that before. As we find new patterns that are used to implement CRC which would be better replaced with a target-specific implementation, we can add these. This is similar to rotate operations, which are supported directly by some processors, and even for other targets, there are generally preferred ways to expand the code, but there are a couple of different variants depending on the available instruction set, registers, and the microarchitecture (pipeline, latency etc). We started out with one patterns that was recognized, and as new patterns were identified in C code, we improved GCC to recognize these other patterns. > What ultimately pushed us to keep moving forward on this effort was > discovering numerous CRC loop implementations out in the wild, including > 4 implementations (IIRC) in the kernel itself. I have always assumed that such must exist (CRCs are useful for a number of problems, and besides, they wouldn't have been included in coremark as a part of the workload if they were not relevant), but it is good to have confirmation, and even better to have code that can detect and analyse a entire class of idioms that is in such widespread use. This still leaves room for further improvements, like detecting fully-unrolled code, table lookup, or slice-by-N, and replacing them with better target-optimized code where this is indicated by the optimization flags to save execution time or code/rodata size. Not something we have to tackle now, but just because we don't do it now, doesn't mean we couldn't address these in the future if that appears worthwhile. > I can easily see creating a clmul RTL opcode for targets which support > it and hoisting the clmul vs lookup table selection into generic code. > I'm still pondering if we're likely to ever see cases where we want a > vector clmul intrinsic or support in the autovectorizer for clmul. > We've certainly got targets with vector clmul in the ISA, the question > is using it. If we aim for optimal code, I think it more likely that we want to detect a block CRC computation, and have a target expander decide to do that with inline code or a library call that uses vectorized clmul. At the time we add such block-CRC expansion code, it also makes sense to add a builtin for block CRC so that new GNU C programs can directly request that functionality without having to go through the cargo cult of matching a supported idiom. Now, the library might be written in GNU C, and for that it might be useful to have a vector clmul intrinsic so that we can express this algorithm more easily. > Probably the biggest task in that space right now is to see if we can > avoid the symbolic execution engine by re-using ranger. I'll be interested to see what you'll come up with, but if reverting to the symbolic execution engine, the compile time cost isn't much if you only use it for a proper match. So whatever heuristics are used before deciding to use the engine matter. Can all the cases detected by the engine be recognized as a loop with a reduction? We might use different heuristics for different optimization levels, i.e. allow more false negatives at -O1, and more false positives at -O2 / -fexpensive-optimizations. > To reiterate the real goal here is to take code as-is and make it > significantly faster. While the original target was Coremark, we've > found similar bitwise implementations of CRCs all over the place. > There's no good reason that code should have to change. > > The idea of exposing a CRC builtin was an intermediate step that would > allow those willing to change their code or writing new code to write > their CRC in a trivial way and let the compiler figure out a sensible > implementation while we clean up the CRC loop detection/validation code. While I concur that we want existing code to be able to take advantage of this optimization by gcc recognizing whatever method the code uses to compute CRC (within reason, we are still bound by the laws of computability and resource constraints for the compilation), I would like to stress that I don't think the builtin will use its value over time. It can be used in tests to make sure we exercise the code generation for the internal function. It can be used in new GNU C code to make it easier to do a CRC computation without worrying about the implementation. If an accord with other major compiler suppliers (e.g. the LLVM community) is reached, it might even see more widespread use. > Which is basically what I was suggesting by moving to the various expand > interfaces. That code does not need to be RISC-V specific and in fact > it's much better if it is not and can be shared across architectures. For the RISC-V implementation, I started out with writing some variants in C and assembler optimized for latency, register use, and code size, picking the one that's best for speed for constant polynoms (crcu16_1 is still viable for other cases, but I didn't have time to implement *and test* -Os and variable-polynom paths), and then I crafted an expander so that I'd get the desired assembler at the end of compilation. I had this idea that every port wants its own implementation (with the previous port going a completely different route than RISC-V). But I suppose you are right that a lot of GCCs ports that are in mainline are just like RISC-V in that they will work well with a lookup table, and to get an initial version that works, it makes sense to start with a single or a handful of generic implementation(s), and do target-specific refinements later. I suppose for such an initial implementation, RISC and x86* 32 / 64 bit targets will be fine with the table lookup,while small-address-space targets like avr would be better off with a library function along the lines of crcu16_1, tailored to their ALU and register sizes. The 'For base instruction set' variant should actually work with int or short just as well as long, long just happens to be faster for RISC-V as it doesn't need extra extensions, while avr will likely be much better off with short. FWIW, for the table lookup code, there are probably some places where force_operand is convenient to make the code portable across targets.