On 9/23/23 17:05, Joern Rennecke wrote:
Mariam Harutyunyan:
+++ b/gcc/ChangeLog
@@ -1,3 +1,45 @@
+2023-08-03  Mariam Arutunian  <mariamarutun...@gmail.com>
+

It is common courtesy to include all authors in the list of authors
for the ChangeLog; also,
this may help people in the future understand the history of the code better.
While must of your patch is new, it still contains non-trivial parts of mine
( https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591744.html )
.And stripping out the comment why, currently,  we can't use linkonce
for crc tables on the the RISC-V target is
not helpful to someone who wants to understand the code.
Thanks for pointing this out Joern. Neither Mariam nor I were aware that some of your code was used to seed this work. We are happy to include you as a co-author.


Mariam -- the way to do this is to add a "Co-Author:" line to your commit message. If you look at upstream commit:

5ff4431675c0d0c800d4a983254e94a6b401c14d

Shows the right format.



See also the discussion to put this into loop distribution:
https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591821.html > 
https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591866.html
Yes, this was the gist of the discussion I had with Richi. CRC optimization has a lot in common with distribution, reduction and final value replacement.

For early testing and development we focused on detecting a CRC function and replacing it at a function level. THat allowed Mariam to avoid a class a problems for a while and focus on an end-to-end solution. Once that was working reasonably well Mariam went back and enhanced the first pass filter and validation so that it would work if the CRC loop was embedded inside a larger function either at the source level or due to inlining.

The placement we chose was just after loop distribution. I think there's some wiggle room in terms of pass placement.

The one thing I'm hesitant to do is make this part of an existing pass. I dont see many benefits of integrating inside loop dist and I very much like its clean separation.



Mariam Harutyunyan:
It adds internal
functions and built-ins specifically designed to handle CRC computations
efficiently.

This sounds like this is a finished work, although defining built-in
functions to calculate the CRC of single data elements and recognizers
for some C idioms that do these calculations,
is just a starting point.
More precisely it was carved out of a larger piece of work in the hopes that it would be useful independently and allow upstreaming of the backend work. Alexander was fairly dismissive of the utility of doing that. There was some interest from Cauldron attendees to potentially reviving that idea.



Alexander Monakov :

Jeff, as I understand this all is happening only because Coremark contains
use of bitwise CRC that affects benchmark scores. In another universe where
- Coremark was careful to checksum outputs outside of timed sections, or
- implemented CRC in a manner that is not transparent to the compiler, or
- did not use CRC at all
we would not be spending effort on this, correct?

It is a stated goal of coremark to test performance for CRC.  They do
not use a library call
to implement CRC, but a specific bit-banging algorithm they have
chosen.  That algorithm is,
for the vast majority of processors, not representative of the targets
performance potential in calculating CRCs, thus if a compiler fails to
translate this into the CRC implementation that
would be used for performance code, the compiler frustrates this goal
of coremark to give a measure of CRC calculation performance.All true. But the Coremark code is what it is. This isn't a whole lot
different than the work in the 90s which rewrote loops and compromised some of the spec benchmarks, or the eqntott hack to simplify that one key loop in eqntott.

What ultimately pushed us to keep moving forward on this effort was discovering numerous CRC loop implementations out in the wild, including 4 implementations (IIRC) in the kernel itself.



At best we might have
a discussion on providing a __builtin_clmul for carry-less multiplication
(which _is_ a fundamental primitive, unlike __builtin_crc), and move on.

Some processors have specialized instructions for CRC computations.
They do. But I think the discussion about providing intrinsic access to a clmul instruction is orthogonal to the discussion about whether or not to provide a builtin for crc.

I can easily see creating a clmul RTL opcode for targets which support it and hoisting the clmul vs lookup table selection into generic code. I'm still pondering if we're likely to ever see cases where we want a vector clmul intrinsic or support in the autovectorizer for clmul. We've certainly got targets with vector clmul in the ISA, the question is using it.



Instead, efficient CRC loops have the following structure:
- they carry an unreduced remainder in the loop, performing final reduction
  modulo polynomial only once after the loop — this halves the CLMUL count;
- they overlap multiple CLMUL chains to make the loop throughput-bound
rather than latency-bound. The typical unroll factor is about 4x-8x.

If you want to recognize a loop that does a CRC on a block, it makes
sense to start with recognizing the CRC computation for single array
elements first.  We have to learn to
walk before we can run.

Nore that my initial patch already contained a match.pd stanza to
recognize two chained single-element CRC calculations.
Mariam's code goes well behind simple match.pd recognition of a block. The hope is to do a few rounds of internal review/cleanup then post it for the world to see.

Probably the biggest task in that space right now is to see if we can avoid the symbolic execution engine by re-using ranger.


Jeff Law:
The intention is to provide a useful
builtin_crc while at the same time putting one side of the
infrastructure we need for automatic detection of CRC loops and turning
them into table lookups or CLMULs.

Note that when optimizing for size, for a target with tiny memory, or
when using a non-constant (or constant but undiscoverable by the
compiler) polynom, we can't use the table lookup.  But still, even if
we don't have CLmul, we can generally speed up CRC computation over
the coremark algorithm by using something more suited to the target,
like the crcu16_1 function I
put into comments in my patch.
Sure. And once the CRC loop is discovered and an IFN created, the backend would have the opportunity to guide code generation.






Alexander Monakov:
Useful to whom? The Linux kernel? zlib, bzip2, xz-utils? ffmpeg?
These consumers need high-performance blockwise CRC, offering them
a latency-bound elementwise CRC primitive is a disservice. And what
should they use as a fallback when __builtin_crc is unavailable?

We can provide a fallback implementation for all targets with table
lookup and/or shifts .
And as I've stated before, the latency of clmuls is dropping. I wouldn't be terribly surprised to see single cycle clmul implmementations showing up within the next 18-24 months. It's really just a matter of gate budget vs expected value.



Alexander Monakov:
I think offering a conventional library for CRC has substantial advantages.
Are you volunteering?  It would make our work to emit code for block
CRC easier if we could
just use a library call when we recognize a block CRC (although making
that recognizer is likely still considerable work if we want to get
good coverage over different coding styles).

Although maybe Oleg Endo's library, as mentioned in
https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591748.html ,
might be suitable?  What is the license for that?
But going this route as the primary direction means that any such code has to change their implementation to take advantage of a faster CRC impmementation.

To reiterate the real goal here is to take code as-is and make it significantly faster. While the original target was Coremark, we've found similar bitwise implementations of CRCs all over the place. There's no good reason that code should have to change.

The idea of exposing a CRC builtin was an intermediate step that would allow those willing to change their code or writing new code to write their CRC in a trivial way and let the compiler figure out a sensible implementation while we clean up the CRC loop detection/validation code.


Yes, the builtin functions naturally have lots of narrow operations,
and thus the operands are narrow.
Note that in my implementation of crcsihi4 / crchihi4, I use a Pmode
paradoxical subreg of the output operand, so that the move itself can
be in Pmode.

I suppose we could make the output mode wider than the data size, but
that can only be for the targets that want it; we can't bend the
target-independent built-in for the convenience of RISC-V.
So the builtin function expanding machinery would have to know to use
the mode of the pattern instead of the one implied by the type.
I'm not at all suggesting we do something weird in the middle end for RISC-V. My point was we should look for a better solution. This feels like it ought to work more like the existing widening/narrowing capabilities we have in the expansion code.



Jeff Law :

I wonder if all the table based generation support code should move into a
generic file so that it could be used by other targets.    I don't offhand
see anything in there that is RISC-V specific.

Well, the avoidance of linkonce is.  OTOH, other targets might not
have weak.  On the grabber hand, making the table static and
duplicating it all over would be sad.
Right. But making the table a DECL node with an initializer allows us to clean up a whole mess of things -- including the ability to take advantage of linkonce on targets that support it. The whole point is to take what was fairly target specific and generalize it so that it's useful elsewhere.



I suppose we could make it a string constant, to take advantage of
string commoning on targets where that is supported.  However, that
would likely get us into trouble on targets with addressable units
other that 8 bits.
I think we can leave this as a follow-up. I'm not convinced it's a major issue.



However, the code generation actually is somewhat target specific
beyond these technicalities.
The widths of arithmetic used, and how we order it with memory
accesses (which might need arithmetic transformations), depends on the
facilities available and latency considerations to get good code.
Well, we can still have some generic implementation. but optimal
performance it makessense to look at what assembly code you get and
consider a target-specific code generation if you can do better.
The widths and such as well as "does this instruction support immediates" should be handled by using the existing expansion interfaces. That's the broader point.

As written the code makes various assumptions about the ISA. By moving to standard expansion interfaces most of those problems are solved.


The other port we used this optimization for actually has dedicated
crc instructions in some configurations, and is generally
memory-tight, so the fallback is tightly coded library functions.
ACK. And I think we can get to a point where those ports that have CRC functions (presumably with a limited set of polynomials) are trivially handled.



It wouldn't be an an alpha, since that doesn't have indexed
addressing.  And that has nothing to do with 12 bit, tab is a symbol
(usually needs lots of bits, and multiple machine instructions
to set up for RISC-V), and ix is a an index.
Right. This isn't about 12bit immediates or anything like that, just a general need to be careful about how we generate code to ensure it's target independent.


..
And here I don't think we can necessarily depend on the target
supporting shifts by a constant amount.
..
Similarly here, some targets may not support AND with immediate or may
not have enough bits to handle all the immediates that can arise here.

Yes, it's RISC-V specific code in a RISC_V file.  If we want to make
this more general, we might be better off constructing some trees or
gimple and expanding them.  We might indeed do a check if the internal
function or a library function is defined, and substitute a table
lookup otherwise.  (or leave it alone for -Os in that case).
Which is basically what I was suggesting by moving to the various expand interfaces. That code does not need to be RISC-V specific and in fact it's much better if it is not and can be shared across architectures.



So did we ever figure out why we're treating the CRC optab as a
conversion optab?

Input and output are different sizes.  At least for some of the
variants.  So you get an ICE if you treat it as something other than a
conversion.
That's what I was guessing.   Thanks for the confirmation.

jeff

Reply via email to