On Sun, 24 Sept 2023 at 12:41, Alexander Monakov <amona...@ispras.ru> wrote:
>
>
> On Sun, 24 Sep 2023, Joern Rennecke wrote:
>
> > It is a stated goal of coremark to test performance for CRC.
>
> I would expect a good CRC benchmark to print CRC throughput in
> bytes per cycle or megabytes per second.
>
> I don't see where Coremark states that goal. In the readme at
> https://github.com/eembc/coremark/blob/main/README.md
> it enumerates the three subcategories (linked list, matrix ops,
> state machine) and indicates that CRC is used for validation.

At https://www.eembc.org/coremark/index.php , they state under the
Details heading:

...
Replacing the antiquated Dhrystone benchmark, Coremark contains
implementations of the following algorithms: list processing (find and
sort), matrix manipulation (common matrix operations), state machine
(determine if an input stream contains valid numbers), and CRC (cyclic
redundancy check).
...
The CRC algorithm serves a dual function; it provides a workload
commonly seen in embedded applications and ensures correct operation
of the CoreMark benchmark, essentially providing a self-checking
mechanism.
...

They also point to a whitepaper there, which states:

Since CRC is also a commonly used function in embedded applications, this
calculation is included in the timed portion of the CoreMark.

> If it claims that elsewhere, the way its code employs CRC does not
> correspond to real-world use patterns, like in the Linux kernel for
> protocol and filesystem checksumming, or decompression libraries.

That may be so, but we should still strive to optimize the code to
obtain the intended purpose.

> It is, however, representative of the target CPU's ability to run
> those basic bitwise ops with good overlap with the rest of computation,
> which is far more relevant for the real-world performance of the CPU.

That depends on how much CRC calculation your application does.  You can
disable specific compiler optimizations in GCC for specialized testing.
> > thus if a compiler fails to translate this into the CRC implementation
> > that would be used for performance code, the compiler frustrates this
> > goal of coremark to give a measure of CRC calculation performance.
>
> Are you seriously saying that if a customer chooses CPU A over CPU B
> based on Coremark scores, and then discovers that actual performance
> in, say, zlib (which uses slice-by-N for CRC) is better on CPU B, that's
> entirely fair and the benchmarks scores they saw were not misleading?

Using coremark as a yardstick for any one application is always going to be
likely to give an inaccurate assessment - unless your application is
identical to
coremark.  I don't see why whatever implementation is chosen for the
short-length
CRC in coremark should be closer or farther from slice-by-N CRC, I would expect
it to be pseudo-random.  Unless CPU B has worse GCC support or
available hardware
instruction for short-range CRC, in which case the manufacturer might
considering
improving support (particularily if it's about GCC support ;-)
Actually, if CRC optimization is implemented via table lookup, on both
CPU A and B,
it gets a bit closer to slice-by-N, since both do table lookups,
although for slice-by-N you
trade latency for register pressure.

Any single benchmark can't be a good performance predictor for all applications.
If you care a lot about performance for a particular load, you should
benchmark that load,
or something that is known to be a close proxy.

> > > At best we might have
> > > a discussion on providing a __builtin_clmul for carry-less multiplication
> > > (which _is_ a fundamental primitive, unlike __builtin_crc), and move on.
> >
> > Some processors have specialized instructions for CRC computations.
>
> Only for one or two fixed polynomials. For that matter, some processors
> have instructions for AES and SHA, but that doesn't change that clmul is
> a more fundamental and flexible primitive than "CRC".

So it is, but when analyzing user programs that haven't been written by experts
with a focus on performance, CRC is more likely to come up than clmul .
I agree that it would make sense to have a builtin for clmul that can be used
uniformly across architectures that support this operation, but I'm
not volunteering
to write a patch for that.

> If only the "walk before you run" logic was applied in favor of
> implementing a portable clmul builtin prior to all this.

I started writing the CRC patch for an architecture that didn't have
clmul as a basecase
instruction, so a clmul builtin would not have helped.
>
> > A library can be used to implement built-ins in gcc (we still need to
> > define one for block operations, one step at a time...).  However,
> > someone or something needs to rewrite the existing code to use the
> > library.  It is commonly accepted that an efficient way to do this is
> > to make a compiler do the necessary transformations, as long as it can
> > be made to churn out good enough code.
>
> How does this apply to the real world? Among CRC implementations in the
> Linux kernel, ffmpeg, zlib, bzip2, xz-utils, and zstd I'm aware of only
> a single instance where bitwise CRC is used. It's in the table
> initialization function in xz-utils. The compiler would transform that
> to copying one table into another. Is that a valuable transform?

As long as the target sets sensible costs, we can compare the cost of
the analyzed code to the target implementation,
and choose not to do a transformation if there is no gain.
Although there might be corner cases where we can gain benefits when
we see different table based implementations mixed, or table and
non-table-based implementations, and we'd need WPA to detect that.
Well, we can worry about when we see it.  Or not, depending on the
causes and the impact.  Is it the user's fault, and/or insignificant?
Or is it an emerging problem from throwing lots of code together, and
we can get some significant gain from merging the implementations?

> > We can provide a fallback implementation for all targets with table
> > lookup and/or shifts .
>
> How would it help when they are compiled with LLVM, or GCC version
> earlier than 14?

It wouldn't, at least not initially.  That's in the nature of
improving compilers, which leaves older versions behind.  In the case
of using LLVM, it might help those users too if/when LLVM catches up
(if they haven't already implemented a comparable or superior
optimization by the time we release an improved GCC).

Reply via email to