Hi, On 5/23/26 03:44, Eric Biggers wrote:
Otherwise this looks good. Really there's a good chance this driver is no longer useful (if it ever was) and should just be deleted, but that would be a separate effort.
I happen to have one (well, two) of these, so this is relevant to my interests.
tl;dr: the crypto drivers are most likely unused, the hardware is great, but the crypto subsystem cannot use it efficiently.
Below drivers/crypto/nx, there are three drivers in a trenchcoat:- an NX crypto driver that is not endian safe, can therefore only be used on big endian systems, and that implements a bunch of AES modes plus SHA256/SHA512, all of them synchronous.
- an scomp driver with an IBM specific compression algorithm- a gzip driver that does not integrate with the crypto subsystem and provides its own userspace interface.
The "big endian only" thing is a massive restriction, this is how IBM separates enterprise and hobbyist customers, so if there are users of this module, then they both have enterprise support contracts.
The gzip mode is really useful, with 4 GB of random data I get $ time ./nx_gzip test.bin real 0m2.989s user 0m1.317s sys 0m1.665s $ time gzip -9k test.bin real 2m57.468s user 2m55.325s sys 0m1.682sso 3 GB/s vs 22 MB/s. Even if I had a workload where I could use all the CPU cores in parallel, offloading is still faster, 120W cheaper and leaves the CPU free as a bonus, so I think that's a no-brainer.
The "842" compression is mainly designed to be fast, the marketing material claims > 25 GB/s, which makes sense, this unit sits on a 128 bit wide bus clocked at 2 GHz, and the algorithm is designed around that. On the other hand it is fairly niche.
I couldn't find numbers for the AES and SHA units, I'd expect them to be in the same ballpark, but I cannot measure them easily. CPU is ~500 MB/s for SHA1 and SHA512, ~300 MB/s for SHA256, that should be easy to beat (even a primitive 2-way SHA256 would be at 4 GB/s, and I doubt IBM left it at that).
POWER11 introduces new opcodes, which will shake things up, but these machines are on a fairly long replacement cycle.
The main problem with getting the advertised performance is feeding requests fast enough. Large requests are easy, but the optimum strategy for feeding small requests is just to start submitting, poll old requests for completion inbetween, and start requesting interrupts only if nothing is complete and it looks like the unit will be busy for a while.
That's not what is currently implemented, and I doubt it could be implemented with the current kernel interfaces, so getting decent performance inside the kernel would require some redesign.
I suppose that also explains the synchronous implementation: we are submitting the request and polling for completion, so overhead is fairly minimal and should break even at a few hundred bytes, but obviously that is not the ideal way to run this thing.
The endianness issues are trivial to fix (really just needs a sprinkle of cpu_to_beXX/beXX_to_cpu when putting the job control blocks together, like nx-842 does); if you have a definition of what you would consider a "real world" workload for AES I could run that to gather some numbers.
So far however, no one bothered fixing this, and I'm pretty meh about it myself since I don't have SHA/AES workloads in the kernel, only in userspace.
Other than that, if you decide to remove the driver from the crypto subsystem, then nx-gzip should be kept (and probably moved somewhere else), because it is not a crypto driver, it just shares a bunch of headers with them.
Simon
OpenPGP_signature.asc
Description: OpenPGP digital signature
