On Wed, 25 Feb 2026 08:08:34 +0000
Fuad Tabba <[email protected]> wrote:

...
> I also noticed that the read path already expects and handles
> unaligned addresses. If you look at load_unaligned_zeropad() (called
> above the write), it explicitly loads an unaligned word and handles
> potential page-crossing faults. The write path lacked the equivalent
> put_unaligned() wrapper, leaving it exposed to UB.

Not really, the read side is doing reads that might go past the '\0'
that terminates the string.
If they are misaligned they can fault even though the string is
correctly terminated.
OTOH the write side must not write beyond the terminating '\0'
so the memory must always be there.

I didn't look at exactly how the 'word at a time' version terminates.
For strlen() doing it at all is pretty marginal for 32bit.
For strcpy() it may depend on whether the byte writes get merged
it the cpu's store buffer.
On 64bit you have to try quite hard to stop the compiler making
'a right pigs breakfast' of generating the constants; it is pretty
bad on x64-64, mips64 is a lot worse.
On LE with fast 'bit scan' you can quickly determine the number of
partial bytes - so they can be written without re-reading from
memory.

The actual problem with that version of strlen() is that it is
only faster for strings above (about and IIRC) 64 bytes long.
So in the kernel it is pretty much a complete waste of time.
The same is probably true for strscpy().

My guess is that the fastest code uses the 'unrolled once' loop:
        do {
                if ((dst[len] = src[len]) == 0)
                        break;
                if ((dst[len + 1] = src[len + 1]) == 0) {
                        len++;
                        break;
                }
        } while ((len += 2) < lim);
(Provided it gets compiled reasonably).
Without the writes that was pretty much the best strlen() on the few cpu
I tried it on.

        David

Reply via email to