On Sat, Jan 20, 2024 at 7:13 AM Jeff Davis <pg...@j-davis.com> wrote: > > On Fri, 2024-01-19 at 13:38 -0800, Jeff Davis wrote: > > One post-commit question on 0aba255440: why do > > haszero64(pg_bswap64(chunk)) rather than just haszero64(chunk)? How > > does byteswapping affect whether a zero byte exists or not? > > I missed that it was used later when finding the rightmost one > position. > > The placement of the comment was slightly confusing. Is: > > haszero64(pg_bswap64(chunk)) == pg_bswap64(haszero64(chunk)) > > ? If so, perhaps we can do the byte swapping outside of the loop, which > might save a few cycles on longer strings and would be more readable.
The above identity is not true for this haszero64 macro. I phrased it as "The rest of the bytes are indeterminate", but that's not very clear. It can only be true if it set bytes for all and only those bytes where the input had zeros. In the NetBSD strlen source, there is a comment telling of a way to do this: ~(((x & 0x7f....7f) + 0x7f....7f) | (x | 0x7f....7f)) https://github.com/NetBSD/src/blob/trunk/common/lib/libc/arch/x86_64/string/strlen.S (They don't actually use it of course, since x86_64 is little-endian) >From the commentary there, it sounds like 1 or 2 more instructions. One unmentioned assumption I had was that the byte swap would be a single instruction on all platforms where we care about performance (*). If that's not the case, we could switch to the above macro for big-endian machines. It'd be less readable since we'd then need an additional #ifdef for counting leading, rather than trailing zeros (that would avoid byte-swapping entirely). Either way, I'm afraid big-endian is stuck doing a bit of extra work somewhere. That work could be amortized by doing a quick check in the loop and afterwards completely redoing the zero check (or a bytewise check same as the unaligned path), but that would penalize short strings. (*) 32-bit platforms don't take this path, but mamba's build failed because the previously-misspelled symbol was still in the source file. We could also #ifdef around the whole aligned-path function, although it's redundant. I hope this makes it more clear. Maybe the comment could use some work.