On Mon, May 30, 2016 at 11:10 AM, George Spelvin <li...@sciencehorizons.net> wrote: > > I understand, but 64x64-bit multiply on 32-bit is pretty annoyingly > expensive. In time, code size, and register pressure which bloats > surrounding code.
Side note, the code seems to work fairly well, but I do worry a bit about the three large multiplies in link_path_walk(). There's two in fold_hash(), and one comes from "find_zero()". It turns out to work fairly well on at least modern big-core x86 CPU's, because the multiplier is fairly beefy: low latency (3-4 cycles in the current ctop) and fully pipelined. Even atom should be 5 cycles and a multiplication result every two cycles for 64-bit results. Maybe we don't care, because looking around the modern ARM and POWER cores do similarly, but I just wanted to point out that that code does seem to fairly heavily rely on "everybody has bug and pipelined hw multipliers" for performance. .. and it's probably true that transistors are cheap, and crypto and other uses have made CPU designers spend the effort on good multipliers. I just remember a time when you definitely couldn't rely on fast multiplies. Linus