A benchmark was supposedly made of the new duffcopy/duffzero which claimed significant speedup for larger copies: https://github.com/golang/go/commit/5cf281a9b791f0f10efd1574934cbb19ea1b33da
I have no clue whether this holds true or not. My intention to reenable duffcopy and continue to use duffzero is mostly to avoid differences and ensure that the note handlers are floating point free in the future. Whether the duffcopy/duffzero’s current form is an actual optimization or just a complexity, I cannot say. A test was made in #cat-v out of annoyance where the result seemed to be that it was indeed faster to use MOVUPS, but I don’t remember the details. Best regards, Kenny Levinsen > On 23 Feb 2016, at 16:27, erik quanstrom <quans...@quanstro.net> wrote: > > On Tue Feb 23 02:36:41 PST 2016, kennylevin...@gmail.com wrote: >> Ah, no - it is not a system-wide adjustment, but adjustment of the plan9 >> specific runtime.sighandler implementation and everything called by it >> directly. Notes that don't exit the process are queued and should run >> outside the actual note handler. >> >> I think the "magic" code will be isolated, and might fend off accidental >> future additions of floating point registers. The magic-ness also only >> revolves around avoiding duffzero and duffcopy in some way. I also think >> that removing conditionals in the compiler will be a positive thing. >> >> I still do not know the feasibility of my plan, whether it is possible to do >> cleanly, or possible at all. Maybe someone smarter than me with knowledge on >> the matter could chime in and call me an idiot? >> >> Avoiding duffcopy should be easy with a simple memmove implementation. If >> done right, we can also remove the plan9 specific runtime.memmove and only >> use the slow memmove in sighandler (The globlal runtime.memmove is >> implemented using MOVUPS just like duffcopy. Duffcopy is used for >> blockcopies by the compiler in some cases, although I must admit to not know >> all the cases yet). >> >> Avoiding duffzero without compiler assistance is a bit more tricky - global >> variables, stack on assembly functions, something like that. > > fwiw, on modern amd64 machines, using the xmm and ymm registers has a benefit > only in a narrow range > of sizes (384-511 bytes) and a subset of (mis-)alignments that i've > forgotten. at least for the exact test setup > i used on 3-4 different µarches. intel claims rep; movs is the > (architecturally) fastest way to go. > > i am not sure any of this makes much difference, as it's hard to know what a > real-world memory > access pattern looks like, and that seems to dominate all but gigantic moves, > for which rep; movs > is actually no slower than even the trickiest use of ymm registers. > > - erik >