Arnd Bergmann <a...@arndb.de> writes: > On Mon, Jul 29, 2019 at 11:52 PM Segher Boessenkool > <seg...@kernel.crashing.org> wrote: >> On Mon, Jul 29, 2019 at 01:32:46PM -0700, Nathan Chancellor wrote: >> > For the record: >> > >> > https://godbolt.org/z/z57VU7 >> > >> > This seems consistent with what Michael found so I don't think a revert >> > is entirely unreasonable. >> >> Try this: >> >> https://godbolt.org/z/6_ZfVi >> >> This matters in non-trivial loops, for example. But all current cases >> where such non-trivial loops are done with cache block instructions are >> actually written in real assembler already, using two registers. >> Because performance matters. Not that I recommend writing code as >> critical as memset in C with inline asm :-) > > Upon a second look, I think the issue is that the "Z" is an input argument > when it should be an output. clang decides that it can make a copy of the > input and pass that into the inline asm. This is not the most efficient > way, but it seems entirely correct according to the constraints. > > Changing it to an output "=Z" constraint seems to make it work: > > https://godbolt.org/z/FwEqHf > > Clang still doesn't use the optimum form, but it passes the correct pointer.
Thanks Arnd. This seems like a better solution. I'll drop the revert I have staged. Segher does this look OK to you? Nathan/Nick, are one of you able to test this with your clang CI? cheers