Re: Help with an ABI peculiarity
Hi Richard, > On 20 Jan 2022, at 22:32, Richard Sandiford wrote: > > Iain Sandoe writes: >>> On 10 Jan 2022, at 10:46, Richard Sandiford >>> wrot>> An alternative might be to make promote_function_arg a “proper” >>> ABI hook, taking a cumulative_args_t and a function_arg_info. >>> Perhaps the return case should become a separate hook at the >>> same time. >>> >>> That would probably require more extensive changes than just >>> updating the call sites, and I haven't really checked how much >>> work it would be, but hopefully it wouldn't be too bad. >>> >>> The new hook would still be called before function_arg, but that >>> should no longer be a problem, since the new hook arguments would >>> give the target the information it needs to decide whether the >>> argument is passed in registers. >> >> Yeah, this was my next port of call (I have looked at it ~10 times and then >> decided “not today, maybe there’s a simpler way”). … and I did not have a chance to look at this in the meantime … > BTW, finally catching up on old email, I see this is essentially also > the approach that Maxim was taking with the TARGET_FUNCTION_ARG_BOUNDARY > patches. What's the situation with those? I have the patches plus amendments to make use of their new functionality on the development branch, which is actually in pretty good shape (not much difference in testsuite results from other Darwin sub-ports). Maxim and I need to discuss amending the TARGET_FUNCTION_ARG_BOUNDARY changes to account for Richard (B)’s comments. Likewise, I need to tweak the support for heap allocation of nested function trampolines to account for review comments. As always, it’s a question of fitting everything in… thanks Iain
Re: Help with an ABI peculiarity
Iain Sandoe writes: > Hi Richard, >> On 20 Jan 2022, at 22:32, Richard Sandiford >> wrot>> Iain Sandoe writes: On 10 Jan 2022, at 10:46, Richard Sandiford wrot>> An alternative might be to make promote_function_arg a “proper” ABI hook, taking a cumulative_args_t and a function_arg_info. Perhaps the return case should become a separate hook at the same time. That would probably require more extensive changes than just updating the call sites, and I haven't really checked how much work it would be, but hopefully it wouldn't be too bad. The new hook would still be called before function_arg, but that should no longer be a problem, since the new hook arguments would give the target the information it needs to decide whether the argument is passed in registers. >>> >>> Yeah, this was my next port of call (I have looked at it ~10 times and then >>> decided “not today, maybe there’s a simpler way”). > > … and I did not have a chance to look at this in the meantime … > >> BTW, finally catching up on old email, I see this is essentially also >> the approach that Maxim was taking with the TARGET_FUNCTION_ARG_BOUNDARY >> patches. What's the situation with those? > > I have the patches plus amendments to make use of their new functionality on > the > development branch, which is actually in pretty good shape (not much > difference > in testsuite results from other Darwin sub-ports). > > Maxim and I need to discuss amending the TARGET_FUNCTION_ARG_BOUNDARY > changes to account for Richard (B)’s comments. > > Likewise, I need to tweak the support for heap allocation of nested function > trampolines > to account for review comments. Sounds great. > As always, it’s a question of fitting everything in… Yeah :-) The question probably sounded pushier than it was meant to, sorry. I just wanted to check that you or Maxim weren't still waiting on reviews. Richard
Re: reordering of trapping operations and volatile
Am Dienstag, den 18.01.2022, 09:31 +0100 schrieb Richard Biener: > On Mon, Jan 17, 2022 at 3:11 PM Michael Matz via Gcc wrote: > > Hello, > > > > On Sat, 15 Jan 2022, Martin Uecker wrote: > > > > > > Because it interferes with existing optimisations. An explicit > > > > checkpoint has a clear meaning. Using every volatile access that way > > > > will hurt performance of code that doesn't require that behaviour for > > > > correctness. > > > > > > This is why I would like to understand better what real use cases of > > > performance sensitive code actually make use of volatile and are > > > negatively affected. Then one could discuss the tradeoffs. > > > > But you seem to ignore whatever we say in this thread. There are now > > multiple examples that demonstrate problems with your proposal as imagined > > (for lack of a _concrete_ proposal with wording from you), problems that > > don't involve volatile at all. They all stem from the fact that you order > > UB with respect to all side effects (because you haven't said how you want > > to avoid such total ordering with all side effects). Again, this is simply not what I am proposing. I don't want to order UB with all side effects. You are right, there is not yet a specific proposal. But at the moment I simply wanted to understand the impact of reordering traps and volatile. > > As I said upthread: you need to define a concept of time at whose > > granularity you want to limit the effects of UB, and the borders of each > > time step can't simply be (all) the existing side effects. Then you need > > to have wording of what it means for UB to occur within such time step, in > > particular if multiple UB happens within one (for best results it should > > simply be UB, not individual instances of different UBs). > > > > If you look at the C++ proposal (thanks Jonathan) I think you will find > > that if you replace 'std::observable' with 'sequence point containing a > > volatile access' that you basically end up with what you wanted. The > > crucial point being that the time steps (epochs in that proposal) aren't > > defined by all side effects but by a specific and explicit thing only (new > > function in the proposal, volatile accesses in an alternative). > > > > FWIW: I think for a new language feature reusing volatile accesses as the > > clock ticks are the worse choice: if you intend that feature to be used > > for writing safer programs (a reasonable thing) I think being explicit and > > at the same time null-overhead is better (i.e. a new internal > > function/keyword/builtin, specified to have no effects except moving the > > clock forward). volatile accesses obviously already exist and hence are > > easier to integrate into the standard, but in a given new/safe program, > > whenever you see a volatile access you would always need to ask 'is thise > > for clock ticks, or is it a "real" volatile access for memmap IO'. > > I guess Martin want's to have accesses to volatiles handled the same as > function calls where we do not know whether the function call will return > or terminate the program normally. As if the volatile access could have > a similar effect (it might actually reboot the machine or so - but of course > that and anything else I can imagine would be far from "normal termination > of the program"). That's technically possible to implement with a yet unknown > amount of work. Yes. thanks! Semantically this is equivalent to what I want. > Btw, I'm not sure we all agree that (*) in the following program doesn't make > it invoke UB and thus the compiler is not free to re-order the > offending statement > to before the exit (0) call. Thus UB is only "realized" if a stmt > containing it is > executed in the abstract machine. > > int main() > { >exit(0); >1 / 0; /// (*) > } Yes, this not clear although there seems to be some understanding there is a difference between compile-time UB and run-time UB and I think the standard should make it clear what is what. Martin
Re: [libc-coord] Add new ABIs '__strcmpeq', '__strncmpeq', '__wcscmpeq' and '__wcsncmpeq' to libc
On Thu, Jan 20, 2022 at 04:56:59PM -0600, Noah Goldstein wrote: > The goal is that the new interfaces will be usable as an optimization > by compilers if a program uses the return value of the non "eq" > variant as a boolean. So I'm curious, but can you demonstrate that it can be implemented notacibly faster than regular strcmp? Unlike for memcmp, I don't see an obvious way to save any operations. Joerg
Re: [libc-coord] Add new ABIs '__strcmpeq', '__strncmpeq', '__wcscmpeq' and '__wcsncmpeq' to libc
On Fri, Jan 21, 2022 at 12:51 PM Joerg Sonnenberger wrote: > > On Thu, Jan 20, 2022 at 04:56:59PM -0600, Noah Goldstein wrote: > > The goal is that the new interfaces will be usable as an optimization > > by compilers if a program uses the return value of the non "eq" > > variant as a boolean. > > So I'm curious, but can you demonstrate that it can be implemented > notacibly faster than regular strcmp? Unlike for memcmp, I don't see an > obvious way to save any operations. Strong point! I had been somewhat assuming we could make the same optimizations with `__memcmpeq` but there still needs to be some logic that tracks which comes first the mismatch or the null terminator. It's not quite as much as `memcmp` vs `__memcmpeq` but we can still save. Using the x86_64 AVX2 optimized implementation as reference: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/multiarch/strcmp-avx2.S;h=9c73b5899d55a72b292f21b52593284cd513d2a3;hb=HEAD We can convert the general return method of checking equals + strlen from: ``` VMOVU (%rdi), %ymm0 VPCMPEQ (%rsi), %ymm0, %ymm1 VPCMPEQ %ymm0, %ymmZERO, %ymm2 vpandn %ymm1, %ymm2, %ymm1 vpmovmskb %ymm1, %ecx incl %ecx jz L(keep_going) tzcntl %ecx, %ecx movzbl (%rdi, %rcx), %eax movzbl (%rsi, %rcx), %ecx subl %ecx, %eax vzeroupper ret ``` To ``` VMOVU (%rdi), %ymm0 VPCMPEQ (%rsi), %ymm0, %ymm1 VPCMPEQ %ymm0, %ymmZERO, %ymm2 vpandn %ymm1, %ymm2, %ymm2 vpmovmskb %ymm2, %ecx incl %ecx jz L(keep_going) vpmovmskb %ymm1, %eax blsi %ecx, %ecx andn %eax, %ecx, %eax vzeroupper ret ``` Testing this with comparisons where mismatch or strlen in the first 32 bytes (common case) it's about the same throughput but ~20% reduction in latency. Another benefit is we can reuse this exact return logic throughout as memory offset is no longer required. This simplifies the page cross logic a great deal and will net us some serious code size reduction for the common usage of strcmp. I think though I was a bit over optimistic about the performance benefits as I was using `memcmp` vs `__memcmpeq` as a reference. I'll put together a patch for just `__strcmpeq` and post the results here. I think the wide-character versions have more expensive return value checks so if the character versions show a benefit we can expect it to translate. > > Joerg
gcc-10-20220121 is now available
Snapshot gcc-10-20220121 is now available on https://gcc.gnu.org/pub/gcc/snapshots/10-20220121/ and on various mirrors, see http://gcc.gnu.org/mirrors.html for details. This snapshot has been generated from the GCC 10 git branch with the following options: git://gcc.gnu.org/git/gcc.git branch releases/gcc-10 revision c26accdc937e5c4afde2eda5f2aae7820958eb00 You'll find: gcc-10-20220121.tar.xz Complete GCC SHA256=3458deb45e0d0c4373514bc94772c72539601684e250b6d9a09c0b11d22824dd SHA1=ec7a390502872f7ffaa031e7039923e180706a3e Diffs from 10-20220114 are available in the diffs/ subdirectory. When a particular snapshot is ready for public consumption the LATEST-10 link is updated and a message is sent to the gcc list. Please do not use a snapshot before it has been announced that way.