Re: Help with an ABI peculiarity

2022-01-21 Thread Iain Sandoe
Hi Richard,

> On 20 Jan 2022, at 22:32, Richard Sandiford  wrote:
> 
> Iain Sandoe  writes:
>>> On 10 Jan 2022, at 10:46, Richard Sandiford  
>>> wrot>> An alternative might be to make promote_function_arg a “proper”
>>> ABI hook, taking a cumulative_args_t and a function_arg_info.
>>> Perhaps the return case should become a separate hook at the
>>> same time.
>>> 
>>> That would probably require more extensive changes than just
>>> updating the call sites, and I haven't really checked how much
>>> work it would be, but hopefully it wouldn't be too bad.
>>> 
>>> The new hook would still be called before function_arg, but that
>>> should no longer be a problem, since the new hook arguments would
>>> give the target the information it needs to decide whether the
>>> argument is passed in registers.
>> 
>> Yeah, this was my next port of call (I have looked at it ~10 times and then
>> decided “not today, maybe there’s a simpler way”).

… and I did not have a chance to look at this in the meantime …

> BTW, finally catching up on old email, I see this is essentially also
> the approach that Maxim was taking with the TARGET_FUNCTION_ARG_BOUNDARY
> patches.  What's the situation with those? 

I have the patches plus amendments to make use of their new functionality on the
development branch, which is actually in pretty good shape (not much difference
in testsuite results from other Darwin sub-ports).

Maxim and I need to discuss amending the TARGET_FUNCTION_ARG_BOUNDARY
changes to account for Richard (B)’s comments.

Likewise, I need to tweak the support for heap allocation of nested function 
trampolines
to account for review comments.

As always, it’s a question of fitting everything in…
thanks
Iain



Re: Help with an ABI peculiarity

2022-01-21 Thread Richard Sandiford via Gcc
Iain Sandoe  writes:
> Hi Richard,
>> On 20 Jan 2022, at 22:32, Richard Sandiford  
>> wrot>> Iain Sandoe  writes:
 On 10 Jan 2022, at 10:46, Richard Sandiford  
 wrot>> An alternative might be to make promote_function_arg a “proper”
 ABI hook, taking a cumulative_args_t and a function_arg_info.
 Perhaps the return case should become a separate hook at the
 same time.
 
 That would probably require more extensive changes than just
 updating the call sites, and I haven't really checked how much
 work it would be, but hopefully it wouldn't be too bad.
 
 The new hook would still be called before function_arg, but that
 should no longer be a problem, since the new hook arguments would
 give the target the information it needs to decide whether the
 argument is passed in registers.
>>> 
>>> Yeah, this was my next port of call (I have looked at it ~10 times and then
>>> decided “not today, maybe there’s a simpler way”).
>
> … and I did not have a chance to look at this in the meantime …
>
>> BTW, finally catching up on old email, I see this is essentially also
>> the approach that Maxim was taking with the TARGET_FUNCTION_ARG_BOUNDARY
>> patches.  What's the situation with those? 
>
> I have the patches plus amendments to make use of their new functionality on 
> the
> development branch, which is actually in pretty good shape (not much 
> difference
> in testsuite results from other Darwin sub-ports).
>
> Maxim and I need to discuss amending the TARGET_FUNCTION_ARG_BOUNDARY
> changes to account for Richard (B)’s comments.
>
> Likewise, I need to tweak the support for heap allocation of nested function 
> trampolines
> to account for review comments.

Sounds great.

> As always, it’s a question of fitting everything in…

Yeah :-)  The question probably sounded pushier than it was meant to,
sorry.  I just wanted to check that you or Maxim weren't still waiting
on reviews.

Richard


Re: reordering of trapping operations and volatile

2022-01-21 Thread Martin Uecker via Gcc
Am Dienstag, den 18.01.2022, 09:31 +0100 schrieb Richard Biener:
> On Mon, Jan 17, 2022 at 3:11 PM Michael Matz via Gcc  wrote:
> > Hello,
> > 
> > On Sat, 15 Jan 2022, Martin Uecker wrote:
> > 
> > > > Because it interferes with existing optimisations. An explicit
> > > > checkpoint has a clear meaning. Using every volatile access that way
> > > > will hurt performance of code that doesn't require that behaviour for
> > > > correctness.
> > > 
> > > This is why I would like to understand better what real use cases of
> > > performance sensitive code actually make use of volatile and are
> > > negatively affected. Then one could discuss the tradeoffs.
> > 
> > But you seem to ignore whatever we say in this thread.  There are now
> > multiple examples that demonstrate problems with your proposal as imagined
> > (for lack of a _concrete_ proposal with wording from you), problems that
> > don't involve volatile at all.  They all stem from the fact that you order
> > UB with respect to all side effects (because you haven't said how you want
> > to avoid such total ordering with all side effects).

Again, this is simply not what I am proposing. I don't
want to order UB with all side effects.

You are right, there is not yet a specific proposal. But
at the moment I simply wanted to understand the impact of
reordering traps and volatile.

> > As I said upthread: you need to define a concept of time at whose
> > granularity you want to limit the effects of UB, and the borders of each
> > time step can't simply be (all) the existing side effects.  Then you need
> > to have wording of what it means for UB to occur within such time step, in
> > particular if multiple UB happens within one (for best results it should
> > simply be UB, not individual instances of different UBs).
> > 
> > If you look at the C++ proposal (thanks Jonathan) I think you will find
> > that if you replace 'std::observable' with 'sequence point containing a
> > volatile access' that you basically end up with what you wanted.  The
> > crucial point being that the time steps (epochs in that proposal) aren't
> > defined by all side effects but by a specific and explicit thing only (new
> > function in the proposal, volatile accesses in an alternative).
> > 
> > FWIW: I think for a new language feature reusing volatile accesses as the
> > clock ticks are the worse choice: if you intend that feature to be used
> > for writing safer programs (a reasonable thing) I think being explicit and
> > at the same time null-overhead is better (i.e. a new internal
> > function/keyword/builtin, specified to have no effects except moving the
> > clock forward).  volatile accesses obviously already exist and hence are
> > easier to integrate into the standard, but in a given new/safe program,
> > whenever you see a volatile access you would always need to ask 'is thise
> > for clock ticks, or is it a "real" volatile access for memmap IO'.
> 
> I guess Martin want's to have accesses to volatiles handled the same as
> function calls where we do not know whether the function call will return
> or terminate the program normally.  As if the volatile access could have
> a similar effect (it might actually reboot the machine or so - but of course
> that and anything else I can imagine would be far from "normal termination
> of the program").  That's technically possible to implement with a yet unknown
> amount of work.

Yes. thanks! Semantically this is equivalent to what I want.

> Btw, I'm not sure we all agree that (*) in the following program doesn't make
> it invoke UB and thus the compiler is not free to re-order the
> offending statement
> to before the exit (0) call.  Thus UB is only "realized" if a stmt
> containing it is
> executed in the abstract machine.
> 
> int main()
> {
>exit(0);
>1 / 0;  /// (*)
> }

Yes, this not clear although there seems to be some
understanding there is a difference between 
compile-time UB and run-time UB and I think the
standard should make it clear what is what.

Martin







Re: [libc-coord] Add new ABIs '__strcmpeq', '__strncmpeq', '__wcscmpeq' and '__wcsncmpeq' to libc

2022-01-21 Thread Joerg Sonnenberger
On Thu, Jan 20, 2022 at 04:56:59PM -0600, Noah Goldstein wrote:
> The goal is that the new interfaces will be usable as an optimization
> by compilers if a program uses the return value of the non "eq"
> variant as a boolean.

So I'm curious, but can you demonstrate that it can be implemented
notacibly faster than regular strcmp? Unlike for memcmp, I don't see an
obvious way to save any operations.

Joerg


Re: [libc-coord] Add new ABIs '__strcmpeq', '__strncmpeq', '__wcscmpeq' and '__wcsncmpeq' to libc

2022-01-21 Thread Noah Goldstein via Gcc
On Fri, Jan 21, 2022 at 12:51 PM Joerg Sonnenberger  wrote:
>
> On Thu, Jan 20, 2022 at 04:56:59PM -0600, Noah Goldstein wrote:
> > The goal is that the new interfaces will be usable as an optimization
> > by compilers if a program uses the return value of the non "eq"
> > variant as a boolean.
>
> So I'm curious, but can you demonstrate that it can be implemented
> notacibly faster than regular strcmp? Unlike for memcmp, I don't see an
> obvious way to save any operations.

Strong point! I had been somewhat assuming we could make the same
optimizations with `__memcmpeq` but there still needs to be some
logic that tracks which comes first the mismatch or the null terminator.

It's not quite as much as `memcmp` vs `__memcmpeq` but we can
still save.

Using the x86_64 AVX2 optimized implementation as reference:
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/multiarch/strcmp-avx2.S;h=9c73b5899d55a72b292f21b52593284cd513d2a3;hb=HEAD

We can convert the general return method of checking equals + strlen from:

```
VMOVU (%rdi), %ymm0
VPCMPEQ (%rsi), %ymm0, %ymm1
VPCMPEQ %ymm0, %ymmZERO, %ymm2
vpandn %ymm1, %ymm2, %ymm1
vpmovmskb %ymm1, %ecx
incl %ecx
jz L(keep_going)
tzcntl %ecx, %ecx
movzbl (%rdi, %rcx), %eax
movzbl (%rsi, %rcx), %ecx
subl %ecx, %eax
vzeroupper
ret
```

To

```
VMOVU (%rdi), %ymm0
VPCMPEQ (%rsi), %ymm0, %ymm1
VPCMPEQ %ymm0, %ymmZERO, %ymm2
vpandn %ymm1, %ymm2, %ymm2
vpmovmskb %ymm2, %ecx
incl %ecx
jz L(keep_going)
vpmovmskb %ymm1, %eax
blsi %ecx, %ecx
andn %eax, %ecx, %eax
vzeroupper
ret
```

Testing this with comparisons where mismatch or strlen in the first 32 bytes
(common case) it's about the same throughput but ~20% reduction in latency.

Another benefit is we can reuse this exact return logic throughout as memory
offset is no longer required. This simplifies the page cross logic a
great deal and
will net us some serious code size reduction for the common usage of strcmp.

I think though I was a bit over optimistic about the performance benefits as I
was using `memcmp` vs `__memcmpeq` as a reference. I'll put together
a patch for just `__strcmpeq` and post the results here. I think the
wide-character
versions have more expensive return value checks so if the character versions
show a benefit we can expect it to translate.



>
> Joerg


gcc-10-20220121 is now available

2022-01-21 Thread GCC Administrator via Gcc
Snapshot gcc-10-20220121 is now available on
  https://gcc.gnu.org/pub/gcc/snapshots/10-20220121/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 10 git branch
with the following options: git://gcc.gnu.org/git/gcc.git branch 
releases/gcc-10 revision c26accdc937e5c4afde2eda5f2aae7820958eb00

You'll find:

 gcc-10-20220121.tar.xz   Complete GCC

  SHA256=3458deb45e0d0c4373514bc94772c72539601684e250b6d9a09c0b11d22824dd
  SHA1=ec7a390502872f7ffaa031e7039923e180706a3e

Diffs from 10-20220114 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-10
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.