Excerpts from Rich Felker's message of April 17, 2020 4:31 am:
> On Thu, Apr 16, 2020 at 03:18:42PM -0300, Adhemerval Zanella wrote:
>>
>>
>> On 16/04/2020 14:59, Rich Felker wrote:
>> > On Thu, Apr 16, 2020 at 02:50:18PM -0300, Adhemerval Zanella wrote:
>> >>
>> >>
>> >> On 16/04/2020 12:37, Rich Felker wrote:
>> >>> On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote:
>> >>>>> My preference would be that it work just like the i386 AT_SYSINFO
>> >>>>> where you just replace "int $128" with "call *%%gs:16" and the kernel
>> >>>>> provides a stub in the vdso that performs either scv or the old
>> >>>>> mechanism with the same calling convention. Then if the kernel doesn't
>> >>>>> provide it (because the kernel is too old) libc would have to provide
>> >>>>> its own stub that uses the legacy method and matches the calling
>> >>>>> convention of the one the kernel is expected to provide.
>> >>>>
>> >>>> What about pthread cancellation and the requirement of checking the
>> >>>> cancellable syscall anchors in asynchronous cancellation? My plan is
>> >>>> still to use musl strategy on glibc (BZ#12683) and for i686 it
>> >>>> requires to always use old int$128 for program that uses cancellation
>> >>>> (static case) or just threads (dynamic mode, which should be more
>> >>>> common on glibc).
>> >>>>
>> >>>> Using the i686 strategy of a vDSO bridge symbol would require to always
>> >>>> fallback to 'sc' to still use the same cancellation strategy (and
>> >>>> thus defeating this optimization in such cases).
>> >>>
>> >>> Yes, I assumed it would be the same, ignoring the new syscall
>> >>> mechanism for cancellable syscalls. While there are some exceptions,
>> >>> cancellable syscalls are generally not hot paths but things that are
>> >>> expected to block and to have significant amounts of work to do in
>> >>> kernelspace, so saving a few tens of cycles is rather pointless.
>> >>>
>> >>> It's possible to do a branch/multiple versions of the syscall asm for
>> >>> cancellation but would require extending the cancellation handler to
>> >>> support checking against multiple independent address ranges or using
>> >>> some alternate markup of them.
>> >>
>> >> The main issue is at least for glibc dynamic linking is way more common
>> >> than static linking and once the program become multithread the fallback
>> >> will be always used.
>> >
>> > I'm not relying on static linking optimizing out the cancellable
>> > version. I'm talking about how cancellable syscalls are pretty much
>> > all "heavy" operations to begin with where a few tens of cycles are in
>> > the realm of "measurement noise" relative to the dominating time
>> > costs.
>>
>> Yes I am aware, but at same time I am not sure how it plays on real world.
>> For instance, some workloads might issue kernel query syscalls, such as
>> recv, where buffer copying might not be dominant factor. So I see that if
>> the idea is optimizing syscall mechanism, we should try to leverage it
>> as whole in libc.
>
> Have you timed a minimal recv? I'm not assuming buffer copying is the
> dominant factor. I'm assuming the overhead of all the kernel layers
> involved is dominant.
>
>> >> And besides the cancellation performance issue, a new bridge vDSO
>> >> mechanism
>> >> will still require to setup some extra bridge for the case of the older
>> >> kernel. In the scheme you suggested:
>> >>
>> >> __asm__("indirect call" ... with common clobbers);
>> >>
>> >> The indirect call will be either the vDSO bridge or an libc provided that
>> >> fallback to 'sc' for !PPC_FEATURE2_SCV. I am not this is really a gain
>> >> against:
>> >>
>> >> if (hwcap & PPC_FEATURE2_SCV) {
>> >> __asm__(... with some clobbers);
>> >> } else {
>> >> __asm__(... with different clobbers);
>> >> }
>> >
>> > If the indirect call can be made roughly as efficiently as the sc
>> > sequence now (which already have some cost due to handling the nasty
>> > error return convention, making the indirect call likely just as small
>> > or smaller), it's O(1) additional code size (and thus icache usage)
>> > rather than O(n) where n is number of syscall points.
>> >
>> > Of course it would work just as well (for avoiding O(n) growth) to
>> > have a direct call to out-of-line branch like you suggested.
>>
>> Yes, but does it really matter to optimize this specific usage case
>> for size? glibc, for instance, tries to leverage the syscall mechanism
>> by adding some complex pre-processor asm directives. It optimizes
>> the syscall code size in most cases. For instance, kill in static case
>> generates on x86_64:
>>
>> 0000000000000000 <__kill>:
>> 0: b8 3e 00 00 00 mov $0x3e,%eax
>> 5: 0f 05 syscall
>> 7: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax
>> d: 0f 83 00 00 00 00 jae 13 <__kill+0x13>
>> 13: c3 retq
>>
>> While on musl:
>>
>> 0000000000000000 <kill>:
>> 0: 48 83 ec 08 sub $0x8,%rsp
>> 4: 48 63 ff movslq %edi,%rdi
>> 7: 48 63 f6 movslq %esi,%rsi
>> a: b8 3e 00 00 00 mov $0x3e,%eax
>> f: 0f 05 syscall
>> 11: 48 89 c7 mov %rax,%rdi
>> 14: e8 00 00 00 00 callq 19 <kill+0x19>
>> 19: 5a pop %rdx
>> 1a: c3 retq
>
> Wow that's some extraordinarily bad codegen going on by gcc... The
> sign-extension is semantically needed and I don't see a good way
> around it (glibc's asm is kinda a hack taking advantage of kernel not
> looking at high bits, I think), but the gratuitous stack adjustment
> and refusal to generate a tail call isn't. I'll see if we can track
> down what's going on and get it fixed.
>
>> But I hardly think it pays off the required code complexity. Some
>> for providing a O(1) bridge: this will require additional complexity
>> to write it and setup correctly.
>
> In some sense I agree, but inline instructions are a lot more
> expensive on ppc (being 32-bit each), and it might take out-of-lining
> anyway to get rid of stack frame setups if that ends up being a
> problem.
>
>> >> Specially if 'hwcap & PPC_FEATURE2_SCV' could be optimized with a
>> >> TCB member (as we do on glibc) and if we could make the asm clever
>> >> enough to not require different clobbers (although not sure if
>> >> it would be possible).
>> >
>> > The easy way not to require different clobbers is just using the union
>> > of the clobbers, no? Does the proposed new method clobber any
>> > call-saved registers that would make it painful (requiring new call
>> > frames to save them in)?
>>
>> As far I can tell, it should be ok.
>
> Note that because lr is clobbered we need at least once normally
> call-clobbered register that's not syscall clobbered to save lr in.
> Otherwise stack frame setup is required to spill it.
The kernel would like to use r9-r12 for itself. We could do with fewer
registers, but we have some delay establishing the stack (depends on a
load which depends on a mfspr), and entry code tends to be quite store
heavy whereas on the caller side you have r1 set up (modulo stack
updates), and the system call is a long delay during which time the
store queue has significant time to drain.
My feeling is it would be better for kernel to have these scratch
registers.
Thanks,
Nick