Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support

2018-09-14 Thread Peter Zijlstra
On Fri, Sep 14, 2018 at 02:56:46PM +0200, Florian Weimer wrote:
> On 09/14/2018 02:50 PM, Thomas Gleixner wrote:
> > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
> > implementation, which extended the clockid switch case and added yet
> > another slightly different copy of the same code.
> > 
> > Especially the extended switch case is problematic as the compiler tends to
> > generate a jump table which then requires to use retpolines.
> 
> Does vDSO code really have to use retpolines?  It's in userspace, after all.

Userspace is equally susceptible to spectre-v2. Ideally we'd recompile
world with retpoline, but given the amount of inline asm in say things
like openssl and similar projects, validating that there are indeed no
indirect calls/jumps left is nontrivial.

There are currently pending patches to otherwise address user-user
spectre-v2 attacks.

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case

2018-09-18 Thread Peter Zijlstra
On Tue, Sep 18, 2018 at 09:52:26AM +0200, Thomas Gleixner wrote:
> On Mon, 17 Sep 2018, John Stultz wrote:
> > On Mon, Sep 17, 2018 at 12:25 PM, Andy Lutomirski  wrote:
> > > Also, I'm not entirely convinced that this "last" thing is needed at
> > > all.  John, what's the scenario under which we need it?
> > 
> > So my memory is probably a bit foggy, but I recall that as we
> > accelerated gettimeofday, we found that even on systems that claimed
> > to have synced TSCs, they were actually just slightly out of sync.
> > Enough that right after cycles_last had been updated, a read on
> > another cpu could come in just behind cycles_last, resulting in a
> > negative interval causing lots of havoc.
> > 
> > So the sanity check is needed to avoid that case.
> 
> Your memory serves you right. That's indeed observable on CPUs which
> lack TSC_ADJUST.

But, if the gtod code can observe this, then why doesn't the code that
checks the sync?
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case

2018-09-18 Thread Peter Zijlstra
On Tue, Sep 18, 2018 at 12:41:57PM +0200, Thomas Gleixner wrote:
> On Tue, 18 Sep 2018, Thomas Gleixner wrote:
> > On Tue, 18 Sep 2018, Thomas Gleixner wrote:
> > > On Tue, 18 Sep 2018, Peter Zijlstra wrote:
> > > > > Your memory serves you right. That's indeed observable on CPUs which
> > > > > lack TSC_ADJUST.
> > > > 
> > > > But, if the gtod code can observe this, then why doesn't the code that
> > > > checks the sync?
> > > 
> > > Because it depends where the involved CPUs are in the topology. The sync
> > > code might just run on the same package an simply not see it. Yes, w/o
> > > TSC_ADJUST the TSC sync code can just fail to see the havoc.
> > 
> > Even with TSC adjust the TSC can be slightly off by design on multi-socket
> > systems.
> 
> Here are the gory details:
> 
>
> https://lore.kernel.org/lkml/3c1737210708230408i7a8049a9m5db49e6c4d89a...@mail.gmail.com/
> 
> The changelog has an explanation as well.
> 
> d8bb6f4c1670 ("x86: tsc prevent time going backwards")
> 
> I still have one of the machines which is affected by this.

Are we sure this isn't a load vs rdtsc reorder? Because if I look at the
current code:

notrace static u64 vread_tsc(void)
{
u64 ret = (u64)rdtsc_ordered();
u64 last = gtod->cycle_last;

if (likely(ret >= last))
return ret;

/*
 * GCC likes to generate cmov here, but this branch is extremely
 * predictable (it's just a function of time and the likely is
 * very likely) and there's a data dependence, so force GCC
 * to generate a branch instead.  I don't barrier() because
 * we don't actually need a barrier, and if this function
 * ever gets inlined it will generate worse code.
 */
asm volatile ("");
return last;
}

That does:

lfence
rdtsc
load gtod->cycle_last

Which obviously allows us to observe a cycles_last that is later than
the rdtsc itself, and thus time can trivially go backwards.

The new code:

last = gtod->cycle_last;
cycles = vgetcyc(gtod->vclock_mode);
if (unlikely((s64)cycles < 0))
return vdso_fallback_gettime(clk, ts);
if (cycles > last)
ns += (cycles - last) * gtod->mult;

looks like:

load gtod->cycle_last
lfence
rdtsc

which avoids that possibility, the cycle_last load must have completed
before the rdtsc.


___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case

2018-09-18 Thread Peter Zijlstra
On Tue, Sep 18, 2018 at 03:23:08PM +0200, Thomas Gleixner wrote:
> On Tue, 18 Sep 2018, Peter Zijlstra wrote:
> > On Tue, Sep 18, 2018 at 12:41:57PM +0200, Thomas Gleixner wrote:
> > > I still have one of the machines which is affected by this.
> > 
> > Are we sure this isn't a load vs rdtsc reorder? Because if I look at the
> > current code:
> 
> The load order of last vs. rdtsc does not matter at all.
> 
> CPU0  CPU1
> 
> 
> now0 = rdtsc_ordered();
> ...
> tk->cycle_last = now0;
> 
> gtod->seq++;
> gtod->cycle_last = tk->cycle_last;
> ...
> gtod->seq++;
>   seq_begin(gtod->seq);
>   now1 = rdtsc_ordered();
> 
> So if the TSC on CPU1 is slightly behind the TSC on CPU0 then now1 can be
> smaller than cycle_last. The TSC sync stuff does not catch the small delta
> for unknown raisins. I'll go and find that machine and test that again.

Yeah, somehow I forgot about rseq.. maybe I should go sleep or
something.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support

2018-10-04 Thread Peter Zijlstra
On Thu, Oct 04, 2018 at 09:54:45AM +0200, Vitaly Kuznetsov wrote:
> I was hoping to hear this from you :-) If I am to suggest how we can
> move forward I'd propose:
> - Check if pure TSC can be used on SkyLake+ systems (where TSC scaling
> is supported).
> - Check if non-masterclock mode is still needed. E.g. HyperV's TSC page
> clocksource is a single page for the whole VM, not a per-cpu thing. Can
> we think that all the buggy hardware is already gone?

No, and it is not the hardware you have to worry about (mostly), it is
the frigging PoS firmware people put on it.

Ever since Nehalem TSC is stable (unless you get to >4 socket systems,
after which it still can be, but bets are off). But even relatively
recent systems fail the TSC sync test because firmware messes it up by
writing to either MSR_TSC or MSR_TSC_ADJUST.

But the thing is, if the TSC is not synced, you cannot use it for
timekeeping, full stop. So having a single page is fine, it either
contains a mult/shift that is valid, or it indicates TSC is messed up
and you fall back to something else.

There is no inbetween there.

For sched_clock we can still use the global page, because the rate will
still be the same for each cpu, it's just offset between CPUs and the
code compensates for that.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support

2018-10-04 Thread Peter Zijlstra
On Thu, Oct 04, 2018 at 07:00:45AM -0700, Andy Lutomirski wrote:
> > On Oct 4, 2018, at 1:11 AM, Peter Zijlstra  wrote:
> > 
> >> On Thu, Oct 04, 2018 at 09:54:45AM +0200, Vitaly Kuznetsov wrote:
> >> I was hoping to hear this from you :-) If I am to suggest how we can
> >> move forward I'd propose:
> >> - Check if pure TSC can be used on SkyLake+ systems (where TSC scaling
> >> is supported).
> >> - Check if non-masterclock mode is still needed. E.g. HyperV's TSC page
> >> clocksource is a single page for the whole VM, not a per-cpu thing. Can
> >> we think that all the buggy hardware is already gone?
> > 
> > No, and it is not the hardware you have to worry about (mostly), it is
> > the frigging PoS firmware people put on it.
> > 
> > Ever since Nehalem TSC is stable (unless you get to >4 socket systems,
> > after which it still can be, but bets are off). But even relatively
> > recent systems fail the TSC sync test because firmware messes it up by
> > writing to either MSR_TSC or MSR_TSC_ADJUST.
> > 
> > But the thing is, if the TSC is not synced, you cannot use it for
> > timekeeping, full stop. So having a single page is fine, it either
> > contains a mult/shift that is valid, or it indicates TSC is messed up
> > and you fall back to something else.
> > 
> > There is no inbetween there.
> > 
> > For sched_clock we can still use the global page, because the rate will
> > still be the same for each cpu, it's just offset between CPUs and the
> > code compensates for that.
> 
> But if we’re in a KVM guest, then the clock will jump around on the
> same *vCPU* when the vCPU migrates.

Urgh yes..

> But I don’t see how kvmclock helps here, since I don’t think it’s used
> for sched_clock.

I get hits on kvm_sched_clock, but haven't looked further.

Anyway, Most of the argument still holds, either TSC is synced or it is
not and it _really_ should not be used. Not much middle ground there.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v3 1/7] seqnum_ops: Introduce Sequence Number Ops

2021-02-04 Thread Peter Zijlstra
On Wed, Feb 03, 2021 at 11:11:57AM -0700, Shuah Khan wrote:
> +static inline u32 seqnum32_inc(struct seqnum32 *seq)
> +{
> + atomic_t val = ATOMIC_INIT(seq->seqnum);
> +
> + seq->seqnum = (u32) atomic_inc_return(&val);
> + if (seq->seqnum >= UINT_MAX)
> + pr_info("Sequence Number overflow %u detected\n",
> + seq->seqnum);
> + return seq->seqnum;
> +}

What kind of broken garbage is that?
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v3 00/11] Introduce Simple atomic counters

2020-10-09 Thread Peter Zijlstra
On Fri, Oct 09, 2020 at 09:55:55AM -0600, Shuah Khan wrote:
> This patch series is a result of discussion at the refcount_t BOF
> the Linux Plumbers Conference. In this discussion, we identified
> a need for looking closely and investigating atomic_t usages in
> the kernel when it is used strictly as a counter without it
> controlling object lifetimes and state changes.
> 
> There are a number of atomic_t usages in the kernel where atomic_t api
> is used strictly for counting and not for managing object lifetime. In
> some cases, atomic_t might not even be needed.

Then the right patch is to not user atomic_t in those cases.

> The purpose of these counters is to clearly differentiate atomic_t
> counters from atomic_t usages that guard object lifetimes, hence prone
> to overflow and underflow errors. It allows tools that scan for underflow
> and overflow on atomic_t usages to detect overflow and underflows to scan
> just the cases that are prone to errors.

Guarding lifetimes is what we got refcount_t for.

> Simple atomic counters api provides interfaces for simple atomic counters
> that just count, and don't guard resource lifetimes. The interfaces are
> built on top of atomic_t api, providing a smaller subset of atomic_t
> interfaces necessary to support simple counters.

To what actual purpose?!? AFACIT its pointless wrappery, it gets us
nothing.

> Counter wraps around to INT_MIN when it overflows and should not be used
> to guard resource lifetimes, device usage and open counts that control
> state changes, and pm states. Overflowing to INT_MIN is consistent with
> the atomic_t api, which it is built on top of.
> 
> Using counter_atomic* to guard lifetimes could lead to use-after free
> when it overflows and undefined behavior when used to manage state
> changes and device usage/open states.
> 
> This patch series introduces Simple atomic counters. Counter atomic ops
> leverage atomic_t and provide a sub-set of atomic_t ops.

Thanks for Cc'ing the atomic maintainers :/

NAK.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v3 00/11] Introduce Simple atomic counters

2020-10-10 Thread Peter Zijlstra
On Fri, Oct 09, 2020 at 01:45:43PM -0700, Kees Cook wrote:
> On Fri, Oct 09, 2020 at 09:37:46PM +0200, Peter Zijlstra wrote:
> > On Fri, Oct 09, 2020 at 09:55:55AM -0600, Shuah Khan wrote:
> > > Simple atomic counters api provides interfaces for simple atomic counters
> > > that just count, and don't guard resource lifetimes. The interfaces are
> > > built on top of atomic_t api, providing a smaller subset of atomic_t
> > > interfaces necessary to support simple counters.
> > 
> > To what actual purpose?!? AFACIT its pointless wrappery, it gets us
> > nothing.
> 
> It's not pointless. There is value is separating types for behavioral
> constraint to avoid flaws. atomic_t provides a native operation. We gained
> refcount_t for the "must not wrap" type, and this gets us the other side
> of that behavioral type, which is "wrapping is expected". Separating the
> atomic_t uses allows for a clearer path to being able to reason about
> code flow, whether it be a human or a static analyzer.

refcount_t got us actual rutime exceptions that atomic_t doesn't. This
propsal gets us nothing.

atomic_t is very much expected to wrap.

> The counter wrappers add nothing to the image size, and only serve to
> confine the API to one that cannot be used for lifetime management.

It doesn't add anything period. It doesn't get us new behaviour, it
splits a 'can wrap' use-case from a 'can wrap' type. That's sodding
pointless.

Worse, it mixes 2 unrelated cases into one type, which just makes a
mockery of things (all the inc_return users are not statistics, some
might even mis-behave if they wrap).

> Once conversions are done, we have a clean line between refcounting
> and statistical atomics, which means we have a much lower chance of
> introducing new flaws (and maybe we'll fix flaws during the conversion,
> which we've certainly seen before when doing this stricter type/language
> changes).
> 
> I don't see why this is an objectionable goal.

People can and will always find a way to mess things up.

Only add types when you get behavioural changes, otherwise it's
pointless noise.

My NAK stands.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v3 00/11] Introduce Simple atomic counters

2020-10-14 Thread Peter Zijlstra
On Tue, Oct 13, 2020 at 08:12:20PM -0600, Shuah Khan wrote:

> They don't add any new behavior, As Kees mentioned they do give us a
> way to clearly differentiate atomic usages that can wrap.

No it doesn't! atomic_t can wrap, this thing can wrap, no distinction.

All it does is fragment the API and sow confusion. FOR NO BENEFIT.

> > Worse, it mixes 2 unrelated cases into one type, which just makes a
> > mockery of things (all the inc_return users are not statistics, some
> > might even mis-behave if they wrap).
> > 
> 
> You are right that all inc_return usages aren't statistics. There are
> 3 distinct usages:
> 
> 1. Stats
> 2. Cases where wrapping is fine
> 3. Cases where wrapping could be a problem. In which case, this API
>shouldn't be used.

And yet, afaict patch 4 is case 3...

> There is no need to keep inc_return in this API as such. I included it
> so it can be used for above cases 1 and 2, so the users don't have to
> call inc() followed by read(). It can be left out of the API.
> 
> The atomic_t usages in the kernel fall into the following categories:
> 
> 1. Stats (tolerance for accuracy determines whether they need to be
>atomic or not). RFC version included non-atomic API for cases
>when lossiness is acceptable. All these cases use/need just init
>and inc. There are two variations in this case:
> 
>a. No checks for wrapping. Use signed value.
>b. No checks for wrapping, but return unsigned.
> 
> 2. Reference counters that release resource and rapping could result
>in use-after-free type problems. There are two variations in this
>case:
> 
>a. Increments and decrements aren't bounded.
>b. Increments and decrements are bounded.
> 
>Currently tools that flag unsafe atomic_t usages that are candidates
>for refcount_t conversions don't make a distinction between the two.
> 
>The second case, since increments and decrements are bounded, it is
>safe to continue to use it. At the moment there is no good way to
>tell them apart other than looking at each of these cases.
> 
> 3. Reference counters that manage/control states. Wrapping is a problem
>in this case, as it could lead to undefined behavior. These cases
>don't use test and free, use inc/dec. At the moment there is no good
>way to tell them apart other than looking at each of these cases.
>This is addressed by REFCOUNT_SATURATED case.

Wrong! The atomic usage in mutex doesn't fall in any of those
categories.


The only thing you're all saying that makes sense is that unintentional
wrapping can have bad consequences, the rest is pure confusion.

Focus on the non-wrapping cases, _everything_ else is not going
anywhere.

So audit the kernel, find the cases that should not wrap, categorize and
create APIs for them that trap the wrapping. But don't go around
confusing things that don't need confusion.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v3 00/11] Introduce Simple atomic counters

2020-10-16 Thread Peter Zijlstra
On Wed, Oct 14, 2020 at 04:31:42PM -0700, Kees Cook wrote:
> On Wed, Oct 14, 2020 at 11:17:20AM +0200, Peter Zijlstra wrote:
> > On Tue, Oct 13, 2020 at 08:12:20PM -0600, Shuah Khan wrote:
> > 
> > > They don't add any new behavior, As Kees mentioned they do give us a
> > > way to clearly differentiate atomic usages that can wrap.
> > 
> > No it doesn't! atomic_t can wrap, this thing can wrap, no distinction.
> > 
> > All it does is fragment the API and sow confusion. FOR NO BENEFIT.
> 
> I really don't see it this way. It's a distinct subset of the atomic_t
> API. The trouble that has existed here has been with an atomic_t being
> originally used NOT for lifetime management, that mutates into something
> like that because of the available API, but doing so without realizing
> it. atomic_t gets used for all kinds of algorithms, and the "counter"
> type is way too easily accidentally transformed into a "lifetime
> tracker" and we get bugs.
> 
> If we have a distinct type for wrapping-counters that limits the API,
> then it is much harder for folks to shoot themselves in the foot. I don't
> see why this is so bad: we end up with safer usage, more easily auditable
> code behavior ("how was this atomic_t instance _intended_ to be used?"),
> and no change in binary size.

I simply do not believe in that. Maybe I've seen too much code like:

  atomic_dec_and_test(&counter->cnt);

or god forbid (thanks Will):

  typedef union { refcount_t ref; counter_atomic_t cnt; atomic_t atm; } 
my_awesome_type_t;

People simply aren't too fussed. They'll do whatever it takes (to get
the job done with minimal effort).

You think you've cleaned things up, but the moment you turn your back,
it'll decend into madness at break-neck speed. This is part economics
and part the fact that C is crap (see the above two examples).

Can't we use static-analysis/robots for this? The constant feedback from
those is the only thing that helps stem the tide of crap.

Create a variable attribute __wrap, and have the robot complain when
atomic_*_and_test() are used on a variable declared with __wrap. Or
better yet, when a __wrap heads a condition that leads to call_rcu() /
*free*().

> > > There is no need to keep inc_return in this API as such. I included it
> > > so it can be used for above cases 1 and 2, so the users don't have to
> > > call inc() followed by read(). It can be left out of the API.
> 
> I go back and forth on this, but after looking at these instances,
> it makes sense to have inc_return(), for where counters are actually
> "serial numbers". An argument could be made[1], however, that such uses
> should not end up in the position of _reusing_ earlier identifiers, which
> means it's actually can't wrap. (And some cases just need u64 to make this
> happen[2] -- and in that specific case, don't even need to be atomic_t).
> 
> [1] https://lore.kernel.org/lkml/202010071334.8298F3FA7@keescook/
> [2] https://git.kernel.org/linus/d1e7fd6462ca9fc76650fbe6ca800e35b24267da

Sure, there's valid cases where wrapping is ok, but it's a distinct
use-case and really shouldn't be mixed up in the statistics one --
they're distinct.

The fact that patch #4 appears to get this wrong seems like a fairly big
clue.

> > Wrong! The atomic usage in mutex doesn't fall in any of those
> > categories.
> 
> But the atomic usage in mutex is *IN* mutex -- it's a separate data
> type, etc. We don't build mutexes manually, so why build counters
> manually?

The claim was: atomic usage in the kernel falls in these 3 categories.

 - mutex uses atomic (atomic_long_t);
 - mutex is in the kernel;
 - mutex's use of atomic does not fit the listed categories.

Therefore, the claim is false. In fact most of the locking primitives
use atomic one way or another. And yes, some people do build their own.

> > The only thing you're all saying that makes sense is that unintentional
> > wrapping can have bad consequences, the rest is pure confusion.
> > 
> > Focus on the non-wrapping cases, _everything_ else is not going
> > anywhere.
> 
> I view this as a way to do so: this subset of wrapping cases is being
> identified and removed from the pool of all the atomic_t cases so that
> they will have been classified, and we can continue to narrow down all
> the atomic_t uses to find any potentially mis-used non-wrapping cases.
> 
> The other option is adding some kind of attribute to the declarations
> (which gets us the annotation) but doesn't provide a limit to the API.
> (e.g. no counter should ever call dec_return).

See above; robots can do exactly that.

> > So 

Re: [PATCH] sched/wait: introduce wait_event_freezable_hrtimeout

2019-01-18 Thread Peter Zijlstra
On Fri, Jan 18, 2019 at 10:19:41AM -0500, Joel Fernandes wrote:
> You should document the variable names in the header comments.
> 
> Also, this new API appears to conflict with definition of 'freezable' in
> wait_event_freezable I think,
> 
> wait_event_freezable - sleep or freeze until condition is true.
> 
> wait_event_freezable_hrtimeout - sleep but make sure freezer is not blocked.
> (your API)
> 
> It seems to me these are 2 different definitions of 'freezing' (or I could be
> completely confused). But in the first case it calls try_to_freeze after
> schedule(). In the second case (your new API), it calls freezable_schedule().
> 
> So I am wondering why is there this difference in the 'meaning of freezable'.
> In the very least, any such subtle differences should be documented in the
> header comments IMO.

Also; someone was recently hating on the whole freezing thing (again,
and rightfully). I just cannot remember who; Rafael you remember?
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC PATCH] x86, hyperv: fix kernel panic when kexec on HyperV VM

2019-02-27 Thread Peter Zijlstra
On Tue, Feb 26, 2019 at 11:56:15PM +0800, Kairui Song wrote:
>  arch/x86/hyperv/hv_init.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index 7abb09e2eeb8..92291c18d716 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -406,6 +406,10 @@ void hyperv_cleanup(void)
>   /* Reset our OS id */
>   wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
>  
> + /* Cleanup page reference before reset the page */
> + hv_hypercall_pg = NULL;
> + wmb();

What do we need that SFENCE for? Any why does it lack a comment?

> +
>   /* Reset the hypercall page */
>   hypercall_msr.as_uint64 = 0;
>   wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
> -- 
> 2.20.1
> 
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC PATCH] x86, hyperv: fix kernel panic when kexec on HyperV VM

2019-03-05 Thread Peter Zijlstra
On Wed, Feb 27, 2019 at 10:55:46PM +0800, Kairui Song wrote:
> On Wed, Feb 27, 2019 at 8:02 PM Peter Zijlstra  wrote:
> >
> > On Tue, Feb 26, 2019 at 11:56:15PM +0800, Kairui Song wrote:
> > >  arch/x86/hyperv/hv_init.c | 4 
> > >  1 file changed, 4 insertions(+)
> > >
> > > diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> > > index 7abb09e2eeb8..92291c18d716 100644
> > > --- a/arch/x86/hyperv/hv_init.c
> > > +++ b/arch/x86/hyperv/hv_init.c
> > > @@ -406,6 +406,10 @@ void hyperv_cleanup(void)
> > >   /* Reset our OS id */
> > >   wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
> > >
> > > + /* Cleanup page reference before reset the page */
> > > + hv_hypercall_pg = NULL;
> > > + wmb();
> >
> > What do we need that SFENCE for? Any why does it lack a comment?
> 
> Hi, that's for ensuring the hv_hypercall_pg is reset to NULL before
> the following wrmsr call. The wrmsr call will make the pointer address
> invalid.

WRMSR is a serializing instruction (except for TSC_DEADLINE and the
X2APIC).

> I can add some comment in V2 if this is OK.

Barriers must always have a comment.

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v2] x86, hyperv: fix kernel panic when kexec on HyperV

2019-03-05 Thread Peter Zijlstra
On Tue, Mar 05, 2019 at 08:17:03PM +0800, Kairui Song wrote:
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index 7abb09e2eeb8..34aa1e953dfc 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -406,6 +406,12 @@ void hyperv_cleanup(void)
>   /* Reset our OS id */
>   wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
>  
> + /* Cleanup hypercall page reference before reset the page */
> + hv_hypercall_pg = NULL;
> +
> + /* Make sure page reference is cleared before wrmsr */

This comment forgets to tell us who cares about this. And why the wrmsr
itself isn't serializing enough.

> + wmb();
> +
>   /* Reset the hypercall page */
>   hypercall_msr.as_uint64 = 0;
>   wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);

That looks like a fake MSR; and you're telling me that VMEXIT doesn't
serialize?
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v2] x86, hyperv: fix kernel panic when kexec on HyperV

2019-03-05 Thread Peter Zijlstra
On Tue, Mar 05, 2019 at 09:34:13PM +0800, Kairui Song wrote:
> On Tue, Mar 5, 2019 at 8:33 PM Peter Zijlstra  wrote:
> >
> > On Tue, Mar 05, 2019 at 08:17:03PM +0800, Kairui Song wrote:
> > > diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> > > index 7abb09e2eeb8..34aa1e953dfc 100644
> > > --- a/arch/x86/hyperv/hv_init.c
> > > +++ b/arch/x86/hyperv/hv_init.c
> > > @@ -406,6 +406,12 @@ void hyperv_cleanup(void)
> > >   /* Reset our OS id */
> > >   wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
> > >
> > > + /* Cleanup hypercall page reference before reset the page */
> > > + hv_hypercall_pg = NULL;
> > > +
> > > + /* Make sure page reference is cleared before wrmsr */
> >
> > This comment forgets to tell us who cares about this. And why the wrmsr
> > itself isn't serializing enough.
> >
> > > + wmb();
> > > +
> > >   /* Reset the hypercall page */
> > >   hypercall_msr.as_uint64 = 0;
> > >   wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
> >
> > That looks like a fake MSR; and you're telling me that VMEXIT doesn't
> > serialize?
> 
> Thanks for the review, seem I being a bit paranoid on this. Will drop
> it and send a v3 if no one has any other complaint.

Also; our wrmsr implementation has a "memory" clobber on, which means
you don't even need a compiler barrier to force emit that store before
the wrmsr.

You might want to retain the comment that explains this ordering though.

hv_hypercall_pg = NULL;
/*
 * text that explains why hv_hypercall_pg must be set NULL
 * before the wrmsr
 */
wrmsrl(...);

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 00/34] put_user_pages(): miscellaneous call sites

2019-08-02 Thread Peter Zijlstra
On Thu, Aug 01, 2019 at 07:16:19PM -0700, john.hubb...@gmail.com wrote:

> This is part a tree-wide conversion, as described in commit fc1d8e7cca2d
> ("mm: introduce put_user_page*(), placeholder versions"). That commit
> has an extensive description of the problem and the planned steps to
> solve it, but the highlites are:

That is one horridly mangled Changelog there :-/ It looks like it's
partially duplicated.

Anyway; no objections to any of that, but I just wanted to mention that
there are other problems with long term pinning that haven't been
mentioned, notably they inhibit compaction.

A long time ago I proposed an interface to mark pages as pinned, such
that we could run compaction before we actually did the pinning.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: Inconsistent lock state with Hyper-V memory balloon?

2014-11-10 Thread Peter Zijlstra
On Sat, Nov 08, 2014 at 02:36:54PM +, Sitsofe Wheeler wrote:
> I've been trying to use the Hyper-V balloon driver to allow the host to
> reclaim unused memory but have been hitting issues. With a Hyper-V 2012
> R2 guest with 4GBytes of RAM, dynamic memory on, 1GByte minimum 10GByte
> maximum, 8 vcpus, running a 3.18.0-rc3 kernel with no swap configured
> the following lockdep splat occurred:
> 
> =
> [ INFO: inconsistent lock state ]
> 3.18.0-rc3.x86_64 #159 Not tainted
> -
> inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
> swapper/0/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
>  (bdev_lock){+.?...}, at: [] nr_blockdev_pages+0x1c/0x80
> {SOFTIRQ-ON-W} state was registered at:
>   [] __lock_acquire+0x87d/0x1c60
>   [] lock_acquire+0xfc/0x150
>   [] _raw_spin_lock+0x39/0x50
>   [] nr_blockdev_pages+0x1c/0x80
>   [] si_meminfo+0x47/0x70
>   [] eventpoll_init+0x11/0x10a
>   [] do_one_initcall+0xf9/0x1a7
>   [] kernel_init_freeable+0x1d4/0x268
>   [] kernel_init+0xe/0x100
>   [] ret_from_fork+0x7c/0xb0
> irq event stamp: 2660283708
> hardirqs last  enabled at (2660283708): [] 
> free_hot_cold_page+0x175/0x190
> hardirqs last disabled at (2660283707): [] 
> free_hot_cold_page+0xa5/0x190
> softirqs last  enabled at (2660132034): [] 
> _local_bh_enable+0x4a/0x50
> softirqs last disabled at (2660132035): [] 
> irq_exit+0x58/0xc0
> 
> might help us debug this:
>  Possible unsafe locking scenario:
> 
>CPU0
>
>   lock(bdev_lock);
>   
> lock(bdev_lock);
> 
> *
> 
> no locks held by swapper/0/0.
> 
> 
> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.18.0-rc3.x86_64 #159
> Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 
> 090006  05/23/2012
>  8266ac90 880107403af8 816db3ef 
>  81c134c0 880107403b58 816d6fd3 0001
>  0001 8801 81010e6f 0046
> Call Trace:
>[] dump_stack+0x4e/0x68
>  [] print_usage_bug+0x1f3/0x204
>  [] ? save_stack_trace+0x2f/0x50
>  [] ? print_irq_inversion_bug+0x200/0x200
>  [] mark_lock+0x176/0x2e0
>  [] __lock_acquire+0x7c3/0x1c60
>  [] ? lookup_address+0x28/0x30
>  [] ? _lookup_address_cpa.isra.3+0x3b/0x40
>  [] ? __debug_check_no_obj_freed+0x89/0x220
>  [] lock_acquire+0xfc/0x150
>  [] ? nr_blockdev_pages+0x1c/0x80
>  [] _raw_spin_lock+0x39/0x50
>  [] ? nr_blockdev_pages+0x1c/0x80
>  [] nr_blockdev_pages+0x1c/0x80
>  [] si_meminfo+0x47/0x70
>  [] post_status.isra.3+0x6d/0x190
>  [] ? trace_hardirqs_on+0xd/0x10
>  [] ? __free_pages+0x2f/0x60
>  [] ? free_balloon_pages.isra.5+0x8f/0xb0
>  [] balloon_onchannelcallback+0x212/0x380
>  [] vmbus_on_event+0x173/0x1d0
>  [] tasklet_action+0x127/0x160
>  [] __do_softirq+0x18a/0x340
>  [] irq_exit+0x58/0xc0
>  [] hyperv_vector_handler+0x45/0x60
>  [] hyperv_callback_vector+0x72/0x80
>[] ? native_safe_halt+0x6/0x10
>  [] ? trace_hardirqs_on+0xd/0x10
>  [] default_idle+0x51/0xf0
>  [] arch_cpu_idle+0xf/0x20
>  [] cpu_startup_entry+0x217/0x3f0
>  [] rest_init+0xc9/0xd0
>  [] ? rest_init+0x5/0xd0
>  [] start_kernel+0x438/0x445
>  [] ? set_init_arg+0x57/0x57
>  [] ? early_idt_handlers+0x120/0x120
>  [] x86_64_start_reservations+0x2a/0x2c
>  [] x86_64_start_kernel+0x13e/0x14d
> 
> Any help deciphering the above is greatly appreciated!

Its fairly simple, the first trace shows where bdev_lock was taken with
softirqs enabled, and the second trace shows where its taken from
softirqs. Combine the two and you've got a recursive deadlock.

I don't know the block layer very well, but a quick glance at the code
shows its bdev_lock isn't meant to be used from softirq context,
therefore the hyperv stuff is broken.

So complain to the hyperv people.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] x86/hyperv: Disable preemption while setting reenlightenment vector

2019-06-12 Thread Peter Zijlstra
On Tue, Jun 11, 2019 at 10:20:03PM +0100, Dmitry Safonov wrote:
> KVM support may be compiled as dynamic module, which triggers the
> following splat on modprobe:
> 
>  KVM: vmx: using Hyper-V Enlightened VMCS
>  BUG: using smp_processor_id() in preemptible [] code: modprobe/466 
> caller is debug_smp_processor_id+0x17/0x19
>  CPU: 0 PID: 466 Comm: modprobe Kdump: loaded Not tainted 4.19.43 #1
>  Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 
> 090007  06/02/2017
>  Call Trace:
>   dump_stack+0x61/0x7e
>   check_preemption_disabled+0xd4/0xe6
>   debug_smp_processor_id+0x17/0x19
>   set_hv_tscchange_cb+0x1b/0x89
>   kvm_arch_init+0x14a/0x163 [kvm]
>   kvm_init+0x30/0x259 [kvm]
>   vmx_init+0xed/0x3db [kvm_intel]
>   do_one_initcall+0x89/0x1bc
>   do_init_module+0x5f/0x207
>   load_module+0x1b34/0x209b
>   __ia32_sys_init_module+0x17/0x19
>   do_fast_syscall_32+0x121/0x1fa
>   entry_SYSENTER_compat+0x7f/0x91
> 
> The easiest solution seems to be disabling preemption while setting up
> reenlightment MSRs. While at it, fix hv_cpu_*() callbacks.
> 
> Fixes: 93286261de1b4 ("x86/hyperv: Reenlightenment notifications
> support")
> 
> Cc: Andy Lutomirski 
> Cc: Borislav Petkov 
> Cc: Cathy Avery 
> Cc: Haiyang Zhang 
> Cc: "H. Peter Anvin" 
> Cc: Ingo Molnar 
> Cc: "K. Y. Srinivasan" 
> Cc: "Michael Kelley (EOSG)" 
> Cc: Mohammed Gamal 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: Roman Kagan 
> Cc: Sasha Levin 
> Cc: Stephen Hemminger 
> Cc: Thomas Gleixner 
> Cc: Vitaly Kuznetsov 
> 
> Cc: de...@linuxdriverproject.org
> Cc: k...@vger.kernel.org
> Cc: linux-hyp...@vger.kernel.org
> Cc: x...@kernel.org
> Reported-by: Prasanna Panchamukhi 
> Signed-off-by: Dmitry Safonov 
> ---
>  arch/x86/hyperv/hv_init.c | 9 ++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index 1608050e9df9..0bdd79ecbff8 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -91,7 +91,7 @@ EXPORT_SYMBOL_GPL(hv_max_vp_index);
>  static int hv_cpu_init(unsigned int cpu)
>  {
>   u64 msr_vp_index;
> - struct hv_vp_assist_page **hvp = &hv_vp_assist_page[smp_processor_id()];
> + struct hv_vp_assist_page **hvp = &hv_vp_assist_page[cpu];
>   void **input_arg;
>   struct page *pg;
>  
> @@ -103,7 +103,7 @@ static int hv_cpu_init(unsigned int cpu)
>  
>   hv_get_vp_index(msr_vp_index);
>  
> - hv_vp_index[smp_processor_id()] = msr_vp_index;
> + hv_vp_index[cpu] = msr_vp_index;
>  
>   if (msr_vp_index > hv_max_vp_index)
>   hv_max_vp_index = msr_vp_index;
> @@ -182,7 +182,6 @@ void set_hv_tscchange_cb(void (*cb)(void))
>   struct hv_reenlightenment_control re_ctrl = {
>   .vector = HYPERV_REENLIGHTENMENT_VECTOR,
>   .enabled = 1,
> - .target_vp = hv_vp_index[smp_processor_id()]
>   };
>   struct hv_tsc_emulation_control emu_ctrl = {.enabled = 1};
>  
> @@ -196,7 +195,11 @@ void set_hv_tscchange_cb(void (*cb)(void))
>   /* Make sure callback is registered before we write to MSRs */
>   wmb();
>  
> + preempt_disable();
> + re_ctrl.target_vp = hv_vp_index[smp_processor_id()];
>   wrmsrl(HV_X64_MSR_REENLIGHTENMENT_CONTROL, *((u64 *)&re_ctrl));
> + preempt_enable();
> +
>   wrmsrl(HV_X64_MSR_TSC_EMULATION_CONTROL, *((u64 *)&emu_ctrl));
>  }
>  EXPORT_SYMBOL_GPL(set_hv_tscchange_cb);

This looks bogus, MSRs are a per-cpu resource, you had better know what
CPUs you're on and be stuck to it when you do wrmsr. This just fudges
the code to make the warning go away and doesn't fix the actual problem
afaict.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] x86/hyperv: Disable preemption while setting reenlightenment vector

2019-06-14 Thread Peter Zijlstra
On Wed, Jun 12, 2019 at 12:17:24PM +0200, Vitaly Kuznetsov wrote:
> Dmitry Safonov  writes:
> 
> > KVM support may be compiled as dynamic module, which triggers the
> > following splat on modprobe:
> >
> >  KVM: vmx: using Hyper-V Enlightened VMCS
> >  BUG: using smp_processor_id() in preemptible [] code: modprobe/466 
> > caller is debug_smp_processor_id+0x17/0x19
> >  CPU: 0 PID: 466 Comm: modprobe Kdump: loaded Not tainted 4.19.43 #1
> >  Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 
> > 090007  06/02/2017
> >  Call Trace:
> >   dump_stack+0x61/0x7e
> >   check_preemption_disabled+0xd4/0xe6
> >   debug_smp_processor_id+0x17/0x19
> >   set_hv_tscchange_cb+0x1b/0x89
> >   kvm_arch_init+0x14a/0x163 [kvm]
> >   kvm_init+0x30/0x259 [kvm]
> >   vmx_init+0xed/0x3db [kvm_intel]
> >   do_one_initcall+0x89/0x1bc
> >   do_init_module+0x5f/0x207
> >   load_module+0x1b34/0x209b
> >   __ia32_sys_init_module+0x17/0x19
> >   do_fast_syscall_32+0x121/0x1fa
> >   entry_SYSENTER_compat+0x7f/0x91
> 
> Hm, I never noticed this one, you probably need something like
> CONFIG_PREEMPT enabled so see it.

CONFIG_DEBUG_PREEMPT

> > @@ -91,7 +91,7 @@ EXPORT_SYMBOL_GPL(hv_max_vp_index);
> >  static int hv_cpu_init(unsigned int cpu)
> >  {
> > u64 msr_vp_index;
> > -   struct hv_vp_assist_page **hvp = &hv_vp_assist_page[smp_processor_id()];
> > +   struct hv_vp_assist_page **hvp = &hv_vp_assist_page[cpu];
> > void **input_arg;
> > struct page *pg;
> >  
> > @@ -103,7 +103,7 @@ static int hv_cpu_init(unsigned int cpu)
> >  
> > hv_get_vp_index(msr_vp_index);
> >  
> > -   hv_vp_index[smp_processor_id()] = msr_vp_index;
> > +   hv_vp_index[cpu] = msr_vp_index;
> >  
> > if (msr_vp_index > hv_max_vp_index)
> > hv_max_vp_index = msr_vp_index;
> 
> The above is unrelated cleanup (as cpu == smp_processor_id() for
> CPUHP_AP_ONLINE_DYN callbacks), right? As I'm pretty sure these can'd be
> preempted.

Yeah, makes sense though.

> > @@ -182,7 +182,6 @@ void set_hv_tscchange_cb(void (*cb)(void))
> > struct hv_reenlightenment_control re_ctrl = {
> > .vector = HYPERV_REENLIGHTENMENT_VECTOR,
> > .enabled = 1,
> > -   .target_vp = hv_vp_index[smp_processor_id()]
> > };
> > struct hv_tsc_emulation_control emu_ctrl = {.enabled = 1};
> >  
> > @@ -196,7 +195,11 @@ void set_hv_tscchange_cb(void (*cb)(void))
> > /* Make sure callback is registered before we write to MSRs */
> > wmb();
> >  
> > +   preempt_disable();
> > +   re_ctrl.target_vp = hv_vp_index[smp_processor_id()];
> > wrmsrl(HV_X64_MSR_REENLIGHTENMENT_CONTROL, *((u64 *)&re_ctrl));
> > +   preempt_enable();
> > +
> 
> My personal preference would be to do something like
>int cpu = get_cpu();
> 
>... set things up ...
> 
>put_cpu();

If it doesn't matter, how about this then?

---
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 1608050e9df9..e58c693a9fce 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -91,7 +91,7 @@ EXPORT_SYMBOL_GPL(hv_max_vp_index);
 static int hv_cpu_init(unsigned int cpu)
 {
u64 msr_vp_index;
-   struct hv_vp_assist_page **hvp = &hv_vp_assist_page[smp_processor_id()];
+   struct hv_vp_assist_page **hvp = &hv_vp_assist_page[cpu];
void **input_arg;
struct page *pg;
 
@@ -103,7 +103,7 @@ static int hv_cpu_init(unsigned int cpu)
 
hv_get_vp_index(msr_vp_index);
 
-   hv_vp_index[smp_processor_id()] = msr_vp_index;
+   hv_vp_index[cpu] = msr_vp_index;
 
if (msr_vp_index > hv_max_vp_index)
hv_max_vp_index = msr_vp_index;
@@ -182,7 +182,7 @@ void set_hv_tscchange_cb(void (*cb)(void))
struct hv_reenlightenment_control re_ctrl = {
.vector = HYPERV_REENLIGHTENMENT_VECTOR,
.enabled = 1,
-   .target_vp = hv_vp_index[smp_processor_id()]
+   .target_vp = hv_vp_index[raw_smp_processor_id()]
};
struct hv_tsc_emulation_control emu_ctrl = {.enabled = 1};
 
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] x86/hyperv: Disable preemption while setting reenlightenment vector

2019-06-14 Thread Peter Zijlstra
On Fri, Jun 14, 2019 at 12:50:51PM +0100, Dmitry Safonov wrote:
> On 6/14/19 11:08 AM, Vitaly Kuznetsov wrote:
> > Peter Zijlstra  writes:
> > 
> >> @@ -182,7 +182,7 @@ void set_hv_tscchange_cb(void (*cb)(void))
> >>struct hv_reenlightenment_control re_ctrl = {
> >>.vector = HYPERV_REENLIGHTENMENT_VECTOR,
> >>.enabled = 1,
> >> -  .target_vp = hv_vp_index[smp_processor_id()]
> >> +  .target_vp = hv_vp_index[raw_smp_processor_id()]
> >>};
> >>struct hv_tsc_emulation_control emu_ctrl = {.enabled = 1};
> >>  
> > 
> > Yes, this should do, thanks! I'd also suggest to leave a comment like
> > /* 
> >  * This function can get preemted and migrate to a different CPU
> >  * but this doesn't matter. We just need to assign
> >  * reenlightenment notification to some online CPU. In case this
> >  * CPU goes offline, hv_cpu_die() will re-assign it to some
> >  * other online CPU.
> >  */
> 
> What if the cpu goes down just before wrmsrl()?
> I mean, hv_cpu_die() will reassign another cpu, but this thread will be
> resumed on some other cpu and will write cpu number which is at that
> moment already down?
> 
> (probably I miss something)
> 
> And I presume it's guaranteed that during hv_cpu_die() no other cpu may
> go down:
> : new_cpu = cpumask_any_but(cpu_online_mask, cpu);
> : re_ctrl.target_vp = hv_vp_index[new_cpu];
> : wrmsrl(HV_X64_MSR_REENLIGHTENMENT_CONTROL, *((u64 *)&re_ctrl));

Then cpus_read_lock() is the right interface, not preempt_disable().

I know you probably can't change the HV interface, but I'm thinking its
rather daft you have to specify a CPU at all for this. The HV can just
pick one and send the notification there, who cares.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v3 5/6] ANDROID: binder: don't check prio permissions on restore.

2017-11-15 Thread Peter Zijlstra
On Thu, Oct 26, 2017 at 04:07:49PM +0200, Martijn Coenen wrote:
> Because we have disabled RT priority inheritance for
> the regular binder domain, the following can happen:
> 
> 1) thread A (prio 98) calls into thread B
> 2) because RT prio inheritance is disabled, thread B
>runs at the lowest nice (prio 100) instead
> 3) thread B calls back into A; A will run at prio 100
>for the duration of the transaction

That sounds wrong.. Why would you ever downgrade A?

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v3 2/6] ANDROID: binder: add min sched_policy to node.

2017-11-15 Thread Peter Zijlstra
On Thu, Oct 26, 2017 at 04:07:46PM +0200, Martijn Coenen wrote:
> This change adds flags to flat_binder_object.flags
> to allow indicating a minimum scheduling policy for
> the node. It also clarifies the valid value range
> for the priority bits in the flags.
> 
> Internally, we use the priority map that the kernel
> uses, e.g. [0..99] for real-time policies and [100..139]
> for the SCHED_NORMAL/SCHED_BATCH policies.

I will break that without consideration if I have to. That really isn't
something anybody outside of the core code should rely on.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v3 1/6] ANDROID: binder: add support for RT prio inheritance.

2017-11-15 Thread Peter Zijlstra
On Thu, Oct 26, 2017 at 04:07:45PM +0200, Martijn Coenen wrote:
> +/**
> + * binder_set_priority() - sets the scheduler priority of a task
> + * @task:task to set priority on
> + * @desired: desired priority to run at
> + *
> + * The scheduler policy of tasks is changed explicitly, because we want to
> + * support a few distinct features:
> + * 1) If the requested priority is higher than the maximum allowed priority,
> + *we want to "clip" at the highest supported priority.
> + * 2) For a future patch, we need to allow changing the priority of a task
> + *with a different UID; when we make a binder transaction from process A
> + *to process B with different UIDs, A must be able to set B's priority
> + *before B wakes up to handle the call. If B were to raise its own 
> priority
> + *after waking up, a race condition exists where B gets preempted before
> + *it can raise its own priority.
> + *
> + * Feature 2) sounds like something a rt_mutex would solve, for example by
> + * having the caller proxy lock an rt_mutex on behalf of the callee, and then
> + * sleeping on it. But we have a few requirements that don't work with this
> + * approach:
> + * 1) binder supports a "minimum node priority", meaning that all 
> transactions
> + *into a node must run at this priority at a minimum. This means that the
> + *desired priority for handling a transaction is not necessarily equal to
> + *the priority of the caller.

Isn't that the same as running everything in that node (whatever that
is) at that given prio?

> + * 2) binder supports asynchronous transactions, where the caller is not 
> blocked
> + *on transaction completion; so, it also can't be blocked on an rt_mutex.

This is not a theoretically sound thing to do... For a system to be
real-time it not only needs to guarantee forward progress but also have
bounded runtime.

Elevating another task's priority to your own and then not blocking
effectively injects time. At the very least this time should be taking
into consideration when doing runtime analysis of the system.

Also, since its async, why do we care? The original task is still making
progress. Only at the point where we start to wait for completion does
it make sense to boost. Otherwise you end up having to charge one task
double time.

> + * 3) similarly, there may not necessarily be a thread waiting for
> + *transactions at the time the call is made, so we don't know who to 
> proxy-
> + *lock the lock for.

Non-issue; you have the exact same problem now. Your interface takes a
@task argument, so you have to have a @task either way around.

> + * 4) binder supports nested transactions, where A can call into B, and B can
> + *call back into A before returning a reply to the original transaction.
> + *This means that if A is blocked on an rt_mutex B holds, B must first 
> wake
> + *up A to handle a new transaction, and only then can it proxy-lock and 
> try
> + *to acquire the new rt_mutex. This leaves a race condition where B
> + *temporarily runs at its original priority.

That doesn't sound like something a well formed (RT) program should do
in the first place.

You could play horrible games with lock ownership, but YUCK!!

> + * 5) rt_mutex does not currently support PI for CFS tasks.

Neither do you.. just inheriting the nice value is not correct for a WFQ
style scheduler.

> + */


What you're proposing is a bunch of make-work-ish hacks on a system that
isn't designed for RT.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v3 4/6] ANDROID: binder: add RT inheritance flag to node.

2017-11-15 Thread Peter Zijlstra
On Thu, Oct 26, 2017 at 04:07:48PM +0200, Martijn Coenen wrote:
> Allows a binder node to specify whether it wants to
> inherit real-time scheduling policy from a caller. This
> inheritance may not always be desirable, for example in
> cases where the binder call runs untrusted and therefore
> potentially unbounded code.

Isn't that backwards? That is, binder should enforce not inheriting
across a trust boundary, not let a node opt out voluntary.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v3 3/6] ANDROID: binder: improve priority inheritance.

2017-11-15 Thread Peter Zijlstra
On Thu, Oct 26, 2017 at 04:07:47PM +0200, Martijn Coenen wrote:
> By raising the priority of a thread selected for
> a transaction *before* we wake it up.
> 
> Delay restoring the priority when doing a reply
> until after we wake-up the process receiving
> the reply.

What happens if a thread dies?
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v3 1/6] ANDROID: binder: add support for RT prio inheritance.

2017-11-16 Thread Peter Zijlstra
On Thu, Nov 16, 2017 at 10:18:00AM +0100, Martijn Coenen wrote:
> Thanks Peter for looking at this, more inline.

So I think I'm having a very hard time understanding things because I've
no idea how android works. I've made a number of assumptions below;
please bear with me.

> On Wed, Nov 15, 2017 at 2:01 PM, Peter Zijlstra  wrote:
> >> + * 1) binder supports a "minimum node priority", meaning that all 
> >> transactions
> >> + *into a node must run at this priority at a minimum. This means that 
> >> the
> >> + *desired priority for handling a transaction is not necessarily 
> >> equal to
> >> + *the priority of the caller.
> >
> > Isn't that the same as running everything in that node (whatever that
> > is) at that given prio?
> 
> Not for synchronous transactions; in that case it's max(min_node_prio,
> caller_prio). 

But that's exactly what you get!? How would it ever get below
min_node_prio? PI only ever (temporarily) raises prio, it never lowers
it.

> One reason min_node_prio exists is that for certain
> critical nodes, we don't want calls into it to run at a very low
> priority, because the implementation of said calls may take locks that
> are contended. I realize one possible solution to that is to use PI
> mutexes in userspace, but Android doesn't support that today.

Fix that?

> >> + * 2) binder supports asynchronous transactions, where the caller is not 
> >> blocked
> >> + *on transaction completion; so, it also can't be blocked on an 
> >> rt_mutex.
> >
> > This is not a theoretically sound thing to do... For a system to be
> > real-time it not only needs to guarantee forward progress but also have
> > bounded runtime.
> >
> > Elevating another task's priority to your own and then not blocking
> > effectively injects time. At the very least this time should be taking
> > into consideration when doing runtime analysis of the system.
> 
> In case of asynchronous transactions, we never inherit the priority of
> the caller. We run at max(default_prio, min_node_prio). For the
> default node configuration, this means that we don't change priority
> at all for async transactions (they all run at nice 0). But for nodes
> that have a minimum priority set, we want calls into the node to run
> at that priority instead.

I'm confused -- again. Why not run those threads at the min prio to
begin with?

Also, this doesn't seem related to PI.

> > Also, since its async, why do we care? The original task is still making
> > progress. Only at the point where we start to wait for completion does
> > it make sense to boost. Otherwise you end up having to charge one task
> > double time.
> 
> Yeah, so this is actually not about the caller; it's about certain
> critical nodes that need to process events at a high priority. For
> example, consider a node that receives sensor events using async
> transactions.

I'm not understanding; are you saying that some app does an async
request for sensor data, then later this sensor triggers and we need to
route the data back to the app?

But above you said that async requests do not boost the priority; so
then I would expect this sensor's data to land on regular min-prio
thread for dispatch.

I'm unclear of how sensors/devices are hooked up. But I'm assuming you
have a thread that poll()s the device, and this thread is part of a
system service process and that service in turn is a node on the RPC
network.

I would expect the service to have set the thread(s) at the relevant
priority for the device(s) it handles (aka. min prio ?).

> It's imperative that the threads handling these events
> run at real-time priority from the moment they wake up.

Sure, see above, ensure the thread(s) that poll()s and handles the
relevant device(s) are of appropriate priority a priory. Then you don't
need to fix anything in a hurry.

> Binder's thread model has a generic threadpool for the entire process;
> threads aren't dedicated to specific nodes. So to avoid such a thread
> being preempted, we need to raise its priority even before waking it
> up, and the only place to do it is from the caller.

*blink* what? Binder has its own threads? Its not just an RPC/ORB?

> >> + * 3) similarly, there may not necessarily be a thread waiting for
> >> + *transactions at the time the call is made, so we don't know who to 
> >> proxy-
> >> + *lock the lock for.
> >
> > Non-issue; you have the exact same problem now. Your interface takes a
> > @task argument, so you have to have a @task either way around.
> 
> But in my cur

Re: [PATCH v3 1/6] ANDROID: binder: add support for RT prio inheritance.

2017-11-16 Thread Peter Zijlstra
On Thu, Nov 16, 2017 at 02:03:13PM +0100, Martijn Coenen wrote:
> On Thu, Nov 16, 2017 at 12:27 PM, Peter Zijlstra  wrote:

> > But that's exactly what you get!? How would it ever get below
> > min_node_prio? PI only ever (temporarily) raises prio, it never lowers
> > it.
> 
> Ah I guess we have different definitions of inheritance then.

Well, I go by the one described in all the real-time computing texts;
also found on Wikipedia FWIW:

  https://en.wikipedia.org/wiki/Priority_inheritance

> This behavior is also related to binder's threadpool - all those
> threads run at a default prio of 120 (nice 0), and call into the
> kernel to wait for work. If somebody then makes a call with a lower
> prio (>120), we do change the binder thread that handles that call to
> use the same low prio as well. Why run at nice 0 if the caller runs at
> nice 10?

Not immediately saying it doesn't make sense in the sync-rpc scenario;
just saying that calling it PI is terribly confusing.

> Yes, in case of asynchronous transactions it's not PI at all.

It might be good to untangle some of that.

> > I'm not understanding; are you saying that some app does an async
> > request for sensor data, then later this sensor triggers and we need to
> > route the data back to the app?
> 
> The way this typically works is that an interested app creates a
> binder node to receive sensor data on (basically a callback). The app
> then passes this node into the Sensor HAL and says "call me back when
> you have relevant sensor data". When the Sensor HAL has data, it makes
> calls on this node with the relevant data. Because the receiving app
> process has a few generic binder threads that may receive any type of
> data, not just sensor data, we want to raise the priority only when
> handling sensor data, and restore it to the default when the thread is
> done.

I'm still struggling with the fact that _binder_ has threads and not the
apps themselves. Totally weird that.

But yes, that sounds roughly similar. App requests data through
subscription, event happens/gets delivered, app does something.

> > That smells like fail; if you need to raise your prio it means you are
> > low(er) prio and could at that point be subject to inversion.
> 
> True - it does expose a race condition, but it's still better than not
> raising the priority at all. I'm looking into selecting a busy thread
> and boost that. Though in general we don't run into this very often by
> sizing the threadpool large enough.

Right; but real-time systems are about guarantees, not about mostly and
on average.

But yes, you touch upon one way to have mixed priority thread pools. You
need a max prio receiver thread that is basically always runnable, this
is the one thread doing poll(). It is the thread you can immediately
assign your rt_mutex ownership to etc..

This thread then selects a victim thread to handle the received msg. In
case of no idle thread available, it will queue the msg (on priority
obviously) and assign ownership of the rt_mutex to any one random
thread.

This thread will then get boosted, complete the current work and find
the highest prio message to continue with.

And while this sounds pretty straight forward, it gets quite tricky when
you get a second message to queue etc.. IIRC you have always 'boost' the
lowest priority running thread, such that we end up guaranteeing to run
the N (size of thread pool) highest prio msgs or somesuch.

Ideally you can use dedicated threads for the RT part of the workload
and avoid all this pain.

> > Regular RPCs are synchronous; if A asks B to do something, A blocks. B
> > can then not turn around and ask A something. That is called a deadlock.
> 
> In case of binder, this actually works, we call it "recursive calls";
> the same thread in A that called into B can receive an incoming call
> from B and handle that. Of course you still need to be careful when
> that happens - eg if you hold a lock when making the first outbound
> call, you better make sure you can't take that same lock on any
> inbound call that can result from it. "Don't hold locks over
> synchronous binder calls" is a well-adapted Android design paradigm.

Sounds really weird to me though; I'd be curious to know why this
'feature' was created.

> >> What parts of the system aren't designed for RT?
> >
> > None of it as far I can tell. The shared thread pools, the quick fix
> > priority, the nested transaction stuff, those all reek of utter fail and
> > confusion.
> 
> Ack. FWIW I don't think it's really fail and confusion - I believe
> these were conscious design choices for binder and at the time I doubt
> that RT was even 

Re: Will the name of hyperv_clocksource_tsc_page or hyperv_clocksource pages change?

2018-06-22 Thread Peter Zijlstra
On Fri, Jun 22, 2018 at 03:17:25AM +, Alma Eyre (Sonata Software North 
America) wrote:
> Hello,
> 
> This is Alma supporting Azure for Japanese customers. I had a question
> from a customer that I could not find the answers for. I saw this
> github(https://github.com/torvalds/linux/commit/88c9281a9fba67636ab26c1fd6afbc78a632374f)
> page, and I was wondering if someone on this list might be able to
> answer the question.
> 
> Will the name of hyperv_clocksource_tsc_page or hyperv_clocksource
> pages change?

https://github.com/torvalds/linux/blob/e7aa8c2eb11ba69b1b69099c3c7bd6be3087b0ba/Documentation/process/stable-api-nonsense.rst
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 1/2] cpu-hotplug: export cpu_hotplug_enable/cpu_hotplug_disable

2015-05-19 Thread Peter Zijlstra
On Tue, May 19, 2015 at 02:15:41PM +0200, Vitaly Kuznetsov wrote:
> Loaded Hyper-V module will use these functions to disable CPU hotplug
> under certain circumstances.

What's wrong with get_online_cpus() ?
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 1/1] cpu-hotplug: export cpu_hotplug_enable/cpu_hotplug_disable

2015-06-23 Thread Peter Zijlstra
On Tue, Jun 23, 2015 at 04:57:05PM +, KY Srinivasan wrote:

> > diff --git a/kernel/cpu.c b/kernel/cpu.c
> > index 94bbe46..8f35ee6 100644
> > --- a/kernel/cpu.c
> > +++ b/kernel/cpu.c
> > @@ -190,17 +190,19 @@ void cpu_hotplug_done(void)
> >  void cpu_hotplug_disable(void)
> >  {
> > cpu_maps_update_begin();
> > -   cpu_hotplug_disabled = 1;
> > +   cpu_hotplug_disabled++;
> > cpu_maps_update_done();
> >  }
> > +EXPORT_SYMBOL_GPL(cpu_hotplug_disable);
> > 
> >  void cpu_hotplug_enable(void)
> >  {
> > cpu_maps_update_begin();
> > -   cpu_hotplug_disabled = 0;
> > +   if (cpu_hotplug_disabled)
> > +   cpu_hotplug_disabled--;

No that just papers over bugs.

> > cpu_maps_update_done();
> >  }
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 1/1] cpu-hotplug: export cpu_hotplug_enable/cpu_hotplug_disable

2015-06-24 Thread Peter Zijlstra
On Wed, Jun 24, 2015 at 11:43:33AM +0200, Vitaly Kuznetsov wrote:
> Peter Zijlstra  writes:
> > On Tue, Jun 23, 2015 at 04:57:05PM +, KY Srinivasan wrote:
> >
> >> > diff --git a/kernel/cpu.c b/kernel/cpu.c
> >> > index 94bbe46..8f35ee6 100644
> >> > --- a/kernel/cpu.c
> >> > +++ b/kernel/cpu.c
> >> > @@ -190,17 +190,19 @@ void cpu_hotplug_done(void)
> >> >  void cpu_hotplug_disable(void)
> >> >  {
> >> >  cpu_maps_update_begin();
> >> > -cpu_hotplug_disabled = 1;
> >> > +cpu_hotplug_disabled++;
> >> >  cpu_maps_update_done();
> >> >  }
> >> > +EXPORT_SYMBOL_GPL(cpu_hotplug_disable);
> >> > 
> >> >  void cpu_hotplug_enable(void)
> >> >  {
> >> >  cpu_maps_update_begin();
> >> > -cpu_hotplug_disabled = 0;
> >> > +if (cpu_hotplug_disabled)
> >> > +cpu_hotplug_disabled--;
> >
> > No that just papers over bugs.
> 
> Yes, but these bugs are not in Linux. I don't see any other way for a
> platform to enable/disable cpu hotplug in runtime.

Any code using this had better be open source, if not I really don't
care anyway.

That really ought to be:

WARN_ON(--cpu_hotplug_disabled < 0);

You do not make underflows go away.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] android: binder: Disable preemption while holding the global binder lock

2016-09-10 Thread Peter Zijlstra
On Sat, Sep 10, 2016 at 09:16:59AM -0700, Christoph Hellwig wrote:
> On Thu, Sep 08, 2016 at 09:12:50AM -0700, Todd Kjos wrote:
> > In Android systems, the display pipeline relies on low
> > latency binder transactions and is therefore sensitive to
> > delays caused by contention for the global binder lock.
> > Jank is siginificantly reduced by disabling preemption
> > while the global binder lock is held.
> 
> That's now how preempt_disable is supposed to use.  It is for critical

not, that's supposed to be _not_. Just to be absolutely clear, this is
NOT how you're supposed to use preempt_disable().

> sections that use per-cpu or similar resources.
> 
> > 
> > Originally-from: Riley Andrews 
> > Signed-off-by: Todd Kjos 

> > @@ -389,7 +390,11 @@ static int task_get_unused_fd_flags(struct
> > binder_proc *proc, int flags)
> >   rlim_cur = task_rlimit(proc->tsk, RLIMIT_NOFILE);
> >   unlock_task_sighand(proc->tsk, &irqs);
> > 
> > - return __alloc_fd(files, 0, rlim_cur, flags);
> > + preempt_enable_no_resched();
> > + ret = __alloc_fd(files, 0, rlim_cur, flags);
> > + preempt_disable();

And the fact that people want to use preempt_enable_no_resched() shows
that they're absolutely clueless.

This is so broken its not funny.

NAK NAK NAK


___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] android: binder: Disable preemption while holding the global binder lock

2016-09-13 Thread Peter Zijlstra
On Mon, Sep 12, 2016 at 08:44:09PM -0700, Arve Hjønnevåg wrote:

> A previous attempt to fix this problem, changed the lock to use
> rt_mutex instead of mutex, but this apparently did not work as well as
> this patch. I believe the added overhead was noticeable, and it did
> not work when the preempted thread was in a different cgroup (I don't
> know if this is still the case).

Do you actually have RR/FIFO/DL tasks? Currently PI isn't
defined/implemented for OTHER.

cgroups should be irrelevant, PI is unaware of them.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] android: binder: Disable preemption while holding the global binder lock

2016-09-14 Thread Peter Zijlstra
On Tue, Sep 13, 2016 at 12:53:27PM -0700, Arve Hjønnevåg wrote:
> On Tue, Sep 13, 2016 at 12:32 AM, Peter Zijlstra  wrote:
> > On Mon, Sep 12, 2016 at 08:44:09PM -0700, Arve Hjønnevåg wrote:
> >
> >> A previous attempt to fix this problem, changed the lock to use
> >> rt_mutex instead of mutex, but this apparently did not work as well as
> >> this patch. I believe the added overhead was noticeable, and it did
> >> not work when the preempted thread was in a different cgroup (I don't
> >> know if this is still the case).
> >
> > Do you actually have RR/FIFO/DL tasks? Currently PI isn't
> > defined/implemented for OTHER.
> >
> 
> Most of the tasks here are not RR/FIFO/DL tasks. I don't see anything
> in the rtmutex code or documentation that indicates that they don't
> work for normal tasks. From what I can tell the priority gets boosted
> in every case. This may not work as well for CFS tasks as for realtime
> tasks, but it should at least help when there is a large priority
> difference.

It does something (it used to explicitly ignore OTHER) but its not
something well defined or usable.

> > cgroups should be irrelevant, PI is unaware of them.
> 
> I don't think cgroups are irrelevant. PI being unaware of them
> explains the problem I described. If the task that holds the lock is
> in a cgroup that has a low cpu.shares value, then boosting the task's
> priority within that group does necessarily make it any more likely to
> run.

See, the problem is that 'priority' is a meaningless concept for
OTHER/CFS.

In any case, typically only RT tasks care about PI, and the traditional
Priority Inheritance algorithm only really works correctly for FIFO. As
is RR has issues and DL is a crude hack, CFS is really just an accident
by not explicitly exempting it.

We could define a meaningful something for CFS and implement that, but
it isn't currently done.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] android: binder: Disable preemption while holding the global binder lock

2016-09-14 Thread Peter Zijlstra
On Wed, Sep 14, 2016 at 09:10:01AM +0200, Peter Zijlstra wrote:
> We could define a meaningful something for CFS and implement that, but
> it isn't currently done.

So the generalization of the Priority Inheritance Protocol is Proxy
Execution Protocol, which basically lets the boosted task run _as_ the
task on the block chain as picked by the schedule function (treating all
of them as runnable). Where 'as' means it really consumes scheduling
resources of the (blocked) donor task.

Since the scheduling function for FIFO is: pick the highest prio one and
go for it, this trivially reduces to PI for FIFO.

Now, Proxy Execution Protocol is 'trivial' to implement on UP, but has a
few wobbles on SMP.

But we can use it to define a sensible definition for a WFQ class
scheduler (like CFS). For these the scheduling function basically runs
the boosted task as every donor task on the block chain gets their slice.
Alternatively, since it treats all 'blocked' tasks as runnable, the
total weight of the boosted task is its own weight plus the sum of
weight on the block chain.

Which is something that shouldn't be too hard to implement, but very
much isn't what happens today.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] android: binder: Disable preemption while holding the global binder lock

2016-09-14 Thread Peter Zijlstra
On Wed, Sep 14, 2016 at 09:10:01AM +0200, Peter Zijlstra wrote:
> On Tue, Sep 13, 2016 at 12:53:27PM -0700, Arve Hjønnevåg wrote:

> > Most of the tasks here are not RR/FIFO/DL tasks. I don't see anything
> > in the rtmutex code or documentation that indicates that they don't
> > work for normal tasks. From what I can tell the priority gets boosted
> > in every case. This may not work as well for CFS tasks as for realtime
> > tasks, but it should at least help when there is a large priority
> > difference.
> 
> It does something (it used to explicitly ignore OTHER) but its not
> something well defined or usable.

I looked again, and while it updates the ->prio field for OTHER tasks,
that does not seem to cause a change to the actual weight field (which
is derived from ->static_prio).

So it really should not do anything.. as I remebered it.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] android: binder: Disable preemption while holding the global binder lock

2016-09-14 Thread Peter Zijlstra
On Tue, Sep 13, 2016 at 12:53:27PM -0700, Arve Hjønnevåg wrote:
> On Tue, Sep 13, 2016 at 12:32 AM, Peter Zijlstra  wrote:
> > cgroups should be irrelevant, PI is unaware of them.
> 
> I don't think cgroups are irrelevant. PI being unaware of them
> explains the problem I described. If the task that holds the lock is
> in a cgroup that has a low cpu.shares value, then boosting the task's
> priority within that group does necessarily make it any more likely to
> run.

Thing is, for FIFO/DL the important parameters (prio and deadline resp.)
are not cgroup dependent.

For CFS you're right, and as per usual, cgroups will be a royal pain.
While we can compute the total weight in the block chain, getting that
back to a task which is stuck in a cgroup is just not going to work
well.

The only 'solution' I can come up with in a hurry is, when the task is
boosted, move it to the root cgroup. That of course has a whole heap of
problems all on its own.

/me curses @ cgroups.. bloody stupid things.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] android: binder: Disable preemption while holding the global binder lock

2016-09-14 Thread Peter Zijlstra
On Wed, Sep 14, 2016 at 06:11:03PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 13, 2016 at 12:53:27PM -0700, Arve Hjønnevåg wrote:
> > On Tue, Sep 13, 2016 at 12:32 AM, Peter Zijlstra  
> > wrote:
> > > cgroups should be irrelevant, PI is unaware of them.
> > 
> > I don't think cgroups are irrelevant. PI being unaware of them
> > explains the problem I described. If the task that holds the lock is
> > in a cgroup that has a low cpu.shares value, then boosting the task's
> > priority within that group does necessarily make it any more likely to
> > run.
> 
> Thing is, for FIFO/DL the important parameters (prio and deadline resp.)
> are not cgroup dependent.
> 
> For CFS you're right, and as per usual, cgroups will be a royal pain.
> While we can compute the total weight in the block chain, getting that
> back to a task which is stuck in a cgroup is just not going to work
> well.

Not to mention the fact that the weight depends on the current running
state. Having those tasks block insta changes the actual weight.

> /me curses @ cgroups.. bloody stupid things.

More of that.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] android: binder: Disable preemption while holding the global binder lock

2016-09-14 Thread Peter Zijlstra
On Wed, Sep 14, 2016 at 06:13:40PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 14, 2016 at 06:11:03PM +0200, Peter Zijlstra wrote:
> > On Tue, Sep 13, 2016 at 12:53:27PM -0700, Arve Hjønnevåg wrote:
> > > On Tue, Sep 13, 2016 at 12:32 AM, Peter Zijlstra  
> > > wrote:
> > > > cgroups should be irrelevant, PI is unaware of them.
> > > 
> > > I don't think cgroups are irrelevant. PI being unaware of them
> > > explains the problem I described. If the task that holds the lock is
> > > in a cgroup that has a low cpu.shares value, then boosting the task's
> > > priority within that group does necessarily make it any more likely to
> > > run.
> > 
> > Thing is, for FIFO/DL the important parameters (prio and deadline resp.)
> > are not cgroup dependent.
> > 
> > For CFS you're right, and as per usual, cgroups will be a royal pain.
> > While we can compute the total weight in the block chain, getting that
> > back to a task which is stuck in a cgroup is just not going to work
> > well.
> 
> Not to mention the fact that the weight depends on the current running
> state. Having those tasks block insta changes the actual weight.
> 
> > /me curses @ cgroups.. bloody stupid things.
> 
> More of that.


Something a little like so... completely untested, and has a fair number
of corner cases still left open I think..


--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1691,8 +1691,8 @@ struct task_struct {
struct seccomp seccomp;
 
 /* Thread group tracking */
-   u32 parent_exec_id;
-   u32 self_exec_id;
+   u32 parent_exec_id;
+   u32 self_exec_id;
 /* Protection of (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed,
  * mempolicy */
spinlock_t alloc_lock;
@@ -1710,6 +1710,11 @@ struct task_struct {
struct task_struct *pi_top_task;
/* Deadlock detection and priority inheritance handling */
struct rt_mutex_waiter *pi_blocked_on;
+   unsigned long pi_weight;
+#ifdef CONFIG_CGROUP_SCHED
+   struct task_group *orig_task_group;
+   unsigned long normal_weight;
+#endif
 #endif
 
 #ifdef CONFIG_DEBUG_MUTEXES
@@ -1977,6 +1982,8 @@ static inline int tsk_nr_cpus_allowed(st
return p->nr_cpus_allowed;
 }
 
+extern unsigned long task_pi_weight(struct task_struct *p);
+
 #define TNF_MIGRATED   0x01
 #define TNF_NO_GROUP   0x02
 #define TNF_SHARED 0x04
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -20,6 +20,19 @@
 #include "rtmutex_common.h"
 
 /*
+ * !(rt_prio || dl_prio)
+ */
+static inline bool other_prio(int prio)
+{
+   return prio >= MAX_RT_PRIO;
+}
+
+static inline bool other_task(struct task_struct *task)
+{
+   return other_prio(task->prio);
+}
+
+/*
  * lock->owner state tracking:
  *
  * lock->owner holds the task_struct pointer of the owner. Bit 0
@@ -226,6 +239,11 @@ rt_mutex_enqueue(struct rt_mutex *lock,
 
rb_link_node(&waiter->tree_entry, parent, link);
rb_insert_color(&waiter->tree_entry, &lock->waiters);
+   /*
+* We want the lock->waiter_weight to reflect the sum of all queued
+* waiters.
+*/
+   lock->waiters_weight += waiter->weight;
 }
 
 static void
@@ -239,6 +257,7 @@ rt_mutex_dequeue(struct rt_mutex *lock,
 
rb_erase(&waiter->tree_entry, &lock->waiters);
RB_CLEAR_NODE(&waiter->tree_entry);
+   lock->waiters_weight += waiter->weight;
 }
 
 static void
@@ -265,6 +284,18 @@ rt_mutex_enqueue_pi(struct task_struct *
 
rb_link_node(&waiter->pi_tree_entry, parent, link);
rb_insert_color(&waiter->pi_tree_entry, &task->pi_waiters);
+
+   /*
+* Since a task can own multiple locks, and we have one pi_waiter
+* for every lock, propagate the lock->waiter_weight, which is the sum
+* of all weights queued on that lock, into the top waiter, and add
+* that to the owning task's pi_weight.
+*
+* This results in pi_weight being the sum of all blocked waiters
+* on this task.
+*/
+   waiter->top_weight = waiter->lock->waiters_weight;
+   task->pi_weight += waiter->top_weight;
 }
 
 static void
@@ -278,6 +309,7 @@ rt_mutex_dequeue_pi(struct task_struct *
 
rb_erase(&waiter->pi_tree_entry, &task->pi_waiters);
RB_CLEAR_NODE(&waiter->pi_tree_entry);
+   task->pi_weight -= waiter->top_weight;
 }
 
 static void rt_mutex_adjust_prio(struct task_struct *p)
@@ -497,7 +529,7 @@ static int rt_mutex_adjust_prio_chain(st
 * detection is enabled we continue, but stop the
 * requeueing in the chain walk.
 */
-   if (top_waiter != task_top_pi_waiter(task))

Re: [PATCH v2 03/13] ANDROID: binder: add support for RT prio inheritance.

2017-08-31 Thread Peter Zijlstra
On Thu, Aug 31, 2017 at 10:04:20AM +0200, Martijn Coenen wrote:
> Adds support for SCHED_BATCH/SCHED_FIFO/SCHED_RR priority inheritance
> to the binder driver. The desired behavior is as follows:

You fail to support SCHED_DEADLINE, that's not optional.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v2 03/13] ANDROID: binder: add support for RT prio inheritance.

2017-08-31 Thread Peter Zijlstra
On Thu, Aug 31, 2017 at 10:27:58AM +0200, Martijn Coenen wrote:
> On Thu, Aug 31, 2017 at 10:18 AM, Peter Zijlstra  wrote:
> > You fail to support SCHED_DEADLINE, that's not optional.
> 
> The reason I didn't include it is that we don't use SCHED_DEADLINE in
> Android userspace. 

AFAIK people are actively working on fixing that.

I still have to look at the actual patches, but it all sounds very
dodgy. Probably won't have time until after LPC.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v2 03/13] ANDROID: binder: add support for RT prio inheritance.

2017-08-31 Thread Peter Zijlstra
On Thu, Aug 31, 2017 at 02:00:54PM +0200, Martijn Coenen wrote:
> Appreciate if you could take a look! In particular I'm curious why
> this approach is considered "dodgy", or "engineered sideways" as
> Thomas pointed out earlier. 

The fact that you could not reuse rt_mutex makes me very nervous. Other
than that I cannot say anything because I've not actually looked at any
of the code yet.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 0/3] Modernize tasklet callback API

2020-07-16 Thread Peter Zijlstra
On Wed, Jul 15, 2020 at 08:08:44PM -0700, Kees Cook wrote:
> Hi,
> 
> This is the infrastructure changes to prepare the tasklet API for
> conversion to passing the tasklet struct as the callback argument instead
> of an arbitrary unsigned long. The first patch details why this is useful
> (it's the same rationale as the timer_struct changes from a bit ago:
> less abuse during memory corruption attacks, more in line with existing
> ways of doing things in the kernel, save a little space in struct,
> etc). Notably, the existing tasklet API use is much less messy, so there
> is less to clean up.

I would _MUCH_ rather see tasklets go the way of the dodo, esp. given
that:

>  drivers/input/keyboard/omap-keypad.c   |  2 +-
>  drivers/input/serio/hil_mlc.c  |  2 +-
>  drivers/net/wan/farsync.c  |  4 +--
>  drivers/s390/crypto/ap_bus.c   |  2 +-
>  drivers/staging/most/dim2/dim2.c   |  2 +-
>  drivers/staging/octeon/ethernet-tx.c   |  2 +-
>  drivers/tty/vt/keyboard.c  |  2 +-
>  drivers/usb/gadget/udc/snps_udc_core.c |  6 ++---
>  drivers/usb/host/fhci-sched.c  |  2 +-
>  include/linux/interrupt.h  | 37 ++
>  kernel/backtracetest.c |  2 +-
>  kernel/debug/debug_core.c  |  2 +-
>  kernel/irq/resend.c|  2 +-
>  kernel/softirq.c   | 18 -
>  net/atm/pppoatm.c  |  2 +-
>  net/iucv/iucv.c|  2 +-
>  sound/drivers/pcsp/pcsp_lib.c  |  2 +-
>  17 files changed, 66 insertions(+), 25 deletions(-)

there appear to be hardly any users left.. Can't we stage an extinction
event here instead?
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel