from:"Xunlei Pang"

Re: [PATCH v4] sched/deadline: remove useless param from setup_new_dl_entity

2016-07-21 Thread Xunlei Pang

On 2016/07/21 at 22:46, Juri Lelli wrote:
> On 21/07/16 15:36, Juri Lelli wrote:
>> On 21/07/16 15:21, Juri Lelli wrote:
>>> Hi,
>>>
>>> On 18/07/16 21:37, Xunlei Pang wrote:
>>>> On 2016/07/18 at 21:04, Juri Lelli wrote:
>>>>> On 15/07/16 18:39, Xunlei Pang wrote:
>>>>>> On 2016/07/13 at 18:58, Juri Lelli wrote:
>>>>> [...]
>>>>>
>>>>>> Since this is only called for queued cases now, there is no need to
>>>>>> check boosted stuff here. As enqueue_task(ENQUEUE_REPLENISH)
>>>>>> is called before check_class_changed() in rt_mutex_setprio().
>>>>>>
>>>>> But we don't do the same in setscheduler, right?
>>>> If p is deadline PI-boosted, setscheduler() won't call change its 
>>>> sched_class.
>>>> If p isn't deadline PI-boosted, then pi_task is NULL.
>>>>
>>>> So, I think the added code won't hit. Did I miss something?
>>>>
>>> No, I think you are right.
>>>
>> Oh, and we need to filter the call after rt_mutex_setprio has
>> already issued a replenishment.
>>
> Does something like this on top of v4 make sense?
>
> ---
>  kernel/sched/deadline.c | 22 +++---
>  1 file changed, 7 insertions(+), 15 deletions(-)
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index dc56f5be0112..6f05ac78711c 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -350,9 +350,8 @@ static inline void setup_new_dl_entity(struct 
> sched_dl_entity *dl_se)
>  {
>   struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
>   struct rq *rq = rq_of_dl_rq(dl_rq);
> - struct task_struct *pi_task;
> - struct sched_dl_entity *pi_se = dl_se;
>  
> + WARN_ON(dl_se->dl_boosted);
>   WARN_ON(dl_time_before(rq_clock(rq), dl_se->deadline));
>  
>   /*
> @@ -364,21 +363,12 @@ static inline void setup_new_dl_entity(struct 
> sched_dl_entity *dl_se)
>   return;
>  
>   /*
> -  * Use the scheduling parameters of the top pi-waiter task,
> -  * if we have one from which we can inherit a deadline.
> -  */
> - if (dl_se->dl_boosted &&
> - (pi_task = rt_mutex_get_top_task(dl_task_of(dl_se))) &&
> - dl_prio(pi_task->normal_prio))
> - pi_se = &pi_task->dl;
> -
> - /*
>* We use the regular wall clock time to set deadlines in the
>* future; in fact, we must consider execution overheads (time
>* spent on hardirq context, etc.).
>*/
> - dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline;
> - dl_se->runtime = pi_se->dl_runtime;
> + dl_se->deadline = rq_clock(rq) + dl_se->dl_deadline;
> + dl_se->runtime = dl_se->dl_runtime;
>  }
>  
>  /*
> @@ -1734,9 +1724,11 @@ static void switched_to_dl(struct rq *rq, struct 
> task_struct *p)
>   if (task_on_rq_queued(p)) {
>   /*
>* If p is not queued we will update its parameters at next
> -  * wakeup.
> +  * wakeup. If p is dl_boosted we already updated its params in
> +  * rt_mutex_setprio()->enqueue_task(..., ENQUEUE_REPLENISH).
>*/
> - if (dl_time_before(p->dl.deadline, rq_clock(rq)))
> + if (dl_time_before(p->dl.deadline, rq_clock(rq)) &&
> + !p->dl.dl_boosted)

Hi Juri,

It looks good to me, only one question:
For on_rq boosted to deadline, p->dl.deadline has been updated after 
rq_lock(rq) by
rt_mutex_setprio()->enqueue_task(..., ENQUEUE_REPLENISH) and no rq clock update
afterwards, so dl_time_before() will be false, seems p->dl.dl_boosted check is 
needless.

Regards,
Xunlei

>   setup_new_dl_entity(&p->dl);
>  
>   if (rq->curr != p) {

Re: [V4 PATCH 1/2] x86/panic: Replace smp_send_stop() with kdump friendly version in panic path

2016-09-20 Thread Xunlei Pang

On 2016/08/15/ at 19:22, Hidehiro Kawai wrote:
> Hi Dave,
>
> Thank you for the review.
>
>> From: Dave Young [mailto:dyo...@redhat.com]
>> Sent: Friday, August 12, 2016 12:17 PM
>>
>> Thanks for the update.
>> On 08/10/16 at 05:09pm, Hidehiro Kawai wrote:
>>> Daniel Walker reported problems which happens when
>>> crash_kexec_post_notifiers kernel option is enabled
>>> (https://lkml.org/lkml/2015/6/24/44).
>>>
>>> In that case, smp_send_stop() is called before entering kdump routines
>>> which assume other CPUs are still online.  As the result, for x86,
>>> kdump routines fail to save other CPUs' registers  and disable
>>> virtualization extensions.
>> Seems you simplified the changelog, but I think a little more details
>> will be helpful to understand the patch. You know sometimes lkml.org
>> does not work well.
> So, I'll try another archives when I post patch set next time.

Hi Hidehiro Kawai,

What's the status of this patch set, are you going to send an updated version?

Regards,
Xunlei

>>> To fix this problem, call a new kdump friendly function,
>>> crash_smp_send_stop(), instead of the smp_send_stop() when
>>> crash_kexec_post_notifiers is enabled.  crash_smp_send_stop() is a
>>> weak function, and it just call smp_send_stop().  Architecture
>>> codes should override it so that kdump can work appropriately.
>>> This patch only provides x86-specific version.
>>>
>>> For Xen's PV kernel, just keep the current behavior.
>> Could you explain a bit about above Xen PV kernel behavior?
>>
>> BTW, this version looks better,  I think I'm fine with this version
>> besides of the questions about changelog.
> As for Dom0 kernel, it doesn't use crash_kexec routines, and
> it relies on panic notifier chain.  At the end of the chain,
> xen_panic_event is called, and it issues a hypercall which
> requests Hypervisor to execute kdump.  This means whether
> crash_kexec_panic_notifiers is set or not, panic notifiers
> are called after smp_send_stop.  Even if we save registers
> in Dom0 kernel, they seem to be ignored (Hypervisor is responsible
> for that).  This is why I kept the current behavior for Xen.
>
> For PV DomU kernel, kdump is not supported.  For PV HVM
> DomU, I'm not sure what will happen on panic because I
> couldn't boot PV HVM DomU and test it.  But I think it will
> work similarly to baremetal kernels with extra cleanups
> for Hypervisor.
>
> Best regards,
>
> Hidehiro Kawai
>
>>> Changes in V4:
>>> - Keep to use smp_send_stop if crash_kexec_post_notifiers is not set
>>> - Rename panic_smp_send_stop to crash_smp_send_stop
>>> - Don't change the behavior for Xen's PV kernel
>>>
>>> Changes in V3:
>>> - Revise comments, description, and symbol names
>>>
>>> Changes in V2:
>>> - Replace smp_send_stop() call with crash_kexec version which
>>>   saves cpu states and cleans up VMX/SVM
>>> - Drop a fix for Problem 1 at this moment
>>>
>>> Reported-by: Daniel Walker 
>>> Fixes: f06e5153f4ae (kernel/panic.c: add "crash_kexec_post_notifiers" 
>>> option)
>>> Signed-off-by: Hidehiro Kawai 
>>> Cc: Dave Young 
>>> Cc: Baoquan He 
>>> Cc: Vivek Goyal 
>>> Cc: Eric Biederman 
>>> Cc: Masami Hiramatsu 
>>> Cc: Daniel Walker 
>>> Cc: Xunlei Pang 
>>> Cc: Thomas Gleixner 
>>> Cc: Ingo Molnar 
>>> Cc: "H. Peter Anvin" 
>>> Cc: Borislav Petkov 
>>> Cc: David Vrabel 
>>> Cc: Toshi Kani 
>>> Cc: Andrew Morton 
>>> ---
>>>  arch/x86/include/asm/kexec.h |1 +
>>>  arch/x86/include/asm/smp.h   |1 +
>>>  arch/x86/kernel/crash.c  |   22 +---
>>>  arch/x86/kernel/smp.c|5 
>>>  kernel/panic.c   |   47 
>>> --
>>>  5 files changed, 66 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
>>> index d2434c1..282630e 100644
>>> --- a/arch/x86/include/asm/kexec.h
>>> +++ b/arch/x86/include/asm/kexec.h
>>> @@ -210,6 +210,7 @@ struct kexec_entry64_regs {
>>>
>>>  typedef void crash_vmclear_fn(void);
>>>  extern crash_vmclear_fn __rcu *crash_vmclear_loaded_vmcss;
>>> +extern void kdump_nmi_shootdown_cpus(void);
>>>
>>>  #endif /* __ASSEMBLY__ */
>>>
>>&g

Re: [PATCH v5] sched/deadline: remove useless param from setup_new_dl_entity

2016-08-05 Thread Xunlei Pang

On 2016/08/05 at 18:09, Juri Lelli wrote:
> setup_new_dl_entity() takes two parameters, but it only actually uses
> one of them, under a different name, to setup a new dl_entity, after:
>
>  2f9f3fdc928 "sched/deadline: Remove dl_new from struct sched_dl_entity"
>
> as we currently do
>
>  setup_new_dl_entity(&p->dl, &p->dl)
>
> However, before Luca's change we were doing
>
>  setup_new_dl_entity(dl_se, pi_se)
>
> in update_dl_entity() for a dl_se->new entity: we were using pi_se's
> parameters (the potential PI donor) for setting up a new entity.
>
> This change removes the useless second parameter of setup_new_dl_entity().
> While we are at it we also optimize things further calling setup_new_dl_
> entity() only for already queued tasks, since (as pointed out by Xunlei)
> we already do the very same update at tasks wakeup time anyway. By doing
> so, we don't need to worry about a potential PI donor anymore, as rt_
> mutex_setprio() takes care of that already for us.
>
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Steven Rostedt 
> Cc: Luca Abeni 
> Cc: Xunlei Pang 
> Signed-off-by: Juri Lelli 

Reviewed-by: Xunlei Pang 


>
> ---
>  Changes from v4:
>- remove dead code checking for boosted status, as never triggered
>  after rt_mutex_setprio()
>
>  Changes from v3:
>- call setup_new_dl_entity only if task is enqueued already
>
>  Changes from v2:
>- optimize by calling rt_mutex_get_top_task only if the task is
>  boosted (as suggested by Steve)
>
>  Changes from v1:
>- Steve pointed out that we were actually using the second parameter
>  to permorm initialization
>- Luca confirmed that behavior is slightly changed w.r.t. before his
>  change
>- changelog updated and original behavior restored
> ---
>  kernel/sched/deadline.c | 35 ++-
>  1 file changed, 22 insertions(+), 13 deletions(-)
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index fcb7f0217ff4..9491dbe039e8 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -346,12 +346,12 @@ static void check_preempt_curr_dl(struct rq *rq, struct 
> task_struct *p,
>   * one, and to (try to!) reconcile itself with its own scheduling
>   * parameters.
>   */
> -static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se,
> -struct sched_dl_entity *pi_se)
> +static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
>  {
>   struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
>   struct rq *rq = rq_of_dl_rq(dl_rq);
>  
> + WARN_ON(dl_se->dl_boosted);
>   WARN_ON(dl_time_before(rq_clock(rq), dl_se->deadline));
>  
>   /*
> @@ -367,8 +367,8 @@ static inline void setup_new_dl_entity(struct 
> sched_dl_entity *dl_se,
>* future; in fact, we must consider execution overheads (time
>* spent on hardirq context, etc.).
>*/
> - dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline;
> - dl_se->runtime = pi_se->dl_runtime;
> + dl_se->deadline = rq_clock(rq) + dl_se->dl_deadline;
> + dl_se->runtime = dl_se->dl_runtime;
>  }
>  
>  /*
> @@ -1720,19 +1720,28 @@ static void switched_from_dl(struct rq *rq, struct 
> task_struct *p)
>   */
>  static void switched_to_dl(struct rq *rq, struct task_struct *p)
>  {
> - if (dl_time_before(p->dl.deadline, rq_clock(rq)))
> - setup_new_dl_entity(&p->dl, &p->dl);
>  
> - if (task_on_rq_queued(p) && rq->curr != p) {
> + if (task_on_rq_queued(p)) {
> + /*
> +  * If p is not queued we will update its parameters at next
> +  * wakeup. If p is dl_boosted we already updated its params in
> +  * rt_mutex_setprio()->enqueue_task(..., ENQUEUE_REPLENISH),
> +  * p's deadline being now already after rq_clock(rq).
> +  */
> + if (dl_time_before(p->dl.deadline, rq_clock(rq)))
> + setup_new_dl_entity(&p->dl);
> +
> + if (rq->curr != p) {
>  #ifdef CONFIG_SMP
> - if (tsk_nr_cpus_allowed(p) > 1 && rq->dl.overloaded)
> - queue_push_tasks(rq);
> + if (tsk_nr_cpus_allowed(p) > 1 && rq->dl.overloaded)
> + queue_push_tasks(rq);
>  #else
> - if (dl_task(rq->curr))
> - check_preempt_curr_dl(rq, p, 0);
> - else
> - resched_curr(rq);
> + if (dl_task(rq->curr))
> + check_preempt_curr_dl(rq, p, 0);
> + else
> + resched_curr(rq);
>  #endif
> + }
>   }
>  }
>

Re: [PATCH v2 1/2] kexec: Introduce "/sys/kernel/kexec_crash_low_size"

2016-08-24 Thread Xunlei Pang

On 2016/08/24 at 16:20, Dave Young wrote:
> On 08/23/16 at 06:11pm, Yinghai Lu wrote:
>> On Wed, Aug 17, 2016 at 1:20 AM, Dave Young  wrote:
>>> On 08/17/16 at 09:50am, Xunlei Pang wrote:
>>>> "/sys/kernel/kexec_crash_size" only handles crashk_res, it
>>>> is fine in most cases, but sometimes we have crashk_low_res.
>>>> For example, when "crashkernel=size[KMG],high" combined with
>>>> "crashkernel=size[KMG],low" is used for 64-bit x86.
>>>>
>>>> Like crashk_res, we introduce the corresponding sysfs file
>>>> "/sys/kernel/kexec_crash_low_size" for crashk_low_res.
>>>>
>>>> So, the exact total reserved memory is the sum of the two.
>>>>
>>>> crashk_low_res can also be shrunk via this new interface,
>>>> and users should be aware of what they are doing.
>> ...
>>>> @@ -218,6 +238,7 @@ static struct attribute * kernel_attrs[] = {
>>>>  #ifdef CONFIG_KEXEC_CORE
>>>>   &kexec_loaded_attr.attr,
>>>>   &kexec_crash_loaded_attr.attr,
>>>> + &kexec_crash_low_size_attr.attr,
>>>>   &kexec_crash_size_attr.attr,
>>>>   &vmcoreinfo_attr.attr,
>>>>  #endif
>> would be better if you can use attribute_group .is_visible to control 
>> showing of
>> crash_low_size only when the crash_base is above 4G.
> I have same feeling that it looks odd to show low in sysfs in case no
> crashkernel=,high being used. Even if crashkernel=,high is used only in
> x86 the resource crashk_low is in common code. What do you think to move
> it to x86?

If want to put some restriction on it, I'd prefer to move crashk_low to arch 
x86, to make
it x86-specific.

We can show the interface unconditionally. If it isn't used, its size is 0, it 
doesn't matter.

Regards,
Xunlei

>
> Thanks
> Dave
>
>> Thanks
>>
>> Yinghai
>>
>> ___
>> kexec mailing list
>> ke...@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] s390/crash: Fix KEXEC_NOTE_BYTES definition

2017-06-22 Thread Xunlei Pang

On 06/22/2017 at 01:44 AM, Michael Holzheu wrote:
> Am Fri,  9 Jun 2017 10:17:05 +0800
> schrieb Xunlei Pang :
>
>> S390 KEXEC_NOTE_BYTES is not used by note_buf_t as before, which
>> is now defined as follows:
>> typedef u32 note_buf_t[CRASH_CORE_NOTE_BYTES/4];
>> It was changed by the CONFIG_CRASH_CORE feature.
>>
>> This patch gets rid of all the old KEXEC_NOTE_BYTES stuff, and
>> renames KEXEC_NOTE_BYTES to CRASH_CORE_NOTE_BYTES for S390.
>>
>> Fixes: 692f66f26a4c ("crash: move crashkernel parsing and vmcore related 
>> code under CONFIG_CRASH_CORE")
>> Cc: Dave Young 
>> Cc: Dave Anderson 
>> Cc: Hari Bathini 
>> Cc: Gustavo Luiz Duarte 
>> Signed-off-by: Xunlei Pang 
> Hello Xunlei,
>
> As you already know on s390 we create the ELF header in the new kernel.
> Therefore we don't use the per-cpu buffers for ELF notes to store
> the register state.
>
> For RHEL7 we still store the registers in machine_kexec.c:add_elf_notes().
> Though we also use the ELF header from new kernel ...
>
> We assume your original problem with the "kmem -s" failure
> was caused by the memory overwrite due to the invalid size of the
> "crash_notes" per-cpu buffers.
>
> Therefore your patch looks good for RHEL7 but for upstream we propose the
> patch below.

Hi Michael,

Yes, we already did this way.
Thanks for the confirmation, the patch below looks good to me.

Regards,
Xunlei

> ---
> [PATCH] s390/crash: Remove unused KEXEC_NOTE_BYTES
>
> After commmit 692f66f26a4c19 ("crash: move crashkernel parsing and vmcore
> related code under CONFIG_CRASH_CORE") the KEXEC_NOTE_BYTES macro is not
> used anymore and for s390 we create the ELF header in the new kernel
> anyway. Therefore remove the macro.
>
> Reported-by: Xunlei Pang 
> Reviewed-by: Mikhail Zaslonko 
> Signed-off-by: Michael Holzheu 
> ---
>  arch/s390/include/asm/kexec.h | 18 --
>  include/linux/crash_core.h|  5 +
>  include/linux/kexec.h |  9 -
>  3 files changed, 5 insertions(+), 27 deletions(-)
>
> diff --git a/arch/s390/include/asm/kexec.h b/arch/s390/include/asm/kexec.h
> index 2f924bc30e35..dccf24ee26d3 100644
> --- a/arch/s390/include/asm/kexec.h
> +++ b/arch/s390/include/asm/kexec.h
> @@ -41,24 +41,6 @@
>  /* The native architecture */
>  #define KEXEC_ARCH KEXEC_ARCH_S390
>  
> -/*
> - * Size for s390x ELF notes per CPU
> - *
> - * Seven notes plus zero note at the end: prstatus, fpregset, timer,
> - * tod_cmp, tod_reg, control regs, and prefix
> - */
> -#define KEXEC_NOTE_BYTES \
> - (ALIGN(sizeof(struct elf_note), 4) * 8 + \
> -  ALIGN(sizeof("CORE"), 4) * 7 + \
> -  ALIGN(sizeof(struct elf_prstatus), 4) + \
> -  ALIGN(sizeof(elf_fpregset_t), 4) + \
> -  ALIGN(sizeof(u64), 4) + \
> -  ALIGN(sizeof(u64), 4) + \
> -  ALIGN(sizeof(u32), 4) + \
> -  ALIGN(sizeof(u64) * 16, 4) + \
> -  ALIGN(sizeof(u32), 4) \
> - )
> -
>  /* Provide a dummy definition to avoid build failures. */
>  static inline void crash_setup_regs(struct pt_regs *newregs,
>   struct pt_regs *oldregs) { }
> diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
> index 541a197ba4a2..4090a42578a8 100644
> --- a/include/linux/crash_core.h
> +++ b/include/linux/crash_core.h
> @@ -10,6 +10,11 @@
>  #define CRASH_CORE_NOTE_NAME_BYTES ALIGN(sizeof(CRASH_CORE_NOTE_NAME), 4)
>  #define CRASH_CORE_NOTE_DESC_BYTES ALIGN(sizeof(struct elf_prstatus), 4)
>  
> +/*
> + * The per-cpu notes area is a list of notes terminated by a "NULL"
> + * note header.  For kdump, the code in vmcore.c runs in the context
> + * of the second kernel to combine them into one note.
> + */
>  #define CRASH_CORE_NOTE_BYTES   ((CRASH_CORE_NOTE_HEAD_BYTES * 2) +  
> \
>CRASH_CORE_NOTE_NAME_BYTES +   \
>CRASH_CORE_NOTE_DESC_BYTES)
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index c9481ebcbc0c..65888418fb69 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -63,15 +63,6 @@
>  #define KEXEC_CORE_NOTE_NAME CRASH_CORE_NOTE_NAME
>  
>  /*
> - * The per-cpu notes area is a list of notes terminated by a "NULL"
> - * note header.  For kdump, the code in vmcore.c runs in the context
> - * of the second kernel to combine them into one note.
> - */
> -#ifndef KEXEC_NOTE_BYTES
> -#define KEXEC_NOTE_BYTES CRASH_CORE_NOTE_BYTES
> -#endif
> -
> -/*
>   * This structure is used to hold the arguments that are used when loading
>   * kernel binaries.
>   */

Re: [PATCH v5 28/32] x86/mm, kexec: Allow kexec to be used with SME

2017-05-25 Thread Xunlei Pang

On 04/19/2017 at 05:21 AM, Tom Lendacky wrote:
> Provide support so that kexec can be used to boot a kernel when SME is
> enabled.
>
> Support is needed to allocate pages for kexec without encryption.  This
> is needed in order to be able to reboot in the kernel in the same manner
> as originally booted.

Hi Tom,

Looks like kdump will break, I didn't see the similar handling for kdump cases, 
see kernel:
kimage_alloc_crash_control_pages(), kimage_load_crash_segment(), etc.

We need to support kdump with SME, kdump 
kernel/initramfs/purgatory/elfcorehdr/etc
are all loaded into the reserved memory(see crashkernel=X) by userspace 
kexec-tools.
I think a straightforward way would be to mark the whole reserved memory range 
without
encryption before loading all the kexec segments for kdump, I guess we can 
handle this
easily in arch_kexec_unprotect_crashkres().

Moreover, now that "elfcorehdr=X" is left as decrypted, it needs to be remapped 
to the
encrypted data.

Regards,
Xunlei

>
> Additionally, when shutting down all of the CPUs we need to be sure to
> flush the caches and then halt. This is needed when booting from a state
> where SME was not active into a state where SME is active (or vice-versa).
> Without these steps, it is possible for cache lines to exist for the same
> physical location but tagged both with and without the encryption bit. This
> can cause random memory corruption when caches are flushed depending on
> which cacheline is written last.
>
> Signed-off-by: Tom Lendacky 
> ---
>  arch/x86/include/asm/init.h  |1 +
>  arch/x86/include/asm/irqflags.h  |5 +
>  arch/x86/include/asm/kexec.h |8 
>  arch/x86/include/asm/pgtable_types.h |1 +
>  arch/x86/kernel/machine_kexec_64.c   |   35 
> +-
>  arch/x86/kernel/process.c|   26 +++--
>  arch/x86/mm/ident_map.c  |   11 +++
>  include/linux/kexec.h|   14 ++
>  kernel/kexec_core.c  |7 +++
>  9 files changed, 101 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/include/asm/init.h b/arch/x86/include/asm/init.h
> index 737da62..b2ec511 100644
> --- a/arch/x86/include/asm/init.h
> +++ b/arch/x86/include/asm/init.h
> @@ -6,6 +6,7 @@ struct x86_mapping_info {
>   void *context;   /* context for alloc_pgt_page */
>   unsigned long pmd_flag;  /* page flag for PMD entry */
>   unsigned long offset;/* ident mapping offset */
> + unsigned long kernpg_flag;   /* kernel pagetable flag override */
>  };
>  
>  int kernel_ident_mapping_init(struct x86_mapping_info *info, pgd_t *pgd_page,
> diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
> index ac7692d..38b5920 100644
> --- a/arch/x86/include/asm/irqflags.h
> +++ b/arch/x86/include/asm/irqflags.h
> @@ -58,6 +58,11 @@ static inline __cpuidle void native_halt(void)
>   asm volatile("hlt": : :"memory");
>  }
>  
> +static inline __cpuidle void native_wbinvd_halt(void)
> +{
> + asm volatile("wbinvd; hlt" : : : "memory");
> +}
> +
>  #endif
>  
>  #ifdef CONFIG_PARAVIRT
> diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
> index 70ef205..e8183ac 100644
> --- a/arch/x86/include/asm/kexec.h
> +++ b/arch/x86/include/asm/kexec.h
> @@ -207,6 +207,14 @@ struct kexec_entry64_regs {
>   uint64_t r15;
>   uint64_t rip;
>  };
> +
> +extern int arch_kexec_post_alloc_pages(void *vaddr, unsigned int pages,
> +gfp_t gfp);
> +#define arch_kexec_post_alloc_pages arch_kexec_post_alloc_pages
> +
> +extern void arch_kexec_pre_free_pages(void *vaddr, unsigned int pages);
> +#define arch_kexec_pre_free_pages arch_kexec_pre_free_pages
> +
>  #endif
>  
>  typedef void crash_vmclear_fn(void);
> diff --git a/arch/x86/include/asm/pgtable_types.h 
> b/arch/x86/include/asm/pgtable_types.h
> index ce8cb1c..0f326f4 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -213,6 +213,7 @@ enum page_cache_mode {
>  #define PAGE_KERNEL  __pgprot(__PAGE_KERNEL | _PAGE_ENC)
>  #define PAGE_KERNEL_RO   __pgprot(__PAGE_KERNEL_RO | _PAGE_ENC)
>  #define PAGE_KERNEL_EXEC __pgprot(__PAGE_KERNEL_EXEC | _PAGE_ENC)
> +#define PAGE_KERNEL_EXEC_NOENC   __pgprot(__PAGE_KERNEL_EXEC)
>  #define PAGE_KERNEL_RX   __pgprot(__PAGE_KERNEL_RX | _PAGE_ENC)
>  #define PAGE_KERNEL_NOCACHE  __pgprot(__PAGE_KERNEL_NOCACHE | _PAGE_ENC)
>  #define PAGE_KERNEL_LARGE__pgprot(__PAGE_KERNEL_LARGE | _PAGE_ENC)
> diff --git a/arch/x86/kernel/machine_kexec_64.c 
> b/arch/x86/kernel/machine_kexec_64.c
> index 085c3b3..11c0ca9 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -86,7 +86,7 @@ static int init_transition_pgtable(struct kimage *image, 
> pgd_t *pgd)
>   set_pmd(pmd, __pmd(__pa(

Re: [PATCH v5 31/32] x86: Add sysfs support for Secure Memory Encryption

2017-05-25 Thread Xunlei Pang

On 05/26/2017 at 10:49 AM, Dave Young wrote:
> Ccing Xunlei he is reading the patches see what need to be done for
> kdump. There should still be several places to handle to make kdump work.
>
> On 05/18/17 at 07:01pm, Borislav Petkov wrote:
>> On Tue, Apr 18, 2017 at 04:22:12PM -0500, Tom Lendacky wrote:
>>> Add sysfs support for SME so that user-space utilities (kdump, etc.) can
>>> determine if SME is active.
>> But why do user-space tools need to know that?
>>
>> I mean, when we load the kdump kernel, we do it with the first kernel,
>> with the kexec_load() syscall, AFAICT. And that code does a lot of
>> things during that init, like machine_kexec_prepare()->init_pgtable() to
>> prepare the ident mapping of the second kernel, for example.
>>
>> What I'm aiming at is that the first kernel knows *exactly* whether SME
>> is enabled or not and doesn't need to tell the second one through some
>> sysfs entries - it can do that during loading.
>>
>> So I don't think we need any userspace things at all...
> If kdump kernel can get the SME status from hardware register then this
> should be not necessary and this patch can be dropped.

Yes, I also agree with dropping this one.

Regards,
Xunlei

[PATCH v3] cpuidle/coupled: Add sanity check for safe_state_index

2015-08-30 Thread Xunlei Pang

From: Xunlei Pang 

Since we are using cpuidle_driver::safe_state_index directly as the
target state index, it is better to add the sanity check at the point
of registering the driver.

Signed-off-by: Xunlei Pang 
---
v2->v3:
Move the code to a new cpuidle_coupled_state_verify() depending on 
CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED, to avoid using #ifdef in function 
bodies. Thanks Rafael's comments on this.

 drivers/cpuidle/coupled.c | 22 ++
 drivers/cpuidle/cpuidle.h |  6 ++
 drivers/cpuidle/driver.c  |  4 
 3 files changed, 32 insertions(+)

diff --git a/drivers/cpuidle/coupled.c b/drivers/cpuidle/coupled.c
index 1523e2d..344058f 100644
--- a/drivers/cpuidle/coupled.c
+++ b/drivers/cpuidle/coupled.c
@@ -187,6 +187,28 @@ bool cpuidle_state_is_coupled(struct cpuidle_driver *drv, 
int state)
 }
 
 /**
+ * cpuidle_coupled_state_verify - check if the coupled states are correctly 
set.
+ * @drv: struct cpuidle_driver for the platform
+ *
+ * Returns 0 for valid state values, a negative error code otherwise:
+ *  * -EINVAL if any coupled state(safe_state_index) is wrongly set.
+ */
+int cpuidle_coupled_state_verify(struct cpuidle_driver *drv)
+{
+   int i;
+
+   for (i = drv->state_count - 1; i >= 0; i--) {
+   if (cpuidle_state_is_coupled(drv, i) &&
+   (drv->safe_state_index == i ||
+drv->safe_state_index < 0 ||
+drv->safe_state_index >= drv->state_count))
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
+/**
  * cpuidle_coupled_set_ready - mark a cpu as ready
  * @coupled: the struct coupled that contains the current cpu
  */
diff --git a/drivers/cpuidle/cpuidle.h b/drivers/cpuidle/cpuidle.h
index 178c5ad..f87f399 100644
--- a/drivers/cpuidle/cpuidle.h
+++ b/drivers/cpuidle/cpuidle.h
@@ -35,6 +35,7 @@ extern void cpuidle_remove_sysfs(struct cpuidle_device *dev);
 
 #ifdef CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED
 bool cpuidle_state_is_coupled(struct cpuidle_driver *drv, int state);
+int cpuidle_coupled_state_verify(struct cpuidle_driver *drv);
 int cpuidle_enter_state_coupled(struct cpuidle_device *dev,
struct cpuidle_driver *drv, int next_state);
 int cpuidle_coupled_register_device(struct cpuidle_device *dev);
@@ -46,6 +47,11 @@ bool cpuidle_state_is_coupled(struct cpuidle_driver *drv, 
int state)
return false;
 }
 
+static inline int cpuidle_coupled_state_verify(struct cpuidle_driver *drv)
+{
+   return 0;
+}
+
 static inline int cpuidle_enter_state_coupled(struct cpuidle_device *dev,
struct cpuidle_driver *drv, int next_state)
 {
diff --git a/drivers/cpuidle/driver.c b/drivers/cpuidle/driver.c
index 5db1478..389ade4 100644
--- a/drivers/cpuidle/driver.c
+++ b/drivers/cpuidle/driver.c
@@ -227,6 +227,10 @@ static int __cpuidle_register_driver(struct cpuidle_driver 
*drv)
if (!drv || !drv->state_count)
return -EINVAL;
 
+   ret = cpuidle_coupled_state_verify(drv);
+   if (ret)
+   return ret;
+
if (cpuidle_disabled())
return -ENODEV;
 
-- 
1.9.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3] cpuidle/coupled: Add sanity check for safe_state_index

2015-09-03 Thread Xunlei Pang

Hi Rafael,

On 3 September 2015 at 09:35, Rafael J. Wysocki  wrote:
> On Monday, August 31, 2015 11:34:05 AM Xunlei Pang wrote:
>> From: Xunlei Pang 
>>
>> Since we are using cpuidle_driver::safe_state_index directly as the
>> target state index, it is better to add the sanity check at the point
>> of registering the driver.
>>
>> Signed-off-by: Xunlei Pang 
>
> I'm queuing this one up for the next PM pull request, thanks!

Thanks!

Regards,
Xunlei

>
>
>> ---
>> v2->v3:
>> Move the code to a new cpuidle_coupled_state_verify() depending on
>> CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED, to avoid using #ifdef in function
>> bodies. Thanks Rafael's comments on this.
>>
>>  drivers/cpuidle/coupled.c | 22 ++
>>  drivers/cpuidle/cpuidle.h |  6 ++
>>  drivers/cpuidle/driver.c  |  4 
>>  3 files changed, 32 insertions(+)
>>
>> diff --git a/drivers/cpuidle/coupled.c b/drivers/cpuidle/coupled.c
>> index 1523e2d..344058f 100644
>> --- a/drivers/cpuidle/coupled.c
>> +++ b/drivers/cpuidle/coupled.c
>> @@ -187,6 +187,28 @@ bool cpuidle_state_is_coupled(struct cpuidle_driver 
>> *drv, int state)
>>  }
>>
>>  /**
>> + * cpuidle_coupled_state_verify - check if the coupled states are correctly 
>> set.
>> + * @drv: struct cpuidle_driver for the platform
>> + *
>> + * Returns 0 for valid state values, a negative error code otherwise:
>> + *  * -EINVAL if any coupled state(safe_state_index) is wrongly set.
>> + */
>> +int cpuidle_coupled_state_verify(struct cpuidle_driver *drv)
>> +{
>> + int i;
>> +
>> + for (i = drv->state_count - 1; i >= 0; i--) {
>> + if (cpuidle_state_is_coupled(drv, i) &&
>> + (drv->safe_state_index == i ||
>> +  drv->safe_state_index < 0 ||
>> +  drv->safe_state_index >= drv->state_count))
>> + return -EINVAL;
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +/**
>>   * cpuidle_coupled_set_ready - mark a cpu as ready
>>   * @coupled: the struct coupled that contains the current cpu
>>   */
>> diff --git a/drivers/cpuidle/cpuidle.h b/drivers/cpuidle/cpuidle.h
>> index 178c5ad..f87f399 100644
>> --- a/drivers/cpuidle/cpuidle.h
>> +++ b/drivers/cpuidle/cpuidle.h
>> @@ -35,6 +35,7 @@ extern void cpuidle_remove_sysfs(struct cpuidle_device 
>> *dev);
>>
>>  #ifdef CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED
>>  bool cpuidle_state_is_coupled(struct cpuidle_driver *drv, int state);
>> +int cpuidle_coupled_state_verify(struct cpuidle_driver *drv);
>>  int cpuidle_enter_state_coupled(struct cpuidle_device *dev,
>>   struct cpuidle_driver *drv, int next_state);
>>  int cpuidle_coupled_register_device(struct cpuidle_device *dev);
>> @@ -46,6 +47,11 @@ bool cpuidle_state_is_coupled(struct cpuidle_driver *drv, 
>> int state)
>>   return false;
>>  }
>>
>> +static inline int cpuidle_coupled_state_verify(struct cpuidle_driver *drv)
>> +{
>> + return 0;
>> +}
>> +
>>  static inline int cpuidle_enter_state_coupled(struct cpuidle_device *dev,
>>   struct cpuidle_driver *drv, int next_state)
>>  {
>> diff --git a/drivers/cpuidle/driver.c b/drivers/cpuidle/driver.c
>> index 5db1478..389ade4 100644
>> --- a/drivers/cpuidle/driver.c
>> +++ b/drivers/cpuidle/driver.c
>> @@ -227,6 +227,10 @@ static int __cpuidle_register_driver(struct 
>> cpuidle_driver *drv)
>>   if (!drv || !drv->state_count)
>>   return -EINVAL;
>>
>> + ret = cpuidle_coupled_state_verify(drv);
>> + if (ret)
>> + return ret;
>> +
>>   if (cpuidle_disabled())
>>   return -ENODEV;
>>
>>
>
> --
> I speak only for myself.
> Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 1/2] sched/fair: Fix bandwidth timer clock drift condition

2018-06-20 Thread Xunlei Pang

On 6/21/18 1:01 AM, bseg...@google.com wrote:
> Xunlei Pang  writes:
> 
>> I noticed the group constantly got throttled even it consumed
>> low cpu usage, this caused some jitters on the response time
>> to some of our business containers enabling cpu quota.
>>
>> It's very simple to reproduce:
>>   mkdir /sys/fs/cgroup/cpu/test
>>   cd /sys/fs/cgroup/cpu/test
>>   echo 10 > cpu.cfs_quota_us
>>   echo $$ > tasks
>> then repeat:
>>   cat cpu.stat |grep nr_throttled  // nr_throttled will increase
>>
>> After some analysis, we found that cfs_rq::runtime_remaining will
>> be cleared by expire_cfs_rq_runtime() due to two equal but stale
>> "cfs_{b|q}->runtime_expires" after period timer is re-armed.
>>
>> The current condition to judge clock drift in expire_cfs_rq_runtime()
>> is wrong, the two runtime_expires are actually the same when clock
>> drift happens, so this condtion can never hit. The orginal design was
>> correctly done by commit a9cf55b28610 ("sched: Expire invalid runtime"),
>> but was changed to be the current one due to its locking issue.
>>
>> This patch introduces another way, it adds a new field in both structures
>> cfs_rq and cfs_bandwidth to record the expiration update sequence, and
>> use them to figure out if clock drift happens(true if they equal).
>>
>> Fixes: 51f2176d74ac ("sched/fair: Fix unlocked reads of some 
>> cfs_b->quota/period")
>> Cc: Ben Segall 
> 
> Reviewed-By: Ben Segall 

Thanks Ben :-)

Hi Peter, could you please have a look at them?

Cc sta...@vger.kernel.org

> 
>> Signed-off-by: Xunlei Pang 
>> ---
>>  kernel/sched/fair.c  | 14 --
>>  kernel/sched/sched.h |  6 --
>>  2 files changed, 12 insertions(+), 8 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index e497c05aab7f..e6bb68d52962 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4590,6 +4590,7 @@ void __refill_cfs_bandwidth_runtime(struct 
>> cfs_bandwidth *cfs_b)
>>  now = sched_clock_cpu(smp_processor_id());
>>  cfs_b->runtime = cfs_b->quota;
>>  cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
>> +cfs_b->expires_seq++;
>>  }
>>  
>>  static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
>> @@ -4612,6 +4613,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>>  struct task_group *tg = cfs_rq->tg;
>>  struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
>>  u64 amount = 0, min_amount, expires;
>> +int expires_seq;
>>  
>>  /* note: this is a positive sum as runtime_remaining <= 0 */
>>  min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
>> @@ -4628,6 +4630,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>>  cfs_b->idle = 0;
>>  }
>>  }
>> +expires_seq = cfs_b->expires_seq;
>>  expires = cfs_b->runtime_expires;
>>  raw_spin_unlock(&cfs_b->lock);
>>  
>> @@ -4637,8 +4640,10 @@ static int assign_cfs_rq_runtime(struct cfs_rq 
>> *cfs_rq)
>>   * spread between our sched_clock and the one on which runtime was
>>   * issued.
>>   */
>> -if ((s64)(expires - cfs_rq->runtime_expires) > 0)
>> +if (cfs_rq->expires_seq != expires_seq) {
>> +cfs_rq->expires_seq = expires_seq;
>>  cfs_rq->runtime_expires = expires;
>> +}
>>  
>>  return cfs_rq->runtime_remaining > 0;
>>  }
>> @@ -4664,12 +4669,9 @@ static void expire_cfs_rq_runtime(struct cfs_rq 
>> *cfs_rq)
>>   * has not truly expired.
>>   *
>>   * Fortunately we can check determine whether this the case by checking
>> - * whether the global deadline has advanced. It is valid to compare
>> - * cfs_b->runtime_expires without any locks since we only care about
>> - * exact equality, so a partial write will still work.
>> + * whether the global deadline(cfs_b->expires_seq) has advanced.
>>   */
>> -
>> -if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
>> +if (cfs_rq->expires_seq == cfs_b->expires_seq) {
>>  /* extend local deadline, drift is bounded above by 2 ticks */
>>  cfs_rq->runtime_expires += TICK_NSEC;
>>  } else {
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 6601baf2361c..e97

Re: [PATCH v2 1/2] sched/fair: Fix bandwidth timer clock drift condition

2018-06-21 Thread Xunlei Pang

On 6/21/18 4:08 PM, Peter Zijlstra wrote:
> On Thu, Jun 21, 2018 at 11:56:56AM +0800, Xunlei Pang wrote:
>>>> Fixes: 51f2176d74ac ("sched/fair: Fix unlocked reads of some 
>>>> cfs_b->quota/period")
>>>> Cc: Ben Segall 
>>>
>>> Reviewed-By: Ben Segall 
>>
>> Thanks Ben :-)
>>
>> Hi Peter, could you please have a look at them?
> 
> I grabbed them, thanks guys!
> 
>>
>> Cc sta...@vger.kernel.org
> 
> For both or just the first?
> 

Both, thanks!

[PATCH] sched/cputime: Ensure correct utime and stime proportion

2018-06-22 Thread Xunlei Pang

We use per-cgroup cpu usage statistics similar to "cgroup rstat",
and encountered a problem that user and sys usages are wrongly
split sometimes.

Run tasks with some random run-sleep pattern for a long time, and
when tick-based time and scheduler sum_exec_runtime hugely drifts
apart(scheduler sum_exec_runtime is less than tick-based time),
the current implementation of cputime_adjust() will produce less
sys usage than the actual use after changing to run a different
workload pattern with high sys. This is because total tick-based
utime and stime are used to split the total sum_exec_runtime.

Same problem exists on utime and stime from "/proc//stat".

[Example]
Run some random run-sleep patterns for minutes, then change to run
high sys pattern, and watch.
1) standard "top"(which is the correct one):
   4.6 us, 94.5 sy,  0.0 ni,  0.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
2) our tool parsing utime and stime from "/proc//stat":
   20.5 usr, 78.4 sys
We can see "20.5 usr" displayed in 2) was incorrect, it recovers
gradually with time: 9.7 usr, 89.5 sys

This patch fixes the issue by calculating using all kinds of time
elapsed since last parse in cputime_adjust(), and accumulate the
corresponding results calculated into prev_cputime. A new field
of task_cputime type is added in structure prev_cputime to record
previous task_cputime so that we can get the elapsed time deltas.

Signed-off-by: Xunlei Pang 
---
 include/linux/sched.h | 33 +++
 include/linux/sched/cputime.h | 12 -
 kernel/sched/cputime.c| 61 ---
 3 files changed, 51 insertions(+), 55 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 87bf02d93a27..5108ac8414e0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -223,6 +223,22 @@ extern void io_schedule_finish(int token);
 extern long io_schedule_timeout(long timeout);
 extern void io_schedule(void);
 
+/**
+ * struct task_cputime - collected CPU time counts
+ * @utime: time spent in user mode, in nanoseconds
+ * @stime: time spent in kernel mode, in nanoseconds
+ * @sum_exec_runtime:  total time spent on the CPU, in nanoseconds
+ *
+ * This structure groups together three kinds of CPU time that are tracked for
+ * threads and thread groups.  Most things considering CPU time want to group
+ * these counts together and treat all three of them in parallel.
+ */
+struct task_cputime {
+   u64 utime;
+   u64 stime;
+   unsigned long long  sum_exec_runtime;
+};
+
 /**
  * struct prev_cputime - snapshot of system and user cputime
  * @utime: time spent in user mode
@@ -236,26 +252,11 @@ struct prev_cputime {
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
u64 utime;
u64 stime;
+   struct task_cputime cputime;
raw_spinlock_t  lock;
 #endif
 };
 
-/**
- * struct task_cputime - collected CPU time counts
- * @utime: time spent in user mode, in nanoseconds
- * @stime: time spent in kernel mode, in nanoseconds
- * @sum_exec_runtime:  total time spent on the CPU, in nanoseconds
- *
- * This structure groups together three kinds of CPU time that are tracked for
- * threads and thread groups.  Most things considering CPU time want to group
- * these counts together and treat all three of them in parallel.
- */
-struct task_cputime {
-   u64 utime;
-   u64 stime;
-   unsigned long long  sum_exec_runtime;
-};
-
 /* Alternate field names when used on cache expirations: */
 #define virt_exp   utime
 #define prof_exp   stime
diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h
index 53f883f5a2fd..49f8fd2564ed 100644
--- a/include/linux/sched/cputime.h
+++ b/include/linux/sched/cputime.h
@@ -175,10 +175,20 @@ static inline void account_group_exec_runtime(struct 
task_struct *tsk,
atomic64_add(ns, &cputimer->cputime_atomic.sum_exec_runtime);
 }
 
-static inline void prev_cputime_init(struct prev_cputime *prev)
+static inline void prev_cputime_clear(struct prev_cputime *prev)
 {
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
prev->utime = prev->stime = 0;
+   prev->cputime.utime = 0;
+   prev->cputime.stime = 0;
+   prev->cputime.sum_exec_runtime = 0;
+#endif
+}
+
+static inline void prev_cputime_init(struct prev_cputime *prev)
+{
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
+   prev_cputime_clear(prev);
raw_spin_lock_init(&prev->lock);
 #endif
 }
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 0796f938c4f0..a68483ee3ad7 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -

Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion

2018-06-25 Thread Xunlei Pang

On 6/22/18 6:35 PM, kbuild test robot wrote:
> Hi Xunlei,
> 
> Thank you for the patch! Perhaps something to improve:
> 
> [auto build test WARNING on tip/sched/core]
> [also build test WARNING on v4.18-rc1 next-20180622]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
> 
> url:
> https://github.com/0day-ci/linux/commits/Xunlei-Pang/sched-cputime-Ensure-correct-utime-and-stime-proportion/20180622-153720
> reproduce: make htmldocs
> 
> All warnings (new ones prefixed by >>):
> 
>WARNING: convert(1) not found, for SVG to PDF conversion install 
> ImageMagick (https://www.imagemagick.org)
>mm/mempool.c:228: warning: Function parameter or member 'pool' not 
> described in 'mempool_init'
>include/net/cfg80211.h:4279: warning: Function parameter or member 
> 'wext.ibss' not described in 'wireless_dev'
>include/net/cfg80211.h:4279: warning: Function parameter or member 
> 'wext.connect' not described in 'wireless_dev'
>include/net/cfg80211.h:4279: warning: Function parameter or member 
> 'wext.keys' not described in 'wireless_dev'
>include/net/cfg80211.h:4279: warning: Function parameter or member 
> 'wext.ie' not described in 'wireless_dev'
>include/net/cfg80211.h:4279: warning: Function parameter or member 
> 'wext.ie_len' not described in 'wireless_dev'
>include/net/cfg80211.h:4279: warning: Function parameter or member 
> 'wext.bssid' not described in 'wireless_dev'
>include/net/cfg80211.h:4279: warning: Function parameter or member 
> 'wext.ssid' not described in 'wireless_dev'
>include/net/cfg80211.h:4279: warning: Function parameter or member 
> 'wext.default_key' not described in 'wireless_dev'
>include/net/cfg80211.h:4279: warning: Function parameter or member 
> 'wext.default_mgmt_key' not described in 'wireless_dev'
>include/net/cfg80211.h:4279: warning: Function parameter or member 
> 'wext.prev_bssid_valid' not described in 'wireless_dev'
>include/net/mac80211.h:2282: warning: Function parameter or member 
> 'radiotap_timestamp.units_pos' not described in 'ieee80211_hw'
>include/net/mac80211.h:2282: warning: Function parameter or member 
> 'radiotap_timestamp.accuracy' not described in 'ieee80211_hw'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'control.rates' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'control.rts_cts_rate_idx' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'control.use_rts' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'control.use_cts_prot' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'control.short_preamble' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'control.skip_table' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'control.jiffies' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'control.vif' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'control.hw_key' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'control.flags' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'control.enqueue_time' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 'ack' 
> not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'ack.cookie' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'status.rates' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'status.ack_signal' not described in 'ieee80211_tx_info'
>incl

Re: [PATCH] sched/fair: sync expires_seq in distribute_cfs_runtime()

2018-07-29 Thread Xunlei Pang

Hi Cong,

On 7/28/18 8:24 AM, Cong Wang wrote:
> Each time we sync cfs_rq->runtime_expires with cfs_b->runtime_expires,
> we should sync its ->expires_seq too. However it is missing
> for distribute_cfs_runtime(), especially the slack timer call path.

I don't think it's a problem, as expires_seq will get synced in
assign_cfs_rq_runtime().

Thanks,
Xunlei

> 
> Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
> Cc: Xunlei Pang 
> Cc: Ben Segall 
> Cc: Linus Torvalds 
> Cc: Peter Zijlstra 
> Cc: Thomas Gleixner 
> Signed-off-by: Cong Wang 
> ---
>  kernel/sched/fair.c | 12 
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2f0a0be4d344..910c50db3d74 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4857,7 +4857,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>  }
>  
>  static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
> - u64 remaining, u64 expires)
> + u64 remaining, u64 expires, int expires_seq)
>  {
>   struct cfs_rq *cfs_rq;
>   u64 runtime;
> @@ -4880,6 +4880,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth 
> *cfs_b,
>  
>   cfs_rq->runtime_remaining += runtime;
>   cfs_rq->runtime_expires = expires;
> + cfs_rq->expires_seq = expires_seq;
>  
>   /* we check whether we're throttled above */
>   if (cfs_rq->runtime_remaining > 0)
> @@ -4905,7 +4906,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth 
> *cfs_b,
>  static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int 
> overrun)
>  {
>   u64 runtime, runtime_expires;
> - int throttled;
> + int throttled, expires_seq;
>  
>   /* no need to continue the timer with no bandwidth constraint */
>   if (cfs_b->quota == RUNTIME_INF)
> @@ -4933,6 +4934,7 @@ static int do_sched_cfs_period_timer(struct 
> cfs_bandwidth *cfs_b, int overrun)
>   cfs_b->nr_throttled += overrun;
>  
>   runtime_expires = cfs_b->runtime_expires;
> + expires_seq = cfs_b->expires_seq;
>  
>   /*
>* This check is repeated as we are holding onto the new bandwidth while
> @@ -4946,7 +4948,7 @@ static int do_sched_cfs_period_timer(struct 
> cfs_bandwidth *cfs_b, int overrun)
>   raw_spin_unlock(&cfs_b->lock);
>   /* we can't nest cfs_b->lock while distributing bandwidth */
>   runtime = distribute_cfs_runtime(cfs_b, runtime,
> -  runtime_expires);
> +  runtime_expires, expires_seq);
>   raw_spin_lock(&cfs_b->lock);
>  
>   throttled = !list_empty(&cfs_b->throttled_cfs_rq);
> @@ -5055,6 +5057,7 @@ static __always_inline void 
> return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>  {
>   u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
> + int expires_seq;
>   u64 expires;
>  
>   /* confirm we're still not at a refresh boundary */
> @@ -5068,12 +5071,13 @@ static void do_sched_cfs_slack_timer(struct 
> cfs_bandwidth *cfs_b)
>   runtime = cfs_b->runtime;
>  
>   expires = cfs_b->runtime_expires;
> + expires_seq = cfs_b->expires_seq;
>   raw_spin_unlock(&cfs_b->lock);
>  
>   if (!runtime)
>   return;
>  
> - runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
> + runtime = distribute_cfs_runtime(cfs_b, runtime, expires, expires_seq);
>  
>   raw_spin_lock(&cfs_b->lock);
>   if (expires == cfs_b->runtime_expires)
>

Re: [PATCH] sched/fair: sync expires_seq in distribute_cfs_runtime()

2018-07-31 Thread Xunlei Pang

On 7/31/18 1:55 AM, Cong Wang wrote:
> On Sun, Jul 29, 2018 at 10:29 PM Xunlei Pang  wrote:
>>
>> Hi Cong,
>>
>> On 7/28/18 8:24 AM, Cong Wang wrote:
>>> Each time we sync cfs_rq->runtime_expires with cfs_b->runtime_expires,
>>> we should sync its ->expires_seq too. However it is missing
>>> for distribute_cfs_runtime(), especially the slack timer call path.
>>
>> I don't think it's a problem, as expires_seq will get synced in
>> assign_cfs_rq_runtime().
> 
> Sure, but there is a small window during which they are not synced.
> Why do you want to wait until the next assign_cfs_rq_runtime() when
> you already know runtime_expires is synced?
> 
> Also, expire_cfs_rq_runtime() is called before assign_cfs_rq_runtime()
> inside __account_cfs_rq_runtime(), which means the check of
> cfs_rq->expires_seq is not accurate for unthrottling case if the clock
> drift happens soon enough?
> 

expire_cfs_rq_runtime():
if (cfs_rq->expires_seq == cfs_b->expires_seq) {
/* extend local deadline, drift is bounded above by 2 ticks */
cfs_rq->runtime_expires += TICK_NSEC;
} else {
/* global deadline is ahead, expiration has passed */
cfs_rq->runtime_remaining = 0;
}

So if clock drift happens soon, then expires_seq decides the correct
thing we should do: if cfs_b->expires_seq advanced, then clear the stale
cfs_rq->runtime_remaining from the slack timer of the past period, then
assign_cfs_rq_runtime() will refresh them afterwards, otherwise it is a
real clock drift. I am still not getting where the race is?

Re: [PATCH] sched/fair: sync expires_seq in distribute_cfs_runtime()

2018-07-31 Thread Xunlei Pang

On 8/1/18 4:55 AM, Cong Wang wrote:
> On Tue, Jul 31, 2018 at 10:13 AM  wrote:
>>
>> Xunlei Pang  writes:
>>
>>> On 7/31/18 1:55 AM, Cong Wang wrote:
>>>> On Sun, Jul 29, 2018 at 10:29 PM Xunlei Pang  
>>>> wrote:
>>>>>
>>>>> Hi Cong,
>>>>>
>>>>> On 7/28/18 8:24 AM, Cong Wang wrote:
>>>>>> Each time we sync cfs_rq->runtime_expires with cfs_b->runtime_expires,
>>>>>> we should sync its ->expires_seq too. However it is missing
>>>>>> for distribute_cfs_runtime(), especially the slack timer call path.
>>>>>
>>>>> I don't think it's a problem, as expires_seq will get synced in
>>>>> assign_cfs_rq_runtime().
>>>>
>>>> Sure, but there is a small window during which they are not synced.
>>>> Why do you want to wait until the next assign_cfs_rq_runtime() when
>>>> you already know runtime_expires is synced?
>>>>
>>>> Also, expire_cfs_rq_runtime() is called before assign_cfs_rq_runtime()
>>>> inside __account_cfs_rq_runtime(), which means the check of
>>>> cfs_rq->expires_seq is not accurate for unthrottling case if the clock
>>>> drift happens soon enough?
>>>>
>>>
>>> expire_cfs_rq_runtime():
>>> if (cfs_rq->expires_seq == cfs_b->expires_seq) {
>>> /* extend local deadline, drift is bounded above by 2 ticks */
>>> cfs_rq->runtime_expires += TICK_NSEC;
>>> } else {
>>> /* global deadline is ahead, expiration has passed */
>>> cfs_rq->runtime_remaining = 0;
>>> }
>>>
>>> So if clock drift happens soon, then expires_seq decides the correct
>>> thing we should do: if cfs_b->expires_seq advanced, then clear the stale
>>> cfs_rq->runtime_remaining from the slack timer of the past period, then
>>> assign_cfs_rq_runtime() will refresh them afterwards, otherwise it is a
>>> real clock drift. I am still not getting where the race is?
> 
> But expires_seq is supposed to be the same here, after
> distribute_cfs_runtime(), therefore runtime_remaining is not supposed
> to be cleared.
> 
> Which part do I misunderstand? expires_seq should not be same here?
> Or you are saying a wrongly clear of runtime_remaning is fine?
> 

Let's see the unthrottle cases.
1. for the periodic timer
distribute_cfs_runtime updates the throttled cfs_rq->runtime_expires to
be a new value, so expire_cfs_rq_runtime does nothing because of:
  rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires < 0

Afterwards assign_cfs_rq_runtime() will sync its expires_seq.

2. for the slack timer
the two expires_seq should be the same, so if clock drift happens soon,
expire_cfs_rq_runtime regards it as true clock drift:
  cfs_rq->runtime_expires += TICK_NSEC
If it happens that global expires_seq advances, it also doesn't matter,
expire_cfs_rq_runtime will clear the stale expire_cfs_rq_runtime as
expected.

> 
>>
>> Nothing /important/ goes wrong because distribute_cfs_runtime only fills
>> runtime_remaining up to 1, not a real amount.
> 
> No, runtime_remaining is updated right before expire_cfs_rq_runtime():
> 
> static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
> {
> /* dock delta_exec before expiring quota (as it could span periods) */
> cfs_rq->runtime_remaining -= delta_exec;
> expire_cfs_rq_runtime(cfs_rq);
> 
> so almost certainly it can't be 1.

I think Ben means it firstly gets a distributtion of 1 to run after
unthrottling, soon it will have a negative runtime_remaining, and go
to assign_cfs_rq_runtime().

Thanks,
Xunlei

> 
> Which means the following check could be passed:
> 
>  4655 if (cfs_rq->runtime_remaining < 0)
>  4656 return;
> 
> therefore we are reaching the clock drift logic code inside
> expire_cfs_rq_runtime()
> where expires_seq is supposed to be same as they should be sync'ed.
> Therefore without patch, we wrongly clear the runtime_remainng?
> 
> Thanks.
>

Re: [tip:sched/core] sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()

2018-07-24 Thread Xunlei Pang

On 7/23/18 5:21 PM, Peter Zijlstra wrote:
> On Tue, Jul 17, 2018 at 12:08:36PM +0800, Xunlei Pang wrote:
>> The trace data corresponds to the last sample period:
>> trace entry 1:
>>  cat-20755 [022] d...  1370.106496: cputime_adjust: task
>> tick-based utime 36256000 stime 255100, scheduler rtime 333060702626
>>  cat-20755 [022] d...  1370.106497: cputime_adjust: result:
>> old utime 330729718142 stime 2306983867, new utime 330733635372 stime
>> 2327067254
>>
>> trace entry 2:
>>  cat-20773 [005] d...  1371.109825: cputime_adjust: task
>> tick-based utime 36256700 stime 354700, scheduler rtime 334063718912
>>  cat-20773 [005] d...  1371.109826: cputime_adjust: result:
>> old utime 330733635372 stime 2327067254, new utime 330827229702 stime
>> 3236489210
>>
>> 1) expected behaviour
>> Let's compare the last two trace entries(all the data below is in ns):
>> task tick-based utime: 36256000->36256700 increased 700
>> task tick-based stime: 255100  ->354700   increased 99600
>> scheduler rtime:   333060702626->334063718912 increased 1003016286
>>
>> The application actually runs almost 100%sys at the moment, we can
>> use the task tick-based utime and stime increased to double check:
>> 99600/(700+99600) > 99%sys
>>
>> 2) the current cputime_adjust() inaccurate result
>> But for the current cputime_adjust(), we get the following adjusted
>> utime and stime increase in this sample period:
>> adjusted utime: 330733635372->330827229702 increased 93594330
>> adjusted stime: 2327067254  ->3236489210   increased 909421956
>>
>> so 909421956/(93594330+909421956)=91%sys as the shell script shows above.
>>
>> 3) root cause
>> The root cause of the issue is that the current cputime_adjust() always
>> passes the whole times to scale_stime() to split the whole utime and
>> stime. In this patch, we pass all the increased deltas in 1) within
>> user's sample period to scale_stime() instead and accumulate the
>> corresponding results to the previous saved adjusted utime and stime,
>> so guarantee the accurate usr and sys increase within the user sample
>> period.
> 
> But why it this a problem?
> 
> Since its sample based there's really nothing much you can guarantee.
> What if your test program were to run in userspace for 50% of the time
> but is so constructed to always be in kernel space when the tick
> happens?
> 
> Then you would 'expect' it to be 50% user and 50% sys, but you're also
> not getting that.
> 
> This stuff cannot be perfect, and the current code provides 'sensible'
> numbers over the long run for most programs. Why muck with it?
> 

Basically I am ok with the current implementation, except for one
scenario we've met: when kernel went wrong for some reason with 100% sys
suddenly for seconds(even trigger softlockup), the statistics monitor
didn't reflect the fact, which confused people. One example with our
per-cgroup top, we ever noticed "20% usr, 80% sys" displayed while in
fact the kernel was in some busy loop(100% sys) at that moment, and the
tick based time are of course all sys samples in such case.

[PATCH v2 1/3] sched/deadline: Zero out positive runtime after throttling constrained tasks

2017-05-11 Thread Xunlei Pang

When a contrained task is throttled by dl_check_constrained_dl(),
it may carry the remaining positive runtime, as a result when
dl_task_timer() fires and calls replenish_dl_entity(), it will
not be replenished correctly due to the positive dl_se->runtime.

This patch assigns its runtime to 0 if positive after throttling.

Fixes: df8eac8cafce ("sched/deadline: Throttle a constrained deadline task 
activated after the deadline)
Acked-by: Daniel Bristot de Oliveira 
Signed-off-by: Xunlei Pang 
---
 kernel/sched/deadline.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index a2ce590..d3d291e 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -723,6 +723,8 @@ static inline void dl_check_constrained_dl(struct 
sched_dl_entity *dl_se)
if (unlikely(dl_se->dl_boosted || !start_dl_timer(p)))
return;
dl_se->dl_throttled = 1;
+   if (dl_se->runtime > 0)
+   dl_se->runtime = 0;
}
 }
 
-- 
1.8.3.1

[PATCH v2 3/3] sched/deadline: Add statistics to track runtime underruns

2017-05-11 Thread Xunlei Pang

Add accounting to track cases that runtime isn't running
out, and export the information in "/proc//sched".

Specifically, the patch adds three members "nr_underrun_sched",
"nr_underrun_block", and "nr_underrun_yield" in sched_dl_entity:
 -@nr_underrun_sched hints some scheduling issue.
 -@nr_underrun_block hints some block reason. E.g. long sleep.
 -@nr_underrun_yield hints the yield reason.

This is helpful to spot/debug deadline issues, for example,
I launched three 50% dl tasks on my dual-core machine, plus
several buggy contrained dl tasks that Daniel is trying to
address in "sched/deadline: Use the revised wakeup rule for
suspending constrained dl tasks", then I observed one 50%
deadline task's proc sched output:
$ cat /proc/3389/sched |grep underrun
dl.nr_underrun_sched:981
dl.nr_underrun_block:  0
dl.nr_underrun_yield:  0

Very large "dl.nr_underrun_sched" hints it's very likely that
there is some underlying scheduling issue.

Note that we don't use CONFIG_SCHED_DEBUG as the accounting
added has little overhead(also happens infrequently).

Suggested-by: Steven Rostedt 
Signed-off-by: Xunlei Pang 
---
 include/linux/sched.h   | 10 ++
 kernel/sched/core.c |  3 +++
 kernel/sched/deadline.c | 12 +---
 kernel/sched/debug.c|  3 +++
 4 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ba080e5..e17928f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -455,6 +455,16 @@ struct sched_dl_entity {
 * own bandwidth to be enforced, thus we need one timer per task.
 */
struct hrtimer  dl_timer;
+
+   /*
+* Accounting for periods that run less than @dl_runtime:
+* @nr_underrun_sched hints some scheduling issue.
+* @nr_underrun_block hints some block reason. E.g. long sleep.
+* @nr_underrun_yield hints the yield reason.
+*/
+   u64 nr_underrun_sched;
+   u64 nr_underrun_block;
+   u64 nr_underrun_yield;
 };
 
 union rcu_special {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bccd819..6214ada 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4004,6 +4004,9 @@ static struct task_struct *find_process_by_pid(pid_t pid)
dl_se->dl_period = attr->sched_period ?: dl_se->dl_deadline;
dl_se->flags = attr->sched_flags;
dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
+   dl_se->nr_underrun_sched = 0;
+   dl_se->nr_underrun_block = 0;
+   dl_se->nr_underrun_yield = 0;
 
/*
 * Changing the parameters of a task is 'tricky' and we're not doing
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 5691149..a7ddc03 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -394,8 +394,10 @@ static void replenish_dl_entity(struct sched_dl_entity 
*dl_se,
dl_se->runtime = pi_se->dl_runtime;
}
 
-   if (dl_se->dl_yielded && dl_se->runtime > 0)
+   if (dl_se->dl_yielded && dl_se->runtime > 0) {
dl_se->runtime = 0;
+   ++dl_se->nr_underrun_yield;
+   }
 
/*
 * We keep moving the deadline away until we get some
@@ -723,8 +725,10 @@ static inline void dl_check_constrained_dl(struct 
sched_dl_entity *dl_se)
if (unlikely(dl_se->dl_boosted || !start_dl_timer(p)))
return;
dl_se->dl_throttled = 1;
-   if (dl_se->runtime > 0)
+   if (dl_se->runtime > 0) {
dl_se->runtime = 0;
+   ++dl_se->nr_underrun_block;
+   }
}
 }
 
@@ -733,8 +737,10 @@ int dl_runtime_exceeded(struct rq *rq, struct 
sched_dl_entity *dl_se)
 {
bool dmiss = dl_time_before(dl_se->deadline, rq_clock(rq));
 
-   if (dmiss && dl_se->runtime > 0)
+   if (dmiss && dl_se->runtime > 0) {
dl_se->runtime = 0;
+   ++dl_se->nr_underrun_sched;
+   }
 
return (dl_se->runtime <= 0);
 }
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 38f0193..904b43f 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -957,6 +957,9 @@ void proc_sched_show_task(struct task_struct *p, struct 
seq_file *m)
if (p->policy == SCHED_DEADLINE) {
P(dl.runtime);
P(dl.deadline);
+   P(dl.nr_underrun_sched);
+   P(dl.nr_underrun_block);
+   P(dl.nr_underrun_yield);
}
 #undef PN_SCHEDSTAT
 #undef PN
-- 
1.8.3.1

[PATCH v2 2/3] sched/deadline: Throttle the task when missing its deadline

2017-05-11 Thread Xunlei Pang

dl_runtime_exceeded() only checks negative runtime, actually
when the current deadline past, we should start a new period
and zero out the remaining runtime as well.

This patch improves dl_runtime_exceeded() to achieve that.

Fixes: 269ad8015a6b ("sched/deadline: Avoid double-accounting in case of missed 
deadlines")
Cc: Luca Abeni 
Signed-off-by: Xunlei Pang 
---
 kernel/sched/deadline.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d3d291e..5691149 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -729,8 +729,13 @@ static inline void dl_check_constrained_dl(struct 
sched_dl_entity *dl_se)
 }
 
 static
-int dl_runtime_exceeded(struct sched_dl_entity *dl_se)
+int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se)
 {
+   bool dmiss = dl_time_before(dl_se->deadline, rq_clock(rq));
+
+   if (dmiss && dl_se->runtime > 0)
+   dl_se->runtime = 0;
+
return (dl_se->runtime <= 0);
 }
 
@@ -781,7 +786,7 @@ static void update_curr_dl(struct rq *rq)
dl_se->runtime -= delta_exec;
 
 throttle:
-   if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
+   if (dl_runtime_exceeded(rq, dl_se) || dl_se->dl_yielded) {
dl_se->dl_throttled = 1;
__dequeue_task_dl(rq, curr, 0);
if (unlikely(dl_se->dl_boosted || !start_dl_timer(curr)))
-- 
1.8.3.1

Re: [PATCH] sched/deadline: Zero out positive runtime after throttling constrained tasks

2017-05-11 Thread Xunlei Pang

On 05/11/2017 at 09:38 AM, Xunlei Pang wrote:
> On 05/10/2017 at 09:36 PM, Steven Rostedt wrote:
>> On Wed, 10 May 2017 21:03:37 +0800
>> Xunlei Pang  wrote:
>>
>>> When a contrained task is throttled by dl_check_constrained_dl(),
>>> it may carry the remaining positive runtime, as a result when
>>> dl_task_timer() fires and calls replenish_dl_entity(), it will
>>> not be replenished correctly due to the positive dl_se->runtime.
>>>
>>> This patch assigns its runtime to 0 if positive after throttling.
>>>
>>> Fixes: df8eac8cafce ("sched/deadline: Throttle a constrained deadline task 
>>> activated after the deadline)
>>> Cc: Daniel Bristot de Oliveira 
>>> Signed-off-by: Xunlei Pang 
>>> ---
>>>  kernel/sched/deadline.c | 2 ++
>>>  1 file changed, 2 insertions(+)
>>>
>>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>>> index a2ce590..d3d291e 100644
>>> --- a/kernel/sched/deadline.c
>>> +++ b/kernel/sched/deadline.c
>>> @@ -723,6 +723,8 @@ static inline void dl_check_constrained_dl(struct 
>>> sched_dl_entity *dl_se)
>>> if (unlikely(dl_se->dl_boosted || !start_dl_timer(p)))
>>> return;
>>> dl_se->dl_throttled = 1;
>>> +   if (dl_se->runtime > 0)
>>> +   dl_se->runtime = 0;
>> This makes sense to me, but should we have any accounting for runtime
>> that was missed due to wakeups and such?
> It sounds a good idea, will try to add that to "/proc//sched".
>
> Looks like we should also catch and handle the deadline miss in 
> dl_runtime_exceeded(),
> I guess we can add the accounting together.

Hi all,

Thanks for all your valuable review!

FYI: I just sent v2 added two more patches.

Regards,
Xunlei

Re: [PATCH v2 2/3] sched/deadline: Throttle the task when missing its deadline

2017-05-11 Thread Xunlei Pang

On 05/12/2017 at 01:57 PM, luca abeni wrote:
> Hi again,
>
> (sorry for the previous email; I replied from gmail and I did not
> realize I was sending it in html).
>
>
> On Fri, 12 May 2017 11:32:08 +0800
> Xunlei Pang  wrote:
>
>> dl_runtime_exceeded() only checks negative runtime, actually
>> when the current deadline past, we should start a new period
>> and zero out the remaining runtime as well.
> In this case, I think global EDF wants to allow the task to run with
> its remaining runtime even also missing a deadline, so I think this
> change is not correct.
> (when using global EDF, tasks scheduled on multiple CPUs can miss their
> deadlines... Setting the runtime to 0 as soon as a deadline is missed
> would break global EDF scheduling)

Hi Luca,

Thanks for the comment, looks like I neglected the theoretical analysis.

Cited from Documentation/scheduler/sched-deadline.txt:
 "As a matter of fact, in this case it is possible to provide an upper bound
 for tardiness (defined as the maximum between 0 and the difference
 between the finishing time of a job and its absolute deadline).
 More precisely, it can be proven that using a global EDF scheduler the
 maximum tardiness of each task is smaller or equal than
((M − 1) · WCET_max − WCET_min)/(M − (M − 2) · U_max) + WCET_m
 where WCET_max = max{WCET_i} is the maximum WCET, WCET_min=min{WCET_i}
 is the minimum WCET, and U_max = max{WCET_i/P_i} is the maximum
 utilization[12]."

And
 "As seen, enforcing that the total utilization is smaller than M does not
 guarantee that global EDF schedules the tasks without missing any deadline
 (in other words, global EDF is not an optimal scheduling algorithm). However,
 a total utilization smaller than M is enough to guarantee that non real-time
 tasks are not starved and that the tardiness of real-time tasks has an upper
 bound[12] (as previously noted). Different bounds on the maximum tardiness
 experienced by real-time tasks have been developed in various papers[13,14],
 but the theoretical result that is important for SCHED_DEADLINE is that if
 the total utilization is smaller or equal than M then the response times of
 the tasks are limited."

Do you mean there is some tardiness allowed in theory(global EDF is not an
optimal scheduling algorithm), thus missed deadline is allowed for global EDF?

>
> Which kind of issue is this patch fixing? If it is something you saw
> with deadline-constrained tasks, maybe you can add a check for
> deadline!=period?

No, It's an issue(I thought so back then) during making the last patch.

Regards,
Xunlei

>
>   Luca
>> This patch improves dl_runtime_exceeded() to achieve that.
>>
>> Fixes: 269ad8015a6b ("sched/deadline: Avoid double-accounting in case
>> of missed deadlines") Cc: Luca Abeni 
>> Signed-off-by: Xunlei Pang 
>> ---
>>  kernel/sched/deadline.c | 9 +++--
>>  1 file changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>> index d3d291e..5691149 100644
>> --- a/kernel/sched/deadline.c
>> +++ b/kernel/sched/deadline.c
>> @@ -729,8 +729,13 @@ static inline void
>> dl_check_constrained_dl(struct sched_dl_entity *dl_se) }
>>  
>>  static
>> -int dl_runtime_exceeded(struct sched_dl_entity *dl_se)
>> +int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se)
>>  {
>> +bool dmiss = dl_time_before(dl_se->deadline, rq_clock(rq));
>> +
>> +if (dmiss && dl_se->runtime > 0)
>> +dl_se->runtime = 0;
>> +
>>  return (dl_se->runtime <= 0);
>>  }
>>  
>> @@ -781,7 +786,7 @@ static void update_curr_dl(struct rq *rq)
>>  dl_se->runtime -= delta_exec;
>>  
>>  throttle:
>> -if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
>> +if (dl_runtime_exceeded(rq, dl_se) || dl_se->dl_yielded) {
>>  dl_se->dl_throttled = 1;
>>  __dequeue_task_dl(rq, curr, 0);
>>  if (unlikely(dl_se->dl_boosted
>> || !start_dl_timer(curr)))

Re: [PATCH v2 2/3] sched/deadline: Throttle the task when missing its deadline

2017-05-12 Thread Xunlei Pang

On 05/12/2017 at 03:01 PM, luca abeni wrote:
> Hi,
>
> On Fri, 12 May 2017 14:53:33 +0800
> Xunlei Pang  wrote:
>
>> On 05/12/2017 at 01:57 PM, luca abeni wrote:
>>> Hi again,
>>>
>>> (sorry for the previous email; I replied from gmail and I did not
>>> realize I was sending it in html).
>>>
>>>
>>> On Fri, 12 May 2017 11:32:08 +0800
>>> Xunlei Pang  wrote:
>>>  
>>>> dl_runtime_exceeded() only checks negative runtime, actually
>>>> when the current deadline past, we should start a new period
>>>> and zero out the remaining runtime as well.  
>>> In this case, I think global EDF wants to allow the task to run with
>>> its remaining runtime even also missing a deadline, so I think this
>>> change is not correct.
>>> (when using global EDF, tasks scheduled on multiple CPUs can miss
>>> their deadlines... Setting the runtime to 0 as soon as a deadline
>>> is missed would break global EDF scheduling)  
>> Hi Luca,
>>
>> Thanks for the comment, looks like I neglected the theoretical
>> analysis.
>>
>> Cited from Documentation/scheduler/sched-deadline.txt:
>>  "As a matter of fact, in this case it is possible to provide an
>> upper bound for tardiness (defined as the maximum between 0 and the
>> difference between the finishing time of a job and its absolute
>> deadline). More precisely, it can be proven that using a global EDF
>> scheduler the maximum tardiness of each task is smaller or equal than
>> ((M − 1) · WCET_max − WCET_min)/(M − (M − 2) · U_max) + WCET_m
>>  where WCET_max = max{WCET_i} is the maximum WCET,
>> WCET_min=min{WCET_i} is the minimum WCET, and U_max = max{WCET_i/P_i}
>> is the maximum utilization[12]."
>>
>> And
>>  "As seen, enforcing that the total utilization is smaller than M
>> does not guarantee that global EDF schedules the tasks without
>> missing any deadline (in other words, global EDF is not an optimal
>> scheduling algorithm). However, a total utilization smaller than M is
>> enough to guarantee that non real-time tasks are not starved and that
>> the tardiness of real-time tasks has an upper bound[12] (as
>> previously noted). Different bounds on the maximum tardiness
>> experienced by real-time tasks have been developed in various
>> papers[13,14], but the theoretical result that is important for
>> SCHED_DEADLINE is that if the total utilization is smaller or equal
>> than M then the response times of the tasks are limited."
>>
>> Do you mean there is some tardiness allowed in theory(global EDF is
>> not an optimal scheduling algorithm), thus missed deadline is allowed
>> for global EDF?
> Right.
>
> With the admission test currently used by the kernel (sum of
> utilizations <= 1), tasks are guaranteed to have a tardiness smaller
> than a theoretical maximum... But this theoretical maximum can be larger
> than 0.
>
> If you want to strictly respect all of the deadlines, you need a
> stricter admission test (for example, the one based on WCET_max that is
> mentioned above).

Understood.

I think in Patch 3, it is still worthy to add the accounting in 
dl_runtime_exceeded(),
to track the dl scheduling tardiness(after all tardiness is not a good thing) 
like:
if (dl_time_before(dl_se->deadline, rq_clock(rq)) && dl_se->runtime > 0)
++dl_se->nr_underrun_sched;

Maybe changing the name to use "nr_underrun_tardy" is better, large value
does need our attention. What do you think?

Regards,
Xunlei

Re: [PATCH v2 2/3] sched/deadline: Throttle the task when missing its deadline

2017-05-14 Thread Xunlei Pang

On 05/13/2017 at 04:58 AM, luca abeni wrote:
> On Fri, 12 May 2017 15:19:55 +0800
> Xunlei Pang  wrote:
> [...]
>>>>  "As seen, enforcing that the total utilization is smaller than M
>>>> does not guarantee that global EDF schedules the tasks without
>>>> missing any deadline (in other words, global EDF is not an optimal
>>>> scheduling algorithm). However, a total utilization smaller than M
>>>> is enough to guarantee that non real-time tasks are not starved
>>>> and that the tardiness of real-time tasks has an upper bound[12]
>>>> (as previously noted). Different bounds on the maximum tardiness
>>>> experienced by real-time tasks have been developed in various
>>>> papers[13,14], but the theoretical result that is important for
>>>> SCHED_DEADLINE is that if the total utilization is smaller or equal
>>>> than M then the response times of the tasks are limited."
>>>>
>>>> Do you mean there is some tardiness allowed in theory(global EDF is
>>>> not an optimal scheduling algorithm), thus missed deadline is
>>>> allowed for global EDF?  
>>> Right.
>>>
>>> With the admission test currently used by the kernel (sum of
>>> utilizations <= 1), tasks are guaranteed to have a tardiness smaller
>>> than a theoretical maximum... But this theoretical maximum can be
>>> larger than 0.
>>>
>>> If you want to strictly respect all of the deadlines, you need a
>>> stricter admission test (for example, the one based on WCET_max
>>> that is mentioned above).  
>> Understood.
>>
>> I think in Patch 3, it is still worthy to add the accounting in
>> dl_runtime_exceeded(), to track the dl scheduling tardiness(after all
>> tardiness is not a good thing) like: if
>> (dl_time_before(dl_se->deadline, rq_clock(rq)) && dl_se->runtime > 0)
>> ++dl_se->nr_underrun_sched;
>>
>> Maybe changing the name to use "nr_underrun_tardy" is better, large
>> value does need our attention. What do you think?
> I do not know, I never used statistics like these...
>
> If there are enough people having a good usecase for these statistics,
> it might be worth adding them, but I do not know other people's
> opinions about this.
>

Hi Luca,

Thanks for the feedback.

I think I can defer the statistics patch after Daniel's "sched/deadline: Use the
revised wakeup rule for suspending constrained dl tasks", since there will
be another underrun case in the fix, let's wait for other's opinions then.

Regards,
Xunlei

Re: linux-next: build warning after merge of the akpm-current tree

2017-05-14 Thread Xunlei Pang

On 05/15/2017 at 09:56 AM, Stephen Rothwell wrote:
> Hi Andrew,
>
> After merging the akpm-current tree, today's linux-next build (arm
> multi_v7_defconfig) produced this warning:
>
> In file included from include/asm-generic/bug.h:15:0,
>  from arch/arm/include/asm/bug.h:59,
>  from include/linux/bug.h:4,
>  from include/linux/elfcore.h:5,
>  from include/linux/crash_core.h:5,
>  from kernel/crash_core.c:9:
> kernel/crash_core.c: In function 'vmcoreinfo_append_str':
> include/linux/kernel.h:757:16: warning: comparison of distinct pointer types 
> lacks a cast
>   (void) (&min1 == &min2);   \
> ^
> include/linux/kernel.h:760:2: note: in expansion of macro '__min'
>   __min(typeof(x), typeof(y),   \
>   ^
> kernel/crash_core.c:360:6: note: in expansion of macro 'min'
>   r = min(r, VMCOREINFO_BYTES - vmcoreinfo_size);
>   ^
>
> Introduced by commit
>
>   fc7d2b44367f ("powerpc/fadump: use the correct VMCOREINFO_NOTE_SIZE for 
> phdr")
>

Hi Stephen/Andrew,

Sorry for the trouble.

The following patch will fix it, do you want to me to send it out separately? 
or help merge it into
fc7d2b44367f ("powerpc/fadump: use the correct VMCOREINFO_NOTE_SIZE for phdr") 
directly?

Thanks,
Xunlei

===

From: Xunlei Pang 
Subject: [PATCH] crash: Fix build warning

linux-next build (arm multi_v7_defconfig) produced this warning:

In file included from include/asm-generic/bug.h:15:0,
 from arch/arm/include/asm/bug.h:59,
 from include/linux/bug.h:4,
 from include/linux/elfcore.h:5,
 from include/linux/crash_core.h:5,
 from kernel/crash_core.c:9:
kernel/crash_core.c: In function 'vmcoreinfo_append_str':
include/linux/kernel.h:757:16: warning: comparison of distinct pointer
types lacks a cast
  (void) (&min1 == &min2);   \
^
include/linux/kernel.h:760:2: note: in expansion of macro '__min'
  __min(typeof(x), typeof(y),   \
  ^
kernel/crash_core.c:360:6: note: in expansion of macro 'min'
  r = min(r, VMCOREINFO_BYTES - vmcoreinfo_size);
  ^

This patch fixes it.

Signed-off-by: Xunlei Pang 
---
 kernel/crash_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 4a4a4ba..6db80fc 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -357,7 +357,7 @@ void vmcoreinfo_append_str(const char *fmt, ...)
 r = vscnprintf(buf, sizeof(buf), fmt, args);
 va_end(args);
 
-r = min(r, VMCOREINFO_BYTES - vmcoreinfo_size);
+r = min(r, (size_t)VMCOREINFO_BYTES - vmcoreinfo_size);
 
 memcpy(&vmcoreinfo_data[vmcoreinfo_size], buf, r);
 
-- 
1.8.3.1

Re: [PATCH v2 1/3] sched/deadline: Zero out positive runtime after throttling constrained tasks

2017-05-14 Thread Xunlei Pang

On 05/12/2017 at 11:32 AM, Xunlei Pang wrote:
> When a contrained task is throttled by dl_check_constrained_dl(),
> it may carry the remaining positive runtime, as a result when
> dl_task_timer() fires and calls replenish_dl_entity(), it will
> not be replenished correctly due to the positive dl_se->runtime.
>
> This patch assigns its runtime to 0 if positive after throttling.
>
> Fixes: df8eac8cafce ("sched/deadline: Throttle a constrained deadline task 
> activated after the deadline)
> Acked-by: Daniel Bristot de Oliveira 
> Signed-off-by: Xunlei Pang 

Hi Peter,

According to the previous discussion with Luca, please ignore the last two 
patches of this series.
Could you please only help review or pick up this one?

Regards,
Xunlei

> ---
>  kernel/sched/deadline.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index a2ce590..d3d291e 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -723,6 +723,8 @@ static inline void dl_check_constrained_dl(struct 
> sched_dl_entity *dl_se)
>   if (unlikely(dl_se->dl_boosted || !start_dl_timer(p)))
>   return;
>   dl_se->dl_throttled = 1;
> + if (dl_se->runtime > 0)
> + dl_se->runtime = 0;
>   }
>  }
>

[PATCH] crash: Fix linux-next build warning

2017-05-14 Thread Xunlei Pang

linux-next build (arm multi_v7_defconfig) produced this warning:

In file included from include/asm-generic/bug.h:15:0,
 from arch/arm/include/asm/bug.h:59,
 from include/linux/bug.h:4,
 from include/linux/elfcore.h:5,
 from include/linux/crash_core.h:5,
 from kernel/crash_core.c:9:
kernel/crash_core.c: In function 'vmcoreinfo_append_str':
include/linux/kernel.h:757:16: warning: comparison of distinct pointer
types lacks a cast
  (void) (&min1 == &min2);   \
^
include/linux/kernel.h:760:2: note: in expansion of macro '__min'
  __min(typeof(x), typeof(y),   \
  ^
kernel/crash_core.c:360:6: note: in expansion of macro 'min'
  r = min(r, VMCOREINFO_BYTES - vmcoreinfo_size);
  ^

This patch fixes it with an explicit "size_t" type-casting on
VMCOREINFO_BYTES.

Andrew will help merge it into commit fc7d2b44367f ("powerpc/fadump:
use the correct VMCOREINFO_NOTE_SIZE for phdr") before sending it to
stable.

Signed-off-by: Xunlei Pang 
---
 kernel/crash_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 4a4a4ba..6db80fc 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -357,7 +357,7 @@ void vmcoreinfo_append_str(const char *fmt, ...)
r = vscnprintf(buf, sizeof(buf), fmt, args);
va_end(args);
 
-   r = min(r, VMCOREINFO_BYTES - vmcoreinfo_size);
+   r = min(r, (size_t)VMCOREINFO_BYTES - vmcoreinfo_size);
 
memcpy(&vmcoreinfo_data[vmcoreinfo_size], buf, r);
 
-- 
1.8.3.1

Re: [PATCH 1/3] mm/memcg: Fix min/low usage in propagate_protected_usage()

2018-12-05 Thread Xunlei Pang

Hi Roman,

On 2018/12/4 AM 2:00, Roman Gushchin wrote:
> On Mon, Dec 03, 2018 at 04:01:17PM +0800, Xunlei Pang wrote:
>> When usage exceeds min, min usage should be min other than 0.
>> Apply the same for low.
>>
>> Signed-off-by: Xunlei Pang 
>> ---
>>  mm/page_counter.c | 12 ++--
>>  1 file changed, 2 insertions(+), 10 deletions(-)
>>
>> diff --git a/mm/page_counter.c b/mm/page_counter.c
>> index de31470655f6..75d53f15f040 100644
>> --- a/mm/page_counter.c
>> +++ b/mm/page_counter.c
>> @@ -23,11 +23,7 @@ static void propagate_protected_usage(struct page_counter 
>> *c,
>>  return;
>>  
>>  if (c->min || atomic_long_read(&c->min_usage)) {
>> -if (usage <= c->min)
>> -protected = usage;
>> -else
>> -protected = 0;
>> -
>> +protected = min(usage, c->min);
> 
> This change makes sense in the combination with the patch 3, but not as a
> standlone "fix". It's not a bug, it's a required thing unless you start 
> scanning
> proportionally to memory.low/min excess.
> 
> Please, reflect this in the commit message. Or, even better, merge it into
> the patch 3.

The more I looked the more I think it's a bug, but anyway I'm fine with
merging it into patch 3 :-)

> 
> Also, please, make sure that cgroup kselftests are passing after your changes.

Sure, will do and send v2. Thanks for your inputs.

[PATCH 1/3] mm/memcg: Fix min/low usage in propagate_protected_usage()

2018-12-03 Thread Xunlei Pang

When usage exceeds min, min usage should be min other than 0.
Apply the same for low.

Signed-off-by: Xunlei Pang 
---
 mm/page_counter.c | 12 ++--
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/mm/page_counter.c b/mm/page_counter.c
index de31470655f6..75d53f15f040 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -23,11 +23,7 @@ static void propagate_protected_usage(struct page_counter *c,
return;
 
if (c->min || atomic_long_read(&c->min_usage)) {
-   if (usage <= c->min)
-   protected = usage;
-   else
-   protected = 0;
-
+   protected = min(usage, c->min);
old_protected = atomic_long_xchg(&c->min_usage, protected);
delta = protected - old_protected;
if (delta)
@@ -35,11 +31,7 @@ static void propagate_protected_usage(struct page_counter *c,
}
 
if (c->low || atomic_long_read(&c->low_usage)) {
-   if (usage <= c->low)
-   protected = usage;
-   else
-   protected = 0;
-
+   protected = min(usage, c->low);
old_protected = atomic_long_xchg(&c->low_usage, protected);
delta = protected - old_protected;
if (delta)
-- 
2.13.5 (Apple Git-94)

[PATCH 2/3] mm/vmscan: Enable kswapd to reclaim low-protected memory

2018-12-03 Thread Xunlei Pang

There may be cgroup memory overcommitment, it will become
even common in the future.

Let's enable kswapd to reclaim low-protected memory in case
of memory pressure, to mitigate the global direct reclaim
pressures which could cause jitters to the response time of
lantency-sensitive groups.

Signed-off-by: Xunlei Pang 
---
 mm/vmscan.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 62ac0c488624..3d412eb91f73 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3531,6 +3531,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int 
classzone_idx)
 
count_vm_event(PAGEOUTRUN);
 
+retry:
do {
unsigned long nr_reclaimed = sc.nr_reclaimed;
bool raise_priority = true;
@@ -3622,6 +3623,13 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
int classzone_idx)
sc.priority--;
} while (sc.priority >= 1);
 
+   if (!sc.nr_reclaimed && sc.memcg_low_skipped) {
+   sc.priority = DEF_PRIORITY;
+   sc.memcg_low_reclaim = 1;
+   sc.memcg_low_skipped = 0;
+   goto retry;
+   }
+
if (!sc.nr_reclaimed)
pgdat->kswapd_failures++;
 
-- 
2.13.5 (Apple Git-94)

[PATCH 3/3] mm/memcg: Avoid reclaiming below hard protection

2018-12-03 Thread Xunlei Pang

When memcgs get reclaimed after its usage exceeds min, some
usages below the min may also be reclaimed in the current
implementation, the amount is considerably large during kswapd
reclaim according to my ftrace results.

This patch calculates the part over hard protection limit,
and allows only this part of usages to be reclaimed.

Signed-off-by: Xunlei Pang 
---
 include/linux/memcontrol.h |  7 +--
 mm/memcontrol.c|  9 +++--
 mm/vmscan.c| 17 +++--
 3 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7ab2120155a4..637ef975792f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -334,7 +334,8 @@ static inline bool mem_cgroup_disabled(void)
 }
 
 enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
-   struct mem_cgroup *memcg);
+   struct mem_cgroup *memcg,
+   unsigned long *min_excess);
 
 int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
  gfp_t gfp_mask, struct mem_cgroup **memcgp,
@@ -818,7 +819,9 @@ static inline void memcg_memory_event_mm(struct mm_struct 
*mm,
 }
 
 static inline enum mem_cgroup_protection mem_cgroup_protected(
-   struct mem_cgroup *root, struct mem_cgroup *memcg)
+   struct mem_cgroup *root,
+   struct mem_cgroup *memcg,
+   unsigned long *min_excess)
 {
return MEMCG_PROT_NONE;
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6e1469b80cb7..ca96f68e07a0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5694,6 +5694,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
  * mem_cgroup_protected - check if memory consumption is in the normal range
  * @root: the top ancestor of the sub-tree being checked
  * @memcg: the memory cgroup to check
+ * @min_excess: store the number of pages exceeding hard protection
  *
  * WARNING: This function is not stateless! It can only be used as part
  *  of a top-down tree iteration, not for isolated queries.
@@ -5761,7 +5762,8 @@ struct cgroup_subsys memory_cgrp_subsys = {
  * as memory.low is a best-effort mechanism.
  */
 enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
-   struct mem_cgroup *memcg)
+   struct mem_cgroup *memcg,
+   unsigned long *min_excess)
 {
struct mem_cgroup *parent;
unsigned long emin, parent_emin;
@@ -5827,8 +5829,11 @@ enum mem_cgroup_protection mem_cgroup_protected(struct 
mem_cgroup *root,
return MEMCG_PROT_MIN;
else if (usage <= elow)
return MEMCG_PROT_LOW;
-   else
+   else {
+   if (emin)
+   *min_excess = usage - emin;
return MEMCG_PROT_NONE;
+   }
 }
 
 /**
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3d412eb91f73..e4fa7a2a63d0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -66,6 +66,9 @@ struct scan_control {
/* How many pages shrink_list() should reclaim */
unsigned long nr_to_reclaim;
 
+   /* How many pages hard protection allows */
+   unsigned long min_excess;
+
/*
 * Nodemask of nodes allowed by the caller. If NULL, all nodes
 * are scanned.
@@ -2503,10 +2506,14 @@ static void shrink_node_memcg(struct pglist_data 
*pgdat, struct mem_cgroup *memc
unsigned long nr_to_scan;
enum lru_list lru;
unsigned long nr_reclaimed = 0;
-   unsigned long nr_to_reclaim = sc->nr_to_reclaim;
+   unsigned long nr_to_reclaim;
struct blk_plug plug;
bool scan_adjusted;
 
+   nr_to_reclaim = sc->nr_to_reclaim;
+   if (sc->min_excess)
+   nr_to_reclaim = min(nr_to_reclaim, sc->min_excess);
+
get_scan_count(lruvec, memcg, sc, nr, lru_pages);
 
/* Record the original scan target for proportional adjustments later */
@@ -2544,6 +2551,10 @@ static void shrink_node_memcg(struct pglist_data *pgdat, 
struct mem_cgroup *memc
 
cond_resched();
 
+   /* Abort proportional reclaim when hard protection applies */
+   if (sc->min_excess && nr_reclaimed >= sc->min_excess)
+   break;
+
if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
continue;
 
@@ -2725,8 +2736,9 @@ static bool shrink_node(pg_data_t *pgdat, struct 
scan_control *sc)
unsigned long lru_pages;
unsigned long reclaimed;
unsigned long scanned;
+   unsigned long excess = 0;
 
-

Re: [PATCH 1/3] mm/memcg: Fix min/low usage in propagate_protected_usage()

2018-12-03 Thread Xunlei Pang

On 2018/12/3 下午7:54, Michal Hocko wrote:
> On Mon 03-12-18 16:01:17, Xunlei Pang wrote:
>> When usage exceeds min, min usage should be min other than 0.
>> Apply the same for low.
> 
> Why? What is the actual problem.

children_min_usage tracks the total children usages under min,
it's natural that min should be added into children_min_usage
when above min, I can't image why 0 is added, is there special
history I missed?

See mem_cgroup_protected(), when usage exceeds min, emin is
calculated as "parent_emin*min/children_min_usage".

> 
>> Signed-off-by: Xunlei Pang 
>> ---
>>  mm/page_counter.c | 12 ++--
>>  1 file changed, 2 insertions(+), 10 deletions(-)
>>
>> diff --git a/mm/page_counter.c b/mm/page_counter.c
>> index de31470655f6..75d53f15f040 100644
>> --- a/mm/page_counter.c
>> +++ b/mm/page_counter.c
>> @@ -23,11 +23,7 @@ static void propagate_protected_usage(struct page_counter 
>> *c,
>>  return;
>>  
>>  if (c->min || atomic_long_read(&c->min_usage)) {
>> -if (usage <= c->min)
>> -protected = usage;
>> -else
>> -protected = 0;
>> -
>> +protected = min(usage, c->min);
>>  old_protected = atomic_long_xchg(&c->min_usage, protected);
>>  delta = protected - old_protected;
>>  if (delta)
>> @@ -35,11 +31,7 @@ static void propagate_protected_usage(struct page_counter 
>> *c,
>>  }
>>  
>>  if (c->low || atomic_long_read(&c->low_usage)) {
>> -if (usage <= c->low)
>> -protected = usage;
>> -else
>> -protected = 0;
>> -
>> +protected = min(usage, c->low);
>>  old_protected = atomic_long_xchg(&c->low_usage, protected);
>>  delta = protected - old_protected;
>>  if (delta)
>> -- 
>> 2.13.5 (Apple Git-94)
>>
>

Re: [PATCH 2/3] mm/vmscan: Enable kswapd to reclaim low-protected memory

2018-12-03 Thread Xunlei Pang

On 2018/12/3 下午7:56, Michal Hocko wrote:
> On Mon 03-12-18 16:01:18, Xunlei Pang wrote:
>> There may be cgroup memory overcommitment, it will become
>> even common in the future.
>>
>> Let's enable kswapd to reclaim low-protected memory in case
>> of memory pressure, to mitigate the global direct reclaim
>> pressures which could cause jitters to the response time of
>> lantency-sensitive groups.
> 
> Please be more descriptive about the problem you are trying to handle
> here. I haven't actually read the patch but let me emphasise that the
> low limit protection is important isolation tool. And allowing kswapd to
> reclaim protected memcgs is going to break the semantic as it has been
> introduced and designed.

We have two types of memcgs: online groups(important business)
and offline groups(unimportant business). Online groups are
all configured with MAX low protection, while offline groups
are not at all protected(with default 0 low).

When offline groups are overcommitted, the global memory pressure
suffers. This will cause the memory allocations from online groups
constantly go to the slow global direct reclaim in order to reclaim
online's page caches, as kswap is not able to reclaim low-protection
memory. low is not hard limit, it's reasonable to be reclaimed by
kswapd if there's no other reclaimable memory.

> 
>>
>> Signed-off-by: Xunlei Pang 
>> ---
>>  mm/vmscan.c | 8 
>>  1 file changed, 8 insertions(+)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 62ac0c488624..3d412eb91f73 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -3531,6 +3531,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
>> int classzone_idx)
>>  
>>  count_vm_event(PAGEOUTRUN);
>>  
>> +retry:
>>  do {
>>  unsigned long nr_reclaimed = sc.nr_reclaimed;
>>  bool raise_priority = true;
>> @@ -3622,6 +3623,13 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
>> int classzone_idx)
>>  sc.priority--;
>>  } while (sc.priority >= 1);
>>  
>> +if (!sc.nr_reclaimed && sc.memcg_low_skipped) {
>> +sc.priority = DEF_PRIORITY;
>> +sc.memcg_low_reclaim = 1;
>> +sc.memcg_low_skipped = 0;
>> +goto retry;
>> +}
>> +
>>  if (!sc.nr_reclaimed)
>>  pgdat->kswapd_failures++;
>>  
>> -- 
>> 2.13.5 (Apple Git-94)
>>
>

Re: [PATCH 2/3] mm/vmscan: Enable kswapd to reclaim low-protected memory

2018-12-03 Thread Xunlei Pang

On 2018/12/4 AM 1:22, Michal Hocko wrote:
> On Mon 03-12-18 23:20:31, Xunlei Pang wrote:
>> On 2018/12/3 下午7:56, Michal Hocko wrote:
>>> On Mon 03-12-18 16:01:18, Xunlei Pang wrote:
>>>> There may be cgroup memory overcommitment, it will become
>>>> even common in the future.
>>>>
>>>> Let's enable kswapd to reclaim low-protected memory in case
>>>> of memory pressure, to mitigate the global direct reclaim
>>>> pressures which could cause jitters to the response time of
>>>> lantency-sensitive groups.
>>>
>>> Please be more descriptive about the problem you are trying to handle
>>> here. I haven't actually read the patch but let me emphasise that the
>>> low limit protection is important isolation tool. And allowing kswapd to
>>> reclaim protected memcgs is going to break the semantic as it has been
>>> introduced and designed.
>>
>> We have two types of memcgs: online groups(important business)
>> and offline groups(unimportant business). Online groups are
>> all configured with MAX low protection, while offline groups
>> are not at all protected(with default 0 low).
>>
>> When offline groups are overcommitted, the global memory pressure
>> suffers. This will cause the memory allocations from online groups
>> constantly go to the slow global direct reclaim in order to reclaim
>> online's page caches, as kswap is not able to reclaim low-protection
>> memory. low is not hard limit, it's reasonable to be reclaimed by
>> kswapd if there's no other reclaimable memory.
> 
> I am sorry I still do not follow. What role do offline cgroups play.
> Those are certainly not low mem protected because mem_cgroup_css_offline
> will reset them to 0.
> 

Oh, I meant "offline groups" to be "offline-business groups", memcgs
refered to here are all "online state" from kernel's perspective.

Re: [PATCH 3/3] mm/memcg: Avoid reclaiming below hard protection

2018-12-03 Thread Xunlei Pang

On 2018/12/3 PM 7:57, Michal Hocko wrote:
> On Mon 03-12-18 16:01:19, Xunlei Pang wrote:
>> When memcgs get reclaimed after its usage exceeds min, some
>> usages below the min may also be reclaimed in the current
>> implementation, the amount is considerably large during kswapd
>> reclaim according to my ftrace results.
> 
> And here again. Describe the setup and the behavior please?
> 

step 1
mkdir -p /sys/fs/cgroup/memory/online
cd /sys/fs/cgroup/memory/online
echo 512M > memory.max
echo 40960 > memory.min
echo $$ > tasks
dd if=/dev/sda of=/dev/null


while true; do sleep 1; cat memory.current ; cat memory.min; done


step 2
create global memory pressure by allocating annoymous and cached
pages to constantly trigger kswap: dd if=/dev/sdb of=/dev/null

step 3
Then observe "online" groups, hundreds of kbytes a little over
memory.min can cause tens of MiB to be reclaimed by kswapd.

Here is one of test results I got:
cat memory.current; cat memory.min; echo;
409485312   // current
40960   // min

385052672   // See current got over reclaimed for 23MB
40960   // min

Its corresponding ftrace output I monitored:
kswapd_0-281   [000]    304.706632: shrink_node_memcg:
min_excess=24, nr_reclaimed=6013, sc->nr_to_reclaim=147, exceeds
5989pages

Re: [PATCH 2/3] mm/vmscan: Enable kswapd to reclaim low-protected memory

2018-12-04 Thread Xunlei Pang

On 2018/12/4 PM 3:25, Michal Hocko wrote:
> On Tue 04-12-18 10:40:29, Xunlei Pang wrote:
>> On 2018/12/4 AM 1:22, Michal Hocko wrote:
>>> On Mon 03-12-18 23:20:31, Xunlei Pang wrote:
>>>> On 2018/12/3 下午7:56, Michal Hocko wrote:
>>>>> On Mon 03-12-18 16:01:18, Xunlei Pang wrote:
>>>>>> There may be cgroup memory overcommitment, it will become
>>>>>> even common in the future.
>>>>>>
>>>>>> Let's enable kswapd to reclaim low-protected memory in case
>>>>>> of memory pressure, to mitigate the global direct reclaim
>>>>>> pressures which could cause jitters to the response time of
>>>>>> lantency-sensitive groups.
>>>>>
>>>>> Please be more descriptive about the problem you are trying to handle
>>>>> here. I haven't actually read the patch but let me emphasise that the
>>>>> low limit protection is important isolation tool. And allowing kswapd to
>>>>> reclaim protected memcgs is going to break the semantic as it has been
>>>>> introduced and designed.
>>>>
>>>> We have two types of memcgs: online groups(important business)
>>>> and offline groups(unimportant business). Online groups are
>>>> all configured with MAX low protection, while offline groups
>>>> are not at all protected(with default 0 low).
>>>>
>>>> When offline groups are overcommitted, the global memory pressure
>>>> suffers. This will cause the memory allocations from online groups
>>>> constantly go to the slow global direct reclaim in order to reclaim
>>>> online's page caches, as kswap is not able to reclaim low-protection
>>>> memory. low is not hard limit, it's reasonable to be reclaimed by
>>>> kswapd if there's no other reclaimable memory.
>>>
>>> I am sorry I still do not follow. What role do offline cgroups play.
>>> Those are certainly not low mem protected because mem_cgroup_css_offline
>>> will reset them to 0.
>>>
>>
>> Oh, I meant "offline groups" to be "offline-business groups", memcgs
>> refered to here are all "online state" from kernel's perspective.
> 
> What is offline-business group? Please try to explain the actual problem
> in much more details and do not let us guess.
> 

Maybe I choosed the wrong word, let me rephase it, and
here is an example.

root 200GB
   /  \
important(100GB)  unimportant(100GB+DYNAMIC)
  / |  \ /  \
docker0 docker1...  normal(100GB) oversold(DYNAMIC)
  /  |  \  / |  \
 j0 j1 ...w0 w1 ...

"DYNAMIC" is controlled by the cluster job scheduler dynamically,
it periodically samples the available system memory(/proc/meminfo
"MemAvailable"), and use part of that to launch oversold jobs
under some special conditions. When "oversold" is active, the
whole system is put under heavy global memory pressure although
memcgs are not.

IOW "DYNAMIC" is primarily borrowed from "dockers" temporarily,
oversold workers will be killed in a timely fashion if "dockers"
needs their memory back suddenly which is rare.

If kswapd doesn't reclaim low-protected memory configured among
"important" dockers, memory allocations from dockers will trap
into global direct reclaim constantly which harms their performance
and response time. The inactive caches from dockers are allowed
to be reclaimed although they are under low-protected(we used a
simple MAX setting), we allow the inactive low-protected memory
to be reclaimed immediately and asynchronously as long as there's
no unprotected reclaimable memory. Its's also friendly to disk IO.

For really latency-sensitive docker, memory.min is supposed to be
used to guarantee its memory QoS.

Thanks

Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion

2018-07-08 Thread Xunlei Pang

Hi Peter,

On 7/5/18 9:21 PM, Xunlei Pang wrote:
> On 7/5/18 6:46 PM, Peter Zijlstra wrote:
>> On Wed, Jun 27, 2018 at 08:22:42PM +0800, Xunlei Pang wrote:
>>> tick-based whole utime is utime_0, tick-based whole stime
>>> is stime_0, scheduler time is rtime_0. 
>>
>>> For a long time, the process runs mainly in userspace with
>>> run-sleep patterns, and because two different clocks, it
>>> is possible to have the following condition:
>>>   rtime_0 < utime_0 (as with little stime_0)
>>
>> I don't follow... what?
>>
>> Why are you, and why do you think it makes sense to, compare rtime_0
>> against utime_0 ?
>>
>> The [us]time_0 are, per your earlier definition, ticks. They're not an
>> actual measure of time. Do not compare the two, that makes no bloody
>> sense.
>>
> 
> [us]time_0 is task_struct:utime{stime}, I cited directly from
> cputime_adjust(), both in nanoseconds. I assumed "rtime_0 < utime_0"
> here to simple the following proof to help explain the problem we met.
> 

Please see the enclosure for the reproducer cputime_adjust.tgz
(process_top.sh, usr_sys.c):
gcc usr_sys.c -o usr_sys
Firstly, the function consume_sys() in usr_sys.c yields 100% sys
which can be verified as follows:
$ taskset -c 0 ./usr_sys 1
$ ./process_top.sh $(pidof usr_sys)
0.0 usr,  100.0 sys
0.0 usr,  100.0 sys


Tested on my local box on 4.17.0 by executing "taskset -c 0 ./usr_sys",
then executing "./process_top.sh $(pidof usr_sys)" to watch.
1) Before this patch
50.0 usr,  0.0 sys
50.0 usr,  1.0 sys
50.0 usr,  0.0 sys
50.0 usr,  0.0 sys
49.0 usr,  4.0 sys //switch to consume 100% sys, ignore this line
12.0 usr,  88.0 sys
11.0 usr,  89.0 sys
10.0 usr,  90.0 sys
10.0 usr,  90.0 sys
9.0 usr,  91.0 sys
8.0 usr,  91.0 sys

Obviously there were around 10% sys wrongly goes to usr

2) After this patch
50.0 usr,  0.0 sys
50.0 usr,  0.0 sys
50.0 usr,  0.0 sys
50.0 usr,  0.0 sys
11.0 usr,  76.0 sys //switch to consume 100% sys, ignore this line
1.0 usr,  100.0 sys
0.0 usr,  100.0 sys
1.0 usr,  100.0 sys
0.0 usr,  100.0 sys
0.0 usr,  100.0 sys

So it displayed the correct result as we expected after this patch.

Thanks,
Xunlei


cputime_adjust.tgz
Description: GNU Zip compressed data

[PATCH v2] sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()

2018-07-09 Thread Xunlei Pang

If users access "/proc/pid/stat", the utime and stime ratio in the
current SAMPLE period are excepted, but currently cputime_adjust()
always calculates with the ratio of the WHOLE lifetime of the process.

This results in inaccurate utime and stime in "/proc/pid/stat". For
example,  a process runs for a while with "50% usr, 0% sys", then
followed by "100% sys". For later while, the following is excepted:
  0.0 usr,  100.0 sys
but we got:
  10.0 usr,  90.0 sys

This patch uses the accurate ratio in cputime_adjust() to address the
issue. A new task_cputime type field is added in prev_cputime to record
previous task_cputime so that we can get the elapsed times as the accurate
ratio.

Signed-off-by: Xunlei Pang 
---
v1->v2:
 - Rewrite the changelog.

 include/linux/sched.h | 34 
 include/linux/sched/cputime.h | 12 -
 kernel/sched/cputime.c| 61 ---
 3 files changed, 52 insertions(+), 55 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 87bf02d93a27..9cb76005b638 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -223,10 +223,27 @@ extern void io_schedule_finish(int token);
 extern long io_schedule_timeout(long timeout);
 extern void io_schedule(void);
 
+/**
+ * struct task_cputime - collected CPU time counts
+ * @utime: time spent in user mode, in nanoseconds
+ * @stime: time spent in kernel mode, in nanoseconds
+ * @sum_exec_runtime:  total time spent on the CPU, in nanoseconds
+ *
+ * This structure groups together three kinds of CPU time that are tracked for
+ * threads and thread groups.  Most things considering CPU time want to group
+ * these counts together and treat all three of them in parallel.
+ */
+struct task_cputime {
+   u64 utime;
+   u64 stime;
+   unsigned long long  sum_exec_runtime;
+};
+
 /**
  * struct prev_cputime - snapshot of system and user cputime
  * @utime: time spent in user mode
  * @stime: time spent in system mode
+ * @cputime: previous task_cputime to calculate utime/stime
  * @lock: protects the above two fields
  *
  * Stores previous user/system time values such that we can guarantee
@@ -236,26 +253,11 @@ struct prev_cputime {
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
u64 utime;
u64 stime;
+   struct task_cputime cputime;
raw_spinlock_t  lock;
 #endif
 };
 
-/**
- * struct task_cputime - collected CPU time counts
- * @utime: time spent in user mode, in nanoseconds
- * @stime: time spent in kernel mode, in nanoseconds
- * @sum_exec_runtime:  total time spent on the CPU, in nanoseconds
- *
- * This structure groups together three kinds of CPU time that are tracked for
- * threads and thread groups.  Most things considering CPU time want to group
- * these counts together and treat all three of them in parallel.
- */
-struct task_cputime {
-   u64 utime;
-   u64 stime;
-   unsigned long long  sum_exec_runtime;
-};
-
 /* Alternate field names when used on cache expirations: */
 #define virt_exp   utime
 #define prof_exp   stime
diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h
index 53f883f5a2fd..49f8fd2564ed 100644
--- a/include/linux/sched/cputime.h
+++ b/include/linux/sched/cputime.h
@@ -175,10 +175,20 @@ static inline void account_group_exec_runtime(struct 
task_struct *tsk,
atomic64_add(ns, &cputimer->cputime_atomic.sum_exec_runtime);
 }
 
-static inline void prev_cputime_init(struct prev_cputime *prev)
+static inline void prev_cputime_clear(struct prev_cputime *prev)
 {
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
prev->utime = prev->stime = 0;
+   prev->cputime.utime = 0;
+   prev->cputime.stime = 0;
+   prev->cputime.sum_exec_runtime = 0;
+#endif
+}
+
+static inline void prev_cputime_init(struct prev_cputime *prev)
+{
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
+   prev_cputime_clear(prev);
raw_spin_lock_init(&prev->lock);
 #endif
 }
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 0796f938c4f0..a68483ee3ad7 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -590,69 +590,54 @@ static u64 scale_stime(u64 stime, u64 rtime, u64 total)
 void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
u64 *ut, u64 *st)
 {
-   u64 rtime, stime, utime;
+   u64 rtime_delta, stime_delta, utime_delta;
unsigned long flags;
 
/* Serialize concurrent callers such that we can honour our guarantees 
*/
raw_spin_lock_irqsave(&prev->lock, flags);
-   rti

Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion

2018-07-09 Thread Xunlei Pang

Hi Peter,

On 7/9/18 6:48 PM, Peter Zijlstra wrote:
> On Mon, Jul 09, 2018 at 01:52:38PM +0800, Xunlei Pang wrote:
>> Please see the enclosure for the reproducer cputime_adjust.tgz
> 
> No, I'm not going to reverse engineer something if you cannot even
> explain what the problem is.
> 

I rewrote the whole changelog and sent v2, please have a look.

Thanks,
Xunlei

Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion

2018-06-26 Thread Xunlei Pang

On 6/22/18 3:15 PM, Xunlei Pang wrote:
> We use per-cgroup cpu usage statistics similar to "cgroup rstat",
> and encountered a problem that user and sys usages are wrongly
> split sometimes.
> 
> Run tasks with some random run-sleep pattern for a long time, and
> when tick-based time and scheduler sum_exec_runtime hugely drifts
> apart(scheduler sum_exec_runtime is less than tick-based time),
> the current implementation of cputime_adjust() will produce less
> sys usage than the actual use after changing to run a different
> workload pattern with high sys. This is because total tick-based
> utime and stime are used to split the total sum_exec_runtime.
> 
> Same problem exists on utime and stime from "/proc//stat".
> 
> [Example]
> Run some random run-sleep patterns for minutes, then change to run
> high sys pattern, and watch.
> 1) standard "top"(which is the correct one):
>4.6 us, 94.5 sy,  0.0 ni,  0.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> 2) our tool parsing utime and stime from "/proc//stat":
>20.5 usr, 78.4 sys
> We can see "20.5 usr" displayed in 2) was incorrect, it recovers
> gradually with time: 9.7 usr, 89.5 sys
> 
High sys probably means there's something abnormal on the kernel
path, it may hide issues, so we should make it fairly reliable.
It can easily hit this problem with our per-cgroup statistics.

Hi Peter, any comment on this patch?

Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion

2018-06-27 Thread Xunlei Pang

On 6/26/18 11:49 PM, Peter Zijlstra wrote:
> On Tue, Jun 26, 2018 at 08:19:49PM +0800, Xunlei Pang wrote:
>> On 6/22/18 3:15 PM, Xunlei Pang wrote:
>>> We use per-cgroup cpu usage statistics similar to "cgroup rstat",
>>> and encountered a problem that user and sys usages are wrongly
>>> split sometimes.
>>>
>>> Run tasks with some random run-sleep pattern for a long time, and
>>> when tick-based time and scheduler sum_exec_runtime hugely drifts
>>> apart(scheduler sum_exec_runtime is less than tick-based time),
>>> the current implementation of cputime_adjust() will produce less
>>> sys usage than the actual use after changing to run a different
>>> workload pattern with high sys. This is because total tick-based
>>> utime and stime are used to split the total sum_exec_runtime.
>>>
>>> Same problem exists on utime and stime from "/proc//stat".
>>>
>>> [Example]
>>> Run some random run-sleep patterns for minutes, then change to run
>>> high sys pattern, and watch.
>>> 1) standard "top"(which is the correct one):
>>>4.6 us, 94.5 sy,  0.0 ni,  0.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
>>> 2) our tool parsing utime and stime from "/proc//stat":
>>>20.5 usr, 78.4 sys
>>> We can see "20.5 usr" displayed in 2) was incorrect, it recovers
>>> gradually with time: 9.7 usr, 89.5 sys
>>>
>> High sys probably means there's something abnormal on the kernel
>> path, it may hide issues, so we should make it fairly reliable.
>> It can easily hit this problem with our per-cgroup statistics.
>>
>> Hi Peter, any comment on this patch?
> 
> Well, no, because the Changelog is incomprehensible and the patch
> doesn't really have useful comments, so I'll have to reverse engineer
> the entire thing, and I've just not had time for that.
> 

Let me try the best to describe it again.

There are two types of run time for a process:
1) task_struct::utime, task_struct::stime in ticks.
2) scheduler task_struct::se.sum_exec_runtime(rtime) in ns.

In case of no vtime accounting, the utime/stime fileds of
/proc/pid/stat are calculated by cputime_adjust(), which
splits the precise rtime in the proportion of tick-based
utime and stime.

However cputime_adjust() always does the split using the
total utime/stime of the process, this may cause wrong
splitting in some cases, e.g.

A typical statistic collector accesses "/proc/pid/stat".
1) moment t0
After accessed /proc/pid/stat in t0:
tick-based whole utime is utime_0, tick-based whole stime
is stime_0, scheduler time is rtime_0. The ajusted utime
caculated by cputime_adjust() is autime_0, ajusted stime
is astime_0, so astime_0=rtime_0*stime_0/(utime_0+stime_0).

For a long time, the process runs mainly in userspace with
run-sleep patterns, and because two different clocks, it
is possible to have the following condition:
  rtime_0 < utime_0 (as with little stime_0)

2) moment t1(after dt, i.e. t0+dt)
Then the process suddenly runs 100% sys workload afterwards
lasting "dt", when accessing /proc/pid/stat at t1="t0+dt",
both rtime_0 and stime_0 increase "dt", thus cputime_ajust()
does the calculation for new adjusted astime_1 as follows:
  (rtime_0+dt)*(stime_0+dt)/(utime_0+stime_0+dt)
= (rtime_0*stime_0+rtime_0*dt+stime_0*dt+dt*dt)/(utime_0+stime_0+dt)
= (rtime_0*stime_0+rtime_0*dt-utime_0*dt)/(utime_0+stime_0+dt) + dt
< rtime_0*stime_0/(utime_0+stime_0+dt) + dt (for rtime_0 < utime_0)
< rtime_0*stime_0/(utime_0+stime_0) + dt
< astime_0+dt

The actual astime_1 should be "astime_0+dt"(as it runs 100%
sys during dt), but the caculated figure by cputime_adjust()
becomes much smaller, as a result the statistics collector
shows less cpu sys usage than the actual one.

That's why we occasionally observed the incorrect cpu usage
described in the changelog:
[Example]
Run some random run-sleep patterns for minutes, then change
to run high sys pattern, and watch.
1) standard "top"(which is the correct one):
   4.6 us, 94.5 sy,  0.0 ni,  0.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
2) Parsing utime and stime from "/proc//stat":
   20.5 usr, 78.4 sys
We can see "20.5 usr" displayed in 2) was incorrect, it recovers
gradually with time: 9.7 usr, 89.5 sys

Thanks,
Xunlei

[PATCH 2/2] sched/fair: Advance global expiration when period timer is restarted

2018-06-18 Thread Xunlei Pang

I noticed the group frequently got throttled even it consumed
low cpu usage, this caused some jitters on the response time
to some of our business containers enabling cpu quota.

It's very easy to reproduce:
mkdir /sys/fs/cgroup/cpu/test
cd /sys/fs/cgroup/cpu/test
echo 10 > cpu.cfs_quota_us
echo $$ > tasks
then repeat:
cat cpu.stat |grep nr_throttled  // nr_throttled will increase

After some analysis, we found that cfs_rq::runtime_remaining will
be cleared by expire_cfs_rq_runtime() due to two equal but stale
"cfs_{b|q}->runtime_expires" after period timer is re-armed.

The global expiration should be advanced accordingly when the
bandwidth period timer is restarted.

Signed-off-by: Xunlei Pang 
---
 kernel/sched/fair.c | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9f384264e832..bb006e671609 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5204,13 +5204,18 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 
 void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 {
+   u64 overrun;
+
lockdep_assert_held(&cfs_b->lock);
 
-   if (!cfs_b->period_active) {
-   cfs_b->period_active = 1;
-   hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
-   hrtimer_start_expires(&cfs_b->period_timer, 
HRTIMER_MODE_ABS_PINNED);
-   }
+   if (cfs_b->period_active)
+   return;
+
+   cfs_b->period_active = 1;
+   overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
+   cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
+   cfs_b->expires_seq++;
+   hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
 }
 
 static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
-- 
2.14.1.40.g8e62ba1

[PATCH 1/2] sched/fair: Fix bandwidth timer clock drift condition

2018-06-18 Thread Xunlei Pang

The current condition to judge clock drift in expire_cfs_rq_runtime()
is wrong, the two runtime_expires are actually the same when clock
drift happens, so this condtion can never hit. The orginal design was
correctly done by commit a9cf55b28610 ("sched: Expire invalid runtime"),
but was changed to be the current one due to its locking issue.

This patch introduces another way, it adds a new field in both structure
cfs_rq and cfs_bandwidth to record the expiration update sequence, and
use them to figure out if clock drift happens(true if they equal).

This fix is also needed by the following patch.

Fixes: 51f2176d74ac ("sched/fair: Fix unlocked reads of some 
cfs_b->quota/period")
Cc: Ben Segall 
Signed-off-by: Xunlei Pang 
---
 kernel/sched/fair.c  | 14 --
 kernel/sched/sched.h |  6 --
 2 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e497c05aab7f..9f384264e832 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4590,6 +4590,7 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth 
*cfs_b)
now = sched_clock_cpu(smp_processor_id());
cfs_b->runtime = cfs_b->quota;
cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
+   cfs_b->expires_seq++;
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4612,6 +4613,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
struct task_group *tg = cfs_rq->tg;
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
u64 amount = 0, min_amount, expires;
+   int expires_seq;
 
/* note: this is a positive sum as runtime_remaining <= 0 */
min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
@@ -4629,6 +4631,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
}
}
expires = cfs_b->runtime_expires;
+   expires_seq = cfs_b->expires_seq;
raw_spin_unlock(&cfs_b->lock);
 
cfs_rq->runtime_remaining += amount;
@@ -4637,8 +4640,10 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 * spread between our sched_clock and the one on which runtime was
 * issued.
 */
-   if ((s64)(expires - cfs_rq->runtime_expires) > 0)
+   if ((s64)(expires - cfs_rq->runtime_expires) > 0) {
cfs_rq->runtime_expires = expires;
+   cfs_rq->expires_seq = expires_seq;
+   }
 
return cfs_rq->runtime_remaining > 0;
 }
@@ -4664,12 +4669,9 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 * has not truly expired.
 *
 * Fortunately we can check determine whether this the case by checking
-* whether the global deadline has advanced. It is valid to compare
-* cfs_b->runtime_expires without any locks since we only care about
-* exact equality, so a partial write will still work.
+* whether the global deadline(cfs_b->expires_seq) has advanced.
 */
-
-   if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
+   if (cfs_rq->expires_seq == cfs_b->expires_seq) {
/* extend local deadline, drift is bounded above by 2 ticks */
cfs_rq->runtime_expires += TICK_NSEC;
} else {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6601baf2361c..e977e04f8daf 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -334,9 +334,10 @@ struct cfs_bandwidth {
u64 runtime;
s64 hierarchical_quota;
u64 runtime_expires;
+   int expires_seq;
 
-   int idle;
-   int period_active;
+   short   idle;
+   short   period_active;
struct hrtimer  period_timer;
struct hrtimer  slack_timer;
struct list_headthrottled_cfs_rq;
@@ -551,6 +552,7 @@ struct cfs_rq {
 
 #ifdef CONFIG_CFS_BANDWIDTH
int runtime_enabled;
+   int expires_seq;
u64 runtime_expires;
s64 runtime_remaining;
 
-- 
2.14.1.40.g8e62ba1

Re: [PATCH 1/2] sched/fair: Fix bandwidth timer clock drift condition

2018-06-18 Thread Xunlei Pang

On 6/19/18 2:44 AM, bseg...@google.com wrote:
> Xunlei Pang  writes:
> 
>> The current condition to judge clock drift in expire_cfs_rq_runtime()
>> is wrong, the two runtime_expires are actually the same when clock
>> drift happens, so this condtion can never hit. The orginal design was
>> correctly done by commit a9cf55b28610 ("sched: Expire invalid runtime"),
>> but was changed to be the current one due to its locking issue.
>>
>> This patch introduces another way, it adds a new field in both structure
>> cfs_rq and cfs_bandwidth to record the expiration update sequence, and
>> use them to figure out if clock drift happens(true if they equal).
> 
> It might just be simplest to revert the comparison change - if we read a
> torn value, the worst that happens is we extend incorrectly, and that
> is exactly what happens if we just read the old value.
> 
> An extra int isn't exactly the worst thing though, so whichever.

I tried that, it might still consume the old runtime in the worst case,
I choosed this way considering more cfs_b->runtime_expires change in the
2nd patch.

The extra fields added can gurantee the correct control, also it does
not increase the total size of the two structures.

Thanks,
Xunlei

> 
>>
>> This fix is also needed by the following patch.
>>
>> Fixes: 51f2176d74ac ("sched/fair: Fix unlocked reads of some 
>> cfs_b->quota/period")
>> Cc: Ben Segall 
>> Signed-off-by: Xunlei Pang 
>> ---
>>  kernel/sched/fair.c  | 14 --
>>  kernel/sched/sched.h |  6 --
>>  2 files changed, 12 insertions(+), 8 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index e497c05aab7f..9f384264e832 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4590,6 +4590,7 @@ void __refill_cfs_bandwidth_runtime(struct 
>> cfs_bandwidth *cfs_b)
>>  now = sched_clock_cpu(smp_processor_id());
>>  cfs_b->runtime = cfs_b->quota;
>>  cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
>> +cfs_b->expires_seq++;
>>  }
>>  
>>  static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
>> @@ -4612,6 +4613,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>>  struct task_group *tg = cfs_rq->tg;
>>  struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
>>  u64 amount = 0, min_amount, expires;
>> +int expires_seq;
>>  
>>  /* note: this is a positive sum as runtime_remaining <= 0 */
>>  min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
>> @@ -4629,6 +4631,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>>  }
>>  }
>>  expires = cfs_b->runtime_expires;
>> +expires_seq = cfs_b->expires_seq;
>>  raw_spin_unlock(&cfs_b->lock);
>>  
>>  cfs_rq->runtime_remaining += amount;
>> @@ -4637,8 +4640,10 @@ static int assign_cfs_rq_runtime(struct cfs_rq 
>> *cfs_rq)
>>   * spread between our sched_clock and the one on which runtime was
>>   * issued.
>>   */
>> -if ((s64)(expires - cfs_rq->runtime_expires) > 0)
>> +if ((s64)(expires - cfs_rq->runtime_expires) > 0) {
>>  cfs_rq->runtime_expires = expires;
>> +cfs_rq->expires_seq = expires_seq;
>> +}
>>  
>>  return cfs_rq->runtime_remaining > 0;
>>  }
>> @@ -4664,12 +4669,9 @@ static void expire_cfs_rq_runtime(struct cfs_rq 
>> *cfs_rq)
>>   * has not truly expired.
>>   *
>>   * Fortunately we can check determine whether this the case by checking
>> - * whether the global deadline has advanced. It is valid to compare
>> - * cfs_b->runtime_expires without any locks since we only care about
>> - * exact equality, so a partial write will still work.
>> + * whether the global deadline(cfs_b->expires_seq) has advanced.
>>   */
>> -
>> -if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
>> +if (cfs_rq->expires_seq == cfs_b->expires_seq) {
>>  /* extend local deadline, drift is bounded above by 2 ticks */
>>  cfs_rq->runtime_expires += TICK_NSEC;
>>  } else {
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 6601baf2361c..e977e04f8daf 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -334,9 +334,10 @@ struct cfs_bandwidth {
>>  u64 runtime;
>>  s64

Re: [PATCH 2/2] sched/fair: Advance global expiration when period timer is restarted

2018-06-18 Thread Xunlei Pang

On 6/19/18 2:58 AM, bseg...@google.com wrote:
> Xunlei Pang  writes:
> 
>> I noticed the group frequently got throttled even it consumed
>> low cpu usage, this caused some jitters on the response time
>> to some of our business containers enabling cpu quota.
>>
>> It's very easy to reproduce:
>> mkdir /sys/fs/cgroup/cpu/test
>> cd /sys/fs/cgroup/cpu/test
>> echo 10 > cpu.cfs_quota_us
>> echo $$ > tasks
>> then repeat:
>> cat cpu.stat |grep nr_throttled  // nr_throttled will increase
>>
>> After some analysis, we found that cfs_rq::runtime_remaining will
>> be cleared by expire_cfs_rq_runtime() due to two equal but stale
>> "cfs_{b|q}->runtime_expires" after period timer is re-armed.
> 
> If this is after the first patch, then this is no longer what should
> happen, and instead it would incorrectly /keep/ old local cfs_rq
> runtime, and not __refill global runtime until the period.

Exactly, I can improve the changelog.

> 
> 
>>
>> The global expiration should be advanced accordingly when the
>> bandwidth period timer is restarted.
>>
>> Signed-off-by: Xunlei Pang 
>> ---
>>  kernel/sched/fair.c | 15 ++-
>>  1 file changed, 10 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 9f384264e832..bb006e671609 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5204,13 +5204,18 @@ static void init_cfs_rq_runtime(struct cfs_rq 
>> *cfs_rq)
>>  
>>  void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>>  {
>> +u64 overrun;
>> +
>>  lockdep_assert_held(&cfs_b->lock);
>>  
>> -if (!cfs_b->period_active) {
>> -cfs_b->period_active = 1;
>> -hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
>> -hrtimer_start_expires(&cfs_b->period_timer, 
>> HRTIMER_MODE_ABS_PINNED);
>> -}
>> +if (cfs_b->period_active)
>> +return;
>> +
>> +cfs_b->period_active = 1;
>> +overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
>> +cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
> 
> I think we actually want if (overrun) __refill_cfs_bandwidth_runtime(),
> much like tg_set_cfs_bandwidth

It's a little different, e.g. periods like below:

   Pm
   |
   v
  |-|-|
Pn-1   PnPn+1

Assuming we start_cfs_bandwidth() at some moment Pm between "Pn" and
"Pn+1", cfs_b->runtime_expires we want here should be "Pn+1", while
__refill_cfs_bandwidth_runtime() forces it to be "Pm+period"(hrtimer
expires at "Pn+1").

Regards,
Xunlei

> 
>> +cfs_b->expires_seq++;
>> +hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
>>  }
>>  
>>  static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)

Re: [PATCH 2/2] sched/fair: Advance global expiration when period timer is restarted

2018-06-18 Thread Xunlei Pang

On 6/19/18 12:36 PM, Cong Wang wrote:
> On Mon, Jun 18, 2018 at 2:16 AM, Xunlei Pang  wrote:
>> I noticed the group frequently got throttled even it consumed
>> low cpu usage, this caused some jitters on the response time
>> to some of our business containers enabling cpu quota.
>>
>> It's very easy to reproduce:
>> mkdir /sys/fs/cgroup/cpu/test
>> cd /sys/fs/cgroup/cpu/test
>> echo 10 > cpu.cfs_quota_us
>> echo $$ > tasks
>> then repeat:
>> cat cpu.stat |grep nr_throttled  // nr_throttled will increase
>>
>> After some analysis, we found that cfs_rq::runtime_remaining will
>> be cleared by expire_cfs_rq_runtime() due to two equal but stale
>> "cfs_{b|q}->runtime_expires" after period timer is re-armed.
>>
>> The global expiration should be advanced accordingly when the
>> bandwidth period timer is restarted.
>>
> 
> I observed the same problem and already sent some patches:
> 
> https://lkml.org/lkml/2018/5/22/37
> https://lkml.org/lkml/2018/5/22/38
> https://lkml.org/lkml/2018/5/22/35
> 

Looks they are related to large slice setting and unused slack
under large number of cpus, the issue I described is a little
different.

Re: [PATCH 2/2] sched/fair: Advance global expiration when period timer is restarted

2018-06-19 Thread Xunlei Pang

On 6/20/18 1:49 AM, bseg...@google.com wrote:
> Xunlei Pang  writes:
> 
>> On 6/19/18 2:58 AM, bseg...@google.com wrote:
>>> Xunlei Pang  writes:
>>>
>>>> I noticed the group frequently got throttled even it consumed
>>>> low cpu usage, this caused some jitters on the response time
>>>> to some of our business containers enabling cpu quota.
>>>>
>>>> It's very easy to reproduce:
>>>> mkdir /sys/fs/cgroup/cpu/test
>>>> cd /sys/fs/cgroup/cpu/test
>>>> echo 10 > cpu.cfs_quota_us
>>>> echo $$ > tasks
>>>> then repeat:
>>>> cat cpu.stat |grep nr_throttled  // nr_throttled will increase
>>>>
>>>> After some analysis, we found that cfs_rq::runtime_remaining will
>>>> be cleared by expire_cfs_rq_runtime() due to two equal but stale
>>>> "cfs_{b|q}->runtime_expires" after period timer is re-armed.
>>>
>>> If this is after the first patch, then this is no longer what should
>>> happen, and instead it would incorrectly /keep/ old local cfs_rq
>>> runtime, and not __refill global runtime until the period.
>>
>> Exactly, I can improve the changelog.
>>
>>>
>>>
>>>>
>>>> The global expiration should be advanced accordingly when the
>>>> bandwidth period timer is restarted.
>>>>
>>>> Signed-off-by: Xunlei Pang 
>>>> ---
>>>>  kernel/sched/fair.c | 15 ++-
>>>>  1 file changed, 10 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 9f384264e832..bb006e671609 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -5204,13 +5204,18 @@ static void init_cfs_rq_runtime(struct cfs_rq 
>>>> *cfs_rq)
>>>>  
>>>>  void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>>>>  {
>>>> +  u64 overrun;
>>>> +
>>>>lockdep_assert_held(&cfs_b->lock);
>>>>  
>>>> -  if (!cfs_b->period_active) {
>>>> -  cfs_b->period_active = 1;
>>>> -  hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
>>>> -  hrtimer_start_expires(&cfs_b->period_timer, 
>>>> HRTIMER_MODE_ABS_PINNED);
>>>> -  }
>>>> +  if (cfs_b->period_active)
> 
>>>> +  return;
>>>> +
>>>> +  cfs_b->period_active = 1;
>>>> +  overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
>>>> +  cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
>>>
>>> I think we actually want if (overrun) __refill_cfs_bandwidth_runtime(),
>>> much like tg_set_cfs_bandwidth
>>
>> It's a little different, e.g. periods like below:
>>
>>Pm
>>|
>>v
>>   |-|-|
>> Pn-1   PnPn+1
>>
>> Assuming we start_cfs_bandwidth() at some moment Pm between "Pn" and
>> "Pn+1", cfs_b->runtime_expires we want here should be "Pn+1", while
>> __refill_cfs_bandwidth_runtime() forces it to be "Pm+period"(hrtimer
>> expires at "Pn+1").
>>
>> Regards,
>> Xunlei
>>
> 
> Hmm, yeah, it would wind up out of sync with the hrtimer unless we
> restarted the period boundaries.
> 
> Now that I look at it even more closely, I don't see how you're getting
> problems like that, besides wasting tiny amounts of cpu time in expire:
> we only disable the period timer when there has been a full period with
> no activity, so we should have a full cfs_b->runtime, and a
> runtime_expires from the period that was spent entirely idle (so it
> should always be in the past). But with a working extend-local-deadline
> check like after patch 1 it shouldn't /matter/ what the actual value of
> runtime_expires is, we'll just waste tiny amounts of cpu time
> incrementing runtime_expires every account_cfs_rq_runtime call rather
> than aborting early.
> 
> Actually, I think arguably the real issue may be in patch 1:

Ben,

Thanks for the comments.

Yes, the problematic throttle issue is solved by patch 1, and it's a
different issue in patch 2. It wants to address another stale runtime
issue(I will update the changelog in v2), specifically:
An ever-running cfs_rq tends to retain some runtime(min_cfs_rq_runtime),
it could be even more(sysct

[PATCH v2 2/2] sched/fair: Advance global expiration when period timer is restarted

2018-06-20 Thread Xunlei Pang

When period gets restarted after some idle time, start_cfs_bandwidth()
doesn't update the expiration information, expire_cfs_rq_runtime() will
see cfs_rq->runtime_expires smaller than rq clock and go to the clock
drift logic, wasting needless cpu cycles on the scheduler hot path.

Update the global expiration in start_cfs_bandwidth() to avoid frequent
expire_cfs_rq_runtime() calls once a new period begins.

Signed-off-by: Xunlei Pang 
---
 kernel/sched/fair.c | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e6bb68d52962..f167aca066cc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5204,13 +5204,18 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 
 void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 {
+   u64 overrun;
+
lockdep_assert_held(&cfs_b->lock);
 
-   if (!cfs_b->period_active) {
-   cfs_b->period_active = 1;
-   hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
-   hrtimer_start_expires(&cfs_b->period_timer, 
HRTIMER_MODE_ABS_PINNED);
-   }
+   if (cfs_b->period_active)
+   return;
+
+   cfs_b->period_active = 1;
+   overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
+   cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
+   cfs_b->expires_seq++;
+   hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
 }
 
 static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
-- 
2.14.1.40.g8e62ba1

[PATCH v2 1/2] sched/fair: Fix bandwidth timer clock drift condition

2018-06-20 Thread Xunlei Pang

I noticed the group constantly got throttled even it consumed
low cpu usage, this caused some jitters on the response time
to some of our business containers enabling cpu quota.

It's very simple to reproduce:
  mkdir /sys/fs/cgroup/cpu/test
  cd /sys/fs/cgroup/cpu/test
  echo 10 > cpu.cfs_quota_us
  echo $$ > tasks
then repeat:
  cat cpu.stat |grep nr_throttled  // nr_throttled will increase

After some analysis, we found that cfs_rq::runtime_remaining will
be cleared by expire_cfs_rq_runtime() due to two equal but stale
"cfs_{b|q}->runtime_expires" after period timer is re-armed.

The current condition to judge clock drift in expire_cfs_rq_runtime()
is wrong, the two runtime_expires are actually the same when clock
drift happens, so this condtion can never hit. The orginal design was
correctly done by commit a9cf55b28610 ("sched: Expire invalid runtime"),
but was changed to be the current one due to its locking issue.

This patch introduces another way, it adds a new field in both structures
cfs_rq and cfs_bandwidth to record the expiration update sequence, and
use them to figure out if clock drift happens(true if they equal).

Fixes: 51f2176d74ac ("sched/fair: Fix unlocked reads of some 
cfs_b->quota/period")
Cc: Ben Segall 
Signed-off-by: Xunlei Pang 
---
 kernel/sched/fair.c  | 14 --
 kernel/sched/sched.h |  6 --
 2 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e497c05aab7f..e6bb68d52962 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4590,6 +4590,7 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth 
*cfs_b)
now = sched_clock_cpu(smp_processor_id());
cfs_b->runtime = cfs_b->quota;
cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
+   cfs_b->expires_seq++;
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4612,6 +4613,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
struct task_group *tg = cfs_rq->tg;
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
u64 amount = 0, min_amount, expires;
+   int expires_seq;
 
/* note: this is a positive sum as runtime_remaining <= 0 */
min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
@@ -4628,6 +4630,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
cfs_b->idle = 0;
}
}
+   expires_seq = cfs_b->expires_seq;
expires = cfs_b->runtime_expires;
raw_spin_unlock(&cfs_b->lock);
 
@@ -4637,8 +4640,10 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 * spread between our sched_clock and the one on which runtime was
 * issued.
 */
-   if ((s64)(expires - cfs_rq->runtime_expires) > 0)
+   if (cfs_rq->expires_seq != expires_seq) {
+   cfs_rq->expires_seq = expires_seq;
cfs_rq->runtime_expires = expires;
+   }
 
return cfs_rq->runtime_remaining > 0;
 }
@@ -4664,12 +4669,9 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 * has not truly expired.
 *
 * Fortunately we can check determine whether this the case by checking
-* whether the global deadline has advanced. It is valid to compare
-* cfs_b->runtime_expires without any locks since we only care about
-* exact equality, so a partial write will still work.
+* whether the global deadline(cfs_b->expires_seq) has advanced.
 */
-
-   if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
+   if (cfs_rq->expires_seq == cfs_b->expires_seq) {
/* extend local deadline, drift is bounded above by 2 ticks */
cfs_rq->runtime_expires += TICK_NSEC;
} else {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6601baf2361c..e977e04f8daf 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -334,9 +334,10 @@ struct cfs_bandwidth {
u64 runtime;
s64 hierarchical_quota;
u64 runtime_expires;
+   int expires_seq;
 
-   int idle;
-   int period_active;
+   short   idle;
+   short   period_active;
struct hrtimer  period_timer;
struct hrtimer  slack_timer;
struct list_headthrottled_cfs_rq;
@@ -551,6 +552,7 @@ struct cfs_rq {
 
 #ifdef CONFIG_CFS_BANDWIDTH
int runtime_enabled;
+   int expires_seq;
u64 runtime_expires;
s64 runtime_remaining;
 
-- 
2.14.1.40.g8e62ba1

Re: [PATCH 2/2] sched/fair: Advance global expiration when period timer is restarted

2018-06-20 Thread Xunlei Pang

On 6/20/18 2:06 PM, Xunlei Pang wrote:
> On 6/20/18 1:49 AM, bseg...@google.com wrote:
>> Xunlei Pang  writes:
>>
>>> On 6/19/18 2:58 AM, bseg...@google.com wrote:
>>>> Xunlei Pang  writes:
>>>>
>>>>> I noticed the group frequently got throttled even it consumed
>>>>> low cpu usage, this caused some jitters on the response time
>>>>> to some of our business containers enabling cpu quota.
>>>>>
>>>>> It's very easy to reproduce:
>>>>> mkdir /sys/fs/cgroup/cpu/test
>>>>> cd /sys/fs/cgroup/cpu/test
>>>>> echo 10 > cpu.cfs_quota_us
>>>>> echo $$ > tasks
>>>>> then repeat:
>>>>> cat cpu.stat |grep nr_throttled  // nr_throttled will increase
>>>>>
>>>>> After some analysis, we found that cfs_rq::runtime_remaining will
>>>>> be cleared by expire_cfs_rq_runtime() due to two equal but stale
>>>>> "cfs_{b|q}->runtime_expires" after period timer is re-armed.
>>>>
>>>> If this is after the first patch, then this is no longer what should
>>>> happen, and instead it would incorrectly /keep/ old local cfs_rq
>>>> runtime, and not __refill global runtime until the period.
>>>
>>> Exactly, I can improve the changelog.
>>>
>>>>
>>>>
>>>>>
>>>>> The global expiration should be advanced accordingly when the
>>>>> bandwidth period timer is restarted.
>>>>>
>>>>> Signed-off-by: Xunlei Pang 
>>>>> ---
>>>>>  kernel/sched/fair.c | 15 ++-
>>>>>  1 file changed, 10 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> index 9f384264e832..bb006e671609 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -5204,13 +5204,18 @@ static void init_cfs_rq_runtime(struct cfs_rq 
>>>>> *cfs_rq)
>>>>>  
>>>>>  void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>>>>>  {
>>>>> + u64 overrun;
>>>>> +
>>>>>   lockdep_assert_held(&cfs_b->lock);
>>>>>  
>>>>> - if (!cfs_b->period_active) {
>>>>> - cfs_b->period_active = 1;
>>>>> - hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
>>>>> - hrtimer_start_expires(&cfs_b->period_timer, 
>>>>> HRTIMER_MODE_ABS_PINNED);
>>>>> - }
>>>>> + if (cfs_b->period_active)
>>
>>>>> + return;
>>>>> +
>>>>> + cfs_b->period_active = 1;
>>>>> + overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
>>>>> + cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
>>>>
>>>> I think we actually want if (overrun) __refill_cfs_bandwidth_runtime(),
>>>> much like tg_set_cfs_bandwidth
>>>
>>> It's a little different, e.g. periods like below:
>>>
>>>Pm
>>>|
>>>v
>>>   |-|-|
>>> Pn-1   PnPn+1
>>>
>>> Assuming we start_cfs_bandwidth() at some moment Pm between "Pn" and
>>> "Pn+1", cfs_b->runtime_expires we want here should be "Pn+1", while
>>> __refill_cfs_bandwidth_runtime() forces it to be "Pm+period"(hrtimer
>>> expires at "Pn+1").
>>>
>>> Regards,
>>> Xunlei
>>>
>>
>> Hmm, yeah, it would wind up out of sync with the hrtimer unless we
>> restarted the period boundaries.
>>
>> Now that I look at it even more closely, I don't see how you're getting
>> problems like that, besides wasting tiny amounts of cpu time in expire:
>> we only disable the period timer when there has been a full period with
>> no activity, so we should have a full cfs_b->runtime, and a
>> runtime_expires from the period that was spent entirely idle (so it
>> should always be in the past). But with a working extend-local-deadline
>> check like after patch 1 it shouldn't /matter/ what the actual value of
>> runtime_expires is, we'll just waste tiny amounts of cpu time
>> incrementing runtime_expires every account_cfs_rq_runtime call rather
>> than ab

Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion

2018-07-03 Thread Xunlei Pang

On 7/2/18 11:21 PM, Tejun Heo wrote:
> Hello, Peter.
> 
> On Tue, Jun 26, 2018 at 05:49:08PM +0200, Peter Zijlstra wrote:
>> Well, no, because the Changelog is incomprehensible and the patch
>> doesn't really have useful comments, so I'll have to reverse engineer
>> the entire thing, and I've just not had time for that.
> 
> Just as an additional data point, we also sometimes see artifacts from
> cpu_adjust_time() in the form of per-task user or sys time getting
> stuck for some period (in extreme cases for over a minute) while the
> application isn't doing anything differently.  We're telling the users
> that it's an inherent sampling artifact but it'd be nice to improve it
> if possible without adding noticeable overhead.  No idea whether this
> patch's approach is a good one tho.

The patch has no noticeable overhead except the extra cputime fileds
added into task_struct. We've been running this patch on our servers
for months, looks good till now.

Thanks!

Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion

2018-07-05 Thread Xunlei Pang

On 7/5/18 6:46 PM, Peter Zijlstra wrote:
> On Wed, Jun 27, 2018 at 08:22:42PM +0800, Xunlei Pang wrote:
>> tick-based whole utime is utime_0, tick-based whole stime
>> is stime_0, scheduler time is rtime_0. 
> 
>> For a long time, the process runs mainly in userspace with
>> run-sleep patterns, and because two different clocks, it
>> is possible to have the following condition:
>>   rtime_0 < utime_0 (as with little stime_0)
> 
> I don't follow... what?
> 
> Why are you, and why do you think it makes sense to, compare rtime_0
> against utime_0 ?
> 
> The [us]time_0 are, per your earlier definition, ticks. They're not an
> actual measure of time. Do not compare the two, that makes no bloody
> sense.
> 

[us]time_0 is task_struct:utime{stime}, I cited directly from
cputime_adjust(), both in nanoseconds. I assumed "rtime_0 < utime_0"
here to simple the following proof to help explain the problem we met.

Re: [tip:sched/core] sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()

2018-07-16 Thread Xunlei Pang

On 7/17/18 1:41 AM, Ingo Molnar wrote:
> 
> * Peter Zijlstra  wrote:
> 
>> On Sun, Jul 15, 2018 at 04:36:17PM -0700, tip-bot for Xunlei Pang wrote:
>>> Commit-ID:  8d4c00dc38a8aa30dae8402955e55e7b34e74bc8
>>> Gitweb: 
>>> https://git.kernel.org/tip/8d4c00dc38a8aa30dae8402955e55e7b34e74bc8
>>> Author: Xunlei Pang 
>>> AuthorDate: Mon, 9 Jul 2018 22:58:43 +0800
>>> Committer:  Ingo Molnar 
>>> CommitDate: Mon, 16 Jul 2018 00:28:31 +0200
>>>
>>> sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()
>>>
>>> If users access "/proc/pid/stat", the utime and stime ratio in the
>>> current SAMPLE period are excepted, but currently cputime_adjust()
>>> always calculates with the ratio of the WHOLE lifetime of the process.
>>>
>>> This results in inaccurate utime and stime in "/proc/pid/stat". For
>>> example,  a process runs for a while with "50% usr, 0% sys", then
>>> followed by "100% sys". For later while, the following is excepted:
>>>
>>>   0.0 usr,  100.0 sys
>>>
>>> but we get:
>>>
>>>   10.0 usr,  90.0 sys
>>>
>>> This patch uses the accurate ratio in cputime_adjust() to address the
>>> issue. A new 'task_cputime' type field is added in prev_cputime to record
>>> previous 'task_cputime' so that we can get the elapsed times as the accurate
>>> ratio.
>>
>> Ingo, please make this one go away.
> 
> Sure, I've removed it from sched/core.
> 

Hi Ingo, Peter, Frederic,

I captured some runtime data using trace to explain it, hope this can
illustrate the motive behind my patch. Anyone could help improve my
changelog is appreciated if you think so after reading.


Here are the simple trace_printk I added to capture the real data:

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 0796f938c4f0..b65c1f250941 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -611,6 +611,9 @@ void cputime_adjust(struct task_cputime *curr,
struct prev_cputime *prev,
stime = curr->stime;
utime = curr->utime;

+   if (!strncmp(current->comm, "cat", 3))
+   trace_printk("task tick-based utime %lld stime %lld,
scheduler rtime %lld\n", utime, stime, rtime);
+
/*
 * If either stime or utime are 0, assume all runtime is userspace.
 * Once a task gets some ticks, the monotonicy code at 'update:'
@@ -651,9 +654,14 @@ void cputime_adjust(struct task_cputime *curr,
struct prev_cputime *prev,
stime = rtime - utime;
}

+   if (!strncmp(current->comm, "cat", 3))
+   trace_printk("result: old utime %lld stime %lld, new
utime %lld stime %lld\n",
+   prev->utime, prev->stime, utime, stime);
+
prev->stime = stime;
prev->utime = utime;
 out:
*ut = prev->utime;
*st = prev->stime;
raw_spin_unlock_irqrestore(&prev->lock, flags);


Using the reproducer I described in the changelog, it runs for a while
with "50% usr, 0% sys", then followed by "100% sys". A shell script
accesses its /proc/pid/stat at the interval of one second, and got:
50.0 usr,  0.0 sys
51.0 usr,  0.0 sys
50.0 usr,  0.0 sys
...
9.0 usr,  91.0 sys
9.0 usr,  91.0 sys


The trace data corresponds to the last sample period:
trace entry 1:
 cat-20755 [022] d...  1370.106496: cputime_adjust: task
tick-based utime 36256000 stime 255100, scheduler rtime 333060702626
 cat-20755 [022] d...  1370.106497: cputime_adjust: result:
old utime 330729718142 stime 2306983867, new utime 330733635372 stime
2327067254

trace entry 2:
 cat-20773 [005] d...  1371.109825: cputime_adjust: task
tick-based utime 36256700 stime 354700, scheduler rtime 334063718912
 cat-20773 [005] d...  1371.109826: cputime_adjust: result:
old utime 330733635372 stime 2327067254, new utime 330827229702 stime
3236489210

1) expected behaviour
Let's compare the last two trace entries(all the data below is in ns):
task tick-based utime: 36256000->36256700 increased 700
task tick-based stime: 255100  ->354700   increased 99600
scheduler rtime:   333060702626->334063718912 increased 1003016286

The application actually runs almost 100%sys at the moment, we can
use the task tick-based utime and stime increased to double check:
99600/(700+99600) > 99%sys

2) the current cputime_adjust() inaccurate result
But for the current cputime_adjust(), we get the following adjusted
utime and stime increase in this sample period:
adjusted utime: 330733635372->

[PATCH 1/2] rtc/ab8500: set uie_unsupported flag

2014-12-08 Thread Xunlei Pang

Currently, ab8500 doesn't set uie_unsupported of rtc_device, while
it doesn't support UIE, see ab8500_rtc_set_alarm().

Thus, when going through rtc_update_irq_enable()->rtc_timer_enqueue(),
there's a chance it has an alarm timer1 queued before which is going to
fired, so this update timer2 will be queued because it isn't the leftmost
one, which means rtc_timer_enqueue() will return 0.

This will result in two problems:
1) UIE EMUL will not be used.
2) When the alarm timer1 is fired, in rtc_timer_do_work() timer2 will
   fail to set the alarm time, so this rtc will disfunctional due to
   timer2 with the earliest expires in the timerqueue.

So, rtc drivers must set this flag if they don't support UIE.

Signed-off-by: Xunlei Pang 
---
 drivers/rtc/rtc-ab8500.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/rtc/rtc-ab8500.c b/drivers/rtc/rtc-ab8500.c
index 727e2f5..866e0ef 100644
--- a/drivers/rtc/rtc-ab8500.c
+++ b/drivers/rtc/rtc-ab8500.c
@@ -504,6 +504,8 @@ static int ab8500_rtc_probe(struct platform_device *pdev)
return err;
}
 
+   rtc->uie_unsupported = 1;
+
return 0;
 }
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] rtc: Refine rtc_timer_do_work() to consider other set alarm failures

2014-12-08 Thread Xunlei Pang

rtc_timer_do_work() only judges -ETIME failure of__rtc_set_alarm(),
but doesn't handle other failures like -EIO, -EBUSY, etc.

If there is a failure other than -ETIME, the next rtc_timer will stay
in the timerqueue. Then later rtc_timers will be enqueued directly because
they have a later expires time, so the alarm irq will never be programmed.

When such failures happen, this patch will retry __rtc_set_alarm(), if
still can't program the alarm time, it will remove current rtc_timer from
timerqueue and fetch next one, thus preventing it from affecting other
rtc timers.

Signed-off-by: Xunlei Pang 
---
 drivers/rtc/interface.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/drivers/rtc/interface.c b/drivers/rtc/interface.c
index af19213..77a1529 100644
--- a/drivers/rtc/interface.c
+++ b/drivers/rtc/interface.c
@@ -898,11 +898,24 @@ again:
if (next) {
struct rtc_wkalrm alarm;
int err;
+   int retry = 3;
+
alarm.time = rtc_ktime_to_tm(next->expires);
alarm.enabled = 1;
+reprogram:
err = __rtc_set_alarm(rtc, &alarm);
if (err == -ETIME)
goto again;
+   else if (err) {
+   if (retry-- > 0)
+   goto reprogram;
+
+   timer = container_of(next, struct rtc_timer, node);
+   timerqueue_del(&rtc->timerqueue, &timer->node);
+   timer->enabled = 0;
+   dev_err(&rtc->dev, "__rtc_set_alarm: err=%d\n", err);
+   goto again;
+   }
} else
rtc_alarm_disable(rtc);
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 3/3] time: clocksource: Add a comment to CLOCK_SOURCE_SUSPEND_NONSTOP

2015-01-23 Thread Xunlei Pang

Hi Thomas,

On 23 January 2015 at 05:07, Thomas Gleixner  wrote:
> On Thu, 22 Jan 2015, Xunlei Pang wrote:
>> When doing timekeeping_resume(), if the nonstop clocksource
>> wraps back, "cycle_delta" will miss the wrap time.
>>
>> It's hard to determine the right CLOCKSOURCE_MASK(xxx) or
>> something to add code for inspecting such behavior, and we
>> don't have many existent nonstop clocksources, so just add
>> a comment to indicate that if have this flag set, people
>> are aware that this nonstop clocksource won't wrap back
>> during system suspend/resume.
>
> -ENOPARSE
>
> What has the CLOCKSOURCE_MASK() to do with this and why is the fact
> relevant, that we only have a few suspend_nonstop clock sources?

Before this, I tried to add some code to catch such problem at the
time of registering
the clocksource, like using the CLOCKSOURCE_MASK(), for example 64bit counter
will never wrap for us. But there may be other values like CLOCKSOURCE_MASK(56),
I just can't figure out exactly how to do this judge.

So, I think we can only add a comment to let the developer be aware of
this when registering
nonstop clocksource, that's what I want to express.

>
>> +
>> +/*
>> + * When setting this flag, you're also supposed to mean that it doesn't
>> + * wrap back during system suspend/resume. See timekeeping_resume().
>
> -ENOPARSE
>
> I guess what you want to say here is:
>
> /*
>  * clocksource continues to run during suspend and is guaranteed not to
>  * wrap around during long suspend periods.
>  */
>

Yes, this description is way better :-)

Thanks,
Xunlei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 1/3] time: Don't bother to run rtc_resume() for the nonstop clocksource

2015-01-23 Thread Xunlei Pang

On 23 January 2015 at 02:30, John Stultz  wrote:
> On Thu, Jan 22, 2015 at 4:01 AM, Xunlei Pang  wrote:
>> If a system does not provide a persistent_clock(), the time
>> will be updated on resume by rtc_resume(). With the addition
>> of the non-stop clocksources for suspend timing, those systems
>> set the time on resume in timekeeping_resume(), but may not
>> provide a valid persistent_clock().
>>
>> This results in the rtc_resume() logic thinking no one has set
>> the time and it then will over-write the suspend time again,
>> which is not necessary and only increases clock error.
>>
>> So, fix this for rtc_resume().
>>
>> Signed-off-by: Xunlei Pang 
>> ---
>> v1->v2:
>> Modify according to Thomas' feedback.
>>
>>  drivers/rtc/class.c |  2 +-
>>  include/linux/timekeeping.h |  7 +++
>>  kernel/time/timekeeping.c   | 16 +---
>>  3 files changed, 13 insertions(+), 12 deletions(-)
>>
>> diff --git a/drivers/rtc/class.c b/drivers/rtc/class.c
>> index 472a5ad..6100af5 100644
>> --- a/drivers/rtc/class.c
>> +++ b/drivers/rtc/class.c
>> @@ -102,7 +102,7 @@ static int rtc_resume(struct device *dev)
>> struct timespec64   sleep_time;
>> int err;
>>
>> -   if (has_persistent_clock())
>> +   if (timekeeping_sleeptime_injected())
>> return 0;
>>
>> rtc_hctosys_ret = -ENODEV;
>> diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
>> index 9b63d13..2b87c64 100644
>> --- a/include/linux/timekeeping.h
>> +++ b/include/linux/timekeeping.h
>> @@ -238,6 +238,13 @@ extern void getnstime_raw_and_real(struct timespec 
>> *ts_raw,
>>   */
>>  extern bool persistent_clock_exist;
>>  extern int persistent_clock_is_local;
>> +extern bool timekeeping_sleeptime_inject;
>> +
>> +/* Used by rtc_resume() */
>> +static inline bool timekeeping_sleeptime_injected(void)
>> +{
>> +   return timekeeping_sleeptime_inject;
>> +}
>>
>
> Again, there's no reason to make this a static inline, nor the
> sleeptime_inject variable global.
> Just make it a function in timekeeping and provide definition so it
> can be used by the rtc resume.


Hi John,

Thanks for your comments, I've refined it, is this one ok?

---
 drivers/rtc/class.c |  2 +-
 include/linux/timekeeping.h |  1 +
 kernel/time/timekeeping.c   | 32 
 3 files changed, 22 insertions(+), 13 deletions(-)

diff --git a/drivers/rtc/class.c b/drivers/rtc/class.c
index 472a5ad..6100af5 100644
--- a/drivers/rtc/class.c
+++ b/drivers/rtc/class.c
@@ -102,7 +102,7 @@ static int rtc_resume(struct device *dev)
  struct timespec64 sleep_time;
  int err;

- if (has_persistent_clock())
+ if (timekeeping_sleeptime_injected())
  return 0;

  rtc_hctosys_ret = -ENODEV;
diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index 9b63d13..17a460d 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -225,6 +225,7 @@ static inline void timekeeping_clocktai(struct timespec *ts)
 /*
  * RTC specific
  */
+extern bool timekeeping_sleeptime_injected(void);
 extern void timekeeping_inject_sleeptime64(struct timespec64 *delta);

 /*
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 6a93185..b02133e 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1125,12 +1125,26 @@ static void
__timekeeping_inject_sleeptime(struct timekeeper *tk,
  tk_debug_account_sleep_time(delta);
 }

+static bool sleeptime_inject;
+
+#if defined(CONFIG_RTC_CLASS) && \
+ defined(CONFIG_PM_SLEEP) && \
+ defined(CONFIG_RTC_HCTOSYS_DEVICE)
+/**
+ * Used by rtc_resume().
+ */
+bool timekeeping_sleeptime_injected(void)
+{
+ return sleeptime_inject;
+}
+
 /**
  * timekeeping_inject_sleeptime64 - Adds suspend interval to
timeekeeping values
  * @delta: pointer to a timespec64 delta value
  *
  * This hook is for architectures that cannot support read_persistent_clock
- * because their RTC/persistent clock is only accessible when irqs are enabled.
+ * because their RTC/persistent clock is only accessible when irqs are enabled,
+ * and also don't have an effective nonstop clocksource.
  *
  * This function should only be called by rtc_resume(), and allows
  * a suspend offset to be injected into the timekeeping values.
@@ -1140,13 +1154,6 @@ void timekeeping_inject_sleeptime64(struct
timespec64 *delta)
  struct timekeeper *tk = &tk_core.timekeeper;
  unsigned long flags;

- /*
- * Make sure we don't set the clock twice, as timekeeping_resume()
- * already did it
- */
- if (has_persistent_clock())
- return;
-
  raw_spin_lock_irqsave(&timekeeper_lock,

Re: [PATCH 1/5] sched/deadline: Modify cpudl::free_cpus to reflect rd->online

2015-01-23 Thread Xunlei Pang

Hi Peter, Juri,

Could you please give some comments on these 5 patches?
Thanks for your time.

Regards,
Xunlei

On 19 January 2015 at 12:49, Xunlei Pang  wrote:
> Currently, cpudl::free_cpus contains all cpus during init, see
> cpudl_init(). When calling cpudl_find(), we have to add rd->span
> to avoid selecting the cpu outside current root domain, because
> cpus_allowed is undependable when performing clustered scheduling
> using the cpuset, see find_later_rq().
>
> This patch adds cpudl_set_freecpu() and cpudl_clear_freecpu() for
> changing cpudl::free_cpus when doing rq_online_dl()/rq_offline_dl(),
> so we can avoid the rd->span operation when calling cpudl_find()
> in find_later_rq().
>
> Signed-off-by: Xunlei Pang 
> ---
>  kernel/sched/cpudeadline.c | 28 
>  kernel/sched/cpudeadline.h |  2 ++
>  kernel/sched/deadline.c|  5 ++---
>  3 files changed, 28 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
> index 539ca3c..fd9d3fb 100644
> --- a/kernel/sched/cpudeadline.c
> +++ b/kernel/sched/cpudeadline.c
> @@ -107,7 +107,9 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,
> int best_cpu = -1;
> const struct sched_dl_entity *dl_se = &p->dl;
>
> -   if (later_mask && cpumask_and(later_mask, later_mask, cp->free_cpus)) 
> {
> +   if (later_mask &&
> +   cpumask_and(later_mask, cp->free_cpus, &p->cpus_allowed) &&
> +   cpumask_and(later_mask, later_mask, cpu_active_mask)) {
> best_cpu = cpumask_any(later_mask);
> goto out;
> } else if (cpumask_test_cpu(cpudl_maximum(cp), &p->cpus_allowed) &&
> @@ -186,6 +188,26 @@ out:
>  }
>
>  /*
> + * cpudl_set_freecpu - Set the cpudl.free_cpus
> + * @cp: the cpudl max-heap context
> + * @cpu: rd attached cpu
> + */
> +void cpudl_set_freecpu(struct cpudl *cp, int cpu)
> +{
> +   cpumask_set_cpu(cpu, cp->free_cpus);
> +}
> +
> +/*
> + * cpudl_clear_freecpu - Clear the cpudl.free_cpus
> + * @cp: the cpudl max-heap context
> + * @cpu: rd attached cpu
> + */
> +void cpudl_clear_freecpu(struct cpudl *cp, int cpu)
> +{
> +   cpumask_clear_cpu(cpu, cp->free_cpus);
> +}
> +
> +/*
>   * cpudl_init - initialize the cpudl structure
>   * @cp: the cpudl max-heap context
>   */
> @@ -203,7 +225,7 @@ int cpudl_init(struct cpudl *cp)
> if (!cp->elements)
> return -ENOMEM;
>
> -   if (!alloc_cpumask_var(&cp->free_cpus, GFP_KERNEL)) {
> +   if (!zalloc_cpumask_var(&cp->free_cpus, GFP_KERNEL)) {
> kfree(cp->elements);
> return -ENOMEM;
> }
> @@ -211,8 +233,6 @@ int cpudl_init(struct cpudl *cp)
> for_each_possible_cpu(i)
> cp->elements[i].idx = IDX_INVALID;
>
> -   cpumask_setall(cp->free_cpus);
> -
> return 0;
>  }
>
> diff --git a/kernel/sched/cpudeadline.h b/kernel/sched/cpudeadline.h
> index 020039b..1a0a6ef 100644
> --- a/kernel/sched/cpudeadline.h
> +++ b/kernel/sched/cpudeadline.h
> @@ -24,6 +24,8 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,
>struct cpumask *later_mask);
>  void cpudl_set(struct cpudl *cp, int cpu, u64 dl, int is_valid);
>  int cpudl_init(struct cpudl *cp);
> +void cpudl_set_freecpu(struct cpudl *cp, int cpu);
> +void cpudl_clear_freecpu(struct cpudl *cp, int cpu);
>  void cpudl_cleanup(struct cpudl *cp);
>  #endif /* CONFIG_SMP */
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index b52092f..e7b2722 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1165,9 +1165,6 @@ static int find_later_rq(struct task_struct *task)
>  * We have to consider system topology and task affinity
>  * first, then we can look for a suitable cpu.
>  */
> -   cpumask_copy(later_mask, task_rq(task)->rd->span);
> -   cpumask_and(later_mask, later_mask, cpu_active_mask);
> -   cpumask_and(later_mask, later_mask, &task->cpus_allowed);
> best_cpu = cpudl_find(&task_rq(task)->rd->cpudl,
> task, later_mask);
> if (best_cpu == -1)
> @@ -1562,6 +1559,7 @@ static void rq_online_dl(struct rq *rq)
> if (rq->dl.overloaded)
> dl_set_overload(rq);
>
> +   cpudl_set_freecpu(&rq->rd->cpudl, rq->cpu);
> if (rq->dl.dl_nr_running > 0)
> cpudl_set(&rq->rd->cpudl, rq->cpu, rq->dl.earliest_dl.curr, 
> 1);
>  }
> @@ -1573,6 +1571,7 @@ static void rq_offline_dl(struct rq *rq)
> dl_clear_overload(rq);
>
> cpudl_set(&rq->rd->cpudl, rq->cpu, 0, 0);
> +   cpudl_clear_freecpu(&rq->rd->cpudl, rq->cpu);
>  }
>
>  void init_sched_dl_class(void)
> --
> 1.9.1
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v2 6/9] rtc/mxc: Modify rtc_update_alarm() not to touch the alarm time

2015-01-13 Thread Xunlei Pang

From: Xunlei Pang 

rtc_class_ops's set_alarm() shouldn't deal with the alarm date,
as this is handled in the rtc core.

See rtc_dev_ioctl()'s RTC_ALM_SET and RTC_WKALM_SET cases.

Signed-off-by: Xunlei Pang 
---
 drivers/rtc/rtc-mxc.c | 22 --
 1 file changed, 4 insertions(+), 18 deletions(-)

diff --git a/drivers/rtc/rtc-mxc.c b/drivers/rtc/rtc-mxc.c
index 3c3f8d1..a7b218f 100644
--- a/drivers/rtc/rtc-mxc.c
+++ b/drivers/rtc/rtc-mxc.c
@@ -173,29 +173,18 @@ static void set_alarm_or_time(struct device *dev, int 
time_alarm, u32 time)
  * This function updates the RTC alarm registers and then clears all the
  * interrupt status bits.
  */
-static int rtc_update_alarm(struct device *dev, struct rtc_time *alrm)
+static void rtc_update_alarm(struct device *dev, struct rtc_time *alrm)
 {
-   struct rtc_time alarm_tm, now_tm;
-   unsigned long now, time;
+   unsigned long time;
struct platform_device *pdev = to_platform_device(dev);
struct rtc_plat_data *pdata = platform_get_drvdata(pdev);
void __iomem *ioaddr = pdata->ioaddr;
 
-   now = get_alarm_or_time(dev, MXC_RTC_TIME);
-   rtc_time_to_tm(now, &now_tm);
-   alarm_tm.tm_year = now_tm.tm_year;
-   alarm_tm.tm_mon = now_tm.tm_mon;
-   alarm_tm.tm_mday = now_tm.tm_mday;
-   alarm_tm.tm_hour = alrm->tm_hour;
-   alarm_tm.tm_min = alrm->tm_min;
-   alarm_tm.tm_sec = alrm->tm_sec;
-   rtc_tm_to_time(&alarm_tm, &time);
+   rtc_tm_to_time(alrm, &time);
 
/* clear all the interrupt status bits */
writew(readw(ioaddr + RTC_RTCISR), ioaddr + RTC_RTCISR);
set_alarm_or_time(dev, MXC_RTC_ALARM, time);
-
-   return 0;
 }
 
 static void mxc_rtc_irq_enable(struct device *dev, unsigned int bit,
@@ -346,11 +335,8 @@ static int mxc_rtc_set_alarm(struct device *dev, struct 
rtc_wkalrm *alrm)
 {
struct platform_device *pdev = to_platform_device(dev);
struct rtc_plat_data *pdata = platform_get_drvdata(pdev);
-   int ret;
 
-   ret = rtc_update_alarm(dev, &alrm->time);
-   if (ret)
-   return ret;
+   rtc_update_alarm(dev, &alrm->time);
 
memcpy(&pdata->g_rtc_alarm, &alrm->time, sizeof(struct rtc_time));
mxc_rtc_irq_enable(dev, RTC_ALM_BIT, alrm->enabled);
-- 
1.9.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v2 9/9] alpha: change to use rtc_class_ops's set_mmss64()

2015-01-13 Thread Xunlei Pang

From: Xunlei Pang 

Change alpha_rtc_set_mmss() and remote_set_mmss() to use
rtc_class_ops's set_mmss64().

Signed-off-by: Xunlei Pang 
---
 arch/alpha/kernel/rtc.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/alpha/kernel/rtc.c b/arch/alpha/kernel/rtc.c
index c8d284d..f535a3f 100644
--- a/arch/alpha/kernel/rtc.c
+++ b/arch/alpha/kernel/rtc.c
@@ -116,7 +116,7 @@ alpha_rtc_set_time(struct device *dev, struct rtc_time *tm)
 }
 
 static int
-alpha_rtc_set_mmss(struct device *dev, unsigned long nowtime)
+alpha_rtc_set_mmss(struct device *dev, time64_t nowtime)
 {
int retval = 0;
int real_seconds, real_minutes, cmos_minutes;
@@ -211,7 +211,7 @@ alpha_rtc_ioctl(struct device *dev, unsigned int cmd, 
unsigned long arg)
 static const struct rtc_class_ops alpha_rtc_ops = {
.read_time = alpha_rtc_read_time,
.set_time = alpha_rtc_set_time,
-   .set_mmss = alpha_rtc_set_mmss,
+   .set_mmss64 = alpha_rtc_set_mmss,
.ioctl = alpha_rtc_ioctl,
 };
 
@@ -276,7 +276,7 @@ do_remote_mmss(void *data)
 }
 
 static int
-remote_set_mmss(struct device *dev, unsigned long now)
+remote_set_mmss(struct device *dev, time64_t now)
 {
union remote_data x;
if (smp_processor_id() != boot_cpuid) {
@@ -290,7 +290,7 @@ remote_set_mmss(struct device *dev, unsigned long now)
 static const struct rtc_class_ops remote_rtc_ops = {
.read_time = remote_read_time,
.set_time = remote_set_time,
-   .set_mmss = remote_set_mmss,
+   .set_mmss64 = remote_set_mmss,
.ioctl = alpha_rtc_ioctl,
 };
 #endif
-- 
1.9.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v2 2/9] time: Provide y2106 safe get_seconds() replacement

2015-01-13 Thread Xunlei Pang

From: Xunlei Pang 

As part of addressing "y2038 problem" for in-kernel uses, this
patch adds safe get_seconds64() using time64_t.

After this patch, get_seconds() is deprecated and all its call sites
will be fixed using get_seconds64(), after that it can be removed.

Signed-off-by: Xunlei Pang 
---
 include/linux/timekeeping.h | 10 +-
 kernel/time/timekeeping.c   |  4 ++--
 2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index 9b63d13..384d101 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -17,7 +17,7 @@ extern int do_sys_settimeofday(const struct timespec *tv,
 /*
  * Kernel time accessors
  */
-unsigned long get_seconds(void);
+extern time64_t get_seconds64(void);
 struct timespec current_kernel_time(void);
 /* does not take xtime_lock */
 struct timespec __current_kernel_time(void);
@@ -34,6 +34,14 @@ extern time64_t ktime_get_real_seconds(void);
 extern int __getnstimeofday64(struct timespec64 *tv);
 extern void getnstimeofday64(struct timespec64 *tv);
 
+/**
+ * Deprecated. Use get_seconds64().
+ */
+static inline unsigned long get_seconds(void)
+{
+   return (unsigned long)get_seconds64();
+}
+
 #if BITS_PER_LONG == 64
 /**
  * Deprecated. Use do_settimeofday64().
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 6a93185..ab021a3 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1678,13 +1678,13 @@ void getboottime(struct timespec *ts)
 }
 EXPORT_SYMBOL_GPL(getboottime);
 
-unsigned long get_seconds(void)
+time64_t get_seconds64(void)
 {
struct timekeeper *tk = &tk_core.timekeeper;
 
return tk->xtime_sec;
 }
-EXPORT_SYMBOL(get_seconds);
+EXPORT_SYMBOL(get_seconds64);
 
 struct timespec __current_kernel_time(void)
 {
-- 
1.9.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v2 3/9] rtc/test: Update driver to address y2038/y2106 issues

2015-01-13 Thread Xunlei Pang

From: Xunlei Pang 

This driver has a number of y2038/y2106 issues.

This patch resolves them by:
- Repalce get_seconds() with get_seconds64()
- Replace rtc_time_to_tm() with rtc_time64_to_tm()
- Change test_rtc_set_mmss() to use rtc_class_ops's set_mmss64()

After this patch, this driver should not have any remaining
y2038/y2106 issues.

Signed-off-by: Xunlei Pang 
---
 drivers/rtc/rtc-test.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/rtc/rtc-test.c b/drivers/rtc/rtc-test.c
index 8f86fa9..056138b 100644
--- a/drivers/rtc/rtc-test.c
+++ b/drivers/rtc/rtc-test.c
@@ -30,13 +30,13 @@ static int test_rtc_set_alarm(struct device *dev,
 static int test_rtc_read_time(struct device *dev,
struct rtc_time *tm)
 {
-   rtc_time_to_tm(get_seconds(), tm);
+   rtc_time64_to_tm(get_seconds64(), tm);
return 0;
 }
 
-static int test_rtc_set_mmss(struct device *dev, unsigned long secs)
+static int test_rtc_set_mmss(struct device *dev, time64_t secs)
 {
-   dev_info(dev, "%s, secs = %lu\n", __func__, secs);
+   dev_info(dev, "%s, secs = %lld\n", __func__, (long long)secs);
return 0;
 }
 
@@ -60,7 +60,7 @@ static const struct rtc_class_ops test_rtc_ops = {
.read_time = test_rtc_read_time,
.read_alarm = test_rtc_read_alarm,
.set_alarm = test_rtc_set_alarm,
-   .set_mmss = test_rtc_set_mmss,
+   .set_mmss64 = test_rtc_set_mmss,
.alarm_irq_enable = test_rtc_alarm_irq_enable,
 };
 
-- 
1.9.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v2 8/9] rtc/mxc: Update driver to address y2038/y2106 issues

2015-01-13 Thread Xunlei Pang

From: Xunlei Pang 

This driver has a number of y2038/y2106 issues.

This patch resolves them by:
- Replace rtc_time_to_tm() with rtc_time64_to_tm()
- Replace rtc_tm_to_time() with rtc_tm_to_time64()
- Change mxc_rtc_set_mmss() to use rtc_class_ops's set_mmss64()

After this patch, the driver should not have any remaining
y2038/y2106 issues.

Signed-off-by: Xunlei Pang 
---
 drivers/rtc/rtc-mxc.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/rtc/rtc-mxc.c b/drivers/rtc/rtc-mxc.c
index 83cba23..09d422b 100644
--- a/drivers/rtc/rtc-mxc.c
+++ b/drivers/rtc/rtc-mxc.c
@@ -286,7 +286,7 @@ static int mxc_rtc_read_time(struct device *dev, struct 
rtc_time *tm)
 /*
  * This function sets the internal RTC time based on tm in Gregorian date.
  */
-static int mxc_rtc_set_mmss(struct device *dev, unsigned long time)
+static int mxc_rtc_set_mmss(struct device *dev, time64_t time)
 {
struct platform_device *pdev = to_platform_device(dev);
struct rtc_plat_data *pdata = platform_get_drvdata(pdev);
@@ -297,9 +297,9 @@ static int mxc_rtc_set_mmss(struct device *dev, unsigned 
long time)
if (is_imx1_rtc(pdata)) {
struct rtc_time tm;
 
-   rtc_time_to_tm(time, &tm);
+   rtc_time64_to_tm(time, &tm);
tm.tm_year = 70;
-   rtc_tm_to_time(&tm, &time);
+   time = rtc_tm_to_time64(&tm);
}
 
/* Avoid roll-over from reading the different registers */
@@ -347,7 +347,7 @@ static int mxc_rtc_set_alarm(struct device *dev, struct 
rtc_wkalrm *alrm)
 static struct rtc_class_ops mxc_rtc_ops = {
.release= mxc_rtc_release,
.read_time  = mxc_rtc_read_time,
-   .set_mmss   = mxc_rtc_set_mmss,
+   .set_mmss64 = mxc_rtc_set_mmss,
.read_alarm = mxc_rtc_read_alarm,
.set_alarm  = mxc_rtc_set_alarm,
.alarm_irq_enable   = mxc_rtc_alarm_irq_enable,
-- 
1.9.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v2 0/9] Provide y2038/y2106 safe rtc_class_ops.set_mmss64()

2015-01-13 Thread Xunlei Pang

From: Xunlei Pang 

This patch series relies on a former patchset[1], it provides y2038/y2106
safe rtc_class_ops.set_mmss64(), and converts some possible users of set_mmss()
to use set_mmss64(), in the hope that making these users(i.e. rtc drivers)
y2038/y2106 safe.


The first version of this patchset is: [2].

[1] https://lkml.org/lkml/2014/11/18/218
[2] https://lkml.org/lkml/2014/11/27/341

Xunlei Pang (9):
  rtc: Provide y2038 safe rtc_class_ops.set_mmss() replacement
  time: Provide y2106 safe get_seconds() replacement
  rtc/test: Update driver to address y2038/y2106 issues
  rtc/ab3100: Update driver to address y2038/y2106 issues
  rtc/mc13xxx: Update driver to address y2038/y2106 issues
  rtc/mxc: Modify rtc_update_alarm() not to touch the alarm time
  rtc/mxc: Convert get_alarm_or_time()/set_alarm_or_time() to use
time64_t
  rtc/mxc: Update driver to address y2038/y2106 issues
  alpha: change to use rtc_class_ops's set_mmss64()

 arch/alpha/kernel/rtc.c |  8 +++
 drivers/rtc/interface.c |  9 +++-
 drivers/rtc/rtc-ab3100.c| 55 ++---
 drivers/rtc/rtc-mc13xxx.c   | 32 --
 drivers/rtc/rtc-mxc.c   | 55 +
 drivers/rtc/rtc-test.c  |  8 +++
 drivers/rtc/systohc.c   |  5 -
 include/linux/rtc.h |  1 +
 include/linux/timekeeping.h | 10 -
 kernel/time/timekeeping.c   |  4 ++--
 10 files changed, 93 insertions(+), 94 deletions(-)

-- 
1.9.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v2 1/9] rtc: Provide y2038 safe rtc_class_ops.set_mmss() replacement

2015-01-13 Thread Xunlei Pang

From: Xunlei Pang 

Currently the rtc_class_op's set_mmss() function takes a 32bit second
value (on 32bit systems), which is problematic for dates past y2038.

This patch provides a safe version named set_mmss64() using y2038 safe
time64_t.

After this patch, set_mmss() is deprecated and all its users will be
fixed to use set_mmss64(), it can be removed when having no users.

Cc: John Stultz 
Cc: Arnd Bergmann 
Signed-off-by: Xunlei Pang 
---
 drivers/rtc/interface.c | 9 -
 drivers/rtc/systohc.c   | 5 -
 include/linux/rtc.h | 1 +
 3 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/rtc/interface.c b/drivers/rtc/interface.c
index 45bfc28ee..fca7882 100644
--- a/drivers/rtc/interface.c
+++ b/drivers/rtc/interface.c
@@ -72,7 +72,12 @@ int rtc_set_time(struct rtc_device *rtc, struct rtc_time *tm)
err = -ENODEV;
else if (rtc->ops->set_time)
err = rtc->ops->set_time(rtc->dev.parent, tm);
-   else if (rtc->ops->set_mmss) {
+   else if (rtc->ops->set_mmss64) {
+   time64_t secs64;
+
+   secs64 = rtc_tm_to_time64(tm);
+   err = rtc->ops->set_mmss64(rtc->dev.parent, secs64);
+   } else if (rtc->ops->set_mmss) {
unsigned long secs;
err = rtc_tm_to_time(tm, &secs);
if (err == 0)
@@ -98,6 +103,8 @@ int rtc_set_mmss(struct rtc_device *rtc, unsigned long secs)
 
if (!rtc->ops)
err = -ENODEV;
+   else if (rtc->ops->set_mmss64)
+   err = rtc->ops->set_mmss64(rtc->dev.parent, secs);
else if (rtc->ops->set_mmss)
err = rtc->ops->set_mmss(rtc->dev.parent, secs);
else if (rtc->ops->read_time && rtc->ops->set_time) {
diff --git a/drivers/rtc/systohc.c b/drivers/rtc/systohc.c
index bf3e242..e34a07b 100644
--- a/drivers/rtc/systohc.c
+++ b/drivers/rtc/systohc.c
@@ -35,7 +35,10 @@ int rtc_set_ntp_time(struct timespec now)
if (rtc) {
/* rtc_hctosys exclusively uses UTC, so we call set_time here,
 * not set_mmss. */
-   if (rtc->ops && (rtc->ops->set_time || rtc->ops->set_mmss))
+   if (rtc->ops &&
+   (rtc->ops->set_time ||
+rtc->ops->set_mmss64 ||
+rtc->ops->set_mmss))
err = rtc_set_time(rtc, &tm);
rtc_class_close(rtc);
}
diff --git a/include/linux/rtc.h b/include/linux/rtc.h
index 6d6be09..29093da 100644
--- a/include/linux/rtc.h
+++ b/include/linux/rtc.h
@@ -77,6 +77,7 @@ struct rtc_class_ops {
int (*read_alarm)(struct device *, struct rtc_wkalrm *);
int (*set_alarm)(struct device *, struct rtc_wkalrm *);
int (*proc)(struct device *, struct seq_file *);
+   int (*set_mmss64)(struct device *, time64_t secs);
int (*set_mmss)(struct device *, unsigned long secs);
int (*read_callback)(struct device *, int data);
int (*alarm_irq_enable)(struct device *, unsigned int enabled);
-- 
1.9.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v2 7/9] rtc/mxc: Convert get_alarm_or_time()/set_alarm_or_time() to use time64_t

2015-01-13 Thread Xunlei Pang

From: Xunlei Pang 

We want to convert mxc_rtc_set_mmss() to use rtc_class_ops's set_mmss64(),
but it uses get_alarm_or_time()/set_alarm_or_time() internal interfaces
which are y2038 unsafe.

So here as a separate patch, it converts these two internal interfaces
of "mxc" to use safe time64_t to make some preparations.

Signed-off-by: Xunlei Pang 
---
 drivers/rtc/rtc-mxc.c | 29 ++---
 1 file changed, 14 insertions(+), 15 deletions(-)

diff --git a/drivers/rtc/rtc-mxc.c b/drivers/rtc/rtc-mxc.c
index a7b218f..83cba23 100644
--- a/drivers/rtc/rtc-mxc.c
+++ b/drivers/rtc/rtc-mxc.c
@@ -106,7 +106,7 @@ static inline int is_imx1_rtc(struct rtc_plat_data *data)
  * This function is used to obtain the RTC time or the alarm value in
  * second.
  */
-static u32 get_alarm_or_time(struct device *dev, int time_alarm)
+static time64_t get_alarm_or_time(struct device *dev, int time_alarm)
 {
struct platform_device *pdev = to_platform_device(dev);
struct rtc_plat_data *pdata = platform_get_drvdata(pdev);
@@ -129,29 +129,28 @@ static u32 get_alarm_or_time(struct device *dev, int 
time_alarm)
hr = hr_min >> 8;
min = hr_min & 0xff;
 
-   return (((day * 24 + hr) * 60) + min) * 60 + sec;
+   return time64_t)day * 24 + hr) * 60) + min) * 60 + sec;
 }
 
 /*
  * This function sets the RTC alarm value or the time value.
  */
-static void set_alarm_or_time(struct device *dev, int time_alarm, u32 time)
+static void set_alarm_or_time(struct device *dev, int time_alarm, time64_t 
time)
 {
-   u32 day, hr, min, sec, temp;
+   u32 tod, day, hr, min, sec, temp;
struct platform_device *pdev = to_platform_device(dev);
struct rtc_plat_data *pdata = platform_get_drvdata(pdev);
void __iomem *ioaddr = pdata->ioaddr;
 
-   day = time / 86400;
-   time -= day * 86400;
+   day = div_s64_rem(time, 86400, &tod);
 
/* time is within a day now */
-   hr = time / 3600;
-   time -= hr * 3600;
+   hr = tod / 3600;
+   tod -= hr * 3600;
 
/* time is within an hour now */
-   min = time / 60;
-   sec = time - min * 60;
+   min = tod / 60;
+   sec = tod - min * 60;
 
temp = (hr << 8) + min;
 
@@ -175,12 +174,12 @@ static void set_alarm_or_time(struct device *dev, int 
time_alarm, u32 time)
  */
 static void rtc_update_alarm(struct device *dev, struct rtc_time *alrm)
 {
-   unsigned long time;
+   time64_t time;
struct platform_device *pdev = to_platform_device(dev);
struct rtc_plat_data *pdata = platform_get_drvdata(pdev);
void __iomem *ioaddr = pdata->ioaddr;
 
-   rtc_tm_to_time(alrm, &time);
+   time = rtc_tm_to_time64(alrm);
 
/* clear all the interrupt status bits */
writew(readw(ioaddr + RTC_RTCISR), ioaddr + RTC_RTCISR);
@@ -272,14 +271,14 @@ static int mxc_rtc_alarm_irq_enable(struct device *dev, 
unsigned int enabled)
  */
 static int mxc_rtc_read_time(struct device *dev, struct rtc_time *tm)
 {
-   u32 val;
+   time64_t val;
 
/* Avoid roll-over from reading the different registers */
do {
val = get_alarm_or_time(dev, MXC_RTC_TIME);
} while (val != get_alarm_or_time(dev, MXC_RTC_TIME));
 
-   rtc_time_to_tm(val, tm);
+   rtc_time64_to_tm(val, tm);
 
return 0;
 }
@@ -322,7 +321,7 @@ static int mxc_rtc_read_alarm(struct device *dev, struct 
rtc_wkalrm *alrm)
struct rtc_plat_data *pdata = platform_get_drvdata(pdev);
void __iomem *ioaddr = pdata->ioaddr;
 
-   rtc_time_to_tm(get_alarm_or_time(dev, MXC_RTC_ALARM), &alrm->time);
+   rtc_time64_to_tm(get_alarm_or_time(dev, MXC_RTC_ALARM), &alrm->time);
alrm->pending = ((readw(ioaddr + RTC_RTCISR) & RTC_ALM_BIT)) ? 1 : 0;
 
return 0;
-- 
1.9.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v2 4/9] rtc/ab3100: Update driver to address y2038/y2106 issues

2015-01-13 Thread Xunlei Pang

From: Xunlei Pang 

This driver has a number of y2038/y2106 issues.

This patch resolves them by:
- Replace rtc_tm_to_time() with rtc_tm_to_time64()
- Replace rtc_time_to_tm() with rtc_time64_to_tm()
- Change ab3100_rtc_set_mmss() to use rtc_class_ops's set_mmss64()

After this patch, the driver should not have any remaining
y2038/y2106 issues.

Signed-off-by: Xunlei Pang 
---
 drivers/rtc/rtc-ab3100.c | 55 
 1 file changed, 27 insertions(+), 28 deletions(-)

diff --git a/drivers/rtc/rtc-ab3100.c b/drivers/rtc/rtc-ab3100.c
index 1d0340f..9b725c5 100644
--- a/drivers/rtc/rtc-ab3100.c
+++ b/drivers/rtc/rtc-ab3100.c
@@ -43,21 +43,21 @@
 /*
  * RTC clock functions and device struct declaration
  */
-static int ab3100_rtc_set_mmss(struct device *dev, unsigned long secs)
+static int ab3100_rtc_set_mmss(struct device *dev, time64_t secs)
 {
u8 regs[] = {AB3100_TI0, AB3100_TI1, AB3100_TI2,
 AB3100_TI3, AB3100_TI4, AB3100_TI5};
unsigned char buf[6];
-   u64 fat_time = (u64) secs * AB3100_RTC_CLOCK_RATE * 2;
+   u64 hw_counter = secs * AB3100_RTC_CLOCK_RATE * 2;
int err = 0;
int i;
 
-   buf[0] = (fat_time) & 0xFF;
-   buf[1] = (fat_time >> 8) & 0xFF;
-   buf[2] = (fat_time >> 16) & 0xFF;
-   buf[3] = (fat_time >> 24) & 0xFF;
-   buf[4] = (fat_time >> 32) & 0xFF;
-   buf[5] = (fat_time >> 40) & 0xFF;
+   buf[0] = (hw_counter) & 0xFF;
+   buf[1] = (hw_counter >> 8) & 0xFF;
+   buf[2] = (hw_counter >> 16) & 0xFF;
+   buf[3] = (hw_counter >> 24) & 0xFF;
+   buf[4] = (hw_counter >> 32) & 0xFF;
+   buf[5] = (hw_counter >> 40) & 0xFF;
 
for (i = 0; i < 6; i++) {
err = abx500_set_register_interruptible(dev, 0,
@@ -75,7 +75,7 @@ static int ab3100_rtc_set_mmss(struct device *dev, unsigned 
long secs)
 
 static int ab3100_rtc_read_time(struct device *dev, struct rtc_time *tm)
 {
-   unsigned long time;
+   time64_t time;
u8 rtcval;
int err;
 
@@ -88,7 +88,7 @@ static int ab3100_rtc_read_time(struct device *dev, struct 
rtc_time *tm)
dev_info(dev, "clock not set (lost power)");
return -EINVAL;
} else {
-   u64 fat_time;
+   u64 hw_counter;
u8 buf[6];
 
/* Read out time registers */
@@ -98,22 +98,21 @@ static int ab3100_rtc_read_time(struct device *dev, struct 
rtc_time *tm)
if (err != 0)
return err;
 
-   fat_time = ((u64) buf[5] << 40) | ((u64) buf[4] << 32) |
+   hw_counter = ((u64) buf[5] << 40) | ((u64) buf[4] << 32) |
((u64) buf[3] << 24) | ((u64) buf[2] << 16) |
((u64) buf[1] << 8) | (u64) buf[0];
-   time = (unsigned long) (fat_time /
-   (u64) (AB3100_RTC_CLOCK_RATE * 2));
+   time = hw_counter / (u64) (AB3100_RTC_CLOCK_RATE * 2);
}
 
-   rtc_time_to_tm(time, tm);
+   rtc_time64_to_tm(time, tm);
 
return rtc_valid_tm(tm);
 }
 
 static int ab3100_rtc_read_alarm(struct device *dev, struct rtc_wkalrm *alarm)
 {
-   unsigned long time;
-   u64 fat_time;
+   time64_t time;
+   u64 hw_counter;
u8 buf[6];
u8 rtcval;
int err;
@@ -134,11 +133,11 @@ static int ab3100_rtc_read_alarm(struct device *dev, 
struct rtc_wkalrm *alarm)
 AB3100_AL0, buf, 4);
if (err)
return err;
-   fat_time = ((u64) buf[3] << 40) | ((u64) buf[2] << 32) |
+   hw_counter = ((u64) buf[3] << 40) | ((u64) buf[2] << 32) |
((u64) buf[1] << 24) | ((u64) buf[0] << 16);
-   time = (unsigned long) (fat_time / (u64) (AB3100_RTC_CLOCK_RATE * 2));
+   time = hw_counter / (u64) (AB3100_RTC_CLOCK_RATE * 2);
 
-   rtc_time_to_tm(time, &alarm->time);
+   rtc_time64_to_tm(time, &alarm->time);
 
return rtc_valid_tm(&alarm->time);
 }
@@ -147,17 +146,17 @@ static int ab3100_rtc_set_alarm(struct device *dev, 
struct rtc_wkalrm *alarm)
 {
u8 regs[] = {AB3100_AL0, AB3100_AL1, AB3100_AL2, AB3100_AL3};
unsigned char buf[4];
-   unsigned long secs;
-   u64 fat_time;
+   time64_t secs;
+   u64 hw_counter;
int err;
int i;
 
-   rtc_tm_to_time(&alarm->time, &secs);
-   fat_time = (u64) secs * AB3100_RTC_CLOCK_RATE * 2;
-   buf[0] = (fat_time >> 16) & 0xFF;
-   buf[1] = (fat_time >> 24) & 0xFF;
-   buf[2] = (fat_time >> 32) & 0xFF;
-   buf[3] = (fat_time >> 40) & 0xFF;
+   secs = rtc_tm_

[RFC PATCH v2 5/9] rtc/mc13xxx: Update driver to address y2038/y2106 issues

2015-01-13 Thread Xunlei Pang

From: Xunlei Pang 

This driver has a number of y2038/y2106 issues.

This patch resolves them by:
- Replace rtc_time_to_tm() with rtc_time64_to_tm()
- Change mc13xxx_rtc_set_mmss() to use rtc_class_ops's set_mmss64()

After this patch, the driver should not have any remaining
y2038/y2106 issues.

Signed-off-by: Xunlei Pang 
---
 drivers/rtc/rtc-mc13xxx.c | 32 ++--
 1 file changed, 14 insertions(+), 18 deletions(-)

diff --git a/drivers/rtc/rtc-mc13xxx.c b/drivers/rtc/rtc-mc13xxx.c
index 5bce904b..32df1d8 100644
--- a/drivers/rtc/rtc-mc13xxx.c
+++ b/drivers/rtc/rtc-mc13xxx.c
@@ -83,20 +83,19 @@ static int mc13xxx_rtc_read_time(struct device *dev, struct 
rtc_time *tm)
return ret;
} while (days1 != days2);
 
-   rtc_time_to_tm(days1 * SEC_PER_DAY + seconds, tm);
+   rtc_time64_to_tm((time64_t)days1 * SEC_PER_DAY + seconds, tm);
 
return rtc_valid_tm(tm);
 }
 
-static int mc13xxx_rtc_set_mmss(struct device *dev, unsigned long secs)
+static int mc13xxx_rtc_set_mmss(struct device *dev, time64_t secs)
 {
struct mc13xxx_rtc *priv = dev_get_drvdata(dev);
unsigned int seconds, days;
unsigned int alarmseconds;
int ret;
 
-   seconds = secs % SEC_PER_DAY;
-   days = secs / SEC_PER_DAY;
+   days = div_s64_rem(secs, SEC_PER_DAY, &seconds);
 
mc13xxx_lock(priv->mc13xxx);
 
@@ -159,7 +158,7 @@ static int mc13xxx_rtc_read_alarm(struct device *dev, 
struct rtc_wkalrm *alarm)
 {
struct mc13xxx_rtc *priv = dev_get_drvdata(dev);
unsigned seconds, days;
-   unsigned long s1970;
+   time64_t s1970;
int enabled, pending;
int ret;
 
@@ -189,10 +188,10 @@ out:
alarm->enabled = enabled;
alarm->pending = pending;
 
-   s1970 = days * SEC_PER_DAY + seconds;
+   s1970 = (time64_t)days * SEC_PER_DAY + seconds;
 
-   rtc_time_to_tm(s1970, &alarm->time);
-   dev_dbg(dev, "%s: %lu\n", __func__, s1970);
+   rtc_time64_to_tm(s1970, &alarm->time);
+   dev_dbg(dev, "%s: %lld\n", __func__, (long long)s1970);
 
return 0;
 }
@@ -200,8 +199,8 @@ out:
 static int mc13xxx_rtc_set_alarm(struct device *dev, struct rtc_wkalrm *alarm)
 {
struct mc13xxx_rtc *priv = dev_get_drvdata(dev);
-   unsigned long s1970;
-   unsigned seconds, days;
+   time64_t s1970;
+   u32 seconds, days;
int ret;
 
mc13xxx_lock(priv->mc13xxx);
@@ -215,20 +214,17 @@ static int mc13xxx_rtc_set_alarm(struct device *dev, 
struct rtc_wkalrm *alarm)
if (unlikely(ret))
goto out;
 
-   ret = rtc_tm_to_time(&alarm->time, &s1970);
-   if (unlikely(ret))
-   goto out;
+   s1970 = rtc_tm_to_time64(&alarm->time);
 
-   dev_dbg(dev, "%s: o%2.s %lu\n", __func__, alarm->enabled ? "n" : "ff",
-   s1970);
+   dev_dbg(dev, "%s: o%2.s %lld\n", __func__, alarm->enabled ? "n" : "ff",
+   (long long)s1970);
 
ret = mc13xxx_rtc_irq_enable_unlocked(dev, alarm->enabled,
MC13XXX_IRQ_TODA);
if (unlikely(ret))
goto out;
 
-   seconds = s1970 % SEC_PER_DAY;
-   days = s1970 / SEC_PER_DAY;
+   days = div_s64_rem(s1970, SEC_PER_DAY, &seconds);
 
ret = mc13xxx_reg_write(priv->mc13xxx, MC13XXX_RTCDAYA, days);
if (unlikely(ret))
@@ -268,7 +264,7 @@ static irqreturn_t mc13xxx_rtc_update_handler(int irq, void 
*dev)
 
 static const struct rtc_class_ops mc13xxx_rtc_ops = {
.read_time = mc13xxx_rtc_read_time,
-   .set_mmss = mc13xxx_rtc_set_mmss,
+   .set_mmss64 = mc13xxx_rtc_set_mmss,
.read_alarm = mc13xxx_rtc_read_alarm,
.set_alarm = mc13xxx_rtc_set_alarm,
.alarm_irq_enable = mc13xxx_rtc_alarm_irq_enable,
-- 
1.9.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 2/9] time: Provide y2106 safe get_seconds() replacement

2015-01-14 Thread Xunlei Pang

On 14 January 2015 at 04:42, Thomas Gleixner  wrote:
> On Tue, 13 Jan 2015, Xunlei Pang wrote:
>
>> From: Xunlei Pang 
>>
>> As part of addressing "y2038 problem" for in-kernel uses, this
>> patch adds safe get_seconds64() using time64_t.
>>
>> After this patch, get_seconds() is deprecated and all its call sites
>> will be fixed using get_seconds64(), after that it can be removed.
>
> Why another interface?
>
> We already have ktime_get_real_seconds(). That handles 32bit
> correctly, while your new function does not.
>
> You cannot return a 64bit value unprotected against updates on 32bit,
> unless you want to implement a RNG.

Yes, ktime_get_real_seconds() should be used instead.

Thanks,
Xunlei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/9] rtc/test: Update driver to address y2038/y2106 issues

2015-01-14 Thread Xunlei Pang

On 14 January 2015 at 00:19, Alessandro Zummo  wrote:
> On Tue, 13 Jan 2015 23:44:51 +0800
> Xunlei Pang  wrote:
>
>> This patch resolves them by:
>> - Repalce get_seconds() with get_seconds64()
>> - Replace rtc_time_to_tm() with rtc_time64_to_tm()
>> - Change test_rtc_set_mmss() to use rtc_class_ops's set_mmss64()
>
>  so we no more have a non time64 test driver?|

Hi Alessandro,

We want to do away with set_mmss() eventually, time64 is the final version,
so I think this would be ok.

Thanks,
Xunlei

>
> --
>
>  Best regards,
>
>  Alessandro Zummo,
>   Tower Technologies - Torino, Italy
>
>   http://www.towertech.it
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 5/9] rtc/mc13xxx: Update driver to address y2038/y2106 issues

2015-01-14 Thread Xunlei Pang

Cc Uwe Kleine-Koenig 

On 13 January 2015 at 23:44, Xunlei Pang  wrote:
> From: Xunlei Pang 
>
> This driver has a number of y2038/y2106 issues.
>
> This patch resolves them by:
> - Replace rtc_time_to_tm() with rtc_time64_to_tm()
> - Change mc13xxx_rtc_set_mmss() to use rtc_class_ops's set_mmss64()
>
> After this patch, the driver should not have any remaining
> y2038/y2106 issues.
>
> Signed-off-by: Xunlei Pang 
> ---
>  drivers/rtc/rtc-mc13xxx.c | 32 ++--
>  1 file changed, 14 insertions(+), 18 deletions(-)
>
> diff --git a/drivers/rtc/rtc-mc13xxx.c b/drivers/rtc/rtc-mc13xxx.c
> index 5bce904b..32df1d8 100644
> --- a/drivers/rtc/rtc-mc13xxx.c
> +++ b/drivers/rtc/rtc-mc13xxx.c
> @@ -83,20 +83,19 @@ static int mc13xxx_rtc_read_time(struct device *dev, 
> struct rtc_time *tm)
> return ret;
> } while (days1 != days2);
>
> -   rtc_time_to_tm(days1 * SEC_PER_DAY + seconds, tm);
> +   rtc_time64_to_tm((time64_t)days1 * SEC_PER_DAY + seconds, tm);
>
> return rtc_valid_tm(tm);
>  }
>
> -static int mc13xxx_rtc_set_mmss(struct device *dev, unsigned long secs)
> +static int mc13xxx_rtc_set_mmss(struct device *dev, time64_t secs)
>  {
> struct mc13xxx_rtc *priv = dev_get_drvdata(dev);
> unsigned int seconds, days;
> unsigned int alarmseconds;
> int ret;
>
> -   seconds = secs % SEC_PER_DAY;
> -   days = secs / SEC_PER_DAY;
> +   days = div_s64_rem(secs, SEC_PER_DAY, &seconds);
>
> mc13xxx_lock(priv->mc13xxx);
>
> @@ -159,7 +158,7 @@ static int mc13xxx_rtc_read_alarm(struct device *dev, 
> struct rtc_wkalrm *alarm)
>  {
> struct mc13xxx_rtc *priv = dev_get_drvdata(dev);
> unsigned seconds, days;
> -   unsigned long s1970;
> +   time64_t s1970;
> int enabled, pending;
> int ret;
>
> @@ -189,10 +188,10 @@ out:
> alarm->enabled = enabled;
> alarm->pending = pending;
>
> -   s1970 = days * SEC_PER_DAY + seconds;
> +   s1970 = (time64_t)days * SEC_PER_DAY + seconds;
>
> -   rtc_time_to_tm(s1970, &alarm->time);
> -   dev_dbg(dev, "%s: %lu\n", __func__, s1970);
> +   rtc_time64_to_tm(s1970, &alarm->time);
> +   dev_dbg(dev, "%s: %lld\n", __func__, (long long)s1970);
>
> return 0;
>  }
> @@ -200,8 +199,8 @@ out:
>  static int mc13xxx_rtc_set_alarm(struct device *dev, struct rtc_wkalrm 
> *alarm)
>  {
> struct mc13xxx_rtc *priv = dev_get_drvdata(dev);
> -   unsigned long s1970;
> -   unsigned seconds, days;
> +   time64_t s1970;
> +   u32 seconds, days;
> int ret;
>
> mc13xxx_lock(priv->mc13xxx);
> @@ -215,20 +214,17 @@ static int mc13xxx_rtc_set_alarm(struct device *dev, 
> struct rtc_wkalrm *alarm)
> if (unlikely(ret))
> goto out;
>
> -   ret = rtc_tm_to_time(&alarm->time, &s1970);
> -   if (unlikely(ret))
> -   goto out;
> +   s1970 = rtc_tm_to_time64(&alarm->time);
>
> -   dev_dbg(dev, "%s: o%2.s %lu\n", __func__, alarm->enabled ? "n" : "ff",
> -   s1970);
> +   dev_dbg(dev, "%s: o%2.s %lld\n", __func__, alarm->enabled ? "n" : 
> "ff",
> +   (long long)s1970);
>
> ret = mc13xxx_rtc_irq_enable_unlocked(dev, alarm->enabled,
> MC13XXX_IRQ_TODA);
> if (unlikely(ret))
> goto out;
>
> -   seconds = s1970 % SEC_PER_DAY;
> -   days = s1970 / SEC_PER_DAY;
> +   days = div_s64_rem(s1970, SEC_PER_DAY, &seconds);
>
> ret = mc13xxx_reg_write(priv->mc13xxx, MC13XXX_RTCDAYA, days);
> if (unlikely(ret))
> @@ -268,7 +264,7 @@ static irqreturn_t mc13xxx_rtc_update_handler(int irq, 
> void *dev)
>
>  static const struct rtc_class_ops mc13xxx_rtc_ops = {
> .read_time = mc13xxx_rtc_read_time,
> -   .set_mmss = mc13xxx_rtc_set_mmss,
> +   .set_mmss64 = mc13xxx_rtc_set_mmss,
> .read_alarm = mc13xxx_rtc_read_alarm,
> .set_alarm = mc13xxx_rtc_set_alarm,
> .alarm_irq_enable = mc13xxx_rtc_alarm_irq_enable,
> --
> 1.9.1
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 6/9] rtc/mxc: Modify rtc_update_alarm() not to touch the alarm time

2015-01-14 Thread Xunlei Pang

Cc Fabio Estevam 

On 13 January 2015 at 23:44, Xunlei Pang  wrote:
> From: Xunlei Pang 
>
> rtc_class_ops's set_alarm() shouldn't deal with the alarm date,
> as this is handled in the rtc core.
>
> See rtc_dev_ioctl()'s RTC_ALM_SET and RTC_WKALM_SET cases.
>
> Signed-off-by: Xunlei Pang 
> ---
>  drivers/rtc/rtc-mxc.c | 22 --
>  1 file changed, 4 insertions(+), 18 deletions(-)
>
> diff --git a/drivers/rtc/rtc-mxc.c b/drivers/rtc/rtc-mxc.c
> index 3c3f8d1..a7b218f 100644
> --- a/drivers/rtc/rtc-mxc.c
> +++ b/drivers/rtc/rtc-mxc.c
> @@ -173,29 +173,18 @@ static void set_alarm_or_time(struct device *dev, int 
> time_alarm, u32 time)
>   * This function updates the RTC alarm registers and then clears all the
>   * interrupt status bits.
>   */
> -static int rtc_update_alarm(struct device *dev, struct rtc_time *alrm)
> +static void rtc_update_alarm(struct device *dev, struct rtc_time *alrm)
>  {
> -   struct rtc_time alarm_tm, now_tm;
> -   unsigned long now, time;
> +   unsigned long time;
> struct platform_device *pdev = to_platform_device(dev);
> struct rtc_plat_data *pdata = platform_get_drvdata(pdev);
> void __iomem *ioaddr = pdata->ioaddr;
>
> -   now = get_alarm_or_time(dev, MXC_RTC_TIME);
> -   rtc_time_to_tm(now, &now_tm);
> -   alarm_tm.tm_year = now_tm.tm_year;
> -   alarm_tm.tm_mon = now_tm.tm_mon;
> -   alarm_tm.tm_mday = now_tm.tm_mday;
> -   alarm_tm.tm_hour = alrm->tm_hour;
> -   alarm_tm.tm_min = alrm->tm_min;
> -   alarm_tm.tm_sec = alrm->tm_sec;
> -   rtc_tm_to_time(&alarm_tm, &time);
> +   rtc_tm_to_time(alrm, &time);
>
> /* clear all the interrupt status bits */
> writew(readw(ioaddr + RTC_RTCISR), ioaddr + RTC_RTCISR);
> set_alarm_or_time(dev, MXC_RTC_ALARM, time);
> -
> -   return 0;
>  }
>
>  static void mxc_rtc_irq_enable(struct device *dev, unsigned int bit,
> @@ -346,11 +335,8 @@ static int mxc_rtc_set_alarm(struct device *dev, struct 
> rtc_wkalrm *alrm)
>  {
> struct platform_device *pdev = to_platform_device(dev);
> struct rtc_plat_data *pdata = platform_get_drvdata(pdev);
> -   int ret;
>
> -   ret = rtc_update_alarm(dev, &alrm->time);
> -   if (ret)
> -   return ret;
> +   rtc_update_alarm(dev, &alrm->time);
>
> memcpy(&pdata->g_rtc_alarm, &alrm->time, sizeof(struct rtc_time));
> mxc_rtc_irq_enable(dev, RTC_ALM_BIT, alrm->enabled);
> --
> 1.9.1
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/9] rtc/test: Update driver to address y2038/y2106 issues

2015-01-14 Thread Xunlei Pang

On 14 January 2015 at 23:03, Alessandro Zummo  wrote:
> On Wed, 14 Jan 2015 22:56:15 +0800
> Xunlei Pang  wrote:
>
>> We want to do away with set_mmss() eventually, time64 is the final version,
>> so I think this would be ok.
>
>  We'll get rid of set_mmss() from the test driver later on,
>  after everything else has been converted :)

Ok.

But on the other hand, we will have no test for  set_mmss64(),
because adding the set_mmss64() will make set_mmss() dysfunctional.

Thanks,
Xunlei
>
> --
>
>  Best regards,
>
>  Alessandro Zummo,
>   Tower Technologies - Torino, Italy
>
>   http://www.towertech.it
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/9] rtc/test: Update driver to address y2038/y2106 issues

2015-01-16 Thread Xunlei Pang

On 14 January 2015 at 23:39, Alessandro Zummo  wrote:
> On Wed, 14 Jan 2015 23:26:37 +0800
> Xunlei Pang  wrote:
>
>> But on the other hand, we will have no test for  set_mmss64(),
>> because adding the set_mmss64() will make set_mmss() dysfunctional.
>
>  add a module parameter

Hi Alessandro,

Thanks for your advice, how about the following one?

---
 drivers/rtc/rtc-test.c | 19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/rtc/rtc-test.c b/drivers/rtc/rtc-test.c
index 8f86fa9..3a2da4c 100644
--- a/drivers/rtc/rtc-test.c
+++ b/drivers/rtc/rtc-test.c
@@ -13,6 +13,10 @@
 #include 
 #include 

+static int test_mmss64;
+module_param(test_mmss64, int, 0644);
+MODULE_PARM_DESC(test_mmss64, "Test struct rtc_class_ops.set_mmss64().");
+
 static struct platform_device *test0 = NULL, *test1 = NULL;

 static int test_rtc_read_alarm(struct device *dev,
@@ -30,7 +34,13 @@ static int test_rtc_set_alarm(struct device *dev,
 static int test_rtc_read_time(struct device *dev,
  struct rtc_time *tm)
 {
- rtc_time_to_tm(get_seconds(), tm);
+ rtc_time64_to_tm(ktime_get_real_seconds(), tm);
+ return 0;
+}
+
+static int test_rtc_set_mmss64(struct device *dev, time64_t secs)
+{
+ dev_info(dev, "%s, secs = %lld\n", __func__, (long long)secs);
  return 0;
 }

@@ -55,7 +65,7 @@ static int test_rtc_alarm_irq_enable(struct device
*dev, unsigned int enable)
  return 0;
 }

-static const struct rtc_class_ops test_rtc_ops = {
+static struct rtc_class_ops test_rtc_ops = {
  .proc = test_rtc_proc,
  .read_time = test_rtc_read_time,
  .read_alarm = test_rtc_read_alarm,
@@ -101,6 +111,11 @@ static int test_probe(struct platform_device *plat_dev)
  int err;
  struct rtc_device *rtc;

+ if (test_mmss64) {
+ test_rtc_ops.set_mmss64 = test_rtc_set_mmss64;
+ test_rtc_ops.set_mmss = NULL;
+ }
+
  rtc = devm_rtc_device_register(&plat_dev->dev, "test",
  &test_rtc_ops, THIS_MODULE);
  if (IS_ERR(rtc)) {
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/5] sched/rt: Consider deadline tasks in cpupri_find()

2015-01-18 Thread Xunlei Pang

Currently, RT global scheduling doesn't factor deadline
tasks, this may cause some problems.

See a case below:
On a 3 CPU system, CPU0 has one running deadline task,
CPU1 has one running low priority RT task or idle, CPU3
has one running high priority RT task. When another mid
priority RT task is woken on CPU3, it will be pushed to
CPU0(this also disturbs the deadline task on CPU0), while
it is reasonable to put it on CPU1.

This patch eliminates this issue by filtering CPUs that
have runnable deadline tasks, using cpudl->free_cpus in
cpupri_find().

NOTE: We want to make the most use of percpu local_cpu_mask
to save an extra mask allocation, so always passing a non-NULL
lowest_mask to cpupri_find().

Signed-off-by: Xunlei Pang 
---
 kernel/sched/core.c   |  2 ++
 kernel/sched/cpupri.c | 22 +-
 kernel/sched/cpupri.h |  1 +
 kernel/sched/rt.c |  9 +
 4 files changed, 13 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ade2958..48c9576 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5652,6 +5652,8 @@ static int init_rootdomain(struct root_domain *rd)
 
if (cpupri_init(&rd->cpupri) != 0)
goto free_rto_mask;
+
+   rd->cpupri.cpudl = &rd->cpudl;
return 0;
 
 free_rto_mask:
diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
index 981fcd7..40b8e81 100644
--- a/kernel/sched/cpupri.c
+++ b/kernel/sched/cpupri.c
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include "cpupri.h"
+#include "cpudeadline.h"
 
 /* Convert between a 140 based task->prio, and our 102 based cpupri */
 static int convert_prio(int prio)
@@ -54,7 +55,7 @@ static int convert_prio(int prio)
  * cpupri_find - find the best (lowest-pri) CPU in the system
  * @cp: The cpupri context
  * @p: The task
- * @lowest_mask: A mask to fill in with selected CPUs (or NULL)
+ * @lowest_mask: A mask to fill in with selected CPUs (not NULL)
  *
  * Note: This function returns the recommended CPUs as calculated during the
  * current invocation.  By the time the call returns, the CPUs may have in
@@ -103,24 +104,11 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,
if (skip)
continue;
 
-   if (cpumask_any_and(&p->cpus_allowed, vec->mask) >= nr_cpu_ids)
+   cpumask_and(lowest_mask, &p->cpus_allowed, vec->mask);
+   cpumask_and(lowest_mask, lowest_mask, cp->cpudl->free_cpus);
+   if (cpumask_any(lowest_mask) >= nr_cpu_ids)
continue;
 
-   if (lowest_mask) {
-   cpumask_and(lowest_mask, &p->cpus_allowed, vec->mask);
-
-   /*
-* We have to ensure that we have at least one bit
-* still set in the array, since the map could have
-* been concurrently emptied between the first and
-* second reads of vec->mask.  If we hit this
-* condition, simply act as though we never hit this
-* priority level and continue on.
-*/
-   if (cpumask_any(lowest_mask) >= nr_cpu_ids)
-   continue;
-   }
-
return 1;
}
 
diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
index 63cbb9c..acd7ccf 100644
--- a/kernel/sched/cpupri.h
+++ b/kernel/sched/cpupri.h
@@ -18,6 +18,7 @@ struct cpupri_vec {
 struct cpupri {
struct cpupri_vec pri_to_cpu[CPUPRI_NR_PRIORITIES];
int *cpu_to_pri;
+   struct cpudl *cpudl;
 };
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 6725e3c..d28cfa4 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1349,14 +1349,17 @@ out:
return cpu;
 }
 
+static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask);
 static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
 {
+   struct cpumask *lowest_mask = this_cpu_cpumask_var_ptr(local_cpu_mask);
+
/*
 * Current can't be migrated, useless to reschedule,
 * let's hope p can move out.
 */
if (rq->curr->nr_cpus_allowed == 1 ||
-   !cpupri_find(&rq->rd->cpupri, rq->curr, NULL))
+   !cpupri_find(&rq->rd->cpupri, rq->curr, lowest_mask))
return;
 
/*
@@ -1364,7 +1367,7 @@ static void check_preempt_equal_prio(struct rq *rq, 
struct task_struct *p)
 * see if it is pushed or pulled somewhere else.
 */
if (p->nr_cpus_allowed != 1
-   && cpupri_find(&rq->rd->cpupri, p, NULL))
+   && cpupri_find(&rq->rd->cpupri, p, lowest_mask))
return;
 
/*
@@ -1526,8 +1529,6 @@ static struct task_struct 
*pi

[PATCH 2/5] sched/deadline: Remove cpu_active_mask from cpudl_find()

2015-01-18 Thread Xunlei Pang

cpu_active_mask is rarely changeable, so remove this operation
to gain a little performance.

If there is a change in cpu_active_mask, rq_online_dl() and
rq_offline_dl() should take care of it normally, so cpudl::
free_cpus carries enough information for us.

For the rare case(causing a task put onto a dying cpu) which
rq_offline_dl() can't handle timely, then it can be handled
through _cpu_down()->...->multi_cpu_stop()->migration_call()
->migrate_tasks(), preventing the task from hanging on the
dead cpu.

Signed-off-by: Xunlei Pang 
---
 kernel/sched/cpudeadline.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
index fd9d3fb..c6acb07 100644
--- a/kernel/sched/cpudeadline.c
+++ b/kernel/sched/cpudeadline.c
@@ -108,8 +108,7 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,
const struct sched_dl_entity *dl_se = &p->dl;
 
if (later_mask &&
-   cpumask_and(later_mask, cp->free_cpus, &p->cpus_allowed) &&
-   cpumask_and(later_mask, later_mask, cpu_active_mask)) {
+   cpumask_and(later_mask, cp->free_cpus, &p->cpus_allowed)) {
best_cpu = cpumask_any(later_mask);
goto out;
} else if (cpumask_test_cpu(cpudl_maximum(cp), &p->cpus_allowed) &&
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/5] sched/deadline: Fix wrong cpudl_find() in check_preempt_equal_dl()

2015-01-18 Thread Xunlei Pang

In check_preempt_equal_dl(), cpudl_find() is called with a NULL
later_mask, thus cpudl_find() here doesn't check cpudl::free_cpus
at all.

This patch takles this issue by always passing a non-NULL later_mask
to cpudl_find(), thereby fixing this issue.

Signed-off-by: Xunlei Pang 
---
 kernel/sched/cpudeadline.c |  8 +++-
 kernel/sched/deadline.c| 15 +++
 2 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
index c6acb07..f331fcf 100644
--- a/kernel/sched/cpudeadline.c
+++ b/kernel/sched/cpudeadline.c
@@ -97,7 +97,7 @@ static inline int cpudl_maximum(struct cpudl *cp)
  * cpudl_find - find the best (later-dl) CPU in the system
  * @cp: the cpudl max-heap context
  * @p: the task
- * @later_mask: a mask to fill in with the selected CPUs (or NULL)
+ * @later_mask: a mask to fill in with the selected CPUs (not NULL)
  *
  * Returns: int - best CPU (heap maximum if suitable)
  */
@@ -107,15 +107,13 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,
int best_cpu = -1;
const struct sched_dl_entity *dl_se = &p->dl;
 
-   if (later_mask &&
-   cpumask_and(later_mask, cp->free_cpus, &p->cpus_allowed)) {
+   if (cpumask_and(later_mask, cp->free_cpus, &p->cpus_allowed)) {
best_cpu = cpumask_any(later_mask);
goto out;
} else if (cpumask_test_cpu(cpudl_maximum(cp), &p->cpus_allowed) &&
dl_time_before(dl_se->deadline, cp->elements[0].dl)) {
best_cpu = cpudl_maximum(cp);
-   if (later_mask)
-   cpumask_set_cpu(best_cpu, later_mask);
+   cpumask_set_cpu(best_cpu, later_mask);
}
 
 out:
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index e7b2722..82d900f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -943,14 +943,23 @@ out:
return cpu;
 }
 
+static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl);
+
 static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
 {
+   struct cpumask *later_mask =
+   this_cpu_cpumask_var_ptr(local_cpu_mask_dl);
+
+   /* Make sure the mask is initialized first */
+   if (unlikely(!later_mask))
+   return;
+
/*
 * Current can't be migrated, useless to reschedule,
 * let's hope p can move out.
 */
if (rq->curr->nr_cpus_allowed == 1 ||
-   cpudl_find(&rq->rd->cpudl, rq->curr, NULL) == -1)
+   cpudl_find(&rq->rd->cpudl, rq->curr, later_mask) == -1)
return;
 
/*
@@ -958,7 +967,7 @@ static void check_preempt_equal_dl(struct rq *rq, struct 
task_struct *p)
 * see if it is pushed or pulled somewhere else.
 */
if (p->nr_cpus_allowed != 1 &&
-   cpudl_find(&rq->rd->cpudl, p, NULL) != -1)
+   cpudl_find(&rq->rd->cpudl, p, later_mask) != -1)
return;
 
resched_curr(rq);
@@ -1145,8 +1154,6 @@ next_node:
return NULL;
 }
 
-static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl);
-
 static int find_later_rq(struct task_struct *task)
 {
struct sched_domain *sd;
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 5/5] sched/rt: Optimize find_lowest_rq() to select a cache hot cpu

2015-01-18 Thread Xunlei Pang

In find_lowest_rq(), if we can't find a wake_affine cpu from
sched_domain, then we can actually determine a cache hot cpu
instead of simply calling "cpumask_any(lowest_mask)" which
always returns the first cpu in the mask.

So, we can determine the cache hot cpu during the interation of
sched_domain() in passing.

Signed-off-by: Xunlei Pang 
---
 kernel/sched/rt.c | 17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index d28cfa4..e6a42e6 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1535,6 +1535,7 @@ static int find_lowest_rq(struct task_struct *task)
struct cpumask *lowest_mask = this_cpu_cpumask_var_ptr(local_cpu_mask);
int this_cpu = smp_processor_id();
int cpu  = task_cpu(task);
+   int cachehot_cpu = nr_cpu_ids;
 
/* Make sure the mask is initialized first */
if (unlikely(!lowest_mask))
@@ -1566,8 +1567,12 @@ static int find_lowest_rq(struct task_struct *task)
 
rcu_read_lock();
for_each_domain(cpu, sd) {
+   if (cachehot_cpu >= nr_cpu_ids)
+   cachehot_cpu = cpumask_first_and(lowest_mask,
+  sched_domain_span(sd));
+
if (sd->flags & SD_WAKE_AFFINE) {
-   int best_cpu;
+   int wakeaffine_cpu;
 
/*
 * "this_cpu" is cheaper to preempt than a
@@ -1579,16 +1584,20 @@ static int find_lowest_rq(struct task_struct *task)
return this_cpu;
}
 
-   best_cpu = cpumask_first_and(lowest_mask,
+   wakeaffine_cpu = cpumask_first_and(lowest_mask,
 sched_domain_span(sd));
-   if (best_cpu < nr_cpu_ids) {
+   if (wakeaffine_cpu < nr_cpu_ids) {
rcu_read_unlock();
-   return best_cpu;
+   return wakeaffine_cpu;
}
}
}
rcu_read_unlock();
 
+   /* most likely cache-hot */
+   if (cachehot_cpu < nr_cpu_ids)
+   return cachehot_cpu;
+
/*
 * And finally, if there were no matches within the domains
 * just give the caller *something* to work with from the compatible
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/5] sched/deadline: Modify cpudl::free_cpus to reflect rd->online

2015-01-18 Thread Xunlei Pang

Currently, cpudl::free_cpus contains all cpus during init, see
cpudl_init(). When calling cpudl_find(), we have to add rd->span
to avoid selecting the cpu outside current root domain, because
cpus_allowed is undependable when performing clustered scheduling
using the cpuset, see find_later_rq().

This patch adds cpudl_set_freecpu() and cpudl_clear_freecpu() for
changing cpudl::free_cpus when doing rq_online_dl()/rq_offline_dl(),
so we can avoid the rd->span operation when calling cpudl_find()
in find_later_rq().

Signed-off-by: Xunlei Pang 
---
 kernel/sched/cpudeadline.c | 28 
 kernel/sched/cpudeadline.h |  2 ++
 kernel/sched/deadline.c|  5 ++---
 3 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
index 539ca3c..fd9d3fb 100644
--- a/kernel/sched/cpudeadline.c
+++ b/kernel/sched/cpudeadline.c
@@ -107,7 +107,9 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,
int best_cpu = -1;
const struct sched_dl_entity *dl_se = &p->dl;
 
-   if (later_mask && cpumask_and(later_mask, later_mask, cp->free_cpus)) {
+   if (later_mask &&
+   cpumask_and(later_mask, cp->free_cpus, &p->cpus_allowed) &&
+   cpumask_and(later_mask, later_mask, cpu_active_mask)) {
best_cpu = cpumask_any(later_mask);
goto out;
} else if (cpumask_test_cpu(cpudl_maximum(cp), &p->cpus_allowed) &&
@@ -186,6 +188,26 @@ out:
 }
 
 /*
+ * cpudl_set_freecpu - Set the cpudl.free_cpus
+ * @cp: the cpudl max-heap context
+ * @cpu: rd attached cpu
+ */
+void cpudl_set_freecpu(struct cpudl *cp, int cpu)
+{
+   cpumask_set_cpu(cpu, cp->free_cpus);
+}
+
+/*
+ * cpudl_clear_freecpu - Clear the cpudl.free_cpus
+ * @cp: the cpudl max-heap context
+ * @cpu: rd attached cpu
+ */
+void cpudl_clear_freecpu(struct cpudl *cp, int cpu)
+{
+   cpumask_clear_cpu(cpu, cp->free_cpus);
+}
+
+/*
  * cpudl_init - initialize the cpudl structure
  * @cp: the cpudl max-heap context
  */
@@ -203,7 +225,7 @@ int cpudl_init(struct cpudl *cp)
if (!cp->elements)
return -ENOMEM;
 
-   if (!alloc_cpumask_var(&cp->free_cpus, GFP_KERNEL)) {
+   if (!zalloc_cpumask_var(&cp->free_cpus, GFP_KERNEL)) {
kfree(cp->elements);
return -ENOMEM;
}
@@ -211,8 +233,6 @@ int cpudl_init(struct cpudl *cp)
for_each_possible_cpu(i)
cp->elements[i].idx = IDX_INVALID;
 
-   cpumask_setall(cp->free_cpus);
-
return 0;
 }
 
diff --git a/kernel/sched/cpudeadline.h b/kernel/sched/cpudeadline.h
index 020039b..1a0a6ef 100644
--- a/kernel/sched/cpudeadline.h
+++ b/kernel/sched/cpudeadline.h
@@ -24,6 +24,8 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,
   struct cpumask *later_mask);
 void cpudl_set(struct cpudl *cp, int cpu, u64 dl, int is_valid);
 int cpudl_init(struct cpudl *cp);
+void cpudl_set_freecpu(struct cpudl *cp, int cpu);
+void cpudl_clear_freecpu(struct cpudl *cp, int cpu);
 void cpudl_cleanup(struct cpudl *cp);
 #endif /* CONFIG_SMP */
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index b52092f..e7b2722 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1165,9 +1165,6 @@ static int find_later_rq(struct task_struct *task)
 * We have to consider system topology and task affinity
 * first, then we can look for a suitable cpu.
 */
-   cpumask_copy(later_mask, task_rq(task)->rd->span);
-   cpumask_and(later_mask, later_mask, cpu_active_mask);
-   cpumask_and(later_mask, later_mask, &task->cpus_allowed);
best_cpu = cpudl_find(&task_rq(task)->rd->cpudl,
task, later_mask);
if (best_cpu == -1)
@@ -1562,6 +1559,7 @@ static void rq_online_dl(struct rq *rq)
if (rq->dl.overloaded)
dl_set_overload(rq);
 
+   cpudl_set_freecpu(&rq->rd->cpudl, rq->cpu);
if (rq->dl.dl_nr_running > 0)
cpudl_set(&rq->rd->cpudl, rq->cpu, rq->dl.earliest_dl.curr, 1);
 }
@@ -1573,6 +1571,7 @@ static void rq_offline_dl(struct rq *rq)
dl_clear_overload(rq);
 
cpudl_set(&rq->rd->cpudl, rq->cpu, 0, 0);
+   cpudl_clear_freecpu(&rq->rd->cpudl, rq->cpu);
 }
 
 void init_sched_dl_class(void)
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/5] sched/rt: Consider deadline tasks in cpupri_find()

2015-01-28 Thread Xunlei Pang

Hi Steven,

On 28 January 2015 at 07:04, Steven Rostedt  wrote:
> On Tue, 27 Jan 2015 13:58:59 +0100
> Peter Zijlstra  wrote:
>
>
>> Not a bad idea, Cc'ed Steve who likes to look after the RT bits,
>> excessive quoting for him.
>
> Yep I do.
>
>
>> > index ade2958..48c9576 100644
>> > --- a/kernel/sched/core.c
>> > +++ b/kernel/sched/core.c
>> > @@ -5652,6 +5652,8 @@ static int init_rootdomain(struct root_domain *rd)
>> >
>> > if (cpupri_init(&rd->cpupri) != 0)
>> > goto free_rto_mask;
>> > +
>> > +   rd->cpupri.cpudl = &rd->cpudl;
>>
>> This is disgusting though; it breaks the cpuri abstraction. Why not pass
>> in the mask in the one place you actually need it?
>
> Yeah, probably should change cpupri_init() to take the rd->cpudl as a
> parameter.

Ok, thanks for your advice.
I'll refine this patch and send it out separately soon.

Thanks,
Xunlei

>
> Rest looks good (ignoring Peter's other comment that he realized wasn't
> an issue).
>
> -- Steve
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/5] sched/deadline: Fix wrong cpudl_find() in check_preempt_equal_dl()

2015-01-28 Thread Xunlei Pang

Hi Peter,

On 28 January 2015 at 00:47, Peter Zijlstra  wrote:
> On Mon, Jan 19, 2015 at 04:49:38AM +0000, Xunlei Pang wrote:
>> In check_preempt_equal_dl(), cpudl_find() is called with a NULL
>> later_mask, thus cpudl_find() here doesn't check cpudl::free_cpus
>> at all.
>>
>> This patch takles this issue by always passing a non-NULL later_mask
>> to cpudl_find(), thereby fixing this issue.
>
> Fix what issue? Afaict this is an optimization not a fix.

Currently, check_preempt_equal_dl() invokes cpudl_find() with a NULL mask,
so cpudl_find() won't check cpudl::free_cpus. For example, 4 core
system, CPU0~CPU2
are all idle(free of deadline tasks), a deadline task is woken on CPU3
which already
has one running deadline task with the same deadline value, then
cpudl_find() will fail
causing CPU3 holding 2 deadline tasks while other cpus are idle,
obviously it should
be placed on one idle cpu.

Thanks,
Xunlei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/5] drivers/rtc/isl1208: Replace deprecated rtc_tm_to_time()

2015-04-15 Thread Xunlei Pang

From: Xunlei Pang 

isl1208_i2c_set_alarm() uses deprecated rtc_tm_to_time(),
which will overflow in year 2106 on 32-bit machines.

This patch solves this by:
 - Replacing rtc_tm_to_time() with rtc_tm_to_time64()

Cc: Herbert Valerio Riedel 
Signed-off-by: Xunlei Pang 
---
 drivers/rtc/rtc-isl1208.c | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/rtc/rtc-isl1208.c b/drivers/rtc/rtc-isl1208.c
index c3c549d..06113e8 100644
--- a/drivers/rtc/rtc-isl1208.c
+++ b/drivers/rtc/rtc-isl1208.c
@@ -370,19 +370,16 @@ isl1208_i2c_set_alarm(struct i2c_client *client, struct 
rtc_wkalrm *alarm)
struct rtc_time *alarm_tm = &alarm->time;
u8 regs[ISL1208_ALARM_SECTION_LEN] = { 0, };
const int offs = ISL1208_REG_SCA;
-   unsigned long rtc_secs, alarm_secs;
+   time64_t rtc_secs, alarm_secs;
struct rtc_time rtc_tm;
int err, enable;
 
err = isl1208_i2c_read_time(client, &rtc_tm);
if (err)
return err;
-   err = rtc_tm_to_time(&rtc_tm, &rtc_secs);
-   if (err)
-   return err;
-   err = rtc_tm_to_time(alarm_tm, &alarm_secs);
-   if (err)
-   return err;
+
+   rtc_secs = rtc_tm_to_time64(&rtc_tm);
+   alarm_secs = rtc_tm_to_time64(alarm_tm);
 
/* If the alarm time is before the current time disable the alarm */
if (!alarm->enabled || alarm_secs <= rtc_secs)
-- 
1.9.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/5] drivers/rtc/pcf8563: Replace deprecated rtc_time_to_tm() and rtc_tm_to_time()

2015-04-15 Thread Xunlei Pang

From: Xunlei Pang 

pcf8563_rtc_set_alarm() uses deprecated rtc_tm_to_time()
and rtc_time_to_tm(), which will overflow in year 2106
on 32-bit machines.

This patch solves this by:
 - Replacing rtc_time_to_tm() with rtc_time64_to_tm()
 - Replacing rtc_tm_to_time() with rtc_tm_to_time64()

Signed-off-by: Xunlei Pang 
---
 drivers/rtc/rtc-pcf8563.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/rtc/rtc-pcf8563.c b/drivers/rtc/rtc-pcf8563.c
index 96fb32e..4ce66ad 100644
--- a/drivers/rtc/rtc-pcf8563.c
+++ b/drivers/rtc/rtc-pcf8563.c
@@ -368,13 +368,13 @@ static int pcf8563_rtc_set_alarm(struct device *dev, 
struct rtc_wkalrm *tm)
struct i2c_client *client = to_i2c_client(dev);
unsigned char buf[4];
int err;
-   unsigned long alarm_time;
 
/* The alarm has no seconds, round up to nearest minute */
if (tm->time.tm_sec) {
-   rtc_tm_to_time(&tm->time, &alarm_time);
-   alarm_time += 60-tm->time.tm_sec;
-   rtc_time_to_tm(alarm_time, &tm->time);
+   time64_t alarm_time = rtc_tm_to_time64(&tm->time);
+
+   alarm_time += 60 - tm->time.tm_sec;
+   rtc_time64_to_tm(alarm_time, &tm->time);
}
 
dev_dbg(dev, "%s, min=%d hour=%d wday=%d mday=%d "
-- 
1.9.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/5] drivers/rtc/sunxi: Replace deprecated rtc_tm_to_time()

2015-04-15 Thread Xunlei Pang

From: Xunlei Pang 

sunxi_rtc_setalarm() uses deprecated rtc_tm_to_time(),
which will overflow in year 2106 on 32-bit machines.

This patch solves this by:
- Replacing rtc_tm_to_time() with rtc_tm_to_time64()

Also remove the unnecessary initial zeroing of some
local variables in sunxi_rtc_setalarm().

Cc: Carlo Caione 
Signed-off-by: Xunlei Pang 
---
 drivers/rtc/rtc-sunxi.c | 29 ++---
 1 file changed, 14 insertions(+), 15 deletions(-)

diff --git a/drivers/rtc/rtc-sunxi.c b/drivers/rtc/rtc-sunxi.c
index 6e678fa..7f22753 100644
--- a/drivers/rtc/rtc-sunxi.c
+++ b/drivers/rtc/rtc-sunxi.c
@@ -269,14 +269,13 @@ static int sunxi_rtc_setalarm(struct device *dev, struct 
rtc_wkalrm *wkalrm)
struct sunxi_rtc_dev *chip = dev_get_drvdata(dev);
struct rtc_time *alrm_tm = &wkalrm->time;
struct rtc_time tm_now;
-   u32 alrm = 0;
-   unsigned long time_now = 0;
-   unsigned long time_set = 0;
-   unsigned long time_gap = 0;
-   unsigned long time_gap_day = 0;
-   unsigned long time_gap_hour = 0;
-   unsigned long time_gap_min = 0;
-   int ret = 0;
+   u32 alrm;
+   time64_t time_set, time_now;
+   unsigned long time_gap;
+   unsigned long time_gap_day;
+   unsigned long time_gap_hour;
+   unsigned long time_gap_min;
+   int ret;
 
ret = sunxi_rtc_gettime(dev, &tm_now);
if (ret < 0) {
@@ -284,13 +283,18 @@ static int sunxi_rtc_setalarm(struct device *dev, struct 
rtc_wkalrm *wkalrm)
return -EINVAL;
}
 
-   rtc_tm_to_time(alrm_tm, &time_set);
-   rtc_tm_to_time(&tm_now, &time_now);
+   time_set = rtc_tm_to_time64(alrm_tm);
+   time_now = rtc_tm_to_time64(&tm_now);
if (time_set <= time_now) {
dev_err(dev, "Date to set in the past\n");
return -EINVAL;
}
 
+   if (time_set > time_now + 255 * SEC_IN_DAY) {
+   dev_err(dev, "Day must be in the range 0 - 255\n");
+   return -EINVAL;
+   }
+
time_gap = time_set - time_now;
time_gap_day = time_gap / SEC_IN_DAY;
time_gap -= time_gap_day * SEC_IN_DAY;
@@ -299,11 +303,6 @@ static int sunxi_rtc_setalarm(struct device *dev, struct 
rtc_wkalrm *wkalrm)
time_gap_min = time_gap / SEC_IN_MIN;
time_gap -= time_gap_min * SEC_IN_MIN;
 
-   if (time_gap_day > 255) {
-   dev_err(dev, "Day must be in the range 0 - 255\n");
-   return -EINVAL;
-   }
-
sunxi_rtc_setaie(0, chip);
writel(0, chip->base + SUNXI_ALRM_DHMS);
usleep_range(100, 300);
-- 
1.9.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/5] drivers/rtc/pl030: Replace deprecated rtc_tm_to_time() and rtc_time_to_tm()

2015-04-15 Thread Xunlei Pang

From: Xunlei Pang 

The driver uses deprecated rtc_tm_to_time() and rtc_time_to_tm(),
which will overflow in year 2106 on 32-bit machines.

This patch solves this by:
 - Replacing rtc_tm_to_time() with rtc_tm_to_time64()
 - Replacing rtc_time_to_tm() with rtc_time64_to_tm()

Cc: Russell King 
Signed-off-by: Xunlei Pang 
---
 drivers/rtc/rtc-pl030.c | 28 ++--
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/rtc/rtc-pl030.c b/drivers/rtc/rtc-pl030.c
index f85a1a9..d874b6e 100644
--- a/drivers/rtc/rtc-pl030.c
+++ b/drivers/rtc/rtc-pl030.c
@@ -39,32 +39,34 @@ static int pl030_read_alarm(struct device *dev, struct 
rtc_wkalrm *alrm)
 {
struct pl030_rtc *rtc = dev_get_drvdata(dev);
 
-   rtc_time_to_tm(readl(rtc->base + RTC_MR), &alrm->time);
+   rtc_time64_to_tm(readl(rtc->base + RTC_MR), &alrm->time);
return 0;
 }
 
 static int pl030_set_alarm(struct device *dev, struct rtc_wkalrm *alrm)
 {
struct pl030_rtc *rtc = dev_get_drvdata(dev);
-   unsigned long time;
+   time64_t time;
int ret;
 
/*
 * At the moment, we can only deal with non-wildcarded alarm times.
 */
ret = rtc_valid_tm(&alrm->time);
-   if (ret == 0)
-   ret = rtc_tm_to_time(&alrm->time, &time);
-   if (ret == 0)
-   writel(time, rtc->base + RTC_MR);
-   return ret;
+   if (ret)
+   return ret;
+
+   time = rtc_tm_to_time64(&alrm->time);
+   writel(time, rtc->base + RTC_MR);
+
+   return 0;
 }
 
 static int pl030_read_time(struct device *dev, struct rtc_time *tm)
 {
struct pl030_rtc *rtc = dev_get_drvdata(dev);
 
-   rtc_time_to_tm(readl(rtc->base + RTC_DR), tm);
+   rtc_time64_to_tm(readl(rtc->base + RTC_DR), tm);
 
return 0;
 }
@@ -80,14 +82,12 @@ static int pl030_read_time(struct device *dev, struct 
rtc_time *tm)
 static int pl030_set_time(struct device *dev, struct rtc_time *tm)
 {
struct pl030_rtc *rtc = dev_get_drvdata(dev);
-   unsigned long time;
-   int ret;
+   time64_t time;
 
-   ret = rtc_tm_to_time(tm, &time);
-   if (ret == 0)
-   writel(time + 1, rtc->base + RTC_LR);
+   time = rtc_tm_to_time64(tm);
+   writel(time + 1, rtc->base + RTC_LR);
 
-   return ret;
+   return 0;
 }
 
 static const struct rtc_class_ops pl030_ops = {
-- 
1.9.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 5/5] drivers/rtc/sa1100: Replace deprecated rtc_tm_to_time() and rtc_time_to_tm()

2015-04-15 Thread Xunlei Pang

From: Xunlei Pang 

The driver uses deprecated rtc_tm_to_time() and rtc_time_to_tm(),
which will overflow in year 2106 on 32-bit machines.

This patch solves this by:
 - Replacing rtc_tm_to_time() with rtc_tm_to_time64()
 - Replacing rtc_time_to_tm() with rtc_time64_to_tm()

Cc: CIH 
Cc: Nicolas Pitre 
Cc: Andrew Christian 
Cc: Richard Purdie 
Signed-off-by: Xunlei Pang 
---
 drivers/rtc/rtc-sa1100.c | 21 +
 1 file changed, 5 insertions(+), 16 deletions(-)

diff --git a/drivers/rtc/rtc-sa1100.c b/drivers/rtc/rtc-sa1100.c
index b6e1ca0..625a320 100644
--- a/drivers/rtc/rtc-sa1100.c
+++ b/drivers/rtc/rtc-sa1100.c
@@ -157,19 +157,14 @@ static int sa1100_rtc_alarm_irq_enable(struct device 
*dev, unsigned int enabled)
 
 static int sa1100_rtc_read_time(struct device *dev, struct rtc_time *tm)
 {
-   rtc_time_to_tm(RCNR, tm);
+   rtc_time64_to_tm(RCNR, tm);
return 0;
 }
 
 static int sa1100_rtc_set_time(struct device *dev, struct rtc_time *tm)
 {
-   unsigned long time;
-   int ret;
-
-   ret = rtc_tm_to_time(tm, &time);
-   if (ret == 0)
-   RCNR = time;
-   return ret;
+   RCNR = rtc_tm_to_time64(tm);
+   return 0;
 }
 
 static int sa1100_rtc_read_alarm(struct device *dev, struct rtc_wkalrm *alrm)
@@ -185,23 +180,17 @@ static int sa1100_rtc_read_alarm(struct device *dev, 
struct rtc_wkalrm *alrm)
 static int sa1100_rtc_set_alarm(struct device *dev, struct rtc_wkalrm *alrm)
 {
struct sa1100_rtc *info = dev_get_drvdata(dev);
-   unsigned long time;
-   int ret;
 
spin_lock_irq(&info->lock);
-   ret = rtc_tm_to_time(&alrm->time, &time);
-   if (ret != 0)
-   goto out;
RTSR = RTSR & (RTSR_HZE|RTSR_ALE|RTSR_AL);
-   RTAR = time;
+   RTAR = rtc_tm_to_time64(&alrm->time);
if (alrm->enabled)
RTSR |= RTSR_ALE;
else
RTSR &= ~RTSR_ALE;
-out:
spin_unlock_irq(&info->lock);
 
-   return ret;
+   return 0;
 }
 
 static int sa1100_rtc_proc(struct device *dev, struct seq_file *seq)
-- 
1.9.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RESEND v4 2/3] sched/rt: Fix wrong SMP scheduler behavior for equal prio cases

2015-03-31 Thread Xunlei Pang

On 27 March 2015 at 23:28, Steven Rostedt  wrote:
> On Mon,  9 Mar 2015 15:32:27 +0800
> Xunlei Pang  wrote:
>
>> From: Xunlei Pang 
>>
>> Currently, SMP RT scheduler has some trouble in dealing with
>> equal prio cases.
>>
>> For example, in check_preempt_equal_prio():
>> When RT1(current task) gets preempted by RT2, if there is a
>> migratable RT3 with same prio, RT3 will be pushed away instead
>> of RT1 afterwards, because RT1 will be enqueued to the tail of
>> the pushable list when going through succeeding put_prev_task_rt()
>> triggered by resched. This broke FIFO.
>>
>> Furthermore, this is also problematic for normal preempted cases
>> if there're some rt tasks queued with the same prio as current.
>> Because current will be put behind these tasks in the pushable
>> queue.
>>
>> So, if a task is running and gets preempted by a higher priority
>> task (or even with same priority for migrating), this patch ensures
>> that it is put ahead of any existing task with the same priority in
>> the pushable queue.
>>
>> Signed-off-by: Xunlei Pang 
>> ---
>>  kernel/sched/rt.c | 23 ---
>>  1 file changed, 16 insertions(+), 7 deletions(-)
>>
>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>> index f4d4b07..86cd79f 100644
>> --- a/kernel/sched/rt.c
>> +++ b/kernel/sched/rt.c
>> @@ -347,11 +347,15 @@ static inline void set_post_schedule(struct rq *rq)
>>   rq->post_schedule = has_pushable_tasks(rq);
>>  }
>>
>> -static void enqueue_pushable_task(struct rq *rq, struct task_struct *p)
>> +static void enqueue_pushable_task(struct rq *rq,
>> + struct task_struct *p, bool head)
>
> Nit.
>
> static void
> enqueue_pushable_task(struct rq *rq, struct task_struct *p, bool head)
>
> Is a better breaking of the line.

I'll adjust it to be one line.

>
>>  {
>>   plist_del(&p->pushable_tasks, &rq->rt.pushable_tasks);
>>   plist_node_init(&p->pushable_tasks, p->prio);
>> - plist_add(&p->pushable_tasks, &rq->rt.pushable_tasks);
>> + if (head)
>> + plist_add_head(&p->pushable_tasks, &rq->rt.pushable_tasks);
>> + else
>> + plist_add_tail(&p->pushable_tasks, &rq->rt.pushable_tasks);
>>
>>   /* Update the highest prio pushable task */
>>   if (p->prio < rq->rt.highest_prio.next)
>> @@ -373,7 +377,8 @@ static void dequeue_pushable_task(struct rq *rq, struct 
>> task_struct *p)
>>
>>  #else
>>
>> -static inline void enqueue_pushable_task(struct rq *rq, struct task_struct 
>> *p)
>> +static inline void enqueue_pushable_task(struct rq *rq,
>> + struct task_struct *p, bool head)
>
> Same here.
>
>>  {
>>  }
>>
>> @@ -1248,7 +1253,7 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, 
>> int flags)
>>   enqueue_rt_entity(rt_se, flags & ENQUEUE_HEAD);
>>
>>   if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
>> - enqueue_pushable_task(rq, p);
>> + enqueue_pushable_task(rq, p, false);
>>  }
>>
>>  static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
>> @@ -1494,8 +1499,12 @@ static void put_prev_task_rt(struct rq *rq, struct 
>> task_struct *p)
>>* The previous task needs to be made eligible for pushing
>>* if it is still active
>>*/
>> - if (on_rt_rq(&p->rt) && p->nr_cpus_allowed > 1)
>> - enqueue_pushable_task(rq, p);
>> + if (on_rt_rq(&p->rt) && p->nr_cpus_allowed > 1) {
>> + if (task_running(rq, p) && (preempt_count() & PREEMPT_ACTIVE))
>
> put_prev_task_rt() is called by put_prev_task() which is called by
> several functions: rt_mutex_setprio(), __sched_setscheduler(),
> sched_setnuma(), migrate_tasks(), and sched_move_task(). It's not part
> of being preempted.
>
> Now it is also called by pick_next_task_rt() which I'm assuming is what
> you want it to affect.
>
> The above definitely needs a comment about what it is doing. Also, I'm
> not so sure we care about testing task_running(). I'm thinking the
> check for PREEMPT_ACTIVE is good enough, as that would only be set from
> being called within preempt_schedule().

Indeed.

>
> Also, we could get rid of the if statement and do:
>
> enqueue_pushable_task(rq, p, !!(preemp

Re: [PATCH] sched/fair: Restore env status before goto redo in load_balance()

2015-03-31 Thread Xunlei Pang

On 27 March 2015 at 23:30, Peter Zijlstra  wrote:
> On Wed, Mar 18, 2015 at 02:31:02PM +0800, Xunlei Pang wrote:
>> From: Xunlei Pang 
>>
>> In load_balance(), some members of lb_env will be assigned with
>> new values in LBF_DST_PINNED case. But lb_env::flags may still
>> retain LBF_ALL_PINNED if no proper tasks were found afterwards
>> due to another balance, task affinity changing, etc, which can
>> really happen because busiest rq lock has already been released.
>
> Sure..
>
>> This is wrong, for example with env.dst_cpu assigned new_dst_cpu
>> when going back to "redo" label, it may cause should_we_balance()
>> to return false which is unreasonable.
>
> Why? You've got a very unlikely, very hard case, its unlikely that
> anything we do will substantially improve the situation, but you make
> the code uglier for it.
>
>> This patch restores proper status of env before "goto redo", and
>> improves "out_all_pinned" and "out_one_pinned" labels.
>
> That doesn't even begin to explain half of what the patch does.
>
>> @@ -6977,12 +6978,19 @@ more_balance:
>>   /* All tasks on this runqueue were pinned by CPU affinity */
>>   if (unlikely(env.flags & LBF_ALL_PINNED)) {
>>   cpumask_clear_cpu(cpu_of(busiest), cpus);
>> - if (!cpumask_empty(cpus)) {
>> - env.loop = 0;
>> - env.loop_break = sched_nr_migrate_break;
>> - goto redo;
>> + if (env.new_dst_cpu != -1) {
>
> I really don't get this, how can this not be?
>
>> + env.new_dst_cpu = -1;
>> + cpumask_or(cpus, cpus,
>> + sched_group_cpus(sd->groups));
>> + cpumask_and(cpus, cpus, cpu_active_mask);
>
> More unexplained magic, why is this right?

When LBF_DST_PINNED was set, after going back to "more_balance",
things may change as the changelog describes, so it can hit
LBF_ALL_PINNED afterwards. Then env.cpus, env.dst_rq, env.dst_cpu held
the values assigned in the LBF_DST_PINNED case which is unreasonable.
When we want to redo, we must reset those values.

>
> The rest of the patch isn't much better.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v10 07/11] sched: get CPU's usage statistic

2015-03-31 Thread Xunlei Pang

Vincent,
On 27 March 2015 at 23:37, Vincent Guittot  wrote:
> On 27 March 2015 at 16:12, Xunlei Pang  wrote:
>>> +static int get_cpu_usage(int cpu)
>>> +{
>>> +   unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
>>> +   unsigned long capacity = capacity_orig_of(cpu);
>>> +
>>> +   if (usage >= SCHED_LOAD_SCALE)
>>> +   return capacity;
>>
>> Can "capacity" be greater than SCHED_LOAD_SCALE?
>> Why use SCHED_LOAD_SCALE instead of "capacity" in this judgement?
>
> Yes, SCHED_LOAD_SCALE is the default value but the capacity can be in
> the range [1536:512] for arm as an example

Right, I was confused between cpu capacity and
arch_scale_freq_capacity() in "Patch 04"  then. Thanks.

-Xunlei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v10 08/11] sched: replace capacity_factor by usage

2015-03-31 Thread Xunlei Pang

Hi Vincent,

On 27 March 2015 at 23:59, Vincent Guittot  wrote:
> On 27 March 2015 at 15:52, Xunlei Pang  wrote:
>> Hi Vincent,
>>
>> On 27 February 2015 at 23:54, Vincent Guittot
>>  wrote:
>>>  /**
>>> @@ -6432,18 +6435,19 @@ static inline void update_sd_lb_stats(struct lb_env 
>>> *env, struct sd_lb_stats *sd
>>>
>>> /*
>>>  * In case the child domain prefers tasks go to siblings
>>> -* first, lower the sg capacity factor to one so that we'll 
>>> try
>>> +* first, lower the sg capacity so that we'll try
>>>  * and move all the excess tasks away. We lower the capacity
>>>  * of a group only if the local group has the capacity to 
>>> fit
>>> -* these excess tasks, i.e. nr_running < 
>>> group_capacity_factor. The
>>> -* extra check prevents the case where you always pull from 
>>> the
>>> -* heaviest group when it is already under-utilized 
>>> (possible
>>> -* with a large weight task outweighs the tasks on the 
>>> system).
>>> +* these excess tasks. The extra check prevents the case 
>>> where
>>> +* you always pull from the heaviest group when it is 
>>> already
>>> +* under-utilized (possible with a large weight task 
>>> outweighs
>>> +* the tasks on the system).
>>>  */
>>> if (prefer_sibling && sds->local &&
>>> -   sds->local_stat.group_has_free_capacity) {
>>> -   sgs->group_capacity_factor = 
>>> min(sgs->group_capacity_factor, 1U);
>>> -   sgs->group_type = group_classify(sg, sgs);
>>> +   group_has_capacity(env, &sds->local_stat) &&
>>> +   (sgs->sum_nr_running > 1)) {
>>> +   sgs->group_no_capacity = 1;
>>> +   sgs->group_type = group_overloaded;
>>> }
>>>
>>
>> For SD_PREFER_SIBLING, if local has 1 task and group_has_capacity()
>> returns true(but not overloaded)  for it, and assume sgs group has 2
>> tasks, should we still mark this group overloaded?
>
> yes, the load balance will then choose if it's worth pulling it or not
> depending of the load of each groups

Maybe I didn't make it clearly.
For example, CPU0~1 are SMT siblings,  CPU2~CPU3 are another pair.
CPU0 is idle, others each has 1 task. Then according to this patch,
CPU2~CPU3(as one group) will be viewed as overloaded(CPU0~CPU1 as
local group, and group_has_capacity() returns true here), so the
balancer may initiate an active task moving. This is different from
the current code as SD_PREFER_SIBLING logic does. Is this problematic?

>
>>
>> -Xunlei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v10 08/11] sched: replace capacity_factor by usage

2015-04-01 Thread Xunlei Pang

Hi Vincent,

On 1 April 2015 at 17:06, Vincent Guittot  wrote:
> On 1 April 2015 at 05:37, Xunlei Pang  wrote:
>> Hi Vincent,
>>
>> On 27 March 2015 at 23:59, Vincent Guittot  
>> wrote:
>>> On 27 March 2015 at 15:52, Xunlei Pang  wrote:
>>>> Hi Vincent,
>>>>
>>>> On 27 February 2015 at 23:54, Vincent Guittot
>>>>  wrote:
>>>>>  /**
>>>>> @@ -6432,18 +6435,19 @@ static inline void update_sd_lb_stats(struct 
>>>>> lb_env *env, struct sd_lb_stats *sd
>>>>>
>>>>> /*
>>>>>  * In case the child domain prefers tasks go to siblings
>>>>> -* first, lower the sg capacity factor to one so that 
>>>>> we'll try
>>>>> +* first, lower the sg capacity so that we'll try
>>>>>  * and move all the excess tasks away. We lower the 
>>>>> capacity
>>>>>  * of a group only if the local group has the capacity to 
>>>>> fit
>>>>> -* these excess tasks, i.e. nr_running < 
>>>>> group_capacity_factor. The
>>>>> -* extra check prevents the case where you always pull 
>>>>> from the
>>>>> -* heaviest group when it is already under-utilized 
>>>>> (possible
>>>>> -* with a large weight task outweighs the tasks on the 
>>>>> system).
>>>>> +* these excess tasks. The extra check prevents the case 
>>>>> where
>>>>> +* you always pull from the heaviest group when it is 
>>>>> already
>>>>> +* under-utilized (possible with a large weight task 
>>>>> outweighs
>>>>> +* the tasks on the system).
>>>>>  */
>>>>> if (prefer_sibling && sds->local &&
>>>>> -   sds->local_stat.group_has_free_capacity) {
>>>>> -   sgs->group_capacity_factor = 
>>>>> min(sgs->group_capacity_factor, 1U);
>>>>> -   sgs->group_type = group_classify(sg, sgs);
>>>>> +   group_has_capacity(env, &sds->local_stat) &&
>>>>> +   (sgs->sum_nr_running > 1)) {
>>>>> +   sgs->group_no_capacity = 1;
>>>>> +   sgs->group_type = group_overloaded;
>>>>> }
>>>>>
>>>>
>>>> For SD_PREFER_SIBLING, if local has 1 task and group_has_capacity()
>>>> returns true(but not overloaded)  for it, and assume sgs group has 2
>>>> tasks, should we still mark this group overloaded?
>>>
>>> yes, the load balance will then choose if it's worth pulling it or not
>>> depending of the load of each groups
>>
>> Maybe I didn't make it clearly.
>> For example, CPU0~1 are SMT siblings,  CPU2~CPU3 are another pair.
>> CPU0 is idle, others each has 1 task. Then according to this patch,
>> CPU2~CPU3(as one group) will be viewed as overloaded(CPU0~CPU1 as
>> local group, and group_has_capacity() returns true here), so the
>> balancer may initiate an active task moving. This is different from
>> the current code as SD_PREFER_SIBLING logic does. Is this problematic?
>
> IMHO, it's not problematic, It's worth triggering a load balance if
> there is an imbalance between the 2  groups (as an example CPU0~1 has
> one low nice prio task but CPU1~2 have 2 high nice prio tasks) so the
> decision will be done when calculating the imbalance

Yes, but assuming the balancer calculated some imbalance, after moving
like CPU0~CPU1 have 1 low prio task and 1 high prio task, CPU2~CPU3
have 1 high piro task, seems it does no good because there's only 1
task per CPU after all.

So, is code below better( we may have more than 2 SMT siblings, like
Broadcom XLP processor having 4 SMT per core)?
 if (prefer_sibling && sds->local &&
group_has_capacity(env, &sds->local_stat) &&
(sgs->sum_nr_running > sds->local_stat.sum_nr_running + 1)) {
sgs->group_no_capacity = 1;
sgs->group_type = group_overloaded;
 }

Thanks,
-Xunlei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 4/6] lib/plist: Provide plist_add_head() for nodes with the same prio

2015-04-26 Thread Xunlei Pang

From: Xunlei Pang 

If there're multiple nodes with the same prio as @node, currently
plist_add() will add @node behind all of them. Now we need to add
@node before all of these nodes for SMP RT scheduler.

This patch adds a common __plist_add() for adding @node before or
after existing nodes with the same prio, then adds plist_add_head()
and plist_add_tail() inline wrapper functions for convenient uses.

Finally, change plist_add() to an inline wrapper using plist_add_tail()
which has the same behaviour as before.

Reviewed-by: Dan Streetman 
Reviewed-by: Steven Rostedt 
Signed-off-by: Xunlei Pang 
---
 include/linux/plist.h | 34 +-
 lib/plist.c   | 28 +++-
 2 files changed, 56 insertions(+), 6 deletions(-)

diff --git a/include/linux/plist.h b/include/linux/plist.h
index 9788360..d060f28 100644
--- a/include/linux/plist.h
+++ b/include/linux/plist.h
@@ -138,7 +138,39 @@ static inline void plist_node_init(struct plist_node 
*node, int prio)
INIT_LIST_HEAD(&node->node_list);
 }
 
-extern void plist_add(struct plist_node *node, struct plist_head *head);
+extern void __plist_add(struct plist_node *node,
+   struct plist_head *head, bool is_head);
+
+/**
+ * plist_add_head - add @node to @head, before all existing same-prio nodes
+ *
+ * @node:  The plist_node to be added to @head
+ * @head:  The plist_head that @node is being added to
+ */
+static inline
+void plist_add_head(struct plist_node *node, struct plist_head *head)
+{
+   __plist_add(node, head, true);
+}
+
+/**
+ * plist_add_tail - add @node to @head, after all existing same-prio nodes
+ *
+ * @node:  The plist_node to be added to @head
+ * @head:  The plist_head that @node is being added to
+ */
+static inline
+void plist_add_tail(struct plist_node *node, struct plist_head *head)
+{
+   __plist_add(node, head, false);
+}
+
+static inline
+void plist_add(struct plist_node *node, struct plist_head *head)
+{
+   plist_add_tail(node, head);
+}
+
 extern void plist_del(struct plist_node *node, struct plist_head *head);
 
 extern void plist_requeue(struct plist_node *node, struct plist_head *head);
diff --git a/lib/plist.c b/lib/plist.c
index 3a30c53..c1ee2b0 100644
--- a/lib/plist.c
+++ b/lib/plist.c
@@ -66,12 +66,18 @@ static void plist_check_head(struct plist_head *head)
 #endif
 
 /**
- * plist_add - add @node to @head
+ * __plist_add - add @node to @head
  *
- * @node:  &struct plist_node pointer
- * @head:  &struct plist_head pointer
+ * @node:The plist_node to be added to @head
+ * @head:The plist_head that @node is being added to
+ * @is_head: True if adding to head of prio list, false otherwise
+ *
+ * For nodes of the same prio, @node will be added at the
+ * head of previously added nodes if @is_head is true, or
+ * it will be added at the tail of previously added nodes
+ * if @is_head is false.
  */
-void plist_add(struct plist_node *node, struct plist_head *head)
+void __plist_add(struct plist_node *node, struct plist_head *head, bool 
is_head)
 {
struct plist_node *first, *iter, *prev = NULL;
struct list_head *node_next = &head->node_list;
@@ -96,8 +102,20 @@ void plist_add(struct plist_node *node, struct plist_head 
*head)
struct plist_node, prio_list);
} while (iter != first);
 
-   if (!prev || prev->prio != node->prio)
+   if (!prev || prev->prio != node->prio) {
list_add_tail(&node->prio_list, &iter->prio_list);
+   } else if (is_head) {
+   /*
+* prev has the same priority as the node that is being
+* added. It is also the first node for this priority,
+* but the new node needs to be added ahead of it.
+* To accomplish this, replace prev in the prio_list
+* with node. Then set node_next to prev->node_list so
+* that the new node gets added before prev and not iter.
+*/
+   list_replace_init(&prev->prio_list, &node->prio_list);
+   node_next = &prev->node_list;
+   }
 ins_node:
list_add_tail(&node->node_list, node_next);
 
-- 
1.9.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 2/6] sched/rt: Check to push task away when its affinity is changed

2015-04-26 Thread Xunlei Pang

From: Xunlei Pang 

We may suffer from extra rt overload rq due to the affinity,
so when the affinity of any runnable rt task is changed, we
should check to trigger balancing, otherwise it will cause
some unnecessary delayed real-time response. Unfortunately,
current RT global scheduler does nothing about this.

For example: a 2-cpu system with two runnable FIFO tasks(same
rt_priority) bound on CPU0, let's name them rt1(running) and
rt2(runnable) respectively; CPU1 has no RTs. Then, someone sets
the affinity of rt2 to 0x3(i.e. CPU0 and CPU1), but after this,
rt2 still can't be scheduled until rt1 enters schedule(), this
definitely causes some/big response latency for rt2.

This patch introduces new sched_class::post_set_cpus_allowed()
for rt called after set_cpus_allowed_rt(). In this new function,
it triggers a push action when detecting such cases.

Signed-off-by: Xunlei Pang 
---
 kernel/sched/core.c  |  3 +++
 kernel/sched/rt.c| 24 
 kernel/sched/sched.h |  1 +
 3 files changed, 28 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f9123a8..dc646a6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4769,6 +4769,9 @@ void do_set_cpus_allowed(struct task_struct *p, const 
struct cpumask *new_mask)
 
cpumask_copy(&p->cpus_allowed, new_mask);
p->nr_cpus_allowed = cpumask_weight(new_mask);
+
+   if (p->sched_class->post_set_cpus_allowed)
+   p->sched_class->post_set_cpus_allowed(p);
 }
 
 /*
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 6b40555..7b76747 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2136,6 +2136,29 @@ static void set_cpus_allowed_rt(struct task_struct *p,
update_rt_migration(&rq->rt);
 }
 
+static void post_set_cpus_allowed_rt(struct task_struct *p)
+{
+   struct rq *rq;
+
+   if (!task_on_rq_queued(p))
+   return;
+
+   rq = task_rq(p);
+   if (p->nr_cpus_allowed > 1 &&
+   rq->rt.rt_nr_running > 1 &&
+   rt_task(rq->curr) && !test_tsk_need_resched(rq->curr)) {
+   if (!task_running(rq, p)) {
+   push_rt_task(rq);
+   } else if (cpumask_test_cpu(task_cpu(p), &p->cpus_allowed)) {
+   /*
+* p(current) may get migratable due to the change
+* of its affinity, let's try to push it away.
+*/
+   check_preempt_equal_prio_common(rq);
+   }
+   }
+}
+
 /* Assumes rq->lock is held */
 static void rq_online_rt(struct rq *rq)
 {
@@ -2350,6 +2373,7 @@ const struct sched_class rt_sched_class = {
.select_task_rq = select_task_rq_rt,
 
.set_cpus_allowed   = set_cpus_allowed_rt,
+   .post_set_cpus_allowed  = post_set_cpus_allowed_rt,
.rq_online  = rq_online_rt,
.rq_offline = rq_offline_rt,
.post_schedule  = post_schedule_rt,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e0e1299..6f90645 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1191,6 +1191,7 @@ struct sched_class {
 
void (*set_cpus_allowed)(struct task_struct *p,
 const struct cpumask *newmask);
+   void (*post_set_cpus_allowed)(struct task_struct *p);
 
void (*rq_online)(struct rq *rq);
void (*rq_offline)(struct rq *rq);
-- 
1.9.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 6/6] sched/rt: Requeue p back if the preemption of check_preempt_equal_prio_common() failed

2015-04-26 Thread Xunlei Pang

From: Xunlei Pang 

In check_preempt_equal_prio_common(), it requeues "next" ahead
in the "run queue" and want to push current away. But when doing
the actual pushing, if the system state changes, the pushing may
fail as a result.

In this case, p finally becomes the new current and gets running,
while previous current was queued back waiting in the same "run
queue". This broke FIFO.

This patch adds a flag named RT_PREEMPT_PUSHAWAY for task_struct::
rt_preempt, sets it when doing check_preempt_equal_prio_common(),
and clears it if current is away(when it is dequeued). Thus we can
test this flag in p's post_schedule_rt() to judge if the pushing
has happened. If the pushing failed, requeue previous current back
to the head of its "run queue" and start a rescheduling.

Signed-off-by: Xunlei Pang 
---
 kernel/sched/rt.c | 75 +--
 1 file changed, 67 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index e7d66eb..94789f1 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -258,6 +258,8 @@ int alloc_rt_sched_group(struct task_group *tg, struct 
task_group *parent)
 #ifdef CONFIG_SMP
 
 #define RT_PREEMPT_QUEUEAHEAD1UL
+#define RT_PREEMPT_PUSHAWAY  2UL
+#define RT_PREEMPT_MASK  3UL
 
 /*
  * p(current) was preempted, and to be put ahead of
@@ -268,6 +270,22 @@ static inline bool rt_preempted(struct task_struct *p)
return !!(p->rt_preempt & RT_PREEMPT_QUEUEAHEAD);
 }
 
+static inline struct task_struct *rt_preempting_target(struct task_struct *p)
+{
+   return (struct task_struct *) (p->rt_preempt & ~RT_PREEMPT_MASK);
+}
+
+/*
+ * p(new current) is preempting and pushing previous current away.
+ */
+static inline bool rt_preempting(struct task_struct *p)
+{
+   if ((p->rt_preempt & RT_PREEMPT_PUSHAWAY) && rt_preempting_target(p))
+   return true;
+
+   return false;
+}
+
 static inline void clear_rt_preempt(struct task_struct *p)
 {
p->rt_preempt = 0;
@@ -375,13 +393,17 @@ static inline int has_pushable_tasks(struct rq *rq)
return !plist_head_empty(&rq->rt.pushable_tasks);
 }
 
-static inline void set_post_schedule(struct rq *rq)
+static inline void set_post_schedule(struct rq *rq, struct task_struct *p)
 {
-   /*
-* We detect this state here so that we can avoid taking the RQ
-* lock again later if there is no need to push
-*/
-   rq->post_schedule = has_pushable_tasks(rq);
+   if (rt_preempting(p))
+   /* Forced post schedule */
+   rq->post_schedule = 1;
+   else
+   /*
+* We detect this state here so that we can avoid taking
+* the RQ lock again later if there is no need to push
+*/
+   rq->post_schedule = has_pushable_tasks(rq);
 }
 
 static void
@@ -430,6 +452,11 @@ static inline bool rt_preempted(struct task_struct *p)
return false;
 }
 
+static inline bool rt_preempting(struct task_struct *p)
+{
+   return false;
+}
+
 static inline void clear_rt_preempt(struct task_struct *p)
 {
 }
@@ -472,7 +499,7 @@ static inline int pull_rt_task(struct rq *this_rq)
return 0;
 }
 
-static inline void set_post_schedule(struct rq *rq)
+static inline void set_post_schedule(struct rq *rq, struct task_struct *p)
 {
 }
 #endif /* CONFIG_SMP */
@@ -1330,6 +1357,7 @@ static void dequeue_task_rt(struct rq *rq, struct 
task_struct *p, int flags)
dequeue_rt_entity(rt_se);
 
dequeue_pushable_task(rq, p);
+   clear_rt_preempt(p);
 }
 
 /*
@@ -1468,6 +1496,11 @@ static void check_preempt_equal_prio_common(struct rq 
*rq)
 * to try and push current away.
 */
requeue_task_rt(rq, next, 1);
+
+   get_task_struct(curr);
+   curr->rt_preempt |= RT_PREEMPT_PUSHAWAY;
+   next->rt_preempt = (unsigned long) curr;
+   next->rt_preempt |= RT_PREEMPT_PUSHAWAY;
resched_curr_preempted_rt(rq);
 }
 
@@ -1590,7 +1623,7 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev)
/* The running task is never eligible for pushing */
dequeue_pushable_task(rq, p);
 
-   set_post_schedule(rq);
+   set_post_schedule(rq, p);
 
return p;
 }
@@ -2151,6 +2184,32 @@ skip:
 static void post_schedule_rt(struct rq *rq)
 {
push_rt_tasks(rq);
+
+   if (rt_preempting(current)) {
+   struct task_struct *target;
+
+   current->rt_preempt = 0;
+   target = rt_preempting_target(current);
+   if (!(target->rt_preempt & RT_PREEMPT_PUSHAWAY))
+   goto out;
+
+   /*
+* target still has RT_PREEMPT_PUSHAWAY set which
+* means it wasn't pushed away successfully if it
+* is still on this rq. thus restore former sta

[RFC PATCH 1/6] sched/rt: Provide new check_preempt_equal_prio_common()

2015-04-26 Thread Xunlei Pang

From: Xunlei Pang 

When p is queued, there may be other tasks already queued at the
same priority in the "run queue", so we should peek the most front
one to do the equal priority preemption.

This patch modifies check_preempt_equal_prio() and provides new
check_preempt_equal_prio_common() to do the common preemption.

There are also other cases to be added calling the new interface
in the following patches.

Signed-off-by: Xunlei Pang 
---
 kernel/sched/rt.c | 70 ++-
 1 file changed, 54 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 575da76..6b40555 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1366,33 +1366,66 @@ out:
return cpu;
 }
 
-static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
+static struct task_struct *peek_next_task_rt(struct rq *rq);
+
+static void check_preempt_equal_prio_common(struct rq *rq)
 {
+   struct task_struct *curr = rq->curr;
+   struct task_struct *next;
+
+   /* Current can't be migrated, useless to reschedule */
+   if (curr->nr_cpus_allowed == 1 ||
+   !cpupri_find(&rq->rd->cpupri, curr, NULL))
+   return;
+
/*
-* Current can't be migrated, useless to reschedule,
-* let's hope p can move out.
+* Can we find any task with the same priority as
+* curr? To accomplish this, firstly requeue curr
+* to the tail, then peek next, finally put curr
+* back to the head if a different task was peeked.
 */
-   if (rq->curr->nr_cpus_allowed == 1 ||
-   !cpupri_find(&rq->rd->cpupri, rq->curr, NULL))
+   requeue_task_rt(rq, curr, 0);
+   next = peek_next_task_rt(rq);
+   if (next == curr)
+   return;
+
+   requeue_task_rt(rq, curr, 1);
+
+   if (next->prio != curr->prio)
return;
 
/*
-* p is migratable, so let's not schedule it and
-* see if it is pushed or pulled somewhere else.
+* Got the right next queued with the same priority
+* as current. If next is migratable, don't schedule
+* it as it will be pushed or pulled somewhere else.
 */
-   if (p->nr_cpus_allowed != 1
-   && cpupri_find(&rq->rd->cpupri, p, NULL))
+   if (next->nr_cpus_allowed != 1 &&
+   cpupri_find(&rq->rd->cpupri, next, NULL))
return;
 
/*
 * There appears to be other cpus that can accept
-* current and none to run 'p', so lets reschedule
-* to try and push current away:
+* current and none to run next, so lets reschedule
+* to try and push current away.
 */
-   requeue_task_rt(rq, p, 1);
+   requeue_task_rt(rq, next, 1);
resched_curr(rq);
 }
 
+static inline
+void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
+{
+   /*
+* p is migratable, so let's not schedule it and
+* see if it is pushed or pulled somewhere else.
+*/
+   if (p->nr_cpus_allowed != 1 &&
+   cpupri_find(&rq->rd->cpupri, p, NULL))
+   return;
+
+   check_preempt_equal_prio_common(rq);
+}
+
 #endif /* CONFIG_SMP */
 
 /*
@@ -1440,10 +1473,9 @@ static struct sched_rt_entity 
*pick_next_rt_entity(struct rq *rq,
return next;
 }
 
-static struct task_struct *_pick_next_task_rt(struct rq *rq)
+static struct task_struct *peek_next_task_rt(struct rq *rq)
 {
struct sched_rt_entity *rt_se;
-   struct task_struct *p;
struct rt_rq *rt_rq  = &rq->rt;
 
do {
@@ -1452,9 +1484,15 @@ static struct task_struct *_pick_next_task_rt(struct rq 
*rq)
rt_rq = group_rt_rq(rt_se);
} while (rt_rq);
 
-   p = rt_task_of(rt_se);
-   p->se.exec_start = rq_clock_task(rq);
+   return rt_task_of(rt_se);
+}
 
+static inline struct task_struct *_pick_next_task_rt(struct rq *rq)
+{
+   struct task_struct *p;
+
+   p = peek_next_task_rt(rq);
+   p->se.exec_start = rq_clock_task(rq);
return p;
 }
 
-- 
1.9.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 3/6] sched/rt: Check to push FIFO current away at each tick

2015-04-26 Thread Xunlei Pang

From: Xunlei Pang 

There may be some non-migratable tasks queued in the "run queue"
with the same priority as current which is FIFO and migratable,
so at each tick we can check and try to push current away and
give these tasks a chance of running(we don't do this for tasks
queued with lower priority).

Signed-off-by: Xunlei Pang 
---
 kernel/sched/rt.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 7b76747..ddd5b19 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2314,10 +2314,17 @@ static void task_tick_rt(struct rq *rq, struct 
task_struct *p, int queued)
 
/*
 * RR tasks need a special form of timeslice management.
-* FIFO tasks have no timeslices.
+* FIFO tasks have no timeslices. But if p(current) is a
+* FIFO task, try to push it away.
 */
-   if (p->policy != SCHED_RR)
+   if (p->policy != SCHED_RR) {
+   if (p->nr_cpus_allowed > 1 &&
+   rq->rt.rt_nr_running > 1 &&
+   !test_tsk_need_resched(p))
+   check_preempt_equal_prio_common(rq);
+
return;
+   }
 
if (--p->rt.time_slice)
return;
-- 
1.9.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 5/6] sched/rt: Fix wrong SMP scheduler behavior for equal prio cases

2015-04-26 Thread Xunlei Pang

From: Xunlei Pang 

We know, there are two main queues each cpu for RT scheduler:
Let's call them "run queue" and "pushable queue" respectively.

For RT tasks, the scheduler uses "plist" to manage the pushable queue,
so when there are multiple tasks queued at the same priority, they get
queued in the strict FIFO order.

Currently, when an rt task gets queued, it is put to the head or the
tail of its "run queue" at the same priority according to different
scenarios. Then if it is migratable, it will also and always be put to
the tail of its "pushable queue" at the same priority.

For one cpu, assuming initially it has some migratable tasks queued
at the same priority as current(RT) both in "run queue" and "pushable
queue" in the same order. At some time, when current gets preempted, it
will be put behind these tasks in the "pushable queue", while it still
stays ahead of these tasks in the "run queue". Afterwards, if there comes
a pull from other cpu or a push from local cpu, the task behind current
in the "run queue" will be removed from the "pushable queue" and gets
running, as the global rt scheduler fetches tasks from the head of the
"pushable queue" to do pulling or pushing.

Obviously, to maintain the same order between the two queues, when current
is preempted(not involving re-queue in the "run queue"), we want to put it
ahead of all those tasks queued at the same priority in the "pushable queue".

So, if a task is running and gets preempted by a higher priority
task or even with same priority for migrating, this patch ensures
that it is put ahead of any existing task with the same priority in
the "pushable queue".

The dealing logic used here:
 - Add a new variable named "rt_preempt"(define a new flag named
   RT_PREEMPT_QUEUEAHEAD for it) in task_struct, used by RT.
 - When doing preempted resched_curr() for current RT, set the flag.
   Create a new resched_curr_preempted_rt() for this function, and
   replace all the possible resched_curr() used for rt preemption with
   resched_curr_preempted_rt().
 - In put_prev_task_rt(), test RT_PREEMPT_QUEUEAHEAD if set, enqueue
   the task ahead in the "pushable queue".

Signed-off-by: Xunlei Pang 
---
 include/linux/sched.h|  5 +++
 include/linux/sched/rt.h | 16 
 kernel/sched/core.c  |  6 ++-
 kernel/sched/rt.c| 96 ++--
 4 files changed, 110 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f74d4cc..24e0f72 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1321,6 +1321,11 @@ struct task_struct {
const struct sched_class *sched_class;
struct sched_entity se;
struct sched_rt_entity rt;
+
+#ifdef CONFIG_SMP
+   unsigned long rt_preempt; /* Used by rt */
+#endif
+
 #ifdef CONFIG_CGROUP_SCHED
struct task_group *sched_task_group;
 #endif
diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
index 6341f5b..69e3c82 100644
--- a/include/linux/sched/rt.h
+++ b/include/linux/sched/rt.h
@@ -15,6 +15,22 @@ static inline int rt_task(struct task_struct *p)
return rt_prio(p->prio);
 }
 
+struct rq;
+
+#ifdef CONFIG_SMP
+extern void resched_curr_preempted_rt(struct rq *rq);
+
+static inline void resched_curr_preempted(struct rq *rq)
+{
+   resched_curr_preempted_rt(rq);
+}
+#else
+static inline void resched_curr_preempted(struct rq *rq)
+{
+   rsched_curr(rq);
+}
+#endif
+
 #ifdef CONFIG_RT_MUTEXES
 extern int rt_mutex_getprio(struct task_struct *p);
 extern void rt_mutex_setprio(struct task_struct *p, int prio);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index dc646a6..64a1603 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1002,7 +1002,7 @@ void check_preempt_curr(struct rq *rq, struct task_struct 
*p, int flags)
if (class == rq->curr->sched_class)
break;
if (class == p->sched_class) {
-   resched_curr(rq);
+   resched_curr_preempted(rq);
break;
}
}
@@ -1833,6 +1833,10 @@ static void __sched_fork(unsigned long clone_flags, 
struct task_struct *p)
 
INIT_LIST_HEAD(&p->rt.run_list);
 
+#ifdef CONFIG_SMP
+   p->rt_preempt = 0;
+#endif
+
 #ifdef CONFIG_PREEMPT_NOTIFIERS
INIT_HLIST_HEAD(&p->preempt_notifiers);
 #endif
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index ddd5b19..e7d66eb 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -254,8 +254,33 @@ int alloc_rt_sched_group(struct task_group *tg, struct 
task_group *parent)
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+
 #ifdef CONFIG_SMP
 
+#define RT_PREEMPT_QUEUEAHEAD

[RFC PATCH RESEND 0/4] Maintain the FIFO order for same priority RT tasks

2015-04-26 Thread Xunlei Pang

From: Xunlei Pang 

1. TWO FIFO QUEUES
We know, there are two main queues each cpu for RT SMP scheduler. Let's
name them "run queue" and "pushable queue" respectively. 

"run queue" is from the intra-cpu's perspective, while "pushable queue" is
from the inter-cpu's perspective. "pushable queue" also has a very close 
relationship with task affinity. The two queues separately maintain their
FIFO order for tasks queued at the same priority.

If we got a wrong FIFO order of any of the two queues, we say it broke FIFO.


2. THE WAY THEY WORK
Currently, when an rt task gets queued, it can be put to the head or
tail of its "run queue" at the same priority according to different
scenarios. Further if it is migratable, it will also and always be put
to the tail of its "pushable queue" at the same priority because "plist"
is used to manage this queue in strict FIFO order.

a) If a task got enqueued or dequeued, it got added to or removed from the
"run queue", and also "pushable queue"(added to tail) if it is migratable.

b) If a runnable task acquires cpu(current task), it stays in the front of
the "run queue", but got removed from the "pushable queue". 

c) If a runnable task releases cpu, it stays in the "run queue"(if it 
releases cpu involuntarily, stays just behind new current in this queue; 
if it releases cpu voluntarily like yield, usually was requeued behind 
other tasks queued at the same priority in this queue), and got added to 
the tail of its "pushable queue" at the same priority if it is migratable.

d) If a tasks's affinity is changed from one cpu to multiple cpus, it got
added to the tail of "pushable queue" at the same priority.

e) If a tasks's affinity is changed from multiple cpus to one cpu, it got
removed from the "pushable queue".


3. PROBLEMATIC CASES
Unfortunately, in current kernel implementations, as far as I know, c) 
described in "2. THE WAY THEY WORK" have some cases that broke FIFO of 
one of the two queues for tasks queued at the same priority.

- Case 1: current task gets preempted by a higher priority task
current will be enqueued behind other tasks with the same priority in the
"pushable queue", while it actually stays ahead of these tasks in the "run
queue". Afterwards, if there comes a pull from other cpu or a push from
local cpu, the task behind current in the "run queue" will be removed from
the "pushable queue" and gets running, as the global rt scheduler fetches
tasks from the head of the "pushable queue" to do the pulling or pushing.
The kernel broke FIFO in this case.

current task gets preempted by an equal priority task:
This is done in check_preempt_equal_prio(), and can be divided into 3 sub cases.
  - Case 2 : p is the only one has the same priority as current.
 p preempts current successfully, but the kernel still had the 
 same problem as in Case 1 during the preempting.
  - Case 3 : p is not the only one has the same priority as current, let's
 say q is already queued before p. so, in this case we should
 choose q to do the preempting, not p. So the kernel broke FIFO in 
 this case.
  - Case 4 : p is going to preempt current, but when doing the actual 
preempting,
 current isn't pushed away due to the change of system status. so
 eventually p becomes the new current and gets running, while 
 current is queued back waiting in the same "run queue". Obviously, 
 the kernel broke FIFO in this case.

NOTE: when a task's affinity is changed and it becomes migratable(see d) 
described 
in "2. THE WAY THEY WORK"), in current kernel implementation it will be queued 
behind other tasks with the same priority in the "pushable queue". This may 
seems 
contratry to the corresponding order in the "run queue". But from "pushable 
queue"'s 
prospective, when a task's affinity is changed and further got removed from or 
added to the "pushable queue" because of that, it means requeuing. In my view, 
I 
don't think this broke FIFO of the "pushable queue". Any discussion?

But for "case 1" and "case 2", the task eventually wasn't got removed from or 
added 
to the "pushable queue" due to affinity changing, so its "pushable queue" FIFO 
order should also stays unchanged, and must be the same as its "run queue" 
order.
So we say "case 1" and "case 2" broke FIFO of its "pushable queue".

4. SUMMARY
case 1 : broke its "pushable queue" FIFO order.
case 2 : broke its "pushable queue" FIFO order.
case 3 : broke its "run queue" FIFO

[RFC PATCH RESEND 1/4] sched/rt: Modify check_preempt_equal_prio() for multiple tasks queued at the same priority

2015-04-26 Thread Xunlei Pang

From: Xunlei Pang 

In check_preempt_equal_prio(), when p is queued, there may be other
tasks already queued at the same priority in the "run queue", so we
should peek the most front one to do the preemption, not the p.

This patch modified it and moved the preemption job to a new function
named check_preempt_equal_prio_common() to make the logic clearer.

Signed-off-by: Xunlei Pang 
---
 kernel/sched/rt.c | 70 ++-
 1 file changed, 54 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 575da76..0c0f4df 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1366,33 +1366,66 @@ out:
return cpu;
 }
 
-static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
+static struct task_struct *peek_next_task_rt(struct rq *rq);
+
+static void check_preempt_equal_prio_common(struct rq *rq)
 {
+   struct task_struct *curr = rq->curr;
+   struct task_struct *next;
+
+   /* Current can't be migrated, useless to reschedule */
+   if (curr->nr_cpus_allowed == 1 ||
+   !cpupri_find(&rq->rd->cpupri, curr, NULL))
+   return;
+
/*
-* Current can't be migrated, useless to reschedule,
-* let's hope p can move out.
+* Can we find any task with the same priority as
+* curr? To accomplish this, firstly requeue curr
+* to the tail, then peek next, finally put curr
+* back to the head if a different task was peeked.
 */
-   if (rq->curr->nr_cpus_allowed == 1 ||
-   !cpupri_find(&rq->rd->cpupri, rq->curr, NULL))
+   requeue_task_rt(rq, curr, 0);
+   next = peek_next_task_rt(rq);
+   if (next == curr)
+   return;
+
+   requeue_task_rt(rq, curr, 1);
+
+   if (next->prio != curr->prio)
return;
 
/*
-* p is migratable, so let's not schedule it and
-* see if it is pushed or pulled somewhere else.
+* Got the right "next" queued with the same priority
+* as current. If next is migratable, don't schedule
+* it as it will be pushed or pulled somewhere else.
 */
-   if (p->nr_cpus_allowed != 1
-   && cpupri_find(&rq->rd->cpupri, p, NULL))
+   if (next->nr_cpus_allowed != 1 &&
+   cpupri_find(&rq->rd->cpupri, next, NULL))
return;
 
/*
 * There appears to be other cpus that can accept
-* current and none to run 'p', so lets reschedule
-* to try and push current away:
+* current and none to run next, so lets reschedule
+* to try and push current away.
 */
-   requeue_task_rt(rq, p, 1);
+   requeue_task_rt(rq, next, 1);
resched_curr(rq);
 }
 
+static inline
+void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
+{
+   /*
+* p is migratable, so let's not schedule it and
+* see if it is pushed or pulled somewhere else.
+*/
+   if (p->nr_cpus_allowed != 1 &&
+   cpupri_find(&rq->rd->cpupri, p, NULL))
+   return;
+
+   check_preempt_equal_prio_common(rq);
+}
+
 #endif /* CONFIG_SMP */
 
 /*
@@ -1440,10 +1473,9 @@ static struct sched_rt_entity 
*pick_next_rt_entity(struct rq *rq,
return next;
 }
 
-static struct task_struct *_pick_next_task_rt(struct rq *rq)
+static struct task_struct *peek_next_task_rt(struct rq *rq)
 {
struct sched_rt_entity *rt_se;
-   struct task_struct *p;
struct rt_rq *rt_rq  = &rq->rt;
 
do {
@@ -1452,9 +1484,15 @@ static struct task_struct *_pick_next_task_rt(struct rq 
*rq)
rt_rq = group_rt_rq(rt_se);
} while (rt_rq);
 
-   p = rt_task_of(rt_se);
-   p->se.exec_start = rq_clock_task(rq);
+   return rt_task_of(rt_se);
+}
 
+static inline struct task_struct *_pick_next_task_rt(struct rq *rq)
+{
+   struct task_struct *p;
+
+   p = peek_next_task_rt(rq);
+   p->se.exec_start = rq_clock_task(rq);
return p;
 }
 
-- 
1.9.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH RESEND 2/4] lib/plist: Provide plist_add_head() for nodes with the same prio

2015-04-26 Thread Xunlei Pang

From: Xunlei Pang 

If there're multiple nodes with the same prio as @node, currently
plist_add() will add @node behind all of them. Now we need to add
@node before all of these nodes for SMP RT scheduler.

This patch adds a common __plist_add() for adding @node before or
after existing nodes with the same prio, then adds plist_add_head()
and plist_add_tail() inline wrapper functions for convenient uses.

Finally, change plist_add() to an inline wrapper using plist_add_tail()
which has the same behaviour as before.

Reviewed-by: Dan Streetman 
Reviewed-by: Steven Rostedt 
Signed-off-by: Xunlei Pang 
---
 include/linux/plist.h | 34 +-
 lib/plist.c   | 28 +++-
 2 files changed, 56 insertions(+), 6 deletions(-)

diff --git a/include/linux/plist.h b/include/linux/plist.h
index 9788360..d060f28 100644
--- a/include/linux/plist.h
+++ b/include/linux/plist.h
@@ -138,7 +138,39 @@ static inline void plist_node_init(struct plist_node 
*node, int prio)
INIT_LIST_HEAD(&node->node_list);
 }
 
-extern void plist_add(struct plist_node *node, struct plist_head *head);
+extern void __plist_add(struct plist_node *node,
+   struct plist_head *head, bool is_head);
+
+/**
+ * plist_add_head - add @node to @head, before all existing same-prio nodes
+ *
+ * @node:  The plist_node to be added to @head
+ * @head:  The plist_head that @node is being added to
+ */
+static inline
+void plist_add_head(struct plist_node *node, struct plist_head *head)
+{
+   __plist_add(node, head, true);
+}
+
+/**
+ * plist_add_tail - add @node to @head, after all existing same-prio nodes
+ *
+ * @node:  The plist_node to be added to @head
+ * @head:  The plist_head that @node is being added to
+ */
+static inline
+void plist_add_tail(struct plist_node *node, struct plist_head *head)
+{
+   __plist_add(node, head, false);
+}
+
+static inline
+void plist_add(struct plist_node *node, struct plist_head *head)
+{
+   plist_add_tail(node, head);
+}
+
 extern void plist_del(struct plist_node *node, struct plist_head *head);
 
 extern void plist_requeue(struct plist_node *node, struct plist_head *head);
diff --git a/lib/plist.c b/lib/plist.c
index 3a30c53..c1ee2b0 100644
--- a/lib/plist.c
+++ b/lib/plist.c
@@ -66,12 +66,18 @@ static void plist_check_head(struct plist_head *head)
 #endif
 
 /**
- * plist_add - add @node to @head
+ * __plist_add - add @node to @head
  *
- * @node:  &struct plist_node pointer
- * @head:  &struct plist_head pointer
+ * @node:The plist_node to be added to @head
+ * @head:The plist_head that @node is being added to
+ * @is_head: True if adding to head of prio list, false otherwise
+ *
+ * For nodes of the same prio, @node will be added at the
+ * head of previously added nodes if @is_head is true, or
+ * it will be added at the tail of previously added nodes
+ * if @is_head is false.
  */
-void plist_add(struct plist_node *node, struct plist_head *head)
+void __plist_add(struct plist_node *node, struct plist_head *head, bool 
is_head)
 {
struct plist_node *first, *iter, *prev = NULL;
struct list_head *node_next = &head->node_list;
@@ -96,8 +102,20 @@ void plist_add(struct plist_node *node, struct plist_head 
*head)
struct plist_node, prio_list);
} while (iter != first);
 
-   if (!prev || prev->prio != node->prio)
+   if (!prev || prev->prio != node->prio) {
list_add_tail(&node->prio_list, &iter->prio_list);
+   } else if (is_head) {
+   /*
+* prev has the same priority as the node that is being
+* added. It is also the first node for this priority,
+* but the new node needs to be added ahead of it.
+* To accomplish this, replace prev in the prio_list
+* with node. Then set node_next to prev->node_list so
+* that the new node gets added before prev and not iter.
+*/
+   list_replace_init(&prev->prio_list, &node->prio_list);
+   node_next = &prev->node_list;
+   }
 ins_node:
list_add_tail(&node->node_list, node_next);
 
-- 
1.9.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 >

1 - 100 of 553 matches

Mail list logo