date:20170223

Re: [PATCH v2 4/5] perf: kretprobes: offset from reloc_sym if kernel supports it

2017-02-23 Thread Masami Hiramatsu

On Wed, 22 Feb 2017 19:23:40 +0530
"Naveen N. Rao"  wrote:

> We indicate support for accepting sym+offset with kretprobes through a
> line in ftrace README. Parse the same to identify support and choose the
> appropriate format for kprobe_events.
> 
> Signed-off-by: Naveen N. Rao 
> ---
>  tools/perf/util/probe-event.c | 47 
> ---
>  tools/perf/util/probe-event.h |  2 ++
>  2 files changed, 42 insertions(+), 7 deletions(-)
> 
> diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
> index 35f5b7b7715c..f6bc61c47271 100644
> --- a/tools/perf/util/probe-event.c
> +++ b/tools/perf/util/probe-event.c
> @@ -737,6 +737,41 @@ post_process_module_probe_trace_events(struct 
> probe_trace_event *tevs,
>   return ret;
>  }
>  
> +bool is_kretprobe_offset_supported(void)
> +{
> + FILE *fp;
> + char *buf = NULL;
> + size_t len = 0;
> + bool target_line = false;
> + static int supported = -1;
> +
> + if (supported >= 0)
> + return !!supported;
> +
> + if (asprintf(&buf, "%s/README", tracing_path) < 0)
> + return false;
> +
> + fp = fopen(buf, "r");
> + if (!fp)
> + goto end;
> +
> + zfree(&buf);
> + while (getline(&buf, &len, fp) > 0) {
> + target_line = !!strstr(buf, "place (kretprobe): ");
> + if (!target_line)
> + continue;
> + supported = 1;
> + }
> + if (supported == -1)
> + supported = 0;
> +
> + fclose(fp);
> +end:
> + free(buf);
> +
> + return !!supported;
> +}

Could you reuse (refactoring) probe_type_is_available() in probe-file.c to share
opening README file?

Others looks good to me :)

Thank you,

> +
>  static int
>  post_process_kernel_probe_trace_events(struct probe_trace_event *tevs,
>  int ntevs)
> @@ -757,7 +792,9 @@ post_process_kernel_probe_trace_events(struct 
> probe_trace_event *tevs,
>   }
>  
>   for (i = 0; i < ntevs; i++) {
> - if (!tevs[i].point.address || tevs[i].point.retprobe)
> + if (!tevs[i].point.address)
> + continue;
> + if (tevs[i].point.retprobe && !is_kretprobe_offset_supported())
>   continue;
>   /* If we found a wrong one, mark it by NULL symbol */
>   if (kprobe_warn_out_range(tevs[i].point.symbol,
> @@ -1528,11 +1565,6 @@ static int parse_perf_probe_point(char *arg, struct 
> perf_probe_event *pev)
>   return -EINVAL;
>   }
>  
> - if (pp->retprobe && !pp->function) {
> - semantic_error("Return probe requires an entry function.\n");
> - return -EINVAL;
> - }
> -
>   if ((pp->offset || pp->line || pp->lazy_line) && pp->retprobe) {
>   semantic_error("Offset/Line/Lazy pattern can't be used with "
>  "return probe.\n");
> @@ -2841,7 +2873,8 @@ static int find_probe_trace_events_from_map(struct 
> perf_probe_event *pev,
>   }
>  
>   /* Note that the symbols in the kmodule are not relocated */
> - if (!pev->uprobes && !pp->retprobe && !pev->target) {
> + if (!pev->uprobes && !pev->target &&
> + (!pp->retprobe || is_kretprobe_offset_supported())) {
>   reloc_sym = kernel_get_ref_reloc_sym();
>   if (!reloc_sym) {
>   pr_warning("Relocated base symbol is not found!\n");
> diff --git a/tools/perf/util/probe-event.h b/tools/perf/util/probe-event.h
> index 5d4e94061402..449d4f311355 100644
> --- a/tools/perf/util/probe-event.h
> +++ b/tools/perf/util/probe-event.h
> @@ -135,6 +135,8 @@ bool perf_probe_with_var(struct perf_probe_event *pev);
>  /* Check the perf_probe_event needs debuginfo */
>  bool perf_probe_event_need_dwarf(struct perf_probe_event *pev);
>  
> +bool is_kretprobe_offset_supported(void);
> +
>  /* Release event contents */
>  void clear_perf_probe_event(struct perf_probe_event *pev);
>  void clear_probe_trace_event(struct probe_trace_event *tev);
> -- 
> 2.11.0
> 


-- 
Masami Hiramatsu

Re: [PATCH] KVM: PPC: Book3S: Ratelimit copy data failure error messages

2017-02-23 Thread Vipin K Parashar


v2 for this patch with 'printk_ratelimit' replaced with

'printk_ratelimited' is available at mailing list.


https://patchwork.ozlabs.org/patch/728831/



On Tuesday 14 February 2017 11:50 AM, Vipin K Parashar wrote:

Forwarded same patch to k...@vger.kernel.org

and kvm-...@vger.kernel.org too.


On Tuesday 14 February 2017 12:26 AM, Vipin K Parashar wrote:

kvm_ppc_mmu_book3s_32/64 xlat() log "KVM can't copy data" error
upon failing to copy user data to kernel space. This floods kernel
log once such fails occur in short time period. Ratelimit this
error to avoid flooding kernel logs upon copy data failures.

Signed-off-by: Vipin K Parashar 
---
  arch/powerpc/kvm/book3s_32_mmu.c | 3 ++-
  arch/powerpc/kvm/book3s_64_mmu.c | 3 ++-
  2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_32_mmu.c 
b/arch/powerpc/kvm/book3s_32_mmu.c

index a2eb6d3..ca8f960 100644
--- a/arch/powerpc/kvm/book3s_32_mmu.c
+++ b/arch/powerpc/kvm/book3s_32_mmu.c
@@ -224,7 +224,8 @@ static int kvmppc_mmu_book3s_32_xlate_pte(struct 
kvm_vcpu *vcpu, gva_t eaddr,

  ptem = kvmppc_mmu_book3s_32_get_ptem(sre, eaddr, primary);

  if(copy_from_user(pteg, (void __user *)ptegp, sizeof(pteg))) {
-printk(KERN_ERR "KVM: Can't copy data from 0x%lx!\n", ptegp);
+if (printk_ratelimit())
+printk(KERN_ERR "KVM: Can't copy data from 0x%lx!\n", 
ptegp);

  goto no_page_found;
  }

diff --git a/arch/powerpc/kvm/book3s_64_mmu.c 
b/arch/powerpc/kvm/book3s_64_mmu.c

index b9131aa..b420aca 100644
--- a/arch/powerpc/kvm/book3s_64_mmu.c
+++ b/arch/powerpc/kvm/book3s_64_mmu.c
@@ -265,7 +265,8 @@ static int kvmppc_mmu_book3s_64_xlate(struct 
kvm_vcpu *vcpu, gva_t eaddr,

  goto no_page_found;

  if(copy_from_user(pteg, (void __user *)ptegp, sizeof(pteg))) {
-printk(KERN_ERR "KVM can't copy data from 0x%lx!\n", ptegp);
+if (printk_ratelimit())
+printk(KERN_ERR "KVM can't copy data from 0x%lx!\n", 
ptegp);

  goto no_page_found;
  }

Re: [PATCH v2] KVM: PPC: Book3S: Ratelimit copy data failure error messages

2017-02-23 Thread Vipin K Parashar


This patch uses "printk_ratelimited" in place of
"printk_ratelimit" used in v1


On Thursday 16 February 2017 10:40 PM, Vipin K Parashar wrote:

kvm_ppc_mmu_book3s_32/64 xlat() logs "KVM can't copy data" error
upon failing to copy user data to kernel space. This floods kernel
log once such fails occur in short time period. Ratelimit this
error to avoid flooding kernel logs upon copy data failures.

Signed-off-by: Vipin K Parashar 
---
  arch/powerpc/kvm/book3s_32_mmu.c | 3 ++-
  arch/powerpc/kvm/book3s_64_mmu.c | 3 ++-
  2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_32_mmu.c b/arch/powerpc/kvm/book3s_32_mmu.c
index a2eb6d3..1992676 100644
--- a/arch/powerpc/kvm/book3s_32_mmu.c
+++ b/arch/powerpc/kvm/book3s_32_mmu.c
@@ -224,7 +224,8 @@ static int kvmppc_mmu_book3s_32_xlate_pte(struct kvm_vcpu 
*vcpu, gva_t eaddr,
ptem = kvmppc_mmu_book3s_32_get_ptem(sre, eaddr, primary);

if(copy_from_user(pteg, (void __user *)ptegp, sizeof(pteg))) {
-   printk(KERN_ERR "KVM: Can't copy data from 0x%lx!\n", ptegp);
+   printk_ratelimited(KERN_ERR
+   "KVM: Can't copy data from 0x%lx!\n", ptegp);
goto no_page_found;
}

diff --git a/arch/powerpc/kvm/book3s_64_mmu.c b/arch/powerpc/kvm/book3s_64_mmu.c
index b9131aa..7015357 100644
--- a/arch/powerpc/kvm/book3s_64_mmu.c
+++ b/arch/powerpc/kvm/book3s_64_mmu.c
@@ -265,7 +265,8 @@ static int kvmppc_mmu_book3s_64_xlate(struct kvm_vcpu 
*vcpu, gva_t eaddr,
goto no_page_found;

if(copy_from_user(pteg, (void __user *)ptegp, sizeof(pteg))) {
-   printk(KERN_ERR "KVM can't copy data from 0x%lx!\n", ptegp);
+   printk_ratelimited(KERN_ERR
+   "KVM: Can't copy data from 0x%lx!\n", ptegp);
goto no_page_found;
}

Re: [PATCH v2] KVM: PPC: Book3S: Ratelimit copy data failure error messages

2017-02-23 Thread Balbir Singh

On 17/02/17 04:10, Vipin K Parashar wrote:
> kvm_ppc_mmu_book3s_32/64 xlat() logs "KVM can't copy data" error
> upon failing to copy user data to kernel space. This floods kernel
> log once such fails occur in short time period. Ratelimit this
> error to avoid flooding kernel logs upon copy data failures.
> 
> Signed-off-by: Vipin K Parashar 
> ---

What causes the flooding, can it be triggered on demand from user
space? I presume you'll need to have permissions to /dev/kvm to
trigger it? Could you clarify the scope, is it just called
during emulation with KVM_PR?

Balbir Singh.

Re: [PATCH v2] KVM: PPC: Book3S: Ratelimit copy data failure error messages

2017-02-23 Thread Balbir Singh

On 17/02/17 04:10, Vipin K Parashar wrote:
> kvm_ppc_mmu_book3s_32/64 xlat() logs "KVM can't copy data" error
> upon failing to copy user data to kernel space. This floods kernel
> log once such fails occur in short time period. Ratelimit this
> error to avoid flooding kernel logs upon copy data failures.
> 
> Signed-off-by: Vipin K Parashar 
> ---

What causes the flooding, can it be triggered on demand from user
space? I presume you'll need to have permissions to /dev/kvm to
trigger it? Could you clarify the scope, is it just called
during emulation with KVM_PR?

Balbir Singh.

[PATCH] powerpc/xics: Adjust interrupt receive priority for offline cpus

2017-02-23 Thread Vaidyanathan Srinivasan

Offline CPUs need to receive IPIs through XIVE when they are
in stop state and wakeup from that state.

Reduce interrupt receive priority in order to receive XIVE
wakeup interrupts when in offline state.

LOWEST_PRIORITY would allow all interrupts to be delivered
as wakeup events.

Signed-off-by: Vaidyanathan Srinivasan 
---
 arch/powerpc/sysdev/xics/xics-common.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/sysdev/xics/xics-common.c 
b/arch/powerpc/sysdev/xics/xics-common.c
index 69d858e..c674a9d 100644
--- a/arch/powerpc/sysdev/xics/xics-common.c
+++ b/arch/powerpc/sysdev/xics/xics-common.c
@@ -199,7 +199,7 @@ void xics_migrate_irqs_away(void)
xics_set_cpu_giq(xics_default_distrib_server, 0);
 
/* Allow IPIs again... */
-   icp_ops->set_priority(DEFAULT_PRIORITY);
+   icp_ops->set_priority(LOWEST_PRIORITY);
 
for_each_irq_desc(virq, desc) {
struct irq_chip *chip;
-- 
2.9.3

[PATCH v3 2/2] perf: kretprobes: offset from reloc_sym if kernel supports it

2017-02-23 Thread Naveen N. Rao

We indicate support for accepting sym+offset with kretprobes through a
line in ftrace README. Parse the same to identify support and choose the
appropriate format for kprobe_events.

Signed-off-by: Naveen N. Rao 
---
 tools/perf/util/probe-event.c | 49 ---
 tools/perf/util/probe-event.h |  2 ++
 2 files changed, 44 insertions(+), 7 deletions(-)

diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
index 35f5b7b7715c..dd6b9ce0eef3 100644
--- a/tools/perf/util/probe-event.c
+++ b/tools/perf/util/probe-event.c
@@ -737,6 +737,43 @@ post_process_module_probe_trace_events(struct 
probe_trace_event *tevs,
return ret;
 }
 
+bool is_kretprobe_offset_supported(void)
+{
+   FILE *fp;
+   char *buf = NULL;
+   size_t len = 0;
+   bool target_line = false;
+   static int supported = -1;
+   int fd;
+
+   if (supported >= 0)
+   return !!supported;
+
+   fd = open_trace_file("README", false);
+   if (fd < 0)
+   return false;
+
+   fp = fdopen(fd, "r");
+   if (!fp) {
+   close(fd);
+   return false;
+   }
+
+   while (getline(&buf, &len, fp) > 0) {
+   target_line = !!strstr(buf, "place (kretprobe): ");
+   if (!target_line)
+   continue;
+   supported = 1;
+   }
+   if (supported == -1)
+   supported = 0;
+
+   fclose(fp);
+   free(buf);
+
+   return !!supported;
+}
+
 static int
 post_process_kernel_probe_trace_events(struct probe_trace_event *tevs,
   int ntevs)
@@ -757,7 +794,9 @@ post_process_kernel_probe_trace_events(struct 
probe_trace_event *tevs,
}
 
for (i = 0; i < ntevs; i++) {
-   if (!tevs[i].point.address || tevs[i].point.retprobe)
+   if (!tevs[i].point.address)
+   continue;
+   if (tevs[i].point.retprobe && !is_kretprobe_offset_supported())
continue;
/* If we found a wrong one, mark it by NULL symbol */
if (kprobe_warn_out_range(tevs[i].point.symbol,
@@ -1528,11 +1567,6 @@ static int parse_perf_probe_point(char *arg, struct 
perf_probe_event *pev)
return -EINVAL;
}
 
-   if (pp->retprobe && !pp->function) {
-   semantic_error("Return probe requires an entry function.\n");
-   return -EINVAL;
-   }
-
if ((pp->offset || pp->line || pp->lazy_line) && pp->retprobe) {
semantic_error("Offset/Line/Lazy pattern can't be used with "
   "return probe.\n");
@@ -2841,7 +2875,8 @@ static int find_probe_trace_events_from_map(struct 
perf_probe_event *pev,
}
 
/* Note that the symbols in the kmodule are not relocated */
-   if (!pev->uprobes && !pp->retprobe && !pev->target) {
+   if (!pev->uprobes && !pev->target &&
+   (!pp->retprobe || is_kretprobe_offset_supported())) {
reloc_sym = kernel_get_ref_reloc_sym();
if (!reloc_sym) {
pr_warning("Relocated base symbol is not found!\n");
diff --git a/tools/perf/util/probe-event.h b/tools/perf/util/probe-event.h
index 5d4e94061402..449d4f311355 100644
--- a/tools/perf/util/probe-event.h
+++ b/tools/perf/util/probe-event.h
@@ -135,6 +135,8 @@ bool perf_probe_with_var(struct perf_probe_event *pev);
 /* Check the perf_probe_event needs debuginfo */
 bool perf_probe_event_need_dwarf(struct perf_probe_event *pev);
 
+bool is_kretprobe_offset_supported(void);
+
 /* Release event contents */
 void clear_perf_probe_event(struct perf_probe_event *pev);
 void clear_probe_trace_event(struct probe_trace_event *tev);
-- 
2.11.1

[PATCH v3 1/2] perf: probe: generalize probe event file open routine

2017-02-23 Thread Naveen N. Rao

...into a generic function for opening trace files.

Signed-off-by: Naveen N. Rao 
---
 tools/perf/util/probe-file.c | 20 +++-
 tools/perf/util/probe-file.h |  1 +
 2 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/tools/perf/util/probe-file.c b/tools/perf/util/probe-file.c
index 436b64731f65..1a62daceb028 100644
--- a/tools/perf/util/probe-file.c
+++ b/tools/perf/util/probe-file.c
@@ -70,7 +70,7 @@ static void print_both_open_warning(int kerr, int uerr)
}
 }
 
-static int open_probe_events(const char *trace_file, bool readwrite)
+int open_trace_file(const char *trace_file, bool readwrite)
 {
char buf[PATH_MAX];
int ret;
@@ -92,12 +92,12 @@ static int open_probe_events(const char *trace_file, bool 
readwrite)
 
 static int open_kprobe_events(bool readwrite)
 {
-   return open_probe_events("kprobe_events", readwrite);
+   return open_trace_file("kprobe_events", readwrite);
 }
 
 static int open_uprobe_events(bool readwrite)
 {
-   return open_probe_events("uprobe_events", readwrite);
+   return open_trace_file("uprobe_events", readwrite);
 }
 
 int probe_file__open(int flag)
@@ -899,6 +899,7 @@ bool probe_type_is_available(enum probe_type type)
size_t len = 0;
bool target_line = false;
bool ret = probe_type_table[type].avail;
+   int fd;
 
if (type >= PROBE_TYPE_END)
return false;
@@ -906,14 +907,16 @@ bool probe_type_is_available(enum probe_type type)
if (ret || probe_type_table[type].checked)
return ret;
 
-   if (asprintf(&buf, "%s/README", tracing_path) < 0)
+   fd = open_trace_file("README", false);
+   if (fd < 0)
return ret;
 
-   fp = fopen(buf, "r");
-   if (!fp)
-   goto end;
+   fp = fdopen(fd, "r");
+   if (!fp) {
+   close(fd);
+   return ret;
+   }
 
-   zfree(&buf);
while (getline(&buf, &len, fp) > 0 && !ret) {
if (!target_line) {
target_line = !!strstr(buf, " type: ");
@@ -928,7 +931,6 @@ bool probe_type_is_available(enum probe_type type)
probe_type_table[type].avail = ret;
 
fclose(fp);
-end:
free(buf);
 
return ret;
diff --git a/tools/perf/util/probe-file.h b/tools/perf/util/probe-file.h
index eba44c3e9dca..a17a82eff8a0 100644
--- a/tools/perf/util/probe-file.h
+++ b/tools/perf/util/probe-file.h
@@ -35,6 +35,7 @@ enum probe_type {
 
 /* probe-file.c depends on libelf */
 #ifdef HAVE_LIBELF_SUPPORT
+int open_trace_file(const char *trace_file, bool readwrite);
 int probe_file__open(int flag);
 int probe_file__open_both(int *kfd, int *ufd, int flag);
 struct strlist *probe_file__get_namelist(int fd);
-- 
2.11.1

Re: [PATCH 1/2] crypto: vmx - Use skcipher for cbc fallback

2017-02-23 Thread Herbert Xu

Paulo Flabiano Smorigo  wrote:
>
>fallback =
> -   crypto_alloc_blkcipher(alg, 0, CRYPTO_ALG_NEED_FALLBACK);
> +   crypto_alloc_skcipher(alg, 0, CRYPTO_ALG_NEED_FALLBACK);

You need to add CRYPTO_ALG_ASYNC to the mask in order to ensure
that you get a sync algorithm.

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline

2017-02-23 Thread Michal Hocko

On Wed 22-02-17 10:32:34, Vitaly Kuznetsov wrote:
[...]
> > There is a workaround in that a user could online the memory or have
> > a udev rule to online the memory by using the sysfs interface. The
> > sysfs interface to online memory goes through device_online() which
> > should updated the dev->offline flag. I'm not sure that having kernel
> > memory hotplug rely on userspace actions is the correct way to go.
> 
> Using udev rule for memory onlining is possible when you disable
> memhp_auto_online but in some cases it doesn't work well, e.g. when we
> use memory hotplug to address memory pressure the loop through userspace
> is really slow and memory consuming, we may hit OOM before we manage to
> online newly added memory.

How does the in-kernel implementation prevents from that?

> In addition to that, systemd/udev folks
> continuosly refused to add this udev rule to udev calling it stupid as
> it actually is an unconditional and redundant ping-pong between kernel
> and udev.

This is a policy and as such it doesn't belong to the kernel. The whole
auto-enable in the kernel is just plain wrong IMHO and we shouldn't have
merged it.
-- 
Michal Hocko
SUSE Labs

Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline

2017-02-23 Thread Vitaly Kuznetsov

Michal Hocko  writes:

> On Wed 22-02-17 10:32:34, Vitaly Kuznetsov wrote:
> [...]
>> > There is a workaround in that a user could online the memory or have
>> > a udev rule to online the memory by using the sysfs interface. The
>> > sysfs interface to online memory goes through device_online() which
>> > should updated the dev->offline flag. I'm not sure that having kernel
>> > memory hotplug rely on userspace actions is the correct way to go.
>> 
>> Using udev rule for memory onlining is possible when you disable
>> memhp_auto_online but in some cases it doesn't work well, e.g. when we
>> use memory hotplug to address memory pressure the loop through userspace
>> is really slow and memory consuming, we may hit OOM before we manage to
>> online newly added memory.
>
> How does the in-kernel implementation prevents from that?
>

Onlining memory on hot-plug is much more reliable, e.g. if we were able
to add it in add_memory_resource() we'll also manage to online it. With
udev rule we may end up adding many blocks and then (as udev is
asynchronous) failing to online any of them. In-kernel operation is
synchronous.

>> In addition to that, systemd/udev folks
>> continuosly refused to add this udev rule to udev calling it stupid as
>> it actually is an unconditional and redundant ping-pong between kernel
>> and udev.
>
> This is a policy and as such it doesn't belong to the kernel. The whole
> auto-enable in the kernel is just plain wrong IMHO and we shouldn't have
> merged it.

I disagree.

First of all it's not a policy, it is a default. We have many other
defaults in kernel. When I add a network card or a storage, for example,
I don't need to go anywhere and 'enable' it before I'm able to use
it from userspace. An for memory (and CPUs) we, for some unknown reason
opted for something completely different. If someone is plugging new
memory into a box he probably wants to use it, I don't see much value in
waiting for a special confirmation from him. 

Second, this feature is optional. If you want to keep old behavior just
don't enable it.

Third, this solves real world issues. With Hyper-V it is very easy to
show udev failing on stress. No other solution to the issue was ever
suggested.

-- 
  Vitaly

Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline

2017-02-23 Thread Michal Hocko

On Thu 23-02-17 14:31:24, Vitaly Kuznetsov wrote:
> Michal Hocko  writes:
> 
> > On Wed 22-02-17 10:32:34, Vitaly Kuznetsov wrote:
> > [...]
> >> > There is a workaround in that a user could online the memory or have
> >> > a udev rule to online the memory by using the sysfs interface. The
> >> > sysfs interface to online memory goes through device_online() which
> >> > should updated the dev->offline flag. I'm not sure that having kernel
> >> > memory hotplug rely on userspace actions is the correct way to go.
> >> 
> >> Using udev rule for memory onlining is possible when you disable
> >> memhp_auto_online but in some cases it doesn't work well, e.g. when we
> >> use memory hotplug to address memory pressure the loop through userspace
> >> is really slow and memory consuming, we may hit OOM before we manage to
> >> online newly added memory.
> >
> > How does the in-kernel implementation prevents from that?
> >
> 
> Onlining memory on hot-plug is much more reliable, e.g. if we were able
> to add it in add_memory_resource() we'll also manage to online it.

How does that differ from initiating online from the users?

> With
> udev rule we may end up adding many blocks and then (as udev is
> asynchronous) failing to online any of them.

Why would it fail?

> In-kernel operation is synchronous.

which doesn't mean anything as the context is preemptible AFAICS.

> >> In addition to that, systemd/udev folks
> >> continuosly refused to add this udev rule to udev calling it stupid as
> >> it actually is an unconditional and redundant ping-pong between kernel
> >> and udev.
> >
> > This is a policy and as such it doesn't belong to the kernel. The whole
> > auto-enable in the kernel is just plain wrong IMHO and we shouldn't have
> > merged it.
> 
> I disagree.
> 
> First of all it's not a policy, it is a default. We have many other
> defaults in kernel. When I add a network card or a storage, for example,
> I don't need to go anywhere and 'enable' it before I'm able to use
> it from userspace. An for memory (and CPUs) we, for some unknown reason
> opted for something completely different. If someone is plugging new
> memory into a box he probably wants to use it, I don't see much value in
> waiting for a special confirmation from him. 

This was not my decision so I can only guess but to me it makes sense.
Both memory and cpus can be physically present and offline which is a
perfectly reasonable state. So having a two phase physicall hotadd is
just built on top of physical vs. logical distinction. I completely
understand that some usecases will really like to online the whole node
as soon as it appears present. But an automatic in-kernel implementation
has its down sites - e.g. if this operation fails in the middle you will
not know about that unless you check all the memblocks in sysfs. This is
really a poor interface.

> Second, this feature is optional. If you want to keep old behavior just
> don't enable it.

It just adds unnecessary configuration noise as well

> Third, this solves real world issues. With Hyper-V it is very easy to
> show udev failing on stress. 

What is the reason for this failures. Do you have any link handy?

> No other solution to the issue was ever suggested.

you mean like using ballooning for the memory overcommit like other more
reasonable virtualization solutions?

-- 
Michal Hocko
SUSE Labs

Re: [PATCH 00/35] treewide trivial patches converting pr_warning to pr_warn

2017-02-23 Thread Rob Herring

On Fri, Feb 17, 2017 at 1:11 AM, Joe Perches  wrote:
> There are ~4300 uses of pr_warn and ~250 uses of the older
> pr_warning in the kernel source tree.
>
> Make the use of pr_warn consistent across all kernel files.
>
> This excludes all files in tools/ as there is a separate
> define pr_warning for that directory tree and pr_warn is
> not used in tools/.
>
> Done with 'sed s/\bpr_warning\b/pr_warn/' and some emacsing.
>
> Miscellanea:
>
> o Coalesce formats and realign arguments
>
> Some files not compiled - no cross-compilers
>
> Joe Perches (35):
>   alpha: Convert remaining uses of pr_warning to pr_warn
>   ARM: ep93xx: Convert remaining uses of pr_warning to pr_warn
>   arm64: Convert remaining uses of pr_warning to pr_warn
>   arch/blackfin: Convert remaining uses of pr_warning to pr_warn
>   ia64: Convert remaining use of pr_warning to pr_warn
>   powerpc: Convert remaining uses of pr_warning to pr_warn
>   sh: Convert remaining uses of pr_warning to pr_warn
>   sparc: Convert remaining use of pr_warning to pr_warn
>   x86: Convert remaining uses of pr_warning to pr_warn
>   drivers/acpi: Convert remaining uses of pr_warning to pr_warn
>   block/drbd: Convert remaining uses of pr_warning to pr_warn
>   gdrom: Convert remaining uses of pr_warning to pr_warn
>   drivers/char: Convert remaining use of pr_warning to pr_warn
>   clocksource: Convert remaining use of pr_warning to pr_warn
>   drivers/crypto: Convert remaining uses of pr_warning to pr_warn
>   fmc: Convert remaining use of pr_warning to pr_warn
>   drivers/gpu: Convert remaining uses of pr_warning to pr_warn
>   drivers/ide: Convert remaining uses of pr_warning to pr_warn
>   drivers/input: Convert remaining uses of pr_warning to pr_warn
>   drivers/isdn: Convert remaining uses of pr_warning to pr_warn
>   drivers/macintosh: Convert remaining uses of pr_warning to pr_warn
>   drivers/media: Convert remaining use of pr_warning to pr_warn
>   drivers/mfd: Convert remaining uses of pr_warning to pr_warn
>   drivers/mtd: Convert remaining uses of pr_warning to pr_warn
>   drivers/of: Convert remaining uses of pr_warning to pr_warn
>   drivers/oprofile: Convert remaining uses of pr_warning to pr_warn
>   drivers/platform: Convert remaining uses of pr_warning to pr_warn
>   drivers/rapidio: Convert remaining use of pr_warning to pr_warn
>   drivers/scsi: Convert remaining use of pr_warning to pr_warn
>   drivers/sh: Convert remaining use of pr_warning to pr_warn
>   drivers/tty: Convert remaining uses of pr_warning to pr_warn
>   drivers/video: Convert remaining uses of pr_warning to pr_warn
>   kernel/trace: Convert remaining uses of pr_warning to pr_warn
>   lib: Convert remaining uses of pr_warning to pr_warn
>   sound/soc: Convert remaining uses of pr_warning to pr_warn

Where's the removal of pr_warning so we don't have more sneak in?

Rob

Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline

2017-02-23 Thread Vitaly Kuznetsov

Michal Hocko  writes:

> On Thu 23-02-17 14:31:24, Vitaly Kuznetsov wrote:
>> Michal Hocko  writes:
>> 
>> > On Wed 22-02-17 10:32:34, Vitaly Kuznetsov wrote:
>> > [...]
>> >> > There is a workaround in that a user could online the memory or have
>> >> > a udev rule to online the memory by using the sysfs interface. The
>> >> > sysfs interface to online memory goes through device_online() which
>> >> > should updated the dev->offline flag. I'm not sure that having kernel
>> >> > memory hotplug rely on userspace actions is the correct way to go.
>> >> 
>> >> Using udev rule for memory onlining is possible when you disable
>> >> memhp_auto_online but in some cases it doesn't work well, e.g. when we
>> >> use memory hotplug to address memory pressure the loop through userspace
>> >> is really slow and memory consuming, we may hit OOM before we manage to
>> >> online newly added memory.
>> >
>> > How does the in-kernel implementation prevents from that?
>> >
>> 
>> Onlining memory on hot-plug is much more reliable, e.g. if we were able
>> to add it in add_memory_resource() we'll also manage to online it.
>
> How does that differ from initiating online from the users?
>
>> With
>> udev rule we may end up adding many blocks and then (as udev is
>> asynchronous) failing to online any of them.
>
> Why would it fail?
>
>> In-kernel operation is synchronous.
>
> which doesn't mean anything as the context is preemptible AFAICS.
>

It actually does,

imagine the following example: you run a small guest (256M of memory)
and now there is a request to add 1000 128mb blocks to it. In case you
do it the old way you're very likely to get OOM somewhere in the middle
as you keep adding blocks which requere kernel memory and nobody is
onlining it (or, at least you're racing with the onliner). With
in-kernel implementation we're going to online the first block when it's
added and only then go to the second.

>> >> In addition to that, systemd/udev folks
>> >> continuosly refused to add this udev rule to udev calling it stupid as
>> >> it actually is an unconditional and redundant ping-pong between kernel
>> >> and udev.
>> >
>> > This is a policy and as such it doesn't belong to the kernel. The whole
>> > auto-enable in the kernel is just plain wrong IMHO and we shouldn't have
>> > merged it.
>> 
>> I disagree.
>> 
>> First of all it's not a policy, it is a default. We have many other
>> defaults in kernel. When I add a network card or a storage, for example,
>> I don't need to go anywhere and 'enable' it before I'm able to use
>> it from userspace. An for memory (and CPUs) we, for some unknown reason
>> opted for something completely different. If someone is plugging new
>> memory into a box he probably wants to use it, I don't see much value in
>> waiting for a special confirmation from him. 
>
> This was not my decision so I can only guess but to me it makes sense.
> Both memory and cpus can be physically present and offline which is a
> perfectly reasonable state. So having a two phase physicall hotadd is
> just built on top of physical vs. logical distinction. I completely
> understand that some usecases will really like to online the whole node
> as soon as it appears present. But an automatic in-kernel implementation
> has its down sites - e.g. if this operation fails in the middle you will
> not know about that unless you check all the memblocks in sysfs. This is
> really a poor interface.

And how do you know that some blocks failed to online with udev? Who
handles these failures and how? And, the last but not least, why do
these failures happen?

>
>> Second, this feature is optional. If you want to keep old behavior just
>> don't enable it.
>
> It just adds unnecessary configuration noise as well
>

For any particular user everything he doesn't need is 'noise'...

>> Third, this solves real world issues. With Hyper-V it is very easy to
>> show udev failing on stress. 
>
> What is the reason for this failures. Do you have any link handy?
>

The reason is going out of memory, swapping and being slow in
general. Again, think about the example I give above: there is a request
to add many memory blocks and if we try to handle them all before any of
them get online we will get OOM and may even kill the udev process.

>> No other solution to the issue was ever suggested.
>
> you mean like using ballooning for the memory overcommit like other more
> reasonable virtualization solutions?

Not sure how ballooning is related here. Hyper-V uses memory hotplug to
add memory to domains, I don't think we have any other solutions for
that. From hypervisor's point of view the memory was added when the
particular request succeeded, it is not aware of our 'logical/physical'
separation.

-- 
  Vitaly

Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline

2017-02-23 Thread Michal Hocko

On Thu 23-02-17 16:49:06, Vitaly Kuznetsov wrote:
> Michal Hocko  writes:
> 
> > On Thu 23-02-17 14:31:24, Vitaly Kuznetsov wrote:
> >> Michal Hocko  writes:
> >> 
> >> > On Wed 22-02-17 10:32:34, Vitaly Kuznetsov wrote:
> >> > [...]
> >> >> > There is a workaround in that a user could online the memory or have
> >> >> > a udev rule to online the memory by using the sysfs interface. The
> >> >> > sysfs interface to online memory goes through device_online() which
> >> >> > should updated the dev->offline flag. I'm not sure that having kernel
> >> >> > memory hotplug rely on userspace actions is the correct way to go.
> >> >> 
> >> >> Using udev rule for memory onlining is possible when you disable
> >> >> memhp_auto_online but in some cases it doesn't work well, e.g. when we
> >> >> use memory hotplug to address memory pressure the loop through userspace
> >> >> is really slow and memory consuming, we may hit OOM before we manage to
> >> >> online newly added memory.
> >> >
> >> > How does the in-kernel implementation prevents from that?
> >> >
> >> 
> >> Onlining memory on hot-plug is much more reliable, e.g. if we were able
> >> to add it in add_memory_resource() we'll also manage to online it.
> >
> > How does that differ from initiating online from the users?
> >
> >> With
> >> udev rule we may end up adding many blocks and then (as udev is
> >> asynchronous) failing to online any of them.
> >
> > Why would it fail?
> >
> >> In-kernel operation is synchronous.
> >
> > which doesn't mean anything as the context is preemptible AFAICS.
> >
> 
> It actually does,
> 
> imagine the following example: you run a small guest (256M of memory)
> and now there is a request to add 1000 128mb blocks to it. 

Is a grow from 256M -> 128GB really something that happens in real life?
Don't get me wrong but to me this sounds quite exaggerated. Hotmem add
which is an operation which has to allocate memory has to scale with the
currently available memory IMHO.

> In case you
> do it the old way you're very likely to get OOM somewhere in the middle
> as you keep adding blocks which requere kernel memory and nobody is
> onlining it (or, at least you're racing with the onliner). With
> in-kernel implementation we're going to online the first block when it's
> added and only then go to the second.

Yes, adding a memory will cost you some memory and that is why I am
really skeptical when memory hotplug is used under a strong memory
pressure. This can lead to OOMs even when you online one block at the
time.

[...]
> > This was not my decision so I can only guess but to me it makes sense.
> > Both memory and cpus can be physically present and offline which is a
> > perfectly reasonable state. So having a two phase physicall hotadd is
> > just built on top of physical vs. logical distinction. I completely
> > understand that some usecases will really like to online the whole node
> > as soon as it appears present. But an automatic in-kernel implementation
> > has its down sites - e.g. if this operation fails in the middle you will
> > not know about that unless you check all the memblocks in sysfs. This is
> > really a poor interface.
> 
> And how do you know that some blocks failed to online with udev?

Because the udev will run a code which can cope with that - retry if the
error is recoverable or simply report with all the details. Compare that
to crawling the system log to see that something has broken...

> Who
> handles these failures and how? And, the last but not least, why do
> these failures happen?

I haven't heard reports about the failures and from looking into the
code those are possible but very unlikely.
-- 
Michal Hocko
SUSE Labs

[PATCH] powernv-cpuidle: Validate DT property arrays are of same size

2017-02-23 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

The various properties associated with powernv idle states such as
names, flags, residency-ns, latencies-ns, psscr, psscr-mask are exposed
in the device-tree as property arrays such the pointwise entries in each
of these arrays correspond to the properties of the same idle state.

This patch validates that the lengths of the property arrays are the
same. If there is a mismatch, the patch will ensure that we bail out and
not expose the platform idle states via cpuidle.

Signed-off-by: Gautham R. Shenoy 
---
 drivers/cpuidle/cpuidle-powernv.c | 56 ---
 1 file changed, 53 insertions(+), 3 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index 3705930..c30b9fd 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -197,11 +197,25 @@ static inline void add_powernv_state(int index, const 
char *name,
stop_psscr_table[index].mask = psscr_mask;
 }
 
+/*
+ * Returns 0 if prop1_len == prop2_len. Else returns -1
+ */
+static inline int validate_dt_prop_sizes(const char *prop1, int prop1_len,
+const char *prop2, int prop2_len)
+{
+   if (prop1_len == prop2_len)
+   return 0;
+
+   pr_warn("cpuidle-powernv: array sizes don't match for %s and %s\n",
+   prop1, prop2);
+   return -1;
+}
+
 static int powernv_add_idle_states(void)
 {
struct device_node *power_mgt;
int nr_idle_states = 1; /* Snooze */
-   int dt_idle_states;
+   int dt_idle_states, count;
u32 latency_ns[CPUIDLE_STATE_MAX];
u32 residency_ns[CPUIDLE_STATE_MAX];
u32 flags[CPUIDLE_STATE_MAX];
@@ -226,6 +240,19 @@ static int powernv_add_idle_states(void)
goto out;
}
 
+   count = of_property_count_u32_elems(power_mgt,
+   "ibm,cpu-idle-state-latencies-ns");
+
+   if (validate_dt_prop_sizes("flags", dt_idle_states,
+  "latencies-ns", count) != 0)
+   goto out;
+
+   count = of_property_count_strings(power_mgt,
+ "ibm,cpu-idle-state-names");
+   if (validate_dt_prop_sizes("flags", dt_idle_states,
+  "names", count) != 0)
+   goto out;
+
/*
 * Since snooze is used as first idle state, max idle states allowed is
 * CPUIDLE_STATE_MAX -1
@@ -260,6 +287,18 @@ static int powernv_add_idle_states(void)
has_stop_states = (flags[0] &
   (OPAL_PM_STOP_INST_FAST | OPAL_PM_STOP_INST_DEEP));
if (has_stop_states) {
+   count = of_property_count_u64_elems(power_mgt,
+   "ibm,cpu-idle-state-psscr");
+   if (validate_dt_prop_sizes("flags", dt_idle_states,
+  "psscr", count) != 0)
+   goto out;
+
+   count = of_property_count_u64_elems(power_mgt,
+   
"ibm,cpu-idle-state-psscr-mask");
+   if (validate_dt_prop_sizes("flags", dt_idle_states,
+  "psscr-mask", count) != 0)
+   goto out;
+
if (of_property_read_u64_array(power_mgt,
"ibm,cpu-idle-state-psscr", psscr_val, dt_idle_states)) {
pr_warn("cpuidle-powernv: missing 
ibm,cpu-idle-state-psscr in DT\n");
@@ -274,8 +313,19 @@ static int powernv_add_idle_states(void)
}
}
 
-   rc = of_property_read_u32_array(power_mgt,
-   "ibm,cpu-idle-state-residency-ns", residency_ns, 
dt_idle_states);
+   count = of_property_count_u32_elems(power_mgt,
+   "ibm,cpu-idle-state-residency-ns");
+
+   if (count < 0) {
+   rc = count;
+   } else if (validate_dt_prop_sizes("flags", dt_idle_states,
+ "psscr-mask", count) != 0) {
+   goto out;
+   } else {
+   rc = of_property_read_u32_array(power_mgt,
+   
"ibm,cpu-idle-state-residency-ns",
+   residency_ns, dt_idle_states);
+   }
 
for (i = 0; i < dt_idle_states; i++) {
unsigned int exit_latency, target_residency;
-- 
1.9.4

Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline

2017-02-23 Thread Vitaly Kuznetsov

Michal Hocko  writes:

> On Thu 23-02-17 16:49:06, Vitaly Kuznetsov wrote:
>> Michal Hocko  writes:
>> 
>> > On Thu 23-02-17 14:31:24, Vitaly Kuznetsov wrote:
>> >> Michal Hocko  writes:
>> >> 
>> >> > On Wed 22-02-17 10:32:34, Vitaly Kuznetsov wrote:
>> >> > [...]
>> >> >> > There is a workaround in that a user could online the memory or have
>> >> >> > a udev rule to online the memory by using the sysfs interface. The
>> >> >> > sysfs interface to online memory goes through device_online() which
>> >> >> > should updated the dev->offline flag. I'm not sure that having kernel
>> >> >> > memory hotplug rely on userspace actions is the correct way to go.
>> >> >> 
>> >> >> Using udev rule for memory onlining is possible when you disable
>> >> >> memhp_auto_online but in some cases it doesn't work well, e.g. when we
>> >> >> use memory hotplug to address memory pressure the loop through 
>> >> >> userspace
>> >> >> is really slow and memory consuming, we may hit OOM before we manage to
>> >> >> online newly added memory.
>> >> >
>> >> > How does the in-kernel implementation prevents from that?
>> >> >
>> >> 
>> >> Onlining memory on hot-plug is much more reliable, e.g. if we were able
>> >> to add it in add_memory_resource() we'll also manage to online it.
>> >
>> > How does that differ from initiating online from the users?
>> >
>> >> With
>> >> udev rule we may end up adding many blocks and then (as udev is
>> >> asynchronous) failing to online any of them.
>> >
>> > Why would it fail?
>> >
>> >> In-kernel operation is synchronous.
>> >
>> > which doesn't mean anything as the context is preemptible AFAICS.
>> >
>> 
>> It actually does,
>> 
>> imagine the following example: you run a small guest (256M of memory)
>> and now there is a request to add 1000 128mb blocks to it. 
>
> Is a grow from 256M -> 128GB really something that happens in real life?
> Don't get me wrong but to me this sounds quite exaggerated. Hotmem add
> which is an operation which has to allocate memory has to scale with the
> currently available memory IMHO.

With virtual machines this is very real and not exaggerated at
all. E.g. Hyper-V host can be tuned to automatically add new memory when
guest is running out of it. Even 100 blocks can represent an issue.

>
>> In case you
>> do it the old way you're very likely to get OOM somewhere in the middle
>> as you keep adding blocks which requere kernel memory and nobody is
>> onlining it (or, at least you're racing with the onliner). With
>> in-kernel implementation we're going to online the first block when it's
>> added and only then go to the second.
>
> Yes, adding a memory will cost you some memory and that is why I am
> really skeptical when memory hotplug is used under a strong memory
> pressure. This can lead to OOMs even when you online one block at the
> time.

If you can't allocate anything then yes, of course it will fail. But if
you try to add many blocks without onlining at the same time the
probability of failure is orders of magniture higher.

(a bit unrelated) I was actually thinking about the possible failure and
had the following idea in my head: we always keep everything allocated
for one additional memory block so when hotplug happens we use this
reserved space to add the block, online it and immediately reserve space
for the next one. I didn't do any coding yet.

>
> [...]
>> > This was not my decision so I can only guess but to me it makes sense.
>> > Both memory and cpus can be physically present and offline which is a
>> > perfectly reasonable state. So having a two phase physicall hotadd is
>> > just built on top of physical vs. logical distinction. I completely
>> > understand that some usecases will really like to online the whole node
>> > as soon as it appears present. But an automatic in-kernel implementation
>> > has its down sites - e.g. if this operation fails in the middle you will
>> > not know about that unless you check all the memblocks in sysfs. This is
>> > really a poor interface.
>> 
>> And how do you know that some blocks failed to online with udev?
>
> Because the udev will run a code which can cope with that - retry if the
> error is recoverable or simply report with all the details. Compare that
> to crawling the system log to see that something has broken...

I don't know much about udev, but the most common rule to online memory
I've met is:

SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline",  
ATTR{state}="online"

doesn't do anything smart.

In current RHEL7 it is even worse:

SUBSYSTEM=="memory", ACTION=="add", PROGRAM="/bin/uname -p", RESULT!="s390*", 
ATTR{state}=="offline", ATTR{state}="online"

so to online new memory block we actually need to run a process.

>
>> Who
>> handles these failures and how? And, the last but not least, why do
>> these failures happen?
>
> I haven't heard reports about the failures and from looking into the
> code those are possible but very unlikely.

My point is - failures are possible, yes, but

Re: [PATCH 00/35] treewide trivial patches converting pr_warning to pr_warn

2017-02-23 Thread Joe Perches

On Thu, 2017-02-23 at 09:28 -0600, Rob Herring wrote:
> On Fri, Feb 17, 2017 at 1:11 AM, Joe Perches  wrote:
> > There are ~4300 uses of pr_warn and ~250 uses of the older
> > pr_warning in the kernel source tree.
> > 
> > Make the use of pr_warn consistent across all kernel files.
> > 
> > This excludes all files in tools/ as there is a separate
> > define pr_warning for that directory tree and pr_warn is
> > not used in tools/.
> > 
> > Done with 'sed s/\bpr_warning\b/pr_warn/' and some emacsing.
[]
> Where's the removal of pr_warning so we don't have more sneak in?

After all of these actually get applied,
and maybe a cycle or two later, one would
get sent.

Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline

2017-02-23 Thread Michal Hocko

On Thu 23-02-17 17:36:38, Vitaly Kuznetsov wrote:
> Michal Hocko  writes:
[...]
> > Is a grow from 256M -> 128GB really something that happens in real life?
> > Don't get me wrong but to me this sounds quite exaggerated. Hotmem add
> > which is an operation which has to allocate memory has to scale with the
> > currently available memory IMHO.
> 
> With virtual machines this is very real and not exaggerated at
> all. E.g. Hyper-V host can be tuned to automatically add new memory when
> guest is running out of it. Even 100 blocks can represent an issue.

Do you have any reference to a bug report. I am really curious because
something really smells wrong and it is not clear that the chosen
solution is really the best one.
[...]
> > Because the udev will run a code which can cope with that - retry if the
> > error is recoverable or simply report with all the details. Compare that
> > to crawling the system log to see that something has broken...
> 
> I don't know much about udev, but the most common rule to online memory
> I've met is:
> 
> SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline",  
> ATTR{state}="online"
> 
> doesn't do anything smart.

So what? Is there anything that prevents doing something smarter?
-- 
Michal Hocko
SUSE Labs

Re: [PATCH 00/35] treewide trivial patches converting pr_warning to pr_warn

2017-02-23 Thread Joe Perches

On Thu, 2017-02-23 at 17:41 +, Emil Velikov wrote:
> On 23 February 2017 at 17:18, Joe Perches  wrote:
> > On Thu, 2017-02-23 at 09:28 -0600, Rob Herring wrote:
> > > On Fri, Feb 17, 2017 at 1:11 AM, Joe Perches  wrote:
> > > > There are ~4300 uses of pr_warn and ~250 uses of the older
> > > > pr_warning in the kernel source tree.
> > > > 
> > > > Make the use of pr_warn consistent across all kernel files.
> > > > 
> > > > This excludes all files in tools/ as there is a separate
> > > > define pr_warning for that directory tree and pr_warn is
> > > > not used in tools/.
> > > > 
> > > > Done with 'sed s/\bpr_warning\b/pr_warn/' and some emacsing.
> > 
> > []
> > > Where's the removal of pr_warning so we don't have more sneak in?
> > 
> > After all of these actually get applied,
> > and maybe a cycle or two later, one would
> > get sent.
> > 
> 
> By which point you'll get a few reincarnation of it. So you'll have to
> do the same exercise again :-(

Maybe to one or two files.  Not a big deal.

> I guess the question is - are you expecting to get the series merged
> all together/via one tree ?

No.  The only person that could do that effectively is Linus.

> If not, your plan is perfectly reasonable.

Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline

2017-02-23 Thread Vitaly Kuznetsov

Michal Hocko  writes:

> On Thu 23-02-17 17:36:38, Vitaly Kuznetsov wrote:
>> Michal Hocko  writes:
> [...]
>> > Is a grow from 256M -> 128GB really something that happens in real life?
>> > Don't get me wrong but to me this sounds quite exaggerated. Hotmem add
>> > which is an operation which has to allocate memory has to scale with the
>> > currently available memory IMHO.
>> 
>> With virtual machines this is very real and not exaggerated at
>> all. E.g. Hyper-V host can be tuned to automatically add new memory when
>> guest is running out of it. Even 100 blocks can represent an issue.
>
> Do you have any reference to a bug report. I am really curious because
> something really smells wrong and it is not clear that the chosen
> solution is really the best one.

Unfortunately I'm not aware of any publicly posted bug reports (CC:
K. Y. - he may have a reference) but I think I still remember everything
correctly. Not sure how deep you want me to go into details though...

Virtual guests under stress were getting into OOM easily and the OOM
killer was even killing the udev process trying to online the
memory. There was a workaround for the issue added to the hyper-v driver
doing memory add:

hv_mem_hot_add(...) {
...
 add_memory();
 wait_for_completion_timeout(..., 5*HZ);
 ...
}

the completion was done by observing for the MEM_ONLINE event. This, of
course, was slowing things down significantly and waiting for a
userspace action in kernel is not a nice thing to have (not speaking
about all other memory adding methods which had the same issue). Just
removing this wait was leading us to the same OOM as the hypervisor was
adding more and more memory and eventually even add_memory() was
failing, udev and other processes were killed,...

With the feature in place we have new memory available right after we do
add_memory(), everything is serialized.

> [...]
>> > Because the udev will run a code which can cope with that - retry if the
>> > error is recoverable or simply report with all the details. Compare that
>> > to crawling the system log to see that something has broken...
>> 
>> I don't know much about udev, but the most common rule to online memory
>> I've met is:
>> 
>> SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline",  
>> ATTR{state}="online"
>> 
>> doesn't do anything smart.
>
> So what? Is there anything that prevents doing something smarter?

Yes, the asynchronous nature of all this stuff. There is no way you can
stop other blocks from being added to the system while you're processing
something in userspace.

-- 
  Vitaly

Re: [PATCH v2 4/5] perf: kretprobes: offset from reloc_sym if kernel supports it

2017-02-23 Thread Naveen N. Rao

On 2017/02/23 06:10PM, Masami Hiramatsu wrote:
> On Wed, 22 Feb 2017 19:23:40 +0530
> "Naveen N. Rao"  wrote:
> 
> > We indicate support for accepting sym+offset with kretprobes through a
> > line in ftrace README. Parse the same to identify support and choose the
> > appropriate format for kprobe_events.
> > 
> > Signed-off-by: Naveen N. Rao 
> > ---
> >  tools/perf/util/probe-event.c | 47 
> > ---
> >  tools/perf/util/probe-event.h |  2 ++
> >  2 files changed, 42 insertions(+), 7 deletions(-)
> > 

[snip]

> 
> Could you reuse (refactoring) probe_type_is_available() in probe-file.c to 
> share
> opening README file?

Done. I've sent patches to do that, please review.

> 
> Others looks good to me :)

Thanks. I hope that's an Ack for this patchset?

If so, and if Ingo/Michael agree, would it be ok to take the kernel bits 
through the powerpc tree like we did for kprobe_exceptions_notify() 
cleanup?


Regards,
Naveen

Re: [PATCH 00/35] treewide trivial patches converting pr_warning to pr_warn

2017-02-23 Thread Emil Velikov

On 23 February 2017 at 17:18, Joe Perches  wrote:
> On Thu, 2017-02-23 at 09:28 -0600, Rob Herring wrote:
>> On Fri, Feb 17, 2017 at 1:11 AM, Joe Perches  wrote:
>> > There are ~4300 uses of pr_warn and ~250 uses of the older
>> > pr_warning in the kernel source tree.
>> >
>> > Make the use of pr_warn consistent across all kernel files.
>> >
>> > This excludes all files in tools/ as there is a separate
>> > define pr_warning for that directory tree and pr_warn is
>> > not used in tools/.
>> >
>> > Done with 'sed s/\bpr_warning\b/pr_warn/' and some emacsing.
> []
>> Where's the removal of pr_warning so we don't have more sneak in?
>
> After all of these actually get applied,
> and maybe a cycle or two later, one would
> get sent.
>
By which point you'll get a few reincarnation of it. So you'll have to
do the same exercise again :-(

I guess the question is - are you expecting to get the series merged
all together/via one tree ? If not, your plan is perfectly reasonable.
Fwiw in the DRM subsystem, similar cleanups does purge the respective
macros/other with the final commit. But there one can pull the lot in
one go.

Regards,
Emil

[PATCH 01/12] drm/ast: Fix AST2400 POST failure without BMC FW or VBIOS

2017-02-23 Thread Benjamin Herrenschmidt

From: "Y.C. Chen" 

The current POST code for the AST2300/2400 family doesn't work properly
if the chip hasn't been initialized previously by either the BMC own FW
or the VBIOS. This fixes it.

Signed-off-by: Y.C. Chen 
Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/gpu/drm/ast/ast_post.c | 38 +++---
 1 file changed, 35 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/ast/ast_post.c b/drivers/gpu/drm/ast/ast_post.c
index 5331ee1..6c5391c 100644
--- a/drivers/gpu/drm/ast/ast_post.c
+++ b/drivers/gpu/drm/ast/ast_post.c
@@ -1638,12 +1638,44 @@ static void ast_init_dram_2300(struct drm_device *dev)
temp |= 0x73;
ast_write32(ast, 0x12008, temp);
 
+   param.dram_freq = 396;
param.dram_type = AST_DDR3;
+   temp = ast_mindwm(ast, 0x1e6e2070);
if (temp & 0x0100)
param.dram_type = AST_DDR2;
-   param.dram_chipid = ast->dram_type;
-   param.dram_freq = ast->mclk;
-   param.vram_size = ast->vram_size;
+switch (temp & 0x1800) {
+   case 0:
+   param.dram_chipid = AST_DRAM_512Mx16;
+   break;
+   default:
+   case 0x0800:
+   param.dram_chipid = AST_DRAM_1Gx16;
+   break;
+   case 0x1000:
+   param.dram_chipid = AST_DRAM_2Gx16;
+   break;
+   case 0x1800:
+   param.dram_chipid = AST_DRAM_4Gx16;
+   break;
+   }
+switch (temp & 0x0c) {
+default:
+   case 0x00:
+   param.vram_size = AST_VIDMEM_SIZE_8M;
+   break;
+
+   case 0x04:
+   param.vram_size = AST_VIDMEM_SIZE_16M;
+   break;
+
+   case 0x08:
+   param.vram_size = AST_VIDMEM_SIZE_32M;
+   break;
+
+   case 0x0c:
+   param.vram_size = AST_VIDMEM_SIZE_64M;
+   break;
+   }
 
if (param.dram_type == AST_DDR3) {
get_ddr3_info(ast, ¶m);
-- 
2.9.3

[PATCH 02/12] drm/ast: Handle configuration without P2A bridge

2017-02-23 Thread Benjamin Herrenschmidt

From: Russell Currey 

The ast driver configures a window to enable access into BMC
memory space in order to read some configuration registers.

If this window is disabled, which it can be from the BMC side,
the ast driver can't function.

Closing this window is a necessity for security if a machine's
host side and BMC side are controlled by different parties;
i.e. a cloud provider offering machines "bare metal".

A recent patch went in to try to check if that window is open
but it does so by trying to access the registers in question
and testing if the result is 0x.

This method will trigger a PCIe error when the window is closed
which on some systems will be fatal (it will trigger an EEH
for example on POWER which will take out the device).

This patch improves this in two ways:

 - First, if the firmware has put properties in the device-tree
containing the relevant configuration information, we use these.

 - Otherwise, a bit in one of the SCU scratch registers (which
are readable via the VGA register space and writeable by the BMC)
will indicate if the BMC has closed the window. This bit has been
defined by Y.C Chen from Aspeed.

If the window is closed and the configuration isn't available from
the device-tree, some sane defaults are used. Those defaults are
hopefully sufficient for standard video modes used on a server.

Signed-off-by: Russell Currey 
Signed-off-by: Benjamin Herrenschmidt 
--

v2. [BenH]
- Reworked on top of Aspeed P2A patch
- Cleanup overall detection via a "config_mode" and log the
  selected mode for diagnostics purposes
- Add a property for the SCU straps

v3. [BenH]
- Moved the config mode detection to a separate functionn
- Add reading of SCU 0x40 D[12] to detect the window is
  closed as to not trigger a bus error by just "trying".
  (change provided by Y.C. Chen)
v4. [BenH]
- Only devices with the AST2000 PCI ID have a P2A bridge
- Update the P2A presence test to account for VGA only
  mode as provided by Y.C. Chen.
---
 drivers/gpu/drm/ast/ast_drv.h  |   6 +-
 drivers/gpu/drm/ast/ast_main.c | 262 +
 drivers/gpu/drm/ast/ast_post.c |   7 +-
 3 files changed, 166 insertions(+), 109 deletions(-)

diff --git a/drivers/gpu/drm/ast/ast_drv.h b/drivers/gpu/drm/ast/ast_drv.h
index 7abda94..3bedcf7 100644
--- a/drivers/gpu/drm/ast/ast_drv.h
+++ b/drivers/gpu/drm/ast/ast_drv.h
@@ -113,7 +113,11 @@ struct ast_private {
struct ttm_bo_kmap_obj cache_kmap;
int next_cursor;
bool support_wide_screen;
-   bool DisableP2A;
+   enum {
+   ast_use_p2a,
+   ast_use_dt,
+   ast_use_defaults
+   } config_mode;
 
enum ast_tx_chip tx_chip_type;
u8 dp501_maxclk;
diff --git a/drivers/gpu/drm/ast/ast_main.c b/drivers/gpu/drm/ast/ast_main.c
index 533e762..36932a3 100644
--- a/drivers/gpu/drm/ast/ast_main.c
+++ b/drivers/gpu/drm/ast/ast_main.c
@@ -62,13 +62,83 @@ uint8_t ast_get_index_reg_mask(struct ast_private *ast,
return ret;
 }
 
+static void ast_detect_config_mode(struct drm_device *dev, u32 *scu_rev)
+{
+   struct device_node *np = dev->pdev->dev.of_node;
+   struct ast_private *ast = dev->dev_private;
+   uint32_t data, jregd0, jregd1;
+
+   /* Defaults */
+   ast->config_mode = ast_use_defaults;
+   *scu_rev = 0x;
+
+   /* Check if we have device-tree properties */
+   if (np && !of_property_read_u32(np, "ast,scu-revision-id", scu_rev)) {
+   /* We do, disable P2A access */
+   ast->config_mode = ast_use_dt;
+   DRM_INFO("Using device-tree for configuration\n");
+   return;
+   }
+
+   /* Not all families have a P2A bridge */
+   if (dev->pdev->device != PCI_CHIP_AST2000)
+   return;
+
+   /*
+* The BMC will set SCU 0x40 D[12] to 1 if the P2 bridge
+* is disabled. We force using P2A if VGA only mode bit
+* is set D[7]
+*/
+   jregd0 = ast_get_index_reg_mask(ast, AST_IO_CRTC_PORT, 0xd0, 0xff);
+   jregd1 = ast_get_index_reg_mask(ast, AST_IO_CRTC_PORT, 0xd1, 0xff);
+   if (!(jregd0 & 0x80) || !(jregd1 & 0x10)) {
+   /* Double check it's actually working */
+   data = ast_read32(ast, 0xf004);
+   if (data != 0x) {
+   /* P2A works, grab silicon revision */
+   ast->config_mode = ast_use_p2a;
+
+   DRM_INFO("Using P2A bridge for configuration\n");
+
+   /* Read SCU7c (silicon revision register) */
+   ast_write32(ast, 0xf004, 0x1e6e);
+   ast_write32(ast, 0xf000, 0x1);
+   *scu_rev = ast_read32(ast, 0x1207c);
+   return;
+   }
+   }
+
+   /* We have a P2A bridge but it's disabled */
+   DRM_INFO("P2A bridge disabled, using default

[PATCH 03/12] drm/ast: const'ify mode setting tables

2017-02-23 Thread Benjamin Herrenschmidt

And fix some comment alignment & space/tabs while at it

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/gpu/drm/ast/ast_drv.h|   4 +-
 drivers/gpu/drm/ast/ast_mode.c   |   8 +--
 drivers/gpu/drm/ast/ast_tables.h | 106 +++
 3 files changed, 59 insertions(+), 59 deletions(-)

diff --git a/drivers/gpu/drm/ast/ast_drv.h b/drivers/gpu/drm/ast/ast_drv.h
index 3bedcf7..3fd9d6e 100644
--- a/drivers/gpu/drm/ast/ast_drv.h
+++ b/drivers/gpu/drm/ast/ast_drv.h
@@ -304,8 +304,8 @@ struct ast_vbios_dclk_info {
 };
 
 struct ast_vbios_mode_info {
-   struct ast_vbios_stdtable *std_table;
-   struct ast_vbios_enhtable *enh_table;
+   const struct ast_vbios_stdtable *std_table;
+   const struct ast_vbios_enhtable *enh_table;
 };
 
 extern int ast_mode_init(struct drm_device *dev);
diff --git a/drivers/gpu/drm/ast/ast_mode.c b/drivers/gpu/drm/ast/ast_mode.c
index e26c98f..1ff596e 100644
--- a/drivers/gpu/drm/ast/ast_mode.c
+++ b/drivers/gpu/drm/ast/ast_mode.c
@@ -80,9 +80,9 @@ static bool ast_get_vbios_mode_info(struct drm_crtc *crtc, 
struct drm_display_mo
 {
struct ast_private *ast = crtc->dev->dev_private;
u32 refresh_rate_index = 0, mode_id, color_index, refresh_rate;
+   const struct ast_vbios_enhtable *best = NULL;
u32 hborder, vborder;
bool check_sync;
-   struct ast_vbios_enhtable *best = NULL;
 
switch (crtc->primary->fb->bits_per_pixel) {
case 8:
@@ -146,7 +146,7 @@ static bool ast_get_vbios_mode_info(struct drm_crtc *crtc, 
struct drm_display_mo
refresh_rate = drm_mode_vrefresh(mode);
check_sync = vbios_mode->enh_table->flags & WideScreenMode;
do {
-   struct ast_vbios_enhtable *loop = vbios_mode->enh_table;
+   const struct ast_vbios_enhtable *loop = vbios_mode->enh_table;
 
while (loop->refresh_rate != 0xff) {
if ((check_sync) &&
@@ -225,7 +225,7 @@ static void ast_set_std_reg(struct drm_crtc *crtc, struct 
drm_display_mode *mode
struct ast_vbios_mode_info *vbios_mode)
 {
struct ast_private *ast = crtc->dev->dev_private;
-   struct ast_vbios_stdtable *stdtable;
+   const struct ast_vbios_stdtable *stdtable;
u32 i;
u8 jreg;
 
@@ -381,7 +381,7 @@ static void ast_set_dclk_reg(struct drm_device *dev, struct 
drm_display_mode *mo
 struct ast_vbios_mode_info *vbios_mode)
 {
struct ast_private *ast = dev->dev_private;
-   struct ast_vbios_dclk_info *clk_info;
+   const struct ast_vbios_dclk_info *clk_info;
 
clk_info = &dclk_table[vbios_mode->enh_table->dclk_index];
 
diff --git a/drivers/gpu/drm/ast/ast_tables.h b/drivers/gpu/drm/ast/ast_tables.h
index 3608d5a..a4ddf90 100644
--- a/drivers/gpu/drm/ast/ast_tables.h
+++ b/drivers/gpu/drm/ast/ast_tables.h
@@ -78,37 +78,37 @@
 #define VCLK97_75  0x19
 #define VCLK118_25 0x1A
 
-static struct ast_vbios_dclk_info dclk_table[] = {
-   {0x2C, 0xE7, 0x03}, /* 00: 
VCLK25_175   */
-   {0x95, 0x62, 0x03}, /* 01: 
VCLK28_322   */
-   {0x67, 0x63, 0x01}, /* 02: VCLK31_5 
*/
-   {0x76, 0x63, 0x01}, /* 03: VCLK36   
*/
-   {0xEE, 0x67, 0x01}, /* 04: VCLK40   
*/
-   {0x82, 0x62, 0x01}, /* 05: VCLK49_5 
*/
-   {0xC6, 0x64, 0x01}, /* 06: VCLK50   
*/
-   {0x94, 0x62, 0x01}, /* 07: 
VCLK56_25*/
-   {0x80, 0x64, 0x00}, /* 08: VCLK65   
*/
-   {0x7B, 0x63, 0x00}, /* 09: VCLK75   
*/
-   {0x67, 0x62, 0x00}, /* 0A: 
VCLK78_75*/
-   {0x7C, 0x62, 0x00}, /* 0B: VCLK94_5 
*/
-   {0x8E, 0x62, 0x00}, /* 0C: VCLK108  
*/
-   {0x85, 0x24, 0x00}, /* 0D: VCLK135  
*/
-   {0x67, 0x22, 0x00}, /* 0E: 
VCLK157_5*/
-   {0x6A, 0x22, 0x00}, /* 0F: VCLK162  
*/
-   {0x4d, 0x4c, 0x80}, /* 10: VCLK154  
*/
-   {0xa7, 0x78, 0x80}, /* 11: VCLK83.5 
*/
-   {0x28, 0x49, 0x80}, /* 12: 
VCLK106.5*/
-   {0x37, 0x49, 0x80}, /* 13: 
VCLK146.25   */
-   {0x1f, 0x45, 0x80}, /* 14: 
VCLK148.5*/
-   {0x47,

[PATCH 09/12] drm/ast: Rename ast_init_dram_2300 to ast_post_chip_2300

2017-02-23 Thread Benjamin Herrenschmidt

The function does more than initializing the DRAM and in turns
calls other functions to do the actual init. This will keeping
things more consistent with the upcoming AST2500 POST code.

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/gpu/drm/ast/ast_post.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/ast/ast_post.c b/drivers/gpu/drm/ast/ast_post.c
index c55067c..561fd7d 100644
--- a/drivers/gpu/drm/ast/ast_post.c
+++ b/drivers/gpu/drm/ast/ast_post.c
@@ -31,7 +31,7 @@
 
 #include "ast_dram_tables.h"
 
-static void ast_init_dram_2300(struct drm_device *dev);
+static void ast_post_chip_2300(struct drm_device *dev);
 
 void ast_enable_vga(struct drm_device *dev)
 {
@@ -381,7 +381,7 @@ void ast_post_gpu(struct drm_device *dev)
 
if (ast->config_mode == ast_use_p2a) {
if (ast->chip == AST2300 || ast->chip == AST2400)
-   ast_init_dram_2300(dev);
+   ast_post_chip_2300(dev);
else
ast_init_dram_reg(dev);
 
@@ -1589,7 +1589,7 @@ static void ddr2_init(struct ast_private *ast, struct 
ast2300_dram_param *param)
 
 }
 
-static void ast_init_dram_2300(struct drm_device *dev)
+static void ast_post_chip_2300(struct drm_device *dev)
 {
struct ast_private *ast = dev->dev_private;
struct ast2300_dram_param param;
-- 
2.9.3

[PATCH 10/12] drm/ast: POST code for the new AST2500

2017-02-23 Thread Benjamin Herrenschmidt

From: "Y.C. Chen" 

This is used when the BMC isn't running any code and thus has
to be initialized by the host.

The code originates from Aspeed (Y.C. Chen) and has been cleaned
up for coding style purposes by BenH.

Signed-off-by: Y.C. Chen 
Signed-off-by: Benjamin Herrenschmidt 
--

v2. - Fix bug in ddr_test_2500 reported by Emil Velikov
- Rebase on updated mmc_test factoring patch
- Fix missing else statement in 2500 POST code
---
 drivers/gpu/drm/ast/ast_dram_tables.h |  62 +
 drivers/gpu/drm/ast/ast_post.c| 417 +-
 2 files changed, 476 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/ast/ast_dram_tables.h 
b/drivers/gpu/drm/ast/ast_dram_tables.h
index cc04539..1d9c4e7 100644
--- a/drivers/gpu/drm/ast/ast_dram_tables.h
+++ b/drivers/gpu/drm/ast/ast_dram_tables.h
@@ -141,4 +141,66 @@ static const struct ast_dramstruct 
ast2100_dram_table_data[] = {
{ 0x, 0x },
 };
 
+/*
+ * AST2500 DRAM settings modules
+ */
+#define REGTBL_NUM   17
+#define REGIDX_010   0
+#define REGIDX_014   1
+#define REGIDX_018   2
+#define REGIDX_020   3
+#define REGIDX_024   4
+#define REGIDX_02C   5
+#define REGIDX_030   6
+#define REGIDX_214   7
+#define REGIDX_2E0   8
+#define REGIDX_2E4   9
+#define REGIDX_2E8   10
+#define REGIDX_2EC   11
+#define REGIDX_2F0   12
+#define REGIDX_2F4   13
+#define REGIDX_2F8   14
+#define REGIDX_RFC   15
+#define REGIDX_PLL   16
+
+static const u32 ast2500_ddr3_1600_timing_table[REGTBL_NUM] = {
+   0x64604D38,  /* 0x010 */
+   0x29690599,  /* 0x014 */
+   0x0300,  /* 0x018 */
+   0x,  /* 0x020 */
+   0x,  /* 0x024 */
+   0x02181E70,  /* 0x02C */
+   0x0040,  /* 0x030 */
+   0x0024,  /* 0x214 */
+   0x02001300,  /* 0x2E0 */
+   0x0EA0,  /* 0x2E4 */
+   0x000E001B,  /* 0x2E8 */
+   0x35B8C105,  /* 0x2EC */
+   0x08090408,  /* 0x2F0 */
+   0x9B000800,  /* 0x2F4 */
+   0x0E400A00,  /* 0x2F8 */
+   0x9971452F,  /* tRFC  */
+   0x71C1   /* PLL   */
+};
+
+static const u32 ast2500_ddr4_1600_timing_table[REGTBL_NUM] = {
+   0x63604E37,  /* 0x010 */
+   0xE97AFA99,  /* 0x014 */
+   0x00019000,  /* 0x018 */
+   0x0800,  /* 0x020 */
+   0x0400,  /* 0x024 */
+   0x0410,  /* 0x02C */
+   0x0101,  /* 0x030 */
+   0x0024,  /* 0x214 */
+   0x03002900,  /* 0x2E0 */
+   0x0EA0,  /* 0x2E4 */
+   0x000E001C,  /* 0x2E8 */
+   0x35B8C106,  /* 0x2EC */
+   0x08080607,  /* 0x2F0 */
+   0x9B000900,  /* 0x2F4 */
+   0x0E400A00,  /* 0x2F8 */
+   0x99714545,  /* tRFC  */
+   0x71C1   /* PLL   */
+};
+
 #endif
diff --git a/drivers/gpu/drm/ast/ast_post.c b/drivers/gpu/drm/ast/ast_post.c
index 561fd7d..c15f643 100644
--- a/drivers/gpu/drm/ast/ast_post.c
+++ b/drivers/gpu/drm/ast/ast_post.c
@@ -32,6 +32,7 @@
 #include "ast_dram_tables.h"
 
 static void ast_post_chip_2300(struct drm_device *dev);
+static void ast_post_chip_2500(struct drm_device *dev);
 
 void ast_enable_vga(struct drm_device *dev)
 {
@@ -82,7 +83,8 @@ ast_set_def_ext_reg(struct drm_device *dev)
for (i = 0x81; i <= 0x9f; i++)
ast_set_index_reg(ast, AST_IO_CRTC_PORT, i, 0x00);
 
-   if (ast->chip == AST2300 || ast->chip == AST2400) {
+   if (ast->chip == AST2300 || ast->chip == AST2400 ||
+   ast->chip == AST2500) {
if (dev->pdev->revision >= 0x20)
ext_reg_info = extreginfo_ast2300;
else
@@ -106,7 +108,8 @@ ast_set_def_ext_reg(struct drm_device *dev)
 
/* Enable RAMDAC for A1 */
reg = 0x04;
-   if (ast->chip == AST2300 || ast->chip == AST2400)
+   if (ast->chip == AST2300 || ast->chip == AST2400 ||
+   ast->chip == AST2500)
reg |= 0x20;
ast_set_index_reg_mask(ast, AST_IO_CRTC_PORT, 0xb6, 0xff, reg);
 }
@@ -380,7 +383,9 @@ void ast_post_gpu(struct drm_device *dev)
ast_set_def_ext_reg(dev);
 
if (ast->config_mode == ast_use_p2a) {
-   if (ast->chip == AST2300 || ast->chip == AST2400)
+   if (ast->chip == AST2500)
+   ast_post_chip_2500(dev);
+   else if (ast->chip == AST2300 || ast->chip

[PATCH 12/12] drm/ast: Call open_key before enable_mmio in POST code

2017-02-23 Thread Benjamin Herrenschmidt

From: "Y.C. Chen" 

open_key enables access the registers used by enable_mmio

Signed-off-by: Y.C. Chen 
Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/gpu/drm/ast/ast_post.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/ast/ast_post.c b/drivers/gpu/drm/ast/ast_post.c
index a5a7809..f7d4213 100644
--- a/drivers/gpu/drm/ast/ast_post.c
+++ b/drivers/gpu/drm/ast/ast_post.c
@@ -374,8 +374,8 @@ void ast_post_gpu(struct drm_device *dev)
pci_write_config_dword(ast->dev->pdev, 0x04, reg);
 
ast_enable_vga(dev);
-   ast_enable_mmio(dev);
ast_open_key(ast);
+   ast_enable_mmio(dev);
ast_set_def_ext_reg(dev);
 
if (ast->config_mode == ast_use_p2a) {
-- 
2.9.3

[PATCH 11/12] drm/ast: Fix test for VGA enabled

2017-02-23 Thread Benjamin Herrenschmidt

From: "Y.C. Chen" 

(Get better description from Aspeed)

Signed-off-by: Y.C. Chen 
Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/gpu/drm/ast/ast_post.c | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/ast/ast_post.c b/drivers/gpu/drm/ast/ast_post.c
index c15f643..a5a7809 100644
--- a/drivers/gpu/drm/ast/ast_post.c
+++ b/drivers/gpu/drm/ast/ast_post.c
@@ -59,13 +59,9 @@ bool ast_is_vga_enabled(struct drm_device *dev)
/* TODO 1180 */
} else {
ch = ast_io_read8(ast, AST_IO_VGA_ENABLE_PORT);
-   if (ch) {
-   ast_open_key(ast);
-   ch = ast_get_index_reg_mask(ast, AST_IO_CRTC_PORT, 
0xb6, 0xff);
-   return ch & 0x04;
-   }
+   return !!(ch & 0x01);
}
-   return 0;
+   return false;
 }
 
 static const u8 extreginfo[] = { 0x0f, 0x04, 0x1c, 0xff };
-- 
2.9.3

[PATCH 08/12] drm/ast: Factor mmc_test code in POST code

2017-02-23 Thread Benjamin Herrenschmidt

There's a some duplication for what's essentially copies of
two loops, so factor it. The upcoming AST2500 POST code adds
more of them. Also cleanup return types for the test functions,
most of them return a boolean, some return a u32.

Signed-off-by: Benjamin Herrenschmidt 
--

v2. - Keep the split between the "test" and "test2" functions
  as they have a different exit condition in the loop and
  a different return type.
- Fix the return types accross the call chain
---
 drivers/gpu/drm/ast/ast_post.c | 82 --
 1 file changed, 31 insertions(+), 51 deletions(-)

diff --git a/drivers/gpu/drm/ast/ast_post.c b/drivers/gpu/drm/ast/ast_post.c
index e802450..c55067c 100644
--- a/drivers/gpu/drm/ast/ast_post.c
+++ b/drivers/gpu/drm/ast/ast_post.c
@@ -445,85 +445,65 @@ static const u32 pattern[8] = {
0x7C61D253
 };
 
-static int mmc_test_burst(struct ast_private *ast, u32 datagen)
+static bool mmc_test(struct ast_private *ast, u32 datagen, u8 test_ctl)
 {
u32 data, timeout;
 
ast_moutdwm(ast, 0x1e6e0070, 0x);
-   ast_moutdwm(ast, 0x1e6e0070, 0x00c1 | (datagen << 3));
+   ast_moutdwm(ast, 0x1e6e0070, (datagen << 3) | test_ctl);
timeout = 0;
do {
data = ast_mindwm(ast, 0x1e6e0070) & 0x3000;
-   if (data & 0x2000) {
-   return 0;
-   }
+   if (data & 0x2000)
+   return false;
if (++timeout > TIMEOUT) {
ast_moutdwm(ast, 0x1e6e0070, 0x);
-   return 0;
+   return false;
}
} while (!data);
-   ast_moutdwm(ast, 0x1e6e0070, 0x);
-   return 1;
+   ast_moutdwm(ast, 0x1e6e0070, 0x0);
+   return true;
 }
 
-static int mmc_test_burst2(struct ast_private *ast, u32 datagen)
+static u32 mmc_test2(struct ast_private *ast, u32 datagen, u8 test_ctl)
 {
u32 data, timeout;
 
ast_moutdwm(ast, 0x1e6e0070, 0x);
-   ast_moutdwm(ast, 0x1e6e0070, 0x0041 | (datagen << 3));
+   ast_moutdwm(ast, 0x1e6e0070, (datagen << 3) | test_ctl);
timeout = 0;
do {
data = ast_mindwm(ast, 0x1e6e0070) & 0x1000;
if (++timeout > TIMEOUT) {
ast_moutdwm(ast, 0x1e6e0070, 0x0);
-   return -1;
+   return 0x;
}
} while (!data);
data = ast_mindwm(ast, 0x1e6e0078);
data = (data | (data >> 16)) & 0x;
-   ast_moutdwm(ast, 0x1e6e0070, 0x0);
+   ast_moutdwm(ast, 0x1e6e0070, 0x);
return data;
 }
 
-static int mmc_test_single(struct ast_private *ast, u32 datagen)
+
+static bool mmc_test_burst(struct ast_private *ast, u32 datagen)
 {
-   u32 data, timeout;
+   return mmc_test(ast, datagen, 0xc1);
+}
 
-   ast_moutdwm(ast, 0x1e6e0070, 0x);
-   ast_moutdwm(ast, 0x1e6e0070, 0x00c5 | (datagen << 3));
-   timeout = 0;
-   do {
-   data = ast_mindwm(ast, 0x1e6e0070) & 0x3000;
-   if (data & 0x2000)
-   return 0;
-   if (++timeout > TIMEOUT) {
-   ast_moutdwm(ast, 0x1e6e0070, 0x0);
-   return 0;
-   }
-   } while (!data);
-   ast_moutdwm(ast, 0x1e6e0070, 0x0);
-   return 1;
+static u32 mmc_test_burst2(struct ast_private *ast, u32 datagen)
+{
+   return mmc_test2(ast, datagen, 0x41);
 }
 
-static int mmc_test_single2(struct ast_private *ast, u32 datagen)
+static bool mmc_test_single(struct ast_private *ast, u32 datagen)
 {
-   u32 data, timeout;
+   return mmc_test(ast, datagen, 0xc5);
+}
 
-   ast_moutdwm(ast, 0x1e6e0070, 0x);
-   ast_moutdwm(ast, 0x1e6e0070, 0x0005 | (datagen << 3));
-   timeout = 0;
-   do {
-   data = ast_mindwm(ast, 0x1e6e0070) & 0x1000;
-   if (++timeout > TIMEOUT) {
-   ast_moutdwm(ast, 0x1e6e0070, 0x0);
-   return -1;
-   }
-   } while (!data);
-   data = ast_mindwm(ast, 0x1e6e0078);
-   data = (data | (data >> 16)) & 0x;
-   ast_moutdwm(ast, 0x1e6e0070, 0x0);
-   return data;
+static u32 mmc_test_single2(struct ast_private *ast, u32 datagen)
+{
+   return mmc_test2(ast, datagen, 0x05);
 }
 
 static int cbr_test(struct ast_private *ast)
@@ -601,16 +581,16 @@ static u32 cbr_scan2(struct ast_private *ast)
return data2;
 }
 
-static u32 cbr_test3(struct ast_private *ast)
+static bool cbr_test3(struct ast_private *ast)
 {
if (!mmc_test_burst(ast, 0))
-   return 0;
+   return false;
if (!mmc_test_single(ast, 0))
-   return 0;
-   return 1;
+   return false;
+   return true;
 }
 
-static u32 cbr_scan3(struct ast_private *ast)
+static bool c

[PATCH 06/12] drm/ast: Base support for AST2500

2017-02-23 Thread Benjamin Herrenschmidt

From: "Y.C. Chen" 

Add detection and mode setting updates for AST2500 generation chip,
code originally from Aspeed and slightly reworked for coding style
mostly by Ben. This doesn't contain the BMC DRAM POST code which
is in a separate patch.

Signed-off-by: Y.C. Chen 
Signed-off-by: Benjamin Herrenschmidt 
---

v2. Add 800Mhz default mclk for AST2500
---
 drivers/gpu/drm/ast/ast_drv.h|  2 ++
 drivers/gpu/drm/ast/ast_main.c   | 32 +++---
 drivers/gpu/drm/ast/ast_mode.c   | 30 -
 drivers/gpu/drm/ast/ast_tables.h | 58 +---
 4 files changed, 103 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/ast/ast_drv.h b/drivers/gpu/drm/ast/ast_drv.h
index 3fd9d6e..d1c1d53 100644
--- a/drivers/gpu/drm/ast/ast_drv.h
+++ b/drivers/gpu/drm/ast/ast_drv.h
@@ -64,6 +64,7 @@ enum ast_chip {
AST2150,
AST2300,
AST2400,
+   AST2500,
AST1180,
 };
 
@@ -80,6 +81,7 @@ enum ast_tx_chip {
 #define AST_DRAM_1Gx32   3
 #define AST_DRAM_2Gx16   6
 #define AST_DRAM_4Gx16   7
+#define AST_DRAM_8Gx16   8
 
 struct ast_fbdev;
 
diff --git a/drivers/gpu/drm/ast/ast_main.c b/drivers/gpu/drm/ast/ast_main.c
index d194af3..f19669f 100644
--- a/drivers/gpu/drm/ast/ast_main.c
+++ b/drivers/gpu/drm/ast/ast_main.c
@@ -141,7 +141,10 @@ static int ast_detect_chip(struct drm_device *dev, bool 
*need_post)
ast->chip = AST1100;
DRM_INFO("AST 1180 detected\n");
} else {
-   if (dev->pdev->revision >= 0x30) {
+   if (dev->pdev->revision >= 0x40) {
+   ast->chip = AST2500;
+   DRM_INFO("AST 2500 detected\n");
+   } else if (dev->pdev->revision >= 0x30) {
ast->chip = AST2400;
DRM_INFO("AST 2400 detected\n");
} else if (dev->pdev->revision >= 0x20) {
@@ -195,6 +198,9 @@ static int ast_detect_chip(struct drm_device *dev, bool 
*need_post)
if (ast->chip == AST2400 &&
(scu_rev & 0x300) == 0x100) /* ast1400 */
ast->support_wide_screen = true;
+   if (ast->chip == AST2500 &&
+   scu_rev == 0x100)   /* ast2510 */
+   ast->support_wide_screen = true;
}
break;
}
@@ -289,7 +295,10 @@ static int ast_get_dram_info(struct drm_device *dev)
default:
ast->dram_bus_width = 16;
ast->dram_type = AST_DRAM_1Gx16;
-   ast->mclk = 396;
+   if (ast->chip == AST2500)
+   ast->mclk = 800;
+   else
+   ast->mclk = 396;
return 0;
}
 
@@ -298,7 +307,23 @@ static int ast_get_dram_info(struct drm_device *dev)
else
ast->dram_bus_width = 32;
 
-   if (ast->chip == AST2300 || ast->chip == AST2400) {
+   if (ast->chip == AST2500) {
+   switch (mcr_cfg & 0x03) {
+   case 0:
+   ast->dram_type = AST_DRAM_1Gx16;
+   break;
+   default:
+   case 1:
+   ast->dram_type = AST_DRAM_2Gx16;
+   break;
+   case 2:
+   ast->dram_type = AST_DRAM_4Gx16;
+   break;
+   case 3:
+   ast->dram_type = AST_DRAM_8Gx16;
+   break;
+   }
+   } else if (ast->chip == AST2300 || ast->chip == AST2400) {
switch (mcr_cfg & 0x03) {
case 0:
ast->dram_type = AST_DRAM_512Mx16;
@@ -521,6 +546,7 @@ int ast_driver_load(struct drm_device *dev, unsigned long 
flags)
ast->chip == AST2200 ||
ast->chip == AST2300 ||
ast->chip == AST2400 ||
+   ast->chip == AST2500 ||
ast->chip == AST1180) {
dev->mode_config.max_width = 1920;
dev->mode_config.max_height = 2048;
diff --git a/drivers/gpu/drm/ast/ast_mode.c b/drivers/gpu/drm/ast/ast_mode.c
index 1ff596e..e4db1c72 100644
--- a/drivers/gpu/drm/ast/ast_mode.c
+++ b/drivers/gpu/drm/ast/ast_mode.c
@@ -271,7 +271,11 @@ static void ast_set_crtc_reg(struct drm_crtc *crtc, struct 
drm_display_mode *mod
 {
struct ast_private *ast = crtc->dev->dev_private;
u8 jreg05 = 0, jreg07 = 0, jreg09 = 0, jregAC = 0, jregAD = 0, jregAE = 
0;
-   u16 temp;
+   u16 temp, precache = 0;
+
+   if ((ast->chip == AST2500) &&
+   (vbios_mode->enh_table->flags & AST2500PreCatchCRT))
+   precache = 40;
 
ast_set_index_reg_mask(ast, AST_IO_CRTC_PORT, 0x11, 0x7f, 0x00);
 
@@ -297,12 +301,12 @@ static void ast_set_crtc_reg(struct drm_crtc *crtc, 
struct drm_display_mode *mod
jregAD |= 0x01;  /* HBE D

Re: [PATCH 11/12] drm/ast: Fix test for VGA enabled

2017-02-23 Thread Benjamin Herrenschmidt

On Fri, 2017-02-24 at 09:53 +1100, Benjamin Herrenschmidt wrote:
> From: "Y.C. Chen" 
> 
> (Get better description from Aspeed)

And this should have been:

<<
The test to see if VGA was already enabled is doing an unnecessary
second test from a register that may or may not have been initialized
to a valid value. Remove it.
>>

If you prefer you can find the whole thing (already fixed up)
at g...@github.com:ozbenh/linux-ast.git

Cheers,
Ben.

> Signed-off-by: Y.C. Chen 
> Signed-off-by: Benjamin Herrenschmidt 
> ---
>  drivers/gpu/drm/ast/ast_post.c | 8 ++--
>  1 file changed, 2 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/ast/ast_post.c
> b/drivers/gpu/drm/ast/ast_post.c
> index c15f643..a5a7809 100644
> --- a/drivers/gpu/drm/ast/ast_post.c
> +++ b/drivers/gpu/drm/ast/ast_post.c
> @@ -59,13 +59,9 @@ bool ast_is_vga_enabled(struct drm_device *dev)
>   /* TODO 1180 */
>   } else {
>   ch = ast_io_read8(ast, AST_IO_VGA_ENABLE_PORT);
> - if (ch) {
> - ast_open_key(ast);
> - ch = ast_get_index_reg_mask(ast,
> AST_IO_CRTC_PORT, 0xb6, 0xff);
> - return ch & 0x04;
> - }
> + return !!(ch & 0x01);
>   }
> - return 0;
> + return false;
>  }
>  
>  static const u8 extreginfo[] = { 0x0f, 0x04, 0x1c, 0xff };

Re: [PATCH 01/12] drm/ast: Fix AST2400 POST failure without BMC FW or VBIOS

2017-02-23 Thread Benjamin Herrenschmidt

Note: The whole series with the fixed cset comment for patch 11
can be also found there:

https://github.com/ozbenh/linux-ast/commits/master

Cheers,
Ben.

[PATCH 04/12] drm/ast: Remove spurrious include

2017-02-23 Thread Benjamin Herrenschmidt

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/gpu/drm/ast/ast_main.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/gpu/drm/ast/ast_main.c b/drivers/gpu/drm/ast/ast_main.c
index 36932a3..718c15b 100644
--- a/drivers/gpu/drm/ast/ast_main.c
+++ b/drivers/gpu/drm/ast/ast_main.c
@@ -32,8 +32,6 @@
 #include 
 #include 
 
-#include "ast_dram_tables.h"
-
 void ast_set_index_reg_mask(struct ast_private *ast,
uint32_t base, uint8_t index,
uint8_t mask, uint8_t val)
-- 
2.9.3

[PATCH 05/12] drm/ast: Fix calculation of MCLK

2017-02-23 Thread Benjamin Herrenschmidt

Some braces were missing causing an incorrect calculation.

Y.C. Chen from Aspeed provided me with the right formula
which I tested on AST2400 and 2500.

The MCLK isn't currently used by the driver (it will eventually
to filter modes) so the issue isn't catastrophic.

Also make the printed value a bit more meaningful

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/gpu/drm/ast/ast_main.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/ast/ast_main.c b/drivers/gpu/drm/ast/ast_main.c
index 718c15b..d194af3 100644
--- a/drivers/gpu/drm/ast/ast_main.c
+++ b/drivers/gpu/drm/ast/ast_main.c
@@ -352,7 +352,7 @@ static int ast_get_dram_info(struct drm_device *dev)
div = 0x1;
break;
}
-   ast->mclk = ref_pll * (num + 2) / (denum + 2) * (div * 1000);
+   ast->mclk = ref_pll * (num + 2) / ((denum + 2) * (div * 1000));
return 0;
 }
 
@@ -496,7 +496,9 @@ int ast_driver_load(struct drm_device *dev, unsigned long 
flags)
if (ret)
goto out_free;
ast->vram_size = ast_get_vram_info(dev);
-   DRM_INFO("dram %d %d %d %08x\n", ast->mclk, ast->dram_type, 
ast->dram_bus_width, ast->vram_size);
+   DRM_INFO("dram MCLK=%u Mhz type=%d bus_width=%d size=%08x\n",
+ast->mclk, ast->dram_type,
+ast->dram_bus_width, ast->vram_size);
}
 
if (need_post)
-- 
2.9.3

[PATCH 07/12] drm/ast: Fixed vram size incorrect issue on POWER

2017-02-23 Thread Benjamin Herrenschmidt

From: "Y.C. Chen" 

The default value of VGA scratch may incorrect.
Should initial h/w before get vram info.

Signed-off-by: Y.C. Chen 
---
 drivers/gpu/drm/ast/ast_main.c | 6 +++---
 drivers/gpu/drm/ast/ast_post.c | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/ast/ast_main.c b/drivers/gpu/drm/ast/ast_main.c
index f19669f..8684f3c 100644
--- a/drivers/gpu/drm/ast/ast_main.c
+++ b/drivers/gpu/drm/ast/ast_main.c
@@ -516,6 +516,9 @@ int ast_driver_load(struct drm_device *dev, unsigned long 
flags)
 
ast_detect_chip(dev, &need_post);
 
+   if (need_post)
+   ast_post_gpu(dev);
+
if (ast->chip != AST1180) {
ret = ast_get_dram_info(dev);
if (ret)
@@ -526,9 +529,6 @@ int ast_driver_load(struct drm_device *dev, unsigned long 
flags)
 ast->dram_bus_width, ast->vram_size);
}
 
-   if (need_post)
-   ast_post_gpu(dev);
-
ret = ast_mm_init(ast);
if (ret)
goto out_free;
diff --git a/drivers/gpu/drm/ast/ast_post.c b/drivers/gpu/drm/ast/ast_post.c
index 64549ce..e802450 100644
--- a/drivers/gpu/drm/ast/ast_post.c
+++ b/drivers/gpu/drm/ast/ast_post.c
@@ -79,7 +79,7 @@ ast_set_def_ext_reg(struct drm_device *dev)
const u8 *ext_reg_info;
 
/* reset scratch */
-   for (i = 0x81; i <= 0x8f; i++)
+   for (i = 0x81; i <= 0x9f; i++)
ast_set_index_reg(ast, AST_IO_CRTC_PORT, i, 0x00);
 
if (ast->chip == AST2300 || ast->chip == AST2400) {
-- 
2.9.3

Re: [PATCH] powerpc/xics: Adjust interrupt receive priority for offline cpus

2017-02-23 Thread Michael Neuling

On Thu, 2017-02-23 at 16:24 +0530, Vaidyanathan Srinivasan wrote:
> Offline CPUs need to receive IPIs through XIVE when they are
> in stop state and wakeup from that state.
> 
> Reduce interrupt receive priority in order to receive XIVE
> wakeup interrupts when in offline state.
> 
> LOWEST_PRIORITY would allow all interrupts to be delivered
> as wakeup events.

This needs to be expanded to explain why "DEFAULT" doesn't work in this case.

This also needs an explicit statement that "It fixes onlining of CPUs on
POWER9".  I'd even advocate for making that the patch subject. 

Also if it's the right fix, it needs a cc:stable.

Mikey


> Signed-off-by: Vaidyanathan Srinivasan 
> ---
>  arch/powerpc/sysdev/xics/xics-common.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/sysdev/xics/xics-common.c
> b/arch/powerpc/sysdev/xics/xics-common.c
> index 69d858e..c674a9d 100644
> --- a/arch/powerpc/sysdev/xics/xics-common.c
> +++ b/arch/powerpc/sysdev/xics/xics-common.c
> @@ -199,7 +199,7 @@ void xics_migrate_irqs_away(void)
>   xics_set_cpu_giq(xics_default_distrib_server, 0);
>  
>   /* Allow IPIs again... */
> - icp_ops->set_priority(DEFAULT_PRIORITY);
> + icp_ops->set_priority(LOWEST_PRIORITY);
>  
>   for_each_irq_desc(virq, desc) {
>   struct irq_chip *chip;

Re: [PATCH] powerpc/xics: Adjust interrupt receive priority for offline cpus

2017-02-23 Thread Balbir Singh

On Thu, Feb 23, 2017 at 9:54 PM, Vaidyanathan Srinivasan
 wrote:
> Offline CPUs need to receive IPIs through XIVE when they are
> in stop state and wakeup from that state.
>
> Reduce interrupt receive priority in order to receive XIVE
> wakeup interrupts when in offline state.
>
> LOWEST_PRIORITY would allow all interrupts to be delivered
> as wakeup events.
>
> Signed-off-by: Vaidyanathan Srinivasan 
> ---
>  arch/powerpc/sysdev/xics/xics-common.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/sysdev/xics/xics-common.c 
> b/arch/powerpc/sysdev/xics/xics-common.c
> index 69d858e..c674a9d 100644
> --- a/arch/powerpc/sysdev/xics/xics-common.c
> +++ b/arch/powerpc/sysdev/xics/xics-common.c
> @@ -199,7 +199,7 @@ void xics_migrate_irqs_away(void)
> xics_set_cpu_giq(xics_default_distrib_server, 0);
>
> /* Allow IPIs again... */
> -   icp_ops->set_priority(DEFAULT_PRIORITY);
> +   icp_ops->set_priority(LOWEST_PRIORITY);
>

Aren't IPI's at higher priority than DEFAULT_PRIORITY? Like Mikey said
I am not sure what
is broken with the current implementation? Is this true for all
icp_ops? I presume you are using
icp_opal. I suspect you'll need to look at

1. XIVE to see if EMULATION_PRIO is the issue
2. Check if only icp_opal is impacted

Balbir Singh.

Re: [PATCH 0/2] Allow configurable stack size (especially 32k on PPC64)

2017-02-23 Thread Hamish Martin

On 02/22/2017 07:25 PM, Michael Ellerman wrote:
> Hamish Martin  writes:
>> This patch series adds the ability to configure the THREAD_SHIFT value and
>> thereby alter the stack size on powerpc systems. We are particularly 
>> interested
>> in configuring for a 32k stack on PPC64.
> ...
>>
>> For instance for a 70 frame stack, the architecture overhead just for the 
>> stack
>> frames is:
>>70 * 16 bytes = 1120 bytes for PPC32, and
>>70 * 112 bytes = 7840 bytes for PPC64.
>> So a simple doubling of the PPC32 stack size leaves us with a shortfall of 
>> 5600
>> bytes (7840 - (2 * 1120)). In the example the stack frame overhead for PPC32 
>> is
>> 1120/8192 = 13.5% of the stack space, whereas for PPC64 it is 7840/16384 =
>> 47.8% of the space.
>>
>> The aim of this series is to provide the ability for users to configure for
>> larger stacks without altering the defaults in a way that would impact 
>> existing
>> users. However, given the inequity between the PPC32 and PPC64 stacks when
>> taking into account the respective minimum stack frame sizes, we believe
>> consideration should be given to having a large default. We would appreciate
>> any input or opinions on this issue.
>
> Thanks for the detailed explanation.
>
> The patches look fine, so I don't see any reason why we wouldn't merge
> this. I might make the config option depend on EXPERT, but that's just
> cosmetic.
>
>
> You're right about the difference in stack overhead between 32 & 64-bit.
> But I guess on the other hand we've been using 16K stacks on 64-bit for
> over 15 years, and although we have had some reports of stack overflow
> they're not a common problem.
>
> cheers
>
Yes, 15 years testing is hard to argue against, but in our case we feel 
we have a stack that is reasonable and would cause no problems on PPC32. 
This seems like a good compromise.
I think I was most keen to hear from you guys about whether it was a 
flat out crazy idea, or if it would open up a huge can of hidden worms. 
 From your response that seems not to be the case.

Thanks for the input, Michael. I'll add the EXPERT dependency and resubmit.

Re: [PATCH 0/2] Allow configurable stack size (especially 32k on PPC64)

2017-02-23 Thread Hamish Martin

On 02/22/2017 09:06 PM, Benjamin Herrenschmidt wrote:
> On Wed, 2017-02-22 at 17:25 +1100, Michael Ellerman wrote:
>>
>> Thanks for the detailed explanation.
>>
>> The patches look fine, so I don't see any reason why we wouldn't merge
>> this. I might make the config option depend on EXPERT, but that's just
>> cosmetic.
>>
>>
>> You're right about the difference in stack overhead between 32 & 64-bit.
>> But I guess on the other hand we've been using 16K stacks on 64-bit for
>> over 15 years, and although we have had some reports of stack overflow
>> they're not a common problem.
>
> Right and in fact I wonder if we could generally help this for cases
> like this one (lots of stacked devices) by having the generic driver
> core break those chains by deferring new device registration to a work
> queue or kthread.
>
> That would help in a lot of cases. We do get some stupid deep chains
> in cases of many bus encapsulation.
>
> Cheers,
> Ben.
>
>
Agreed. In fact you suggested it back in 2008 somewhere in the thread I 
linked to in my cover letter for the series. That's obviously going to 
take a long time to happen, so for now I hope this series will allow 
some respite for users affected.

Thanks for your input.

[PATCH v2 0/2] Allow configurable stack size (especially 32k on PPC64)

2017-02-23 Thread Hamish Martin

This patch series adds the ability to configure the THREAD_SHIFT value and
thereby alter the stack size on powerpc systems. We are particularly interested
in configuring for a 32k stack on PPC64.

Using an NXP T2081 (e6500 PPC64 cores) we are observing stack overflows as a
result of applying a DTS overlay containing some I2C devices. Our scenario is
an ethernet switch chassis with plug-in cards. The I2C is driven from the T2081
through a PCA9548 mux on the main board. When we detect insertion of the plugin
card we schedule work for a call to of_overlay_create() to install a DTS
overlay for the plugin board. This DTS overlay contains a further PCA9548 mux
with more devices hanging off it including a PCA9539 GPIO expander. The
ultimate installed I2C tree is:

T2081 --- PCA9548 MUX --- PCA9548 MUX --- PCA9539 GPIO Expander

When we install the overlay the devices described in the overlay are probed and
we see a large number of stack frames used as a result. If this is coupled with
an interrupt happening that requires moderate to high stack use we observe
stack corruption. Here is an example long stack (from a 4.10-rc8 kernel) that
does not show corruption but does demonstrate the length and frame sizes
involved.

DepthSize   Location(72 entries)
-   
  0)13872 128   .__raise_softirq_irqoff+0x1c/0x130
  1)13744 144   .raise_softirq+0x30/0x70
  2)13600 112   .invoke_rcu_core+0x54/0x70
  3)13488 336   .rcu_check_callbacks+0x294/0xde0
  4)13152 128   .update_process_times+0x40/0x90
  5)13024 144   .tick_sched_handle.isra.16+0x40/0xb0
  6)12880 144   .tick_sched_timer+0x6c/0xe0
  7)12736 272   .__hrtimer_run_queues+0x1a0/0x4b0
  8)12464 208   .hrtimer_interrupt+0xe8/0x2a0
  9)12256 160   .__timer_interrupt+0xdc/0x330
 10)12096 160   .timer_interrupt+0x138/0x190
 11)11936 752   exc_0x900_common+0xe0/0xe4
 12)11184 128   .ftrace_ops_no_ops+0x11c/0x230
 13)11056 176   .ftrace_ops_test.isra.12+0x30/0x50
 14)10880 160   .ftrace_ops_no_ops+0xd4/0x230
 15)10720 112   ftrace_call+0x4/0x8
 16)10608 176   .lock_timer_base+0x3c/0xf0
 17)10432 144   .try_to_del_timer_sync+0x2c/0x90
 18)10288 128   .del_timer_sync+0x60/0x80
 19)10160 256   .schedule_timeout+0x1fc/0x490
 20) 9904 208   .i2c_wait+0x238/0x290
 21) 9696 256   .mpc_xfer+0x4e4/0x570
 22) 9440 208   .__i2c_transfer+0x158/0x6d0
 23) 9232 192   .pca954x_reg_write+0x70/0x110
 24) 9040 160   .__i2c_mux_master_xfer+0xb4/0xf0
 25) 8880 208   .__i2c_transfer+0x158/0x6d0
 26) 8672 192   .pca954x_reg_write+0x70/0x110
 27) 8480 144   .pca954x_select_chan+0x68/0xa0
 28) 8336 160   .__i2c_mux_master_xfer+0x64/0xf0
 29) 8176 208   .__i2c_transfer+0x158/0x6d0
 30) 7968 144   .i2c_transfer+0x98/0x130
 31) 7824 320   .i2c_smbus_xfer_emulated+0x168/0x600
 32) 7504 208   .i2c_smbus_xfer+0x1c0/0x5d0
 33) 7296 192   .i2c_smbus_write_byte_data+0x50/0x70
 34) 7104 144   .pca953x_write_single+0x6c/0xe0
 35) 6960 192   .pca953x_gpio_direction_output+0xa4/0x160
 36) 6768 160   ._gpiod_direction_output_raw+0xec/0x460
 37) 6608 160   .gpiod_hog+0x98/0x250
 38) 6448 176   .of_gpiochip_add+0xdc/0x1c0
 39) 6272 256   .gpiochip_add_data+0x4f4/0x8c0
 40) 6016 144   .devm_gpiochip_add_data+0x64/0xf0
 41) 5872 208   .pca953x_probe+0x2b4/0x5f0
 42) 5664 144   .i2c_device_probe+0x224/0x2e0
 43) 5520 160   .really_probe+0x244/0x380
 44) 5360 160   .bus_for_each_drv+0x94/0x100
 45) 5200 160   .__device_attach+0x118/0x160
 46) 5040 144   .bus_probe_device+0xe8/0x100
 47) 4896 208   .device_add+0x500/0x6c0
 48) 4688 144   .i2c_new_device+0x1f8/0x240
 49) 4544 256   .of_i2c_register_device+0x160/0x280
 50) 4288 192   .i2c_register_adapter+0x238/0x630
 51) 4096 208   .i2c_mux_add_adapter+0x3f8/0x540
 52) 3888 192   .pca954x_probe+0x234/0x370
 53) 3696 144   .i2c_device_probe+0x224/0x2e0
 54) 3552 160   .really_probe+0x244/0x380
 55) 3392 160   .bus_for_each_drv+0x94/0x100
 56) 3232 160   .__device_attach+0x118/0x160
 57) 3072 144   .bus_probe_device+0xe8/0x100
 58) 2928 208   .device_add+0x500/0x6c0
 59) 2720 144   .i2c_new_device+0x1f8/0x240
 60) 2576 256   .of_i2c_register_device+0x160/0x280
 61) 2320 144   .of_i2c_notify+0x12c/0x1d0
 62) 2176 160   .notifier_call_chain+0x8c/0x100
 63) 2016 160   .__blocking_notifier_call_chain+0x6c/0xe0
 64) 1856 208   .__of_changeset_entry_notify+0xd8/0x140
 65) 1648 192   .__of_changeset_apply+0x7c/0x100
 66) 1456 272   .of_overlay_create+0x2e0/0x4b0
 67) 1184 128   .xem2_install_overlay+0x40/

[PATCH v2 1/2] powerpc: Move THREAD_SHIFT config to KConfig

2017-02-23 Thread Hamish Martin

Shift the logic for defining THREAD_SHIFT logic to Kconfig in order to
allow override by users.

Signed-off-by: Hamish Martin 
Reviewed-by: Chris Packham 
---
 arch/powerpc/Kconfig   | 10 ++
 arch/powerpc/include/asm/thread_info.h | 10 +-
 2 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 281f4f1fcd1f..cd4dd9354b3c 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -669,6 +669,16 @@ config PPC_256K_PAGES
 
 endchoice
 
+config THREAD_SHIFT
+   int "Thread shift" if EXPERT
+   range 13 15
+   default "15" if PPC_256K_PAGES
+   default "14" if PPC64
+   default "13"
+   help
+ Used to define the stack size. The default is almost always what you
+ want. Only change this if you know what you are doing.
+
 config FORCE_MAX_ZONEORDER
int "Maximum zone order"
range 8 9 if PPC64 && PPC_64K_PAGES
diff --git a/arch/powerpc/include/asm/thread_info.h 
b/arch/powerpc/include/asm/thread_info.h
index 87e4b2d8dcd4..2e17d668c472 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -10,15 +10,7 @@
 
 #ifdef __KERNEL__
 
-/* We have 8k stacks on ppc32 and 16k on ppc64 */
-
-#if defined(CONFIG_PPC64)
-#define THREAD_SHIFT   14
-#elif defined(CONFIG_PPC_256K_PAGES)
-#define THREAD_SHIFT   15
-#else
-#define THREAD_SHIFT   13
-#endif
+#define THREAD_SHIFT   CONFIG_THREAD_SHIFT
 
 #define THREAD_SIZE(1 << THREAD_SHIFT)
 
-- 
2.11.0

[PATCH v2 2/2] powerpc64: Allow for THREAD_SIZE > 16k

2017-02-23 Thread Hamish Martin

Fix an assembler error when the THREAD_SIZE is greater than 16k.

Signed-off-by: Hamish Martin 
Reviewed-by: Chris Packham 
---
 arch/powerpc/kernel/head_64.S | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/head_64.S b/arch/powerpc/kernel/head_64.S
index 1dc5eae2ced3..0ddc602b33a4 100644
--- a/arch/powerpc/kernel/head_64.S
+++ b/arch/powerpc/kernel/head_64.S
@@ -949,7 +949,8 @@ start_here_multiplatform:
LOAD_REG_ADDR(r3,init_thread_union)
 
/* set up a stack pointer */
-   addir1,r3,THREAD_SIZE
+   LOAD_REG_IMMEDIATE(r1,THREAD_SIZE)
+   add r1,r3,r1
li  r0,0
stdur0,-STACK_FRAME_OVERHEAD(r1)
 
-- 
2.11.0

Re: [PATCH kernel v5 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

2017-02-23 Thread David Gibson

On Wed, Feb 22, 2017 at 07:21:33PM +1100, Alexey Kardashevskiy wrote:
> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> without passing them to user space which saves time on switching
> to user space and back.
> 
> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> KVM tries to handle a TCE request in the real mode, if failed
> it passes the request to the virtual mode to complete the operation.
> If it a virtual mode handler fails, the request is passed to
> the user space; this is not expected to happen though.
> 
> To avoid dealing with page use counters (which is tricky in real mode),
> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> to pre-register the userspace memory. The very first TCE request will
> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> of the TCE table (iommu_table::it_userspace) is not allocated till
> the very first mapping happens and we cannot call vmalloc in real mode.
> 
> If we fail to update a hardware IOMMU table unexpected reason, we just
> clear it and move on as there is nothing really we can do about it -
> for example, if we hot plug a VFIO device to a guest, existing TCE tables
> will be mirrored automatically to the hardware and there is no interface
> to report to the guest about possible failures.
> 
> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> and associates a physical IOMMU table with the SPAPR TCE table (which
> is a guest view of the hardware IOMMU table). The iommu_table object
> is cached and referenced so we do not have to look up for it in real mode.
> 
> This does not implement the UNSET counterpart as there is no use for it -
> once the acceleration is enabled, the existing userspace won't
> disable it unless a VFIO container is destroyed; this adds necessary
> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> 
> As this creates a descriptor per IOMMU table-LIOBN couple (called
> kvmppc_spapr_tce_iommu_table), it is possible to have several
> descriptors with the same iommu_table (hardware IOMMU table) attached
> to the same LIOBN; we do not remove duplicates though as
> iommu_table_ops::exchange not just update a TCE entry (which is
> shared among IOMMU groups) but also invalidates the TCE cache
> (one per IOMMU group).
> 
> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> space.
> 
> This finally makes use of vfio_external_user_iommu_id() which was
> introduced quite some time ago and was considered for removal.
> 
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> 
> Signed-off-by: Alexey Kardashevskiy 

I have some comments on this patch, but all the definite ones are
pretty minor and could be done as later cleanups.

I have some more serious queries, but they are just queries and
requests for clarification.  If there are satisfactory answers to
them, I'll add my R-b.


---
> Changes:
> v5:
> * changed error codes in multiple places
> * added bunch of WARN_ON() in places which should not really happen
> * adde a check that an iommu table is not attached already to LIOBN
> * dropped explicit calls to iommu_tce_clear_param_check/
> iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
> call them anyway (since the previous patch)
> * if we fail to update a hardware IOMMU table for unexpected reason,
> this just clears the entry
> 
> v4:
> * added note to the commit log about allowing multiple updates of
> the same IOMMU table;
> * instead of checking for if any memory was preregistered, this
> returns H_TOO_HARD if a specific page was not;
> * fixed comments from v3 about error handling in many places;
> * simplified TCE handlers and merged IOMMU parts inline - for example,
> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
> the first attached table only (makes the code simpler);
> 
> v3:
> * simplified not to use VFIO group notifiers
> * reworked cleanup, should be cleaner/simpler now
> 
> v2:
> * reworked to use new VFIO notifiers
> * now same iommu_table may appear in the list several times, to be fixed later
> ---
>  Documentation/virtual/kvm/devices/vfio.txt |  22 ++-
>  arch/powerpc/include/asm/kvm_host.h|   8 +
>  arch/powerpc/include/asm/kvm_ppc.h |   4 +
>  include/uapi/linux/kvm.h   |   8 +
>  arch/powerpc/kvm/book3s_64_vio.c   | 307 
> -
>  arch/powerpc/kvm/book3s_64_vio_hv.c| 152 +-
>  arch/powerpc/kvm/powerpc.c |   2 +
>  virt/kvm/vfio.c|  60 ++
>  8 files changed, 555 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/devices/vfio.

Re: [PATCH 02/12] drm/ast: Handle configuration without P2A bridge

2017-02-23 Thread Joel Stanley

On Fri, Feb 24, 2017 at 9:23 AM, Benjamin Herrenschmidt
 wrote:

>  static int ast_get_dram_info(struct drm_device *dev)
>  {
> +   struct device_node *np = dev->pdev->dev.of_node;
> struct ast_private *ast = dev->dev_private;
> -   uint32_t data, data2;
> -   uint32_t denum, num, div, ref_pll;
> +   uint32_t mcr_cfg, mcr_scu_mpll, mcr_scu_strap;
> +   uint32_t denum, num, div, ref_pll, dsel;
>
> -   if (ast->DisableP2A)
> -   {
> +   switch (ast->config_mode) {
> +   case ast_use_dt:
> +   /*
> +* If some properties are missing, use reasonable
> +* defaults for AST2400
> +*/
> +   if (of_property_read_u32(np, "ast,mcr-configuration", 
> &mcr_cfg))
> +   mcr_cfg = 0x0577;
> +   if (of_property_read_u32(np, "ast,ast,mcr-scu-mpll",
> +&mcr_scu_mpll))
> +   mcr_scu_mpll = 0x50C0;
> +   if (of_property_read_u32(np, "ast,ast,mcr-scu-strap",

Are these properties supposed to repeat the prefix "ast,ast"?

We've chosen aspeed as the vendor prefix for Aspeed stuff.

> +&mcr_scu_strap))
> +   mcr_scu_strap = 0;
> +   break;
> +   case ast_use_p2a:
> +   ast_write32(ast, 0xf004, 0x1e6e);
> +   ast_write32(ast, 0xf000, 0x1);
> +   mcr_cfg = ast_read32(ast, 0x10004);
> +   mcr_scu_mpll = ast_read32(ast, 0x10120);
> +   mcr_scu_strap = ast_read32(ast, 0x10170);
> +   break;
> +   case ast_use_defaults:
> +   default:
> ast->dram_bus_width = 16;
> ast->dram_type = AST_DRAM_1Gx16;
> ast->mclk = 396;
> +   return 0;
> }
> -   else
> -   {
> -   ast_write32(ast, 0xf004, 0x1e6e);
> -   ast_write32(ast, 0xf000, 0x1);
> -   data = ast_read32(ast, 0x10004);
>
> -   if (data & 0x40)
> -   ast->dram_bus_width = 16;
> -   else
> -   ast->dram_bus_width = 32;
> -
> -   if (ast->chip == AST2300 || ast->chip == AST2400) {
> -   switch (data & 0x03) {
> -   case 0:
> -   ast->dram_type = AST_DRAM_512Mx16;
> -   break;
> -   default:
> -   case 1:
> -   ast->dram_type = AST_DRAM_1Gx16;
> -   break;
> -   case 2:
> -   ast->dram_type = AST_DRAM_2Gx16;
> -   break;
> -   case 3:
> -   ast->dram_type = AST_DRAM_4Gx16;
> -   break;
> -   }
> -   } else {
> -   switch (data & 0x0c) {
> -   case 0:
> -   case 4:
> -   ast->dram_type = AST_DRAM_512Mx16;
> -   break;
> -   case 8:
> -   if (data & 0x40)
> -   ast->dram_type = AST_DRAM_1Gx16;
> -   else
> -   ast->dram_type = AST_DRAM_512Mx32;
> -   break;
> -   case 0xc:
> -   ast->dram_type = AST_DRAM_1Gx32;
> -   break;
> -   }
> -   }
> +   if (mcr_cfg & 0x40)
> +   ast->dram_bus_width = 16;
> +   else
> +   ast->dram_bus_width = 32;
>
> -   data = ast_read32(ast, 0x10120);
> -   data2 = ast_read32(ast, 0x10170);
> -   if (data2 & 0x2000)
> -   ref_pll = 14318;
> -   else
> -   ref_pll = 12000;
> -
> -   denum = data & 0x1f;
> -   num = (data & 0x3fe0) >> 5;
> -   data = (data & 0xc000) >> 14;
> -   switch (data) {
> -   case 3:
> -   div = 0x4;
> +   if (ast->chip == AST2300 || ast->chip == AST2400) {
> +   switch (mcr_cfg & 0x03) {
> +   case 0:
> +   ast->dram_type = AST_DRAM_512Mx16;
> break;
> -   case 2:
> +   default:
> case 1:
> -   div = 0x2;
> +   ast->dram_type = AST_DRAM_1Gx16;
> break;
> -   default:
> -   div = 0x1;
> +   case 2:
> +   ast->dram_type = AST_DRAM_2Gx16;
> +   break;
> +

Re: [PATCH 03/12] drm/ast: const'ify mode setting tables

2017-02-23 Thread Joel Stanley

On Fri, Feb 24, 2017 at 9:23 AM, Benjamin Herrenschmidt
 wrote:
> And fix some comment alignment & space/tabs while at it
>
> Signed-off-by: Benjamin Herrenschmidt 

Acked-by: Joel Stanley 

> ---
>  drivers/gpu/drm/ast/ast_drv.h|   4 +-
>  drivers/gpu/drm/ast/ast_mode.c   |   8 +--
>  drivers/gpu/drm/ast/ast_tables.h | 106 
> +++
>  3 files changed, 59 insertions(+), 59 deletions(-)

Re: [PATCH 09/12] drm/ast: Rename ast_init_dram_2300 to ast_post_chip_2300

2017-02-23 Thread Joel Stanley

On Fri, Feb 24, 2017 at 9:23 AM, Benjamin Herrenschmidt
 wrote:
> The function does more than initializing the DRAM and in turns
> calls other functions to do the actual init. This will keeping
> things more consistent with the upcoming AST2500 POST code.
>
> Signed-off-by: Benjamin Herrenschmidt 

Acked-by: Joel Stanley 

> ---
>  drivers/gpu/drm/ast/ast_post.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)

Re: [PATCH 10/12] drm/ast: POST code for the new AST2500

2017-02-23 Thread Joel Stanley

On Fri, Feb 24, 2017 at 9:23 AM, Benjamin Herrenschmidt
 wrote:
> From: "Y.C. Chen" 
>
> This is used when the BMC isn't running any code and thus has
> to be initialized by the host.
>
> The code originates from Aspeed (Y.C. Chen) and has been cleaned
> up for coding style purposes by BenH.
>
> Signed-off-by: Y.C. Chen 
> Signed-off-by: Benjamin Herrenschmidt 

Acked-by: Joel Stanley 

> --
>
> v2. - Fix bug in ddr_test_2500 reported by Emil Velikov
> - Rebase on updated mmc_test factoring patch
> - Fix missing else statement in 2500 POST code
> ---
>  drivers/gpu/drm/ast/ast_dram_tables.h |  62 +
>  drivers/gpu/drm/ast/ast_post.c| 417 
> +-
>  2 files changed, 476 insertions(+), 3 deletions(-)
>

> +void ast_post_chip_2500(struct drm_device *dev)
> +{
> +   struct ast_private *ast = dev->dev_private;
> +   u32 temp;
> +   u8 reg;
> +
> +   reg = ast_get_index_reg_mask(ast, AST_IO_CRTC_PORT, 0xd0, 0xff);
> +   if ((reg & 0x80) == 0) {/* vga only */
> +   /* Clear bus lock condition */
> +   ast_moutdwm(ast, 0x1e60, 0xAEED1A03);
> +   ast_moutdwm(ast, 0x1e600084, 0x0001);
> +   ast_moutdwm(ast, 0x1e600088, 0x);
> +   ast_moutdwm(ast, 0x1e6e2000, 0x1688A8A8);
> +   ast_write32(ast, 0xf004, 0x1e6e);
> +   ast_write32(ast, 0xf000, 0x1);
> +   ast_write32(ast, 0x12000, 0x1688a8a8);
> +   while (ast_read32(ast, 0x12000) != 0x1)
> +   ;
> +
> +   ast_write32(ast, 0x1, 0xfc600309);
> +   while (ast_read32(ast, 0x1) != 0x1)
> +   ;
> +
> +   /* Slow down CPU/AHB CLK in VGA only mode */
> +   temp = ast_read32(ast, 0x12008);
> +   temp |= 0x73;
> +   ast_write32(ast, 0x12008, temp);
> +
> +   /* Reset USB port to patch USB unknown device issue */

Really?!

> +   ast_moutdwm(ast, 0x1e6e2090, 0x2000);
> +   temp  = ast_mindwm(ast, 0x1e6e2094);
> +   temp |= 0x4000;
> +   ast_moutdwm(ast, 0x1e6e2094, temp);
> +   temp  = ast_mindwm(ast, 0x1e6e2070);
> +   if (temp & 0x0080) {
> +   ast_moutdwm(ast, 0x1e6e207c, 0x0080);
> +   mdelay(100);
> +   ast_moutdwm(ast, 0x1e6e2070, 0x0080);
> +   }
> +
> +   if (!ast_dram_init_2500(ast))
> +   DRM_ERROR("DRAM init failed !\n");
> +
> +   temp = ast_mindwm(ast, 0x1e6e2040);
> +   ast_moutdwm(ast, 0x1e6e2040, temp | 0x40);
> +   }
> +
> +   /* wait ready */
> +   do {
> +   reg = ast_get_index_reg_mask(ast, AST_IO_CRTC_PORT, 0xd0, 
> 0xff);
> +   } while ((reg & 0x40) == 0);
> +}
> --
> 2.9.3
>

Re: [PATCH 12/12] drm/ast: Call open_key before enable_mmio in POST code

2017-02-23 Thread Joel Stanley

On Fri, Feb 24, 2017 at 9:23 AM, Benjamin Herrenschmidt
 wrote:
> From: "Y.C. Chen" 
>
> open_key enables access the registers used by enable_mmio
>
> Signed-off-by: Y.C. Chen 
> Signed-off-by: Benjamin Herrenschmidt 

Acked-by: Joel Stanley 

> ---
>  drivers/gpu/drm/ast/ast_post.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Re: [PATCH 08/12] drm/ast: Factor mmc_test code in POST code

2017-02-23 Thread Joel Stanley

On Fri, Feb 24, 2017 at 9:23 AM, Benjamin Herrenschmidt
 wrote:
> There's a some duplication for what's essentially copies of
> two loops, so factor it. The upcoming AST2500 POST code adds
> more of them. Also cleanup return types for the test functions,
> most of them return a boolean, some return a u32.
>
> Signed-off-by: Benjamin Herrenschmidt 

Acked-by: Joel Stanley 

> --
>
> v2. - Keep the split between the "test" and "test2" functions
>   as they have a different exit condition in the loop and
>   a different return type.
> - Fix the return types accross the call chain
> ---
>  drivers/gpu/drm/ast/ast_post.c | 82 
> --
>  1 file changed, 31 insertions(+), 51 deletions(-)

Re: [PATCH 06/12] drm/ast: Base support for AST2500

2017-02-23 Thread Joel Stanley

On Fri, Feb 24, 2017 at 9:23 AM, Benjamin Herrenschmidt
 wrote:
> From: "Y.C. Chen" 
>
> Add detection and mode setting updates for AST2500 generation chip,
> code originally from Aspeed and slightly reworked for coding style
> mostly by Ben. This doesn't contain the BMC DRAM POST code which
> is in a separate patch.
>
> Signed-off-by: Y.C. Chen 
> Signed-off-by: Benjamin Herrenschmidt 

Acked-by: Joel Stanley 

> ---
>
> v2. Add 800Mhz default mclk for AST2500
> ---
>  drivers/gpu/drm/ast/ast_drv.h|  2 ++
>  drivers/gpu/drm/ast/ast_main.c   | 32 +++---
>  drivers/gpu/drm/ast/ast_mode.c   | 30 -
>  drivers/gpu/drm/ast/ast_tables.h | 58 
> +---
>  4 files changed, 103 insertions(+), 19 deletions(-)
>

Re: [PATCH 01/12] drm/ast: Fix AST2400 POST failure without BMC FW or VBIOS

2017-02-23 Thread Joel Stanley

On Fri, Feb 24, 2017 at 9:23 AM, Benjamin Herrenschmidt
 wrote:
> From: "Y.C. Chen" 
>
> The current POST code for the AST2300/2400 family doesn't work properly
> if the chip hasn't been initialized previously by either the BMC own FW
> or the VBIOS. This fixes it.
>
> Signed-off-by: Y.C. Chen 
> Signed-off-by: Benjamin Herrenschmidt 

Acked-by: Joel Stanley 

> ---
>  drivers/gpu/drm/ast/ast_post.c | 38 +++---
>  1 file changed, 35 insertions(+), 3 deletions(-)
>

Re: [PATCH 04/12] drm/ast: Remove spurrious include

2017-02-23 Thread Joel Stanley

On Fri, Feb 24, 2017 at 9:23 AM, Benjamin Herrenschmidt
 wrote:
> Signed-off-by: Benjamin Herrenschmidt 

Acked-by: Joel Stanley 

> ---
>  drivers/gpu/drm/ast/ast_main.c | 2 --
>  1 file changed, 2 deletions(-)

Re: [PATCH 05/12] drm/ast: Fix calculation of MCLK

2017-02-23 Thread Joel Stanley

On Fri, Feb 24, 2017 at 9:23 AM, Benjamin Herrenschmidt
 wrote:
> Some braces were missing causing an incorrect calculation.
>
> Y.C. Chen from Aspeed provided me with the right formula
> which I tested on AST2400 and 2500.

Y. C. Chen, can you point out this calculation in the programming guide?

All of the PLL calculations I can find in the ast2400 documentation
are different to this one.

Cheers,

Joel

>
> The MCLK isn't currently used by the driver (it will eventually
> to filter modes) so the issue isn't catastrophic.
>
> Also make the printed value a bit more meaningful
>
> Signed-off-by: Benjamin Herrenschmidt 
> ---
>  drivers/gpu/drm/ast/ast_main.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/ast/ast_main.c b/drivers/gpu/drm/ast/ast_main.c
> index 718c15b..d194af3 100644
> --- a/drivers/gpu/drm/ast/ast_main.c
> +++ b/drivers/gpu/drm/ast/ast_main.c
> @@ -352,7 +352,7 @@ static int ast_get_dram_info(struct drm_device *dev)
> div = 0x1;
> break;
> }
> -   ast->mclk = ref_pll * (num + 2) / (denum + 2) * (div * 1000);
> +   ast->mclk = ref_pll * (num + 2) / ((denum + 2) * (div * 1000));
> return 0;
>  }
>
> @@ -496,7 +496,9 @@ int ast_driver_load(struct drm_device *dev, unsigned long 
> flags)
> if (ret)
> goto out_free;
> ast->vram_size = ast_get_vram_info(dev);
> -   DRM_INFO("dram %d %d %d %08x\n", ast->mclk, ast->dram_type, 
> ast->dram_bus_width, ast->vram_size);
> +   DRM_INFO("dram MCLK=%u Mhz type=%d bus_width=%d size=%08x\n",
> +ast->mclk, ast->dram_type,
> +ast->dram_bus_width, ast->vram_size);
> }
>
> if (need_post)
> --
> 2.9.3
>

Re: [PATCH 07/12] drm/ast: Fixed vram size incorrect issue on POWER

2017-02-23 Thread Joel Stanley

On Fri, Feb 24, 2017 at 9:23 AM, Benjamin Herrenschmidt
 wrote:
> From: "Y.C. Chen" 
>
> The default value of VGA scratch may incorrect.
> Should initial h/w before get vram info.
>
> Signed-off-by: Y.C. Chen 

Acked-by: Joel Stanley 

> ---
>  drivers/gpu/drm/ast/ast_main.c | 6 +++---
>  drivers/gpu/drm/ast/ast_post.c | 2 +-
>  2 files changed, 4 insertions(+), 4 deletions(-)

Re: [PATCH 02/12] drm/ast: Handle configuration without P2A bridge

2017-02-23 Thread Benjamin Herrenschmidt

On Fri, 2017-02-24 at 12:51 +1030, Joel Stanley wrote:
> Are these properties supposed to repeat the prefix "ast,ast"?
> 
> We've chosen aspeed as the vendor prefix for Aspeed stuff.

Argh no, that's a typo... must have worked in my tests bcs
the defaults are fine. I'll update.

Cheers,
Ben.

Re: [PATCH 02/12] drm/ast: Handle configuration without P2A bridge

2017-02-23 Thread Benjamin Herrenschmidt

On Fri, 2017-02-24 at 12:51 +1030, Joel Stanley wrote:
> 
> Are these properties supposed to repeat the prefix "ast,ast"?
> 
> We've chosen aspeed as the vendor prefix for Aspeed stuff.

Sent my reply too early ... so yes, I can change that, our FW hasn't
merge the FW side yet. I'll respin now.

> > +   if (mcr_scu_strap & 0x2000)
> 
> This bit confused me. Bit 13 of the strap (SCU70) is the SPI mode.

The register is actually "MCR170: AST2000 Backward Compatible SCU
Hardware Strapping Value"

> > +   ref_pll = 14318;
> > +   else
> > +   ref_pll = 12000;
> > +
> > +   denum = mcr_scu_mpll & 0x1f;
> > +   num = (mcr_scu_mpll & 0x3fe0) >> 5;
> > +   dsel = (mcr_scu_mpll & 0xc000) >> 14;
> 
> These calculations don't make sense for the ast2400 or ast2500.

They do if you look at this:

MCR120: AST2000 Backward Compatible SCU MPLL Parameter

It's not the SCU version of the register it's the MCU "copy" of it
that maintains some kind of legacy layout. Hence "mcr_scu" prefix
not "scu".

> > +   switch (dsel) {
> > +   case 3:
> > +   div = 0x4;
> > +   break;
> > +   case 2:
> > +   case 1:
> > +   div = 0x2;
> > +   break;
> > +   default:
> > +   div = 0x1;
> > +   break;
> > }
> > +   ast->mclk = ref_pll * (num + 2) / (denum + 2) * (div *
> > 1000);
> > return 0;
> >  }

Re: [PATCH 05/12] drm/ast: Fix calculation of MCLK

2017-02-23 Thread Benjamin Herrenschmidt

On Fri, 2017-02-24 at 12:54 +1030, Joel Stanley wrote:
> On Fri, Feb 24, 2017 at 9:23 AM, Benjamin Herrenschmidt
>  wrote:
> > Some braces were missing causing an incorrect calculation.
> > 
> > Y.C. Chen from Aspeed provided me with the right formula
> > which I tested on AST2400 and 2500.
> 
> Y. C. Chen, can you point out this calculation in the programming
> guide?
> 
> All of the PLL calculations I can find in the ast2400 documentation
> are different to this one.

Different PLL register, see my other email. I've checked the result
of the calculation on our AST2500 and AST2400 machines.

Cheers,
Ben.

> Cheers,
> 
> Joel
> 
> > 
> > The MCLK isn't currently used by the driver (it will eventually
> > to filter modes) so the issue isn't catastrophic.
> > 
> > Also make the printed value a bit more meaningful
> > 
> > Signed-off-by: Benjamin Herrenschmidt 
> > ---
> >  drivers/gpu/drm/ast/ast_main.c | 6 --
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/ast/ast_main.c
> > b/drivers/gpu/drm/ast/ast_main.c
> > index 718c15b..d194af3 100644
> > --- a/drivers/gpu/drm/ast/ast_main.c
> > +++ b/drivers/gpu/drm/ast/ast_main.c
> > @@ -352,7 +352,7 @@ static int ast_get_dram_info(struct drm_device
> > *dev)
> > div = 0x1;
> > break;
> > }
> > -   ast->mclk = ref_pll * (num + 2) / (denum + 2) * (div *
> > 1000);
> > +   ast->mclk = ref_pll * (num + 2) / ((denum + 2) * (div *
> > 1000));
> > return 0;
> >  }
> > 
> > @@ -496,7 +496,9 @@ int ast_driver_load(struct drm_device *dev,
> > unsigned long flags)
> > if (ret)
> > goto out_free;
> > ast->vram_size = ast_get_vram_info(dev);
> > -   DRM_INFO("dram %d %d %d %08x\n", ast->mclk, ast-
> > >dram_type, ast->dram_bus_width, ast->vram_size);
> > +   DRM_INFO("dram MCLK=%u Mhz type=%d bus_width=%d
> > size=%08x\n",
> > +ast->mclk, ast->dram_type,
> > +ast->dram_bus_width, ast->vram_size);
> > }
> > 
> > if (need_post)
> > --
> > 2.9.3
> >

[PATCH v5 2/12] drm/ast: Handle configuration without P2A bridge

2017-02-23 Thread Benjamin Herrenschmidt

The ast driver configures a window to enable access into BMC
memory space in order to read some configuration registers.

If this window is disabled, which it can be from the BMC side,
the ast driver can't function.

Closing this window is a necessity for security if a machine's
host side and BMC side are controlled by different parties;
i.e. a cloud provider offering machines "bare metal".

A recent patch went in to try to check if that window is open
but it does so by trying to access the registers in question
and testing if the result is 0x.

This method will trigger a PCIe error when the window is closed
which on some systems will be fatal (it will trigger an EEH
for example on POWER which will take out the device).

This patch improves this in two ways:

 - First, if the firmware has put properties in the device-tree
containing the relevant configuration information, we use these.

 - Otherwise, a bit in one of the SCU scratch registers (which
are readable via the VGA register space and writeable by the BMC)
will indicate if the BMC has closed the window. This bit has been
defined by Y.C Chen from Aspeed.

If the window is closed and the configuration isn't available from
the device-tree, some sane defaults are used. Those defaults are
hopefully sufficient for standard video modes used on a server.

Signed-off-by: Russell Currey 
Signed-off-by: Benjamin Herrenschmidt 
--

v2. [BenH]
- Reworked on top of Aspeed P2A patch
- Cleanup overall detection via a "config_mode" and log the
  selected mode for diagnostics purposes
- Add a property for the SCU straps

v3. [BenH]
- Moved the config mode detection to a separate functionn
- Add reading of SCU 0x40 D[12] to detect the window is
  closed as to not trigger a bus error by just "trying".
  (change provided by Y.C. Chen)
v4. [BenH]
- Only devices with the AST2000 PCI ID have a P2A bridge
- Update the P2A presence test to account for VGA only
  mode as provided by Y.C. Chen.
v5. [BenH]
- Fixup prefix of OF properties based on Joel Stanley
  review comments.
---
 drivers/gpu/drm/ast/ast_drv.h  |   6 +-
 drivers/gpu/drm/ast/ast_main.c | 264 +
 drivers/gpu/drm/ast/ast_post.c |   7 +-
 3 files changed, 168 insertions(+), 109 deletions(-)

diff --git a/drivers/gpu/drm/ast/ast_drv.h b/drivers/gpu/drm/ast/ast_drv.h
index 7abda94..3bedcf7 100644
--- a/drivers/gpu/drm/ast/ast_drv.h
+++ b/drivers/gpu/drm/ast/ast_drv.h
@@ -113,7 +113,11 @@ struct ast_private {
struct ttm_bo_kmap_obj cache_kmap;
int next_cursor;
bool support_wide_screen;
-   bool DisableP2A;
+   enum {
+   ast_use_p2a,
+   ast_use_dt,
+   ast_use_defaults
+   } config_mode;
 
enum ast_tx_chip tx_chip_type;
u8 dp501_maxclk;
diff --git a/drivers/gpu/drm/ast/ast_main.c b/drivers/gpu/drm/ast/ast_main.c
index 533e762..fb99762 100644
--- a/drivers/gpu/drm/ast/ast_main.c
+++ b/drivers/gpu/drm/ast/ast_main.c
@@ -62,13 +62,84 @@ uint8_t ast_get_index_reg_mask(struct ast_private *ast,
return ret;
 }
 
+static void ast_detect_config_mode(struct drm_device *dev, u32 *scu_rev)
+{
+   struct device_node *np = dev->pdev->dev.of_node;
+   struct ast_private *ast = dev->dev_private;
+   uint32_t data, jregd0, jregd1;
+
+   /* Defaults */
+   ast->config_mode = ast_use_defaults;
+   *scu_rev = 0x;
+
+   /* Check if we have device-tree properties */
+   if (np && !of_property_read_u32(np, "aspeed,scu-revision-id",
+   scu_rev)) {
+   /* We do, disable P2A access */
+   ast->config_mode = ast_use_dt;
+   DRM_INFO("Using device-tree for configuration\n");
+   return;
+   }
+
+   /* Not all families have a P2A bridge */
+   if (dev->pdev->device != PCI_CHIP_AST2000)
+   return;
+
+   /*
+* The BMC will set SCU 0x40 D[12] to 1 if the P2 bridge
+* is disabled. We force using P2A if VGA only mode bit
+* is set D[7]
+*/
+   jregd0 = ast_get_index_reg_mask(ast, AST_IO_CRTC_PORT, 0xd0, 0xff);
+   jregd1 = ast_get_index_reg_mask(ast, AST_IO_CRTC_PORT, 0xd1, 0xff);
+   if (!(jregd0 & 0x80) || !(jregd1 & 0x10)) {
+   /* Double check it's actually working */
+   data = ast_read32(ast, 0xf004);
+   if (data != 0x) {
+   /* P2A works, grab silicon revision */
+   ast->config_mode = ast_use_p2a;
+
+   DRM_INFO("Using P2A bridge for configuration\n");
+
+   /* Read SCU7c (silicon revision register) */
+   ast_write32(ast, 0xf004, 0x1e6e);
+   ast_write32(ast, 0xf000, 0x1);
+   *scu_rev = ast_read32(ast, 0x1207c);
+   return;
+   }
+

Re: [PATCH v5 2/12] drm/ast: Handle configuration without P2A bridge

2017-02-23 Thread Joel Stanley

On Fri, Feb 24, 2017 at 1:11 PM, Benjamin Herrenschmidt
 wrote:
> The ast driver configures a window to enable access into BMC
> memory space in order to read some configuration registers.
>
> If this window is disabled, which it can be from the BMC side,
> the ast driver can't function.
>
> Closing this window is a necessity for security if a machine's
> host side and BMC side are controlled by different parties;
> i.e. a cloud provider offering machines "bare metal".
>
> A recent patch went in to try to check if that window is open
> but it does so by trying to access the registers in question
> and testing if the result is 0x.
>
> This method will trigger a PCIe error when the window is closed
> which on some systems will be fatal (it will trigger an EEH
> for example on POWER which will take out the device).
>
> This patch improves this in two ways:
>
>  - First, if the firmware has put properties in the device-tree
> containing the relevant configuration information, we use these.
>
>  - Otherwise, a bit in one of the SCU scratch registers (which
> are readable via the VGA register space and writeable by the BMC)
> will indicate if the BMC has closed the window. This bit has been
> defined by Y.C Chen from Aspeed.
>
> If the window is closed and the configuration isn't available from
> the device-tree, some sane defaults are used. Those defaults are
> hopefully sufficient for standard video modes used on a server.
>
> Signed-off-by: Russell Currey 
> Signed-off-by: Benjamin Herrenschmidt 
> --
>
> v2. [BenH]
> - Reworked on top of Aspeed P2A patch
> - Cleanup overall detection via a "config_mode" and log the
>   selected mode for diagnostics purposes
> - Add a property for the SCU straps
>
> v3. [BenH]
> - Moved the config mode detection to a separate functionn
> - Add reading of SCU 0x40 D[12] to detect the window is
>   closed as to not trigger a bus error by just "trying".
>   (change provided by Y.C. Chen)
> v4. [BenH]
> - Only devices with the AST2000 PCI ID have a P2A bridge
> - Update the P2A presence test to account for VGA only
>   mode as provided by Y.C. Chen.
> v5. [BenH]
> - Fixup prefix of OF properties based on Joel Stanley
>   review comments.

LGTM. Thanks for the explaining the mcr regs, that stuff checks out now.

Acked-by: Joel Stanley 

> ---
>  drivers/gpu/drm/ast/ast_drv.h  |   6 +-
>  drivers/gpu/drm/ast/ast_main.c | 264 
> +
>  drivers/gpu/drm/ast/ast_post.c |   7 +-
>  3 files changed, 168 insertions(+), 109 deletions(-)
>

Re: [PATCH 05/12] drm/ast: Fix calculation of MCLK

2017-02-23 Thread Joel Stanley

On Fri, Feb 24, 2017 at 1:08 PM, Benjamin Herrenschmidt
 wrote:
> On Fri, 2017-02-24 at 12:54 +1030, Joel Stanley wrote:
>> On Fri, Feb 24, 2017 at 9:23 AM, Benjamin Herrenschmidt
>>  wrote:
>> > Some braces were missing causing an incorrect calculation.
>> >
>> > Y.C. Chen from Aspeed provided me with the right formula
>> > which I tested on AST2400 and 2500.
>>
>> Y. C. Chen, can you point out this calculation in the programming
>> guide?
>>
>> All of the PLL calculations I can find in the ast2400 documentation
>> are different to this one.
>
> Different PLL register, see my other email. I've checked the result
> of the calculation on our AST2500 and AST2400 machines.

I see now. It woudl be good to see the calculation added to Aspeed's
documentation in the future.

Acked-by: Joel Stanley 

Cheers,

Joel

Re: [PATCH kernel v5 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

2017-02-23 Thread Alexey Kardashevskiy

On 24/02/17 13:14, David Gibson wrote:
> On Wed, Feb 22, 2017 at 07:21:33PM +1100, Alexey Kardashevskiy wrote:
>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
>> without passing them to user space which saves time on switching
>> to user space and back.
>>
>> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
>> KVM tries to handle a TCE request in the real mode, if failed
>> it passes the request to the virtual mode to complete the operation.
>> If it a virtual mode handler fails, the request is passed to
>> the user space; this is not expected to happen though.
>>
>> To avoid dealing with page use counters (which is tricky in real mode),
>> this only accelerates SPAPR TCE IOMMU v2 clients which are required
>> to pre-register the userspace memory. The very first TCE request will
>> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
>> of the TCE table (iommu_table::it_userspace) is not allocated till
>> the very first mapping happens and we cannot call vmalloc in real mode.
>>
>> If we fail to update a hardware IOMMU table unexpected reason, we just
>> clear it and move on as there is nothing really we can do about it -
>> for example, if we hot plug a VFIO device to a guest, existing TCE tables
>> will be mirrored automatically to the hardware and there is no interface
>> to report to the guest about possible failures.
>>
>> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
>> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
>> and associates a physical IOMMU table with the SPAPR TCE table (which
>> is a guest view of the hardware IOMMU table). The iommu_table object
>> is cached and referenced so we do not have to look up for it in real mode.
>>
>> This does not implement the UNSET counterpart as there is no use for it -
>> once the acceleration is enabled, the existing userspace won't
>> disable it unless a VFIO container is destroyed; this adds necessary
>> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
>>
>> As this creates a descriptor per IOMMU table-LIOBN couple (called
>> kvmppc_spapr_tce_iommu_table), it is possible to have several
>> descriptors with the same iommu_table (hardware IOMMU table) attached
>> to the same LIOBN; we do not remove duplicates though as
>> iommu_table_ops::exchange not just update a TCE entry (which is
>> shared among IOMMU groups) but also invalidates the TCE cache
>> (one per IOMMU group).
>>
>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>> space.
>>
>> This finally makes use of vfio_external_user_iommu_id() which was
>> introduced quite some time ago and was considered for removal.
>>
>> Tests show that this patch increases transmission speed from 220MB/s
>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>
>> Signed-off-by: Alexey Kardashevskiy 
> 
> I have some comments on this patch, but all the definite ones are
> pretty minor and could be done as later cleanups.
> 
> I have some more serious queries, but they are just queries and
> requests for clarification.  If there are satisfactory answers to
> them, I'll add my R-b.
> 
> 
> ---
>> Changes:
>> v5:
>> * changed error codes in multiple places
>> * added bunch of WARN_ON() in places which should not really happen
>> * adde a check that an iommu table is not attached already to LIOBN
>> * dropped explicit calls to iommu_tce_clear_param_check/
>> iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
>> call them anyway (since the previous patch)
>> * if we fail to update a hardware IOMMU table for unexpected reason,
>> this just clears the entry
>>
>> v4:
>> * added note to the commit log about allowing multiple updates of
>> the same IOMMU table;
>> * instead of checking for if any memory was preregistered, this
>> returns H_TOO_HARD if a specific page was not;
>> * fixed comments from v3 about error handling in many places;
>> * simplified TCE handlers and merged IOMMU parts inline - for example,
>> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
>> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
>> the first attached table only (makes the code simpler);
>>
>> v3:
>> * simplified not to use VFIO group notifiers
>> * reworked cleanup, should be cleaner/simpler now
>>
>> v2:
>> * reworked to use new VFIO notifiers
>> * now same iommu_table may appear in the list several times, to be fixed 
>> later
>> ---
>>  Documentation/virtual/kvm/devices/vfio.txt |  22 ++-
>>  arch/powerpc/include/asm/kvm_host.h|   8 +
>>  arch/powerpc/include/asm/kvm_ppc.h |   4 +
>>  include/uapi/linux/kvm.h   |   8 +
>>  arch/powerpc/kvm/book3s_64_vio.c   | 307 
>> -
>>  arch/powerpc/kvm/book3s_64_vio_hv.c| 152 +-
>>  arch/powerpc/kvm/powerpc.c |   2 +
>>  virt/kvm/vfio.c

Re: [PATCH kernel v5 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

2017-02-23 Thread David Gibson

On Fri, Feb 24, 2017 at 02:29:14PM +1100, Alexey Kardashevskiy wrote:
> On 24/02/17 13:14, David Gibson wrote:
> > On Wed, Feb 22, 2017 at 07:21:33PM +1100, Alexey Kardashevskiy wrote:
> >> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> >> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> >> without passing them to user space which saves time on switching
> >> to user space and back.
> >>
> >> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> >> KVM tries to handle a TCE request in the real mode, if failed
> >> it passes the request to the virtual mode to complete the operation.
> >> If it a virtual mode handler fails, the request is passed to
> >> the user space; this is not expected to happen though.
> >>
> >> To avoid dealing with page use counters (which is tricky in real mode),
> >> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> >> to pre-register the userspace memory. The very first TCE request will
> >> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> >> of the TCE table (iommu_table::it_userspace) is not allocated till
> >> the very first mapping happens and we cannot call vmalloc in real mode.
> >>
> >> If we fail to update a hardware IOMMU table unexpected reason, we just
> >> clear it and move on as there is nothing really we can do about it -
> >> for example, if we hot plug a VFIO device to a guest, existing TCE tables
> >> will be mirrored automatically to the hardware and there is no interface
> >> to report to the guest about possible failures.
> >>
> >> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> >> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> >> and associates a physical IOMMU table with the SPAPR TCE table (which
> >> is a guest view of the hardware IOMMU table). The iommu_table object
> >> is cached and referenced so we do not have to look up for it in real mode.
> >>
> >> This does not implement the UNSET counterpart as there is no use for it -
> >> once the acceleration is enabled, the existing userspace won't
> >> disable it unless a VFIO container is destroyed; this adds necessary
> >> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> >>
> >> As this creates a descriptor per IOMMU table-LIOBN couple (called
> >> kvmppc_spapr_tce_iommu_table), it is possible to have several
> >> descriptors with the same iommu_table (hardware IOMMU table) attached
> >> to the same LIOBN; we do not remove duplicates though as
> >> iommu_table_ops::exchange not just update a TCE entry (which is
> >> shared among IOMMU groups) but also invalidates the TCE cache
> >> (one per IOMMU group).
> >>
> >> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >> space.
> >>
> >> This finally makes use of vfio_external_user_iommu_id() which was
> >> introduced quite some time ago and was considered for removal.
> >>
> >> Tests show that this patch increases transmission speed from 220MB/s
> >> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> >>
> >> Signed-off-by: Alexey Kardashevskiy 
> > 
> > I have some comments on this patch, but all the definite ones are
> > pretty minor and could be done as later cleanups.
> > 
> > I have some more serious queries, but they are just queries and
> > requests for clarification.  If there are satisfactory answers to
> > them, I'll add my R-b.
> > 
> > 
> > ---
> >> Changes:
> >> v5:
> >> * changed error codes in multiple places
> >> * added bunch of WARN_ON() in places which should not really happen
> >> * adde a check that an iommu table is not attached already to LIOBN
> >> * dropped explicit calls to iommu_tce_clear_param_check/
> >> iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
> >> call them anyway (since the previous patch)
> >> * if we fail to update a hardware IOMMU table for unexpected reason,
> >> this just clears the entry
> >>
> >> v4:
> >> * added note to the commit log about allowing multiple updates of
> >> the same IOMMU table;
> >> * instead of checking for if any memory was preregistered, this
> >> returns H_TOO_HARD if a specific page was not;
> >> * fixed comments from v3 about error handling in many places;
> >> * simplified TCE handlers and merged IOMMU parts inline - for example,
> >> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
> >> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
> >> the first attached table only (makes the code simpler);
> >>
> >> v3:
> >> * simplified not to use VFIO group notifiers
> >> * reworked cleanup, should be cleaner/simpler now
> >>
> >> v2:
> >> * reworked to use new VFIO notifiers
> >> * now same iommu_table may appear in the list several times, to be fixed 
> >> later
> >> ---
> >>  Documentation/virtual/kvm/devices/vfio.txt |  22 ++-
> >>  arch/powerpc/include/asm/kvm_host.h|   8 +
> >>  arch/powerpc/include/asm/kvm_ppc.h |   4 +
> >>  include/uapi/linux

Re: [PATCH kernel v5 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

2017-02-23 Thread Alexey Kardashevskiy

On 24/02/17 14:36, David Gibson wrote:
> On Fri, Feb 24, 2017 at 02:29:14PM +1100, Alexey Kardashevskiy wrote:
>> On 24/02/17 13:14, David Gibson wrote:
>>> On Wed, Feb 22, 2017 at 07:21:33PM +1100, Alexey Kardashevskiy wrote:
 This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
 and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
 without passing them to user space which saves time on switching
 to user space and back.

 This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
 KVM tries to handle a TCE request in the real mode, if failed
 it passes the request to the virtual mode to complete the operation.
 If it a virtual mode handler fails, the request is passed to
 the user space; this is not expected to happen though.

 To avoid dealing with page use counters (which is tricky in real mode),
 this only accelerates SPAPR TCE IOMMU v2 clients which are required
 to pre-register the userspace memory. The very first TCE request will
 be handled in the VFIO SPAPR TCE driver anyway as the userspace view
 of the TCE table (iommu_table::it_userspace) is not allocated till
 the very first mapping happens and we cannot call vmalloc in real mode.

 If we fail to update a hardware IOMMU table unexpected reason, we just
 clear it and move on as there is nothing really we can do about it -
 for example, if we hot plug a VFIO device to a guest, existing TCE tables
 will be mirrored automatically to the hardware and there is no interface
 to report to the guest about possible failures.

 This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
 the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
 and associates a physical IOMMU table with the SPAPR TCE table (which
 is a guest view of the hardware IOMMU table). The iommu_table object
 is cached and referenced so we do not have to look up for it in real mode.

 This does not implement the UNSET counterpart as there is no use for it -
 once the acceleration is enabled, the existing userspace won't
 disable it unless a VFIO container is destroyed; this adds necessary
 cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.

 As this creates a descriptor per IOMMU table-LIOBN couple (called
 kvmppc_spapr_tce_iommu_table), it is possible to have several
 descriptors with the same iommu_table (hardware IOMMU table) attached
 to the same LIOBN; we do not remove duplicates though as
 iommu_table_ops::exchange not just update a TCE entry (which is
 shared among IOMMU groups) but also invalidates the TCE cache
 (one per IOMMU group).

 This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
 space.

 This finally makes use of vfio_external_user_iommu_id() which was
 introduced quite some time ago and was considered for removal.

 Tests show that this patch increases transmission speed from 220MB/s
 to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

 Signed-off-by: Alexey Kardashevskiy 
>>>
>>> I have some comments on this patch, but all the definite ones are
>>> pretty minor and could be done as later cleanups.
>>>
>>> I have some more serious queries, but they are just queries and
>>> requests for clarification.  If there are satisfactory answers to
>>> them, I'll add my R-b.
>>>
>>>
>>> ---
 Changes:
 v5:
 * changed error codes in multiple places
 * added bunch of WARN_ON() in places which should not really happen
 * adde a check that an iommu table is not attached already to LIOBN
 * dropped explicit calls to iommu_tce_clear_param_check/
 iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
 call them anyway (since the previous patch)
 * if we fail to update a hardware IOMMU table for unexpected reason,
 this just clears the entry

 v4:
 * added note to the commit log about allowing multiple updates of
 the same IOMMU table;
 * instead of checking for if any memory was preregistered, this
 returns H_TOO_HARD if a specific page was not;
 * fixed comments from v3 about error handling in many places;
 * simplified TCE handlers and merged IOMMU parts inline - for example,
 there used to be kvmppc_h_put_tce_iommu(), now it is merged into
 kvmppc_h_put_tce(); this allows to check IOBA boundaries against
 the first attached table only (makes the code simpler);

 v3:
 * simplified not to use VFIO group notifiers
 * reworked cleanup, should be cleaner/simpler now

 v2:
 * reworked to use new VFIO notifiers
 * now same iommu_table may appear in the list several times, to be fixed 
 later
 ---
  Documentation/virtual/kvm/devices/vfio.txt |  22 ++-
  arch/powerpc/include/asm/kvm_host.h|   8 +
  arch/powerpc/include/asm/kvm_ppc.h

Re: [PATCH kernel v5 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

2017-02-23 Thread Alexey Kardashevskiy

On 24/02/17 14:43, Alexey Kardashevskiy wrote:
> On 24/02/17 14:36, David Gibson wrote:
>> On Fri, Feb 24, 2017 at 02:29:14PM +1100, Alexey Kardashevskiy wrote:
>>> On 24/02/17 13:14, David Gibson wrote:
 On Wed, Feb 22, 2017 at 07:21:33PM +1100, Alexey Kardashevskiy wrote:
> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> without passing them to user space which saves time on switching
> to user space and back.
>
> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> KVM tries to handle a TCE request in the real mode, if failed
> it passes the request to the virtual mode to complete the operation.
> If it a virtual mode handler fails, the request is passed to
> the user space; this is not expected to happen though.
>
> To avoid dealing with page use counters (which is tricky in real mode),
> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> to pre-register the userspace memory. The very first TCE request will
> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> of the TCE table (iommu_table::it_userspace) is not allocated till
> the very first mapping happens and we cannot call vmalloc in real mode.
>
> If we fail to update a hardware IOMMU table unexpected reason, we just
> clear it and move on as there is nothing really we can do about it -
> for example, if we hot plug a VFIO device to a guest, existing TCE tables
> will be mirrored automatically to the hardware and there is no interface
> to report to the guest about possible failures.
>
> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> and associates a physical IOMMU table with the SPAPR TCE table (which
> is a guest view of the hardware IOMMU table). The iommu_table object
> is cached and referenced so we do not have to look up for it in real mode.
>
> This does not implement the UNSET counterpart as there is no use for it -
> once the acceleration is enabled, the existing userspace won't
> disable it unless a VFIO container is destroyed; this adds necessary
> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
>
> As this creates a descriptor per IOMMU table-LIOBN couple (called
> kvmppc_spapr_tce_iommu_table), it is possible to have several
> descriptors with the same iommu_table (hardware IOMMU table) attached
> to the same LIOBN; we do not remove duplicates though as
> iommu_table_ops::exchange not just update a TCE entry (which is
> shared among IOMMU groups) but also invalidates the TCE cache
> (one per IOMMU group).
>
> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> space.
>
> This finally makes use of vfio_external_user_iommu_id() which was
> introduced quite some time ago and was considered for removal.
>
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>
> Signed-off-by: Alexey Kardashevskiy 

 I have some comments on this patch, but all the definite ones are
 pretty minor and could be done as later cleanups.

 I have some more serious queries, but they are just queries and
 requests for clarification.  If there are satisfactory answers to
 them, I'll add my R-b.


 ---
> Changes:
> v5:
> * changed error codes in multiple places
> * added bunch of WARN_ON() in places which should not really happen
> * adde a check that an iommu table is not attached already to LIOBN
> * dropped explicit calls to iommu_tce_clear_param_check/
> iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
> call them anyway (since the previous patch)
> * if we fail to update a hardware IOMMU table for unexpected reason,
> this just clears the entry
>
> v4:
> * added note to the commit log about allowing multiple updates of
> the same IOMMU table;
> * instead of checking for if any memory was preregistered, this
> returns H_TOO_HARD if a specific page was not;
> * fixed comments from v3 about error handling in many places;
> * simplified TCE handlers and merged IOMMU parts inline - for example,
> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
> the first attached table only (makes the code simpler);
>
> v3:
> * simplified not to use VFIO group notifiers
> * reworked cleanup, should be cleaner/simpler now
>
> v2:
> * reworked to use new VFIO notifiers
> * now same iommu_table may appear in the list several times, to be fixed 
> later
> ---
>  Docume

Re: [RFC NO-MERGE 2/2] arch/powerpc/CAS: Update to new option-vector-5 format for CAS

2017-02-23 Thread Suraj Jitindar Singh

On Thu, 2017-02-23 at 15:44 +1100, Paul Mackerras wrote:
> On Tue, Feb 21, 2017 at 05:06:11PM +1100, Suraj Jitindar Singh wrote:
> > 
> > The CAS process has been updated to change how the host to guest
> Once again, explain CAS; perhaps "The ibm,client-architecture-support
> (CAS) negotiation process has been updated for POWER9 to ..."
> 
> > 
> > negotiation is done for the new hash/radix mmu as well as the nest
> > mmu,
> > process tables and guest translation shootdown (GTSE).
> > 
> > The host tells the guest which options it supports in
> > ibm,arch-vec-5-platform-support. The guest then chooses a subset of
> > these
> > to request in the CAS call and these are agreed to in the
> > ibm,architecture-vec-5 property of the chosen node.
> > 
> > Thus we read ibm,arch-vec-5-platform-support and make our selection
> > before
> > calling CAS. We then parse the ibm,architecture-vec-5 property of
> > the
> > chosen node to check whether we should run as hash or radix.
> > 
> > Signed-off-by: Suraj Jitindar Singh 
> > ---
> >  arch/powerpc/include/asm/prom.h | 16 ---
> >  arch/powerpc/kernel/prom_init.c | 99
> > +++--
> >  arch/powerpc/mm/init_64.c   | 31 ++---
> >  3 files changed, 130 insertions(+), 16 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/prom.h
> > b/arch/powerpc/include/asm/prom.h
> > index 8af2546..19d2e84 100644
> > --- a/arch/powerpc/include/asm/prom.h
> > +++ b/arch/powerpc/include/asm/prom.h
> > @@ -158,12 +158,16 @@ struct of_drconf_cell {
> >  #define OV5_PFO_HW_ENCR0x1120  /* PFO
> > Encryption Accelerator */
> >  #define OV5_SUB_PROCESSORS 0x1501  /* 1,2,or 4 Sub-
> > Processors supported */
> >  #define OV5_XIVE_EXPLOIT   0x1701  /* XIVE exploitation
> > supported */
> > -#define OV5_MMU_RADIX_300  0x1880  /* ISA v3.00 radix
> > MMU supported */
> > -#define OV5_MMU_HASH_300   0x1840  /* ISA v3.00 hash
> > MMU supported */
> > -#define OV5_MMU_SEGM_RADIX 0x1820  /* radix mode (no
> > segmentation) */
> > -#define OV5_MMU_PROC_TBL   0x1810  /* hcall selects SLB
> > or proc table */
> > -#define OV5_MMU_SLB0x1800  /* always use SLB
> > */
> > -#define OV5_MMU_GTSE   0x1808  /* Guest
> > translation shootdown */
> > +/* MMU Base Architecture */
> > +#define OV5_MMU_HASH_300   0x1800  /* ISA v3.00 Hash
> > MMU Only */
> This is actually legacy HPT as well as ISA v3.00 HPT.

True

> 
> > 
> > +#define OV5_MMU_RADIX_300  0x1840  /* ISA v3.00 Radix
> > MMU Only */
> > +#define OV5_MMU_EITHER_300 0x1880  /* ISA v3.00 Hash
> > or Radix Supported */
> I wonder if it would work better to have a define for the 2-bit field
> with subsidiary definitions for the field values.  Something like
> 
> #define OV5_MMU_SELECTION 0x18c0
> #define  OV5_MMU_HPT  0x00
> #define  OV5_MMU_RADIX0x40
> #define  OV5_MMU_EITHER   0x80

Yep that's clearer

> 
> > 
> > +#define OV5_NMMU   0x1820  /* Nest MMU
> > Available */
> > +/* Hash Table Extensions */
> > +#define OV5_HASH_SEG_TBL   0x1980  /* In Memory Segment
> > Tables Available */
> > +#define OV5_HASH_GTSE  0x1940  /* Guest
> > Translation Shoot Down Avail */
> > +/* Radix Table Extensions */
> > +#define OV5_RADIX_GTSE 0x1A40  /* Guest
> > Translation Shoot Down Avail */
> >  
> >  /* Option Vector 6: IBM PAPR hints */
> >  #define OV6_LINUX  0x02/* Linux is our OS */
> > diff --git a/arch/powerpc/kernel/prom_init.c
> > b/arch/powerpc/kernel/prom_init.c
> > index 37b5a29..8272104 100644
> > --- a/arch/powerpc/kernel/prom_init.c
> > +++ b/arch/powerpc/kernel/prom_init.c
> > @@ -168,6 +168,8 @@ static unsigned long __initdata
> > prom_tce_alloc_start;
> >  static unsigned long __initdata prom_tce_alloc_end;
> >  #endif
> >  
> > +static bool __initdata prom_radix_disable;
> > +
> >  /* Platforms codes are now obsolete in the kernel. Now only used
> > within this
> >   * file and ultimately gone too. Feel free to change them if you
> > need, they
> >   * are not shared with anything outside of this file anymore
> > @@ -626,6 +628,12 @@ static void __init early_cmdline_parse(void)
> >     prom_memory_limit = ALIGN(prom_memory_limit,
> > 0x100);
> >  #endif
> >     }
> > +
> > +   opt = strstr(prom_cmd_line, "disable_radix");
> > +   if (opt) {
> > +   prom_debug("Radix disabled from cmdline\n");
> > +   prom_radix_disable = true;
> > +   }
> >  }
> >  
> >  #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV)
> > @@ -693,8 +701,10 @@ struct option_vector5 {
> >     __be16 reserved3;
> >     u8 subprocessors;
> >     u8 byte22;
> > -   u8 intarch;
> > +   u8 xive;
> >     u8 mmu;
> > +   u8 hash_ext;
> > +   u8 radix_ext;
> >  } __packed;
> >  
> >  struct option_vector6 {
> > @@ -849,9 +859,10 @@ struct ibm_arch_vec __cacheline_aligned
> > ibm_architecture_vec = {
> >     .reserved2 = 0,
> >     .res

[PATCH v2] powerpc/powernv: add hdat attribute to sysfs

2017-02-23 Thread Matt Brown

The HDAT data area is consumed by skiboot and turned into a device-tree.
In some cases we would like to look directly at the HDAT, so this patch
adds a sysfs node to allow it to be viewed.  This is not possible through
/dev/mem as it is reserved memory which is stopped by the /dev/mem filter.

Signed-off-by: Matt Brown 
---

Between v1 and v2 of the patch the following changes were made.
Changelog:
- moved hdat code into opal-hdat.c
- added opal-hdat to the makefile
- changed struct and variable names from camelcase
---
 arch/powerpc/include/asm/opal.h|  1 +
 arch/powerpc/platforms/powernv/Makefile|  1 +
 arch/powerpc/platforms/powernv/opal-hdat.c | 63 ++
 arch/powerpc/platforms/powernv/opal.c  |  2 +
 4 files changed, 67 insertions(+)
 create mode 100644 arch/powerpc/platforms/powernv/opal-hdat.c

diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 5c7db0f..b26944e 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -277,6 +277,7 @@ extern int opal_async_comp_init(void);
 extern int opal_sensor_init(void);
 extern int opal_hmi_handler_init(void);
 extern int opal_event_init(void);
+extern void opal_hdat_sysfs_init(void);
 
 extern int opal_machine_check(struct pt_regs *regs);
 extern bool opal_mce_check_early_recovery(struct pt_regs *regs);
diff --git a/arch/powerpc/platforms/powernv/Makefile 
b/arch/powerpc/platforms/powernv/Makefile
index b5d98cb..9a0c9d6 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -3,6 +3,7 @@ obj-y   += opal-rtc.o opal-nvram.o opal-lpc.o 
opal-flash.o
 obj-y  += rng.o opal-elog.o opal-dump.o opal-sysparam.o 
opal-sensor.o
 obj-y  += opal-msglog.o opal-hmi.o opal-power.o opal-irqchip.o
 obj-y  += opal-kmsg.o
+obj-y  += opal-hdat.o
 
 obj-$(CONFIG_SMP)  += smp.o subcore.o subcore-asm.o
 obj-$(CONFIG_PCI)  += pci.o pci-ioda.o npu-dma.o
diff --git a/arch/powerpc/platforms/powernv/opal-hdat.c 
b/arch/powerpc/platforms/powernv/opal-hdat.c
new file mode 100644
index 000..bd305e0
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/opal-hdat.c
@@ -0,0 +1,63 @@
+/*
+ * PowerNV OPAL in-memory console interface
+ *
+ * Copyright 2014 IBM Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+struct hdat_info {
+   char *base;
+   u64 size;
+};
+
+static struct hdat_info hdat_inf;
+
+/* Read function for HDAT attribute in sysfs */
+static ssize_t hdat_read(struct file *file, struct kobject *kobj,
+struct bin_attribute *bin_attr, char *to,
+loff_t pos, size_t count)
+{
+   if (!hdat_inf.base)
+   return -ENODEV;
+
+   return memory_read_from_buffer(to, count, &pos, hdat_inf.base,
+   hdat_inf.size);
+}
+
+
+/* HDAT attribute for sysfs */
+static struct bin_attribute hdat_attr = {
+   .attr = {.name = "hdat", .mode = 0444},
+   .read = hdat_read
+};
+
+void __init opal_hdat_sysfs_init(void)
+{
+   u64 hdat_addr[2];
+
+   /* Check for the hdat-map prop in device-tree */
+   if (of_property_read_u64_array(opal_node, "hdat-map", hdat_addr, 2)) {
+   pr_debug("OPAL: Property hdat-map not found.\n");
+   return;
+   }
+
+   /* Print out hdat-map values. [0]: base, [1]: size */
+   pr_debug("OPAL: HDAT Base address: %#llx\n", hdat_addr[0]);
+   pr_debug("OPAL: HDAT Size: %#llx\n", hdat_addr[1]);
+
+   hdat_inf.base = phys_to_virt(hdat_addr[0]);
+   hdat_inf.size = hdat_addr[1];
+
+   if (sysfs_create_bin_file(opal_kobj, &hdat_attr) != 0)
+   pr_debug("OPAL: sysfs file creation for HDAT failed");
+
+}
diff --git a/arch/powerpc/platforms/powernv/opal.c 
b/arch/powerpc/platforms/powernv/opal.c
index 2822935..cae3745 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -740,6 +740,8 @@ static int __init opal_init(void)
opal_sys_param_init();
/* Setup message log sysfs interface. */
opal_msglog_sysfs_init();
+   /* Create hdat object under sys/firmware/opal */
+   opal_hdat_sysfs_init();
}
 
/* Initialize platform devices: IPMI backend, PRD & flash interface */
-- 
2.9.3

[PATCH V2 1/2] arch/powerpc/prom_init: Parse the command line before calling CAS

2017-02-23 Thread Suraj Jitindar Singh

On POWER9 the hypervisor requires the guest to decide whether it would
like to use a hash or radix mmu model at the time it calls
ibm,client-architecture-support (CAS) based on what the hypervisor has
said it's allowed to do. It is possible to disable radix by passing
"disable_radix" on the command line. The next patch will add support for
the new CAS format, thus we need to parse the command line before calling
CAS so we can correctly select which mmu we would like to use.

Signed-off-by: Suraj Jitindar Singh 
Reviewed-by: Paul Mackerras 

---

V1 -> V2:
 - Reword commit message for clarity. No functional change
---
 arch/powerpc/kernel/prom_init.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index d3db1bc..37b5a29 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -2993,6 +2993,11 @@ unsigned long __init prom_init(unsigned long r3, 
unsigned long r4,
 */
prom_check_initrd(r3, r4);
 
+   /*
+* Do early parsing of command line
+*/
+   early_cmdline_parse();
+
 #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV)
/*
 * On pSeries, inform the firmware about our capabilities
@@ -3009,11 +3014,6 @@ unsigned long __init prom_init(unsigned long r3, 
unsigned long r4,
copy_and_flush(0, kbase, 0x100, 0);
 
/*
-* Do early parsing of command line
-*/
-   early_cmdline_parse();
-
-   /*
 * Initialize memory management within prom_init
 */
prom_init_mem();
-- 
2.5.5

[PATCH V2 2/2] arch/powerpc/CAS: Update to new option-vector-5 format for CAS

2017-02-23 Thread Suraj Jitindar Singh

On POWER9 the ibm,client-architecture-support (CAS) negotiation process
has been updated to change how the host to guest negotiation is done for
the new hash/radix mmu as well as the nest mmu, process tables and guest
translation shootdown (GTSE).

The host tells the guest which options it supports in
ibm,arch-vec-5-platform-support. The guest then chooses a subset of these
to request in the CAS call and these are agreed to in the
ibm,architecture-vec-5 property of the chosen node.

Thus we read ibm,arch-vec-5-platform-support and make our selection before
calling CAS. We then parse the ibm,architecture-vec-5 property of the
chosen node to check whether we should run as hash or radix.

ibm,arch-vec-5-platform-support format:

index value pairs:  ... 

index: Option vector 5 byte number
val:   Some representation of supported values

Signed-off-by: Suraj Jitindar Singh 

---

V1 -> V2:
 - Fix error where whole byte was compared for mmu support instead of only the
   first two bytes
 - Break platform support parsing into multiple functions for clarity
 - Instead of printing WARNING: messages on old hypervisors change to a debug
   message
---
 arch/powerpc/include/asm/prom.h |  17 --
 arch/powerpc/kernel/prom_init.c | 120 ++--
 arch/powerpc/mm/init_64.c   |  36 ++--
 3 files changed, 157 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index 8af2546..d838b9d 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -158,12 +158,17 @@ struct of_drconf_cell {
 #define OV5_PFO_HW_ENCR0x1120  /* PFO Encryption Accelerator */
 #define OV5_SUB_PROCESSORS 0x1501  /* 1,2,or 4 Sub-Processors supported */
 #define OV5_XIVE_EXPLOIT   0x1701  /* XIVE exploitation supported */
-#define OV5_MMU_RADIX_300  0x1880  /* ISA v3.00 radix MMU supported */
-#define OV5_MMU_HASH_300   0x1840  /* ISA v3.00 hash MMU supported */
-#define OV5_MMU_SEGM_RADIX 0x1820  /* radix mode (no segmentation) */
-#define OV5_MMU_PROC_TBL   0x1810  /* hcall selects SLB or proc table */
-#define OV5_MMU_SLB0x1800  /* always use SLB */
-#define OV5_MMU_GTSE   0x1808  /* Guest translation shootdown */
+/* MMU Base Architecture */
+#define OV5_MMU_SUPPORT0x18C0  /* MMU Mode Support Mask */
+#define OV5_MMU_HASH   0x00/* Hash MMU Only */
+#define OV5_MMU_RADIX  0x40/* Radix MMU Only */
+#define OV5_MMU_EITHER 0x80/* Hash or Radix Supported */
+#define OV5_NMMU   0x1820  /* Nest MMU Available */
+/* Hash Table Extensions */
+#define OV5_HASH_SEG_TBL   0x1980  /* In Memory Segment Tables Available */
+#define OV5_HASH_GTSE  0x1940  /* Guest Translation Shoot Down Avail */
+/* Radix Table Extensions */
+#define OV5_RADIX_GTSE 0x1A40  /* Guest Translation Shoot Down Avail */
 
 /* Option Vector 6: IBM PAPR hints */
 #define OV6_LINUX  0x02/* Linux is our OS */
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 37b5a29..08cd1b8 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -168,6 +168,14 @@ static unsigned long __initdata prom_tce_alloc_start;
 static unsigned long __initdata prom_tce_alloc_end;
 #endif
 
+static bool __initdata prom_radix_disable;
+
+struct platform_support {
+   bool hash_mmu;
+   bool radix_mmu;
+   bool radix_gtse;
+};
+
 /* Platforms codes are now obsolete in the kernel. Now only used within this
  * file and ultimately gone too. Feel free to change them if you need, they
  * are not shared with anything outside of this file anymore
@@ -626,6 +634,12 @@ static void __init early_cmdline_parse(void)
prom_memory_limit = ALIGN(prom_memory_limit, 0x100);
 #endif
}
+
+   opt = strstr(prom_cmd_line, "disable_radix");
+   if (opt) {
+   prom_debug("Radix disabled from cmdline\n");
+   prom_radix_disable = true;
+   }
 }
 
 #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV)
@@ -693,8 +707,10 @@ struct option_vector5 {
__be16 reserved3;
u8 subprocessors;
u8 byte22;
-   u8 intarch;
+   u8 xive;
u8 mmu;
+   u8 hash_ext;
+   u8 radix_ext;
 } __packed;
 
 struct option_vector6 {
@@ -849,9 +865,10 @@ struct ibm_arch_vec __cacheline_aligned 
ibm_architecture_vec = {
.reserved2 = 0,
.reserved3 = 0,
.subprocessors = 1,
-   .intarch = 0,
-   .mmu = OV5_FEAT(OV5_MMU_RADIX_300) | OV5_FEAT(OV5_MMU_HASH_300) 
|
-   OV5_FEAT(OV5_MMU_PROC_TBL) | OV5_FEAT(OV5_MMU_GTSE),
+   .xive = 0,
+   .mmu = 0,
+   .hash_ext = 0,
+   .radix_ext = 0,
},
 
/* option vector 6: IBM PAPR hints */
@@ -990,6 +1007,98 @@ stati

Re: [PATCH v2] powerpc/powernv: add hdat attribute to sysfs

2017-02-23 Thread Suraj Jitindar Singh

On Fri, 2017-02-24 at 15:28 +1100, Matt Brown wrote:
> The HDAT data area is consumed by skiboot and turned into a device-
> tree.
> In some cases we would like to look directly at the HDAT, so this
> patch
> adds a sysfs node to allow it to be viewed.  This is not possible
> through
> /dev/mem as it is reserved memory which is stopped by the /dev/mem
> filter.
> 
> Signed-off-by: Matt Brown 
Your first patch, nice work! :)

See below.
> ---
> 
> Between v1 and v2 of the patch the following changes were made.
> Changelog:
>   - moved hdat code into opal-hdat.c
>   - added opal-hdat to the makefile
>   - changed struct and variable names from camelcase
> ---
>  arch/powerpc/include/asm/opal.h|  1 +
>  arch/powerpc/platforms/powernv/Makefile|  1 +
>  arch/powerpc/platforms/powernv/opal-hdat.c | 63
> ++
>  arch/powerpc/platforms/powernv/opal.c  |  2 +
>  4 files changed, 67 insertions(+)
>  create mode 100644 arch/powerpc/platforms/powernv/opal-hdat.c
> 
> diff --git a/arch/powerpc/include/asm/opal.h
> b/arch/powerpc/include/asm/opal.h
> index 5c7db0f..b26944e 100644
> --- a/arch/powerpc/include/asm/opal.h
> +++ b/arch/powerpc/include/asm/opal.h
> @@ -277,6 +277,7 @@ extern int opal_async_comp_init(void);
>  extern int opal_sensor_init(void);
>  extern int opal_hmi_handler_init(void);
>  extern int opal_event_init(void);
> +extern void opal_hdat_sysfs_init(void);
>  
>  extern int opal_machine_check(struct pt_regs *regs);
>  extern bool opal_mce_check_early_recovery(struct pt_regs *regs);
> diff --git a/arch/powerpc/platforms/powernv/Makefile
> b/arch/powerpc/platforms/powernv/Makefile
> index b5d98cb..9a0c9d6 100644
> --- a/arch/powerpc/platforms/powernv/Makefile
> +++ b/arch/powerpc/platforms/powernv/Makefile
> @@ -3,6 +3,7 @@ obj-y += opal-rtc.o opal-
> nvram.o opal-lpc.o opal-flash.o
>  obj-y+= rng.o opal-elog.o opal-dump.o opal-
> sysparam.o opal-sensor.o
>  obj-y+= opal-msglog.o opal-hmi.o opal-
> power.o opal-irqchip.o
>  obj-y+= opal-kmsg.o
> +obj-y+= opal-hdat.o
>  
>  obj-$(CONFIG_SMP)+= smp.o subcore.o subcore-asm.o
>  obj-$(CONFIG_PCI)+= pci.o pci-ioda.o npu-dma.o
> diff --git a/arch/powerpc/platforms/powernv/opal-hdat.c
> b/arch/powerpc/platforms/powernv/opal-hdat.c
> new file mode 100644
> index 000..bd305e0
> --- /dev/null
> +++ b/arch/powerpc/platforms/powernv/opal-hdat.c
> @@ -0,0 +1,63 @@
> +/*
> + * PowerNV OPAL in-memory console interface
> + *
> + * Copyright 2014 IBM Corp.
2014?
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version
> + * 2 of the License, or (at your option) any later version.
Check with someone maybe, but I thought we had to use V2.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +struct hdat_info {
> + char *base;
> + u64 size;
> +};
> +
> +static struct hdat_info hdat_inf;
> +
> +/* Read function for HDAT attribute in sysfs */
> +static ssize_t hdat_read(struct file *file, struct kobject *kobj,
I assume this is just misaligned in my mail client...
> +  struct bin_attribute *bin_attr, char *to,
> +  loff_t pos, size_t count)
> +{
> + if (!hdat_inf.base)
> + return -ENODEV;
> +
> + return memory_read_from_buffer(to, count, &pos,
> hdat_inf.base,
> + hdat_inf.size);
> +}
> +
> +
> +/* HDAT attribute for sysfs */
> +static struct bin_attribute hdat_attr = {
> + .attr = {.name = "hdat", .mode = 0444},
> + .read = hdat_read
> +};
> +
> +void __init opal_hdat_sysfs_init(void)
> +{
> + u64 hdat_addr[2];
> +
> + /* Check for the hdat-map prop in device-tree */
> + if (of_property_read_u64_array(opal_node, "hdat-map",
> hdat_addr, 2)) {
> + pr_debug("OPAL: Property hdat-map not found.\n");
> + return;
> + }
> +
> + /* Print out hdat-map values. [0]: base, [1]: size */
> + pr_debug("OPAL: HDAT Base address: %#llx\n", hdat_addr[0]);
> + pr_debug("OPAL: HDAT Size: %#llx\n", hdat_addr[1]);
> +
> + hdat_inf.base = phys_to_virt(hdat_addr[0]);
> + hdat_inf.size = hdat_addr[1];
> +
> + if (sysfs_create_bin_file(opal_kobj, &hdat_attr) != 0)
 
Not Required
This can be replaced with:
"if (sysfs_create_bin_file(opal_kobj, &hdat_attr))"
> + pr_debug("OPAL: sysfs file creation for HDAT
> failed");
> +
> +}
> diff --git a/arch/powerpc/platforms/powernv/opal.c
> b/arch/powerpc/platforms/powernv/opal.c
> index 2822935..cae3745 100644
> --- a/arch/powerpc/platforms/powernv/opal.c
> +++ b/arch/powerpc/platforms/powernv/opal.c
> @@ -740,6 +740,8 @@

[PATCH v3] powerpc/powernv: add hdat attribute to sysfs

2017-02-23 Thread Matt Brown

The HDAT data area is consumed by skiboot and turned into a device-tree.
In some cases we would like to look directly at the HDAT, so this patch
adds a sysfs node to allow it to be viewed.  This is not possible through
/dev/mem as it is reserved memory which is stopped by the /dev/mem filter.

Signed-off-by: Matt Brown 
---

Changes between v2 to v3:
- fixed header comments
- simplified if statement

---
 arch/powerpc/include/asm/opal.h|  1 +
 arch/powerpc/platforms/powernv/Makefile|  1 +
 arch/powerpc/platforms/powernv/opal-hdat.c | 65 ++
 arch/powerpc/platforms/powernv/opal.c  |  2 +
 4 files changed, 69 insertions(+)
 create mode 100644 arch/powerpc/platforms/powernv/opal-hdat.c

diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 5c7db0f..b26944e 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -277,6 +277,7 @@ extern int opal_async_comp_init(void);
 extern int opal_sensor_init(void);
 extern int opal_hmi_handler_init(void);
 extern int opal_event_init(void);
+extern void opal_hdat_sysfs_init(void);
 
 extern int opal_machine_check(struct pt_regs *regs);
 extern bool opal_mce_check_early_recovery(struct pt_regs *regs);
diff --git a/arch/powerpc/platforms/powernv/Makefile 
b/arch/powerpc/platforms/powernv/Makefile
index b5d98cb..9a0c9d6 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -3,6 +3,7 @@ obj-y   += opal-rtc.o opal-nvram.o opal-lpc.o 
opal-flash.o
 obj-y  += rng.o opal-elog.o opal-dump.o opal-sysparam.o 
opal-sensor.o
 obj-y  += opal-msglog.o opal-hmi.o opal-power.o opal-irqchip.o
 obj-y  += opal-kmsg.o
+obj-y  += opal-hdat.o
 
 obj-$(CONFIG_SMP)  += smp.o subcore.o subcore-asm.o
 obj-$(CONFIG_PCI)  += pci.o pci-ioda.o npu-dma.o
diff --git a/arch/powerpc/platforms/powernv/opal-hdat.c 
b/arch/powerpc/platforms/powernv/opal-hdat.c
new file mode 100644
index 000..3315dd3
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/opal-hdat.c
@@ -0,0 +1,65 @@
+/*
+ * PowerNV OPAL HDAT interface
+ *
+ * Author: Matt Brown 
+ *
+ * Copyright 2017 IBM Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+struct hdat_info {
+   char *base;
+   u64 size;
+};
+
+static struct hdat_info hdat_inf;
+
+/* Read function for HDAT attribute in sysfs */
+static ssize_t hdat_read(struct file *file, struct kobject *kobj,
+struct bin_attribute *bin_attr, char *to,
+loff_t pos, size_t count)
+{
+   if (!hdat_inf.base)
+   return -ENODEV;
+
+   return memory_read_from_buffer(to, count, &pos, hdat_inf.base,
+   hdat_inf.size);
+}
+
+
+/* HDAT attribute for sysfs */
+static struct bin_attribute hdat_attr = {
+   .attr = {.name = "hdat", .mode = 0444},
+   .read = hdat_read
+};
+
+void __init opal_hdat_sysfs_init(void)
+{
+   u64 hdat_addr[2];
+
+   /* Check for the hdat-map prop in device-tree */
+   if (of_property_read_u64_array(opal_node, "hdat-map", hdat_addr, 2)) {
+   pr_debug("OPAL: Property hdat-map not found.\n");
+   return;
+   }
+
+   /* Print out hdat-map values. [0]: base, [1]: size */
+   pr_debug("OPAL: HDAT Base address: %#llx\n", hdat_addr[0]);
+   pr_debug("OPAL: HDAT Size: %#llx\n", hdat_addr[1]);
+
+   hdat_inf.base = phys_to_virt(hdat_addr[0]);
+   hdat_inf.size = hdat_addr[1];
+
+   if (sysfs_create_bin_file(opal_kobj, &hdat_attr))
+   pr_debug("OPAL: sysfs file creation for HDAT failed");
+
+}
diff --git a/arch/powerpc/platforms/powernv/opal.c 
b/arch/powerpc/platforms/powernv/opal.c
index 2822935..cae3745 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -740,6 +740,8 @@ static int __init opal_init(void)
opal_sys_param_init();
/* Setup message log sysfs interface. */
opal_msglog_sysfs_init();
+   /* Create hdat object under sys/firmware/opal */
+   opal_hdat_sysfs_init();
}
 
/* Initialize platform devices: IPMI backend, PRD & flash interface */
-- 
2.9.3

powerpc: Enable cpuhotplug with ESL=1 on POWER9

2017-02-23 Thread Vaidyanathan Srinivasan

The attached patch enables ESL=1 STOP2 for cpuhotplug.  This is a debug
patch that we could carry now until STOP states are discovered from
device tree.

Test run:

[  151.670021] CPU8 going offline with request psscr 003f0332
[  151.719856] CPU 8 offline: Remove Rx thread
[  189.200410] CPU8 coming online with psscr 203f0332, srr1 
90261001

We request ESL=EC=1 and STOP2 and successfully wakeup from that state
using IPI from XIVE.

--Vaidy

[PATCH 2/2] powerpc/cpuhotplug: print psscr and srr1 value for debug

2017-02-23 Thread Vaidyanathan Srinivasan

This is a debug patch that helps trace various STOP
state transitions and look at srr1 and psscr at wakeup.

Signed-off-by: Vaidyanathan Srinivasan 
---
 arch/powerpc/platforms/powernv/smp.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/smp.c 
b/arch/powerpc/platforms/powernv/smp.c
index e39e6c4..5b3f002 100644
--- a/arch/powerpc/platforms/powernv/smp.c
+++ b/arch/powerpc/platforms/powernv/smp.c
@@ -185,8 +185,12 @@ static void pnv_smp_cpu_kill_self(void)
ppc64_runlatch_off();
 
if (cpu_has_feature(CPU_FTR_ARCH_300)) {
+   pr_info("CPU%d going offline with request psscr 
%016llx\n",
+   cpu, pnv_deepest_stop_psscr_val);
srr1 = power9_idle_stop(pnv_deepest_stop_psscr_val,
pnv_deepest_stop_psscr_mask);
+   pr_info("CPU%d coming online with psscr %016lx, srr1 
%016lx\n",
+   cpu, mfspr(SPRN_PSSCR), srr1);
} else if (idle_states & OPAL_PM_WINKLE_ENABLED) {
srr1 = power7_winkle();
} else if ((idle_states & OPAL_PM_SLEEP_ENABLED) ||
-- 
2.9.3

[PATCH 1/2] powerpc/cpuhotplug: Force ESL=1 for offline cpus

2017-02-23 Thread Vaidyanathan Srinivasan

From: Gautham R. Shenoy 

ESL=1 losses some HYP SPR context and not idea for cpuidle,
however can be used for offline cpus.

Signed-off-by: Vaidyanathan Srinivasan 
Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/include/asm/cpuidle.h| 15 +++
 arch/powerpc/platforms/powernv/idle.c | 11 +++
 2 files changed, 26 insertions(+)

diff --git a/arch/powerpc/include/asm/cpuidle.h 
b/arch/powerpc/include/asm/cpuidle.h
index fd321eb4..31192d8 100644
--- a/arch/powerpc/include/asm/cpuidle.h
+++ b/arch/powerpc/include/asm/cpuidle.h
@@ -40,6 +40,21 @@
 #define ERR_EC_ESL_MISMATCH-1
 #define ERR_DEEP_STATE_ESL_MISMATCH-2
 
+/* Additional defs for debug and trace */
+
+#define RL_SHIFT  0
+#define MTL_SHIFT 4
+#define TR_SHIFT  8
+#define PSLL_SHIFT16
+#define EC_SHIFT  20
+#define ESL_SHIFT 21
+#define INIT_PSSCR(ESL, EC, PSLL, TR, MTL, RL) (((ESL) << (ESL_SHIFT)) | \
+((EC) << (EC_SHIFT)) | \
+((PSLL) << (PSLL_SHIFT)) | \
+((TR) << (TR_SHIFT)) | \
+((MTL) << (MTL_SHIFT)) | \
+((RL) << (RL_SHIFT)))
+
 #ifndef __ASSEMBLY__
 extern u32 pnv_fastsleep_workaround_at_entry[];
 extern u32 pnv_fastsleep_workaround_at_exit[];
diff --git a/arch/powerpc/platforms/powernv/idle.c 
b/arch/powerpc/platforms/powernv/idle.c
index 4ee837e..4f0663a 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -524,6 +524,17 @@ static int __init pnv_init_idle_states(void)
 
pnv_alloc_idle_core_states();
 
+   /* On POWER9 DD1, enter stop2 with ESL=EC=1 on Hotplug */
+   if (cpu_has_feature(CPU_FTR_POWER9_DD1)) {
+   pnv_deepest_stop_psscr_val =
+   /* ESL, EC, PSSL, TR,  MTL, RL */
+   INIT_PSSCR(0x1, 0x1, 0xf, 0x3, 0x3, 0x2);
+   pnv_deepest_stop_psscr_mask = PSSCR_HV_DEFAULT_MASK;
+   pr_warn("Overriding deepest_stop_psscr to: 
val=0x%016llx,mask=0x%016llx\n",
+   pnv_deepest_stop_psscr_val,
+   pnv_deepest_stop_psscr_mask);
+   }
+
if (supported_cpuidle_states & OPAL_PM_NAP_ENABLED)
ppc_md.power_save = power7_idle;
else if (supported_cpuidle_states & OPAL_PM_STOP_INST_FAST)
-- 
2.9.3

Re: [kernel-hardening] Re: [PATCH 1/2] powerpc: mm: support ARCH_MMAP_RND_BITS

2017-02-23 Thread Bhupesh Sharma

Hi Michael,

On Thu, Feb 16, 2017 at 10:19 AM, Bhupesh Sharma  wrote:
> Hi Michael,
>
> On Fri, Feb 10, 2017 at 4:41 PM, Bhupesh Sharma  wrote:
>> On Fri, Feb 10, 2017 at 4:31 PM, Michael Ellerman  
>> wrote:
>>> Bhupesh Sharma  writes:
>>>
 HI Michael,

 On Thu, Feb 2, 2017 at 3:53 PM, Michael Ellerman  
 wrote:
> Bhupesh Sharma  writes:
>
>> powerpc: arch_mmap_rnd() uses hard-coded values, (23-PAGE_SHIFT) for
>> 32-bit and (30-PAGE_SHIFT) for 64-bit, to generate the random offset
>> for the mmap base address.
>>
>> This value represents a compromise between increased
>> ASLR effectiveness and avoiding address-space fragmentation.
>> Replace it with a Kconfig option, which is sensibly bounded, so that
>> platform developers may choose where to place this compromise.
>> Keep default values as new minimums.
>>
>> This patch makes sure that now powerpc mmap arch_mmap_rnd() approach
>> is similar to other ARCHs like x86, arm64 and arm.
>
> Thanks for looking at this, it's been on my TODO for a while.
>
> I have a half completed version locally, but never got around to testing
> it thoroughly.

 Sure :)

>> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
>> index a8ee573fe610..b4a843f68705 100644
>> --- a/arch/powerpc/Kconfig
>> +++ b/arch/powerpc/Kconfig
>> @@ -22,6 +22,38 @@ config MMU
>>   bool
>>   default y
>>
>> +config ARCH_MMAP_RND_BITS_MIN
>> +   default 5 if PPC_256K_PAGES && 32BIT
>> +   default 12 if PPC_256K_PAGES && 64BIT
>> +   default 7 if PPC_64K_PAGES && 32BIT
>> +   default 14 if PPC_64K_PAGES && 64BIT
>> +   default 9 if PPC_16K_PAGES && 32BIT
>> +   default 16 if PPC_16K_PAGES && 64BIT
>> +   default 11 if PPC_4K_PAGES && 32BIT
>> +   default 18 if PPC_4K_PAGES && 64BIT
>> +
>> +# max bits determined by the following formula:
>> +#  VA_BITS - PAGE_SHIFT - 4
>> +#  for e.g for 64K page and 64BIT = 48 - 16 - 4 = 28
>> +config ARCH_MMAP_RND_BITS_MAX
>> +   default 10 if PPC_256K_PAGES && 32BIT
>> +   default 26 if PPC_256K_PAGES && 64BIT
>> +   default 12 if PPC_64K_PAGES && 32BIT
>> +   default 28 if PPC_64K_PAGES && 64BIT
>> +   default 14 if PPC_16K_PAGES && 32BIT
>> +   default 30 if PPC_16K_PAGES && 64BIT
>> +   default 16 if PPC_4K_PAGES && 32BIT
>> +   default 32 if PPC_4K_PAGES && 64BIT
>> +
>> +config ARCH_MMAP_RND_COMPAT_BITS_MIN
>> +   default 5 if PPC_256K_PAGES
>> +   default 7 if PPC_64K_PAGES
>> +   default 9 if PPC_16K_PAGES
>> +   default 11
>> +
>> +config ARCH_MMAP_RND_COMPAT_BITS_MAX
>> +   default 16
>> +
>
> This is what I have below, which is a bit neater I think because each
> value is only there once (by defaulting to the COMPAT value).
>
> My max values are different to yours, I don't really remember why I
> chose those values, so we can argue about which is right.

 I am not sure how you derived these values, but I am not sure there
 should be differences between 64-BIT x86/ARM64 and PPC values for the
 MAX values.
>>>
>>> But your values *are* different to x86 and arm64.
>>>
>>> And why would they be the same anyway? x86 has a 47 bit address space,
>>> 64-bit powerpc is 46 bits, and arm64 is configurable from 36 to 48 bits.
>>>
>>> So your calculations above using VA_BITS = 48 should be using 46 bits.
>>>
>>> But if you fixed that, your formula basically gives 1/16th of the
>>> address space as the maximum range. Why is that the right amount?
>>>
>>> x86 uses 1/8th, and arm64 uses a mixture of 1/8th and 1/32nd (though
>>> those might be bugs).
>>>
>>> My values were more liberal, giving up to half the address space for 32
>>> & 64-bit. Maybe that's too generous, but my rationale was it's up to the
>>> sysadmin to tweak the values and they get to keep the pieces if it
>>> breaks.
>>
>> I am not sure why would one want to use more than the practical limits
>> of 1/8th used by x86 - this causes additional burden of address space
>> fragmentation.
>>
>> So we need to balance between the randomness increase and the address
>> space fragmentation.
>>
> +config ARCH_MMAP_RND_BITS_MAX
> +   # On 64-bit up to 32T of address space (2^45)
> +   default 27 if 64BIT && PPC_256K_PAGES   # 256K (2^18), = 45 - 18 
> = 27
> +   default 29 if 64BIT && PPC_64K_PAGES# 64K  (2^16), = 45 - 16 
> = 29
> +   default 31 if 64BIT && PPC_16K_PAGES# 16K  (2^14), = 45 - 14 
> = 31
> +   default 33 if 64BIT # 4K   (2^12), = 45 - 12 
> = 33
> +   default ARCH_MMAP_RND_COMPAT_BITS_MAX
>>>
>>> I played with my values a bit and allowing 32T is a little bit nuts. It
>>> means you can actually end up with the

76 matches

Mail list logo