[PATCH 04/86] powerpc/pci: use uapi/linux/pci_ids.h directly

2015-03-29 Thread Michael S. Tsirkin
Header moved from linux/pci_ids.h to uapi/linux/pci_ids.h,
use the new header directly so we can drop
the wrapper in include/linux/pci_ids.h.

Signed-off-by: Michael S. Tsirkin 
---
 arch/powerpc/platforms/embedded6xx/mpc10x.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/embedded6xx/mpc10x.h 
b/arch/powerpc/platforms/embedded6xx/mpc10x.h
index b290b63..4175800 100644
--- a/arch/powerpc/platforms/embedded6xx/mpc10x.h
+++ b/arch/powerpc/platforms/embedded6xx/mpc10x.h
@@ -13,7 +13,7 @@
 #ifndef __PPC_KERNEL_MPC10X_H
 #define __PPC_KERNEL_MPC10X_H
 
-#include 
+#include 
 #include 
 
 /*
@@ -35,7 +35,7 @@
 
 /*
  * Define the vendor/device IDs for the various bridges--should be added to
- * 
+ * 
  */
 #defineMPC10X_BRIDGE_106   ((PCI_DEVICE_ID_MOTOROLA_MPC106 << 16) 
|  \
  PCI_VENDOR_ID_MOTOROLA)
-- 
MST

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 33/86] edac/pasemi: use uapi/linux/pci_ids.h directly

2015-03-29 Thread Michael S. Tsirkin
Header moved from linux/pci_ids.h to uapi/linux/pci_ids.h,
use the new header directly so we can drop
the wrapper in include/linux/pci_ids.h.

Signed-off-by: Michael S. Tsirkin 
---
 drivers/edac/pasemi_edac.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/edac/pasemi_edac.c b/drivers/edac/pasemi_edac.c
index 9c971b5..cd0da08 100644
--- a/drivers/edac/pasemi_edac.c
+++ b/drivers/edac/pasemi_edac.c
@@ -24,7 +24,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include "edac_core.h"
 
-- 
MST

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 43/86] macintosh: use uapi/linux/pci_ids.h directly

2015-03-29 Thread Michael S. Tsirkin
Header moved from linux/pci_ids.h to uapi/linux/pci_ids.h,
use the new header directly so we can drop
the wrapper in include/linux/pci_ids.h.

Signed-off-by: Michael S. Tsirkin 
---
 drivers/macintosh/macio_asic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/macintosh/macio_asic.c b/drivers/macintosh/macio_asic.c
index 4f12c6f..ccf9547 100644
--- a/drivers/macintosh/macio_asic.c
+++ b/drivers/macintosh/macio_asic.c
@@ -20,7 +20,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
-- 
MST

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 51/86] cxl: use uapi/linux/pci_ids.h directly

2015-03-29 Thread Michael S. Tsirkin
Header moved from linux/pci_ids.h to uapi/linux/pci_ids.h,
use the new header directly so we can drop
the wrapper in include/linux/pci_ids.h.

Signed-off-by: Michael S. Tsirkin 
---
 drivers/misc/cxl/pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index 1ef0164..8f2a6c8 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -8,7 +8,7 @@
  */
 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
-- 
MST

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [V4, 1/4] KVM: PPC: Use READ_ONCE when dereferencing pte_t pointer

2015-03-29 Thread Aneesh Kumar K.V
Michael Ellerman  writes:

> On Thu, 2015-26-03 at 13:13:39 UTC, "Aneesh Kumar K.V" wrote:
>> pte can get updated from other CPUs as part of multiple activities
>> like THP split, huge page collapse, unmap. We need to make sure we
>> don't reload the pte value again and again for different checks.
>> 
>> ---
>>  arch/powerpc/include/asm/kvm_book3s_64.h |  5 -
>>  arch/powerpc/kvm/e500_mmu_host.c | 20 
>>  2 files changed, 16 insertions(+), 9 deletions(-)
>
> So this series is partly KVM but mostly powerpc.
>
> I assume you can't split it into two separate series easily?

Yes, will do. But before that I was looking for feedback from Ben or
Paul.

>
> You haven't sent it to the KVM lists or to Alex AFAICS. You'll need to do that
> for the KVM pieces at least. We can probably take it all via powerpc, but I'll
> need an ACK from Alex at least.
>

If I can get an ack from Ben, that makes it easy for the kvm list.

-aneesh

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

FW: Finally Working RadeonHD 6570 2Gb on Quad G5

2015-03-29 Thread luigi burdo


just need to share to developer the success we face today
the Tursk is running in full 2D and 3D on Linux PPC (lubuntu 14.04.2 kernel 
4.0.0-rc3) on Quad G5 
Need to say a big thankyou to Christian (Xeno74)  Zigotzky

I share a  video for the community 
https://www.youtube.com/watch?v=MYgFfRLYhLU&list=UUkb4bw4N19d-x_tn2FXLojQ



Great Job to all the devs!

Luigi

  ___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [V4,1/4] KVM: PPC: Use READ_ONCE when dereferencing pte_t pointer

2015-03-29 Thread Benjamin Herrenschmidt
On Sun, 2015-03-29 at 22:16 +0530, Aneesh Kumar K.V wrote:
> Michael Ellerman  writes:
> 
> > On Thu, 2015-26-03 at 13:13:39 UTC, "Aneesh Kumar K.V" wrote:
> >> pte can get updated from other CPUs as part of multiple activities
> >> like THP split, huge page collapse, unmap. We need to make sure we
> >> don't reload the pte value again and again for different checks.
> >> 
> >> ---
> >>  arch/powerpc/include/asm/kvm_book3s_64.h |  5 -
> >>  arch/powerpc/kvm/e500_mmu_host.c | 20 
> >>  2 files changed, 16 insertions(+), 9 deletions(-)
> >
> > So this series is partly KVM but mostly powerpc.
> >
> > I assume you can't split it into two separate series easily?
> 
> Yes, will do. But before that I was looking for feedback from Ben or
> Paul.
> 
> >
> > You haven't sent it to the KVM lists or to Alex AFAICS. You'll need to do 
> > that
> > for the KVM pieces at least. We can probably take it all via powerpc, but 
> > I'll
> > need an ACK from Alex at least.
> >
> 
> If I can get an ack from Ben, that makes it easy for the kvm list.

Ack. Using ACCESS_ONCE in a lockless access of the PTE that does
multiple checks makes sense.

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 3/3] selftests/powerpc: Add transactional syscall test

2015-03-29 Thread Sam Bobroff
On 24/03/15 13:02, Michael Ellerman wrote:
> On Tue, 2015-03-24 at 12:52 +1100, Sam Bobroff wrote:
>> On 20/03/15 20:25, Anshuman Khandual wrote:
>>> On 03/19/2015 10:13 AM, Sam Bobroff wrote:
 Check that a syscall made during an active transaction will fail with
 the correct failure code and that one made during a suspended
 transaction will succeed.

 Signed-off-by: Sam Bobroff 
>>>
>>> The test works.
>>
>> Great :-)
>>
 +
 +int tm_syscall(void)
 +{
 +  SKIP_IF(!((long)get_auxv_entry(AT_HWCAP2) & PPC_FEATURE2_HTM));
 +  setbuf(stdout, 0);
 +  FAIL_IF(!t_active_getppid_test());
 +  printf("%d active transactions correctly aborted.\n", TM_TEST_RUNS);
 +  FAIL_IF(!t_suspended_getppid_test());
 +  printf("%d suspended transactions succeeded.\n", TM_TEST_RUNS);
 +  return 0;
 +}
 +
 +int main(void)
 +{
 +  return test_harness(tm_syscall, "tm_syscall");
 +}
 +
>>>
>>> There is an extra blank line at the end of this file. Interchanging return
>>> codes of 0 and 1 for various functions make it very confusing along with
>>> negative FAIL_IF checks in the primary test function. Control flow 
>>> structures
>>> like these can use some in-code documentation for readability.
>>>
>>> +   for (i = 0; i < TM_RETRIES; i++) {
>>> +   if (__builtin_tbegin(0)) {
>>> +   getppid();
>>> +   __builtin_tend(0);
>>> +   return 1;
>>> +   }
>>> +   if (t_failure_persistent())
>>> +   return 0;
>>>
>>> or
>>>
>>> +   if (__builtin_tbegin(0)) {
>>> +   __builtin_tsuspend();
>>> +   getppid();
>>> +   __builtin_tresume();
>>> +   __builtin_tend(0);
>>> +   return 1;
>>> +   }
>>> +   if (t_failure_persistent())
>>> +   return 0;
>>>
>>
>> Good points. I'll remove the blank line and comment the code.
>>
>> I'm not sure I can do any better with the FAIL_IF() macro: I wanted it
>> to read "fail if the test failed", but I can see what you mean about a
>> double negative. Maybe it would be better to introduce a different
>> macro, more like a standard assert: TEST(XXX) which fails if XXX is
>> false. However, I think "TEST" would be too generic a name and I'm not
>> should what would be better. Any comments/suggestions?
> 
> FAIL_IF() is designed for things that return 0 for OK and !0 for failure. Like
> most things in C.
> 
> So I think it would be improved if you inverted your return codes in your test
> routines.
> 
> Even better to return ESOMETHING in the error cases, and zero otherwise.
> 
> cheers

Fair enough. I think the *_test() functions I added for "clarity" were
just making it more confusing, so I've dropped them.

Moving the code around, even a little, has also exposed the fact that
transactions are very sensitive to how the code is compiled so I'm going
to move the transaction blocks out into a separate assembly file where I
can control exactly what instructions get used. This will also mean that
it's no longer dependent on using linker magic (or some other trick) to
avoid lazy symbol loading.

I'll repost the series.

Thanks for the review!

Cheers,
Sam.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3] powerpc: Use PFN_PHYS() to avoid truncating the physical address

2015-03-29 Thread Michael Ellerman
On Fri, 2015-03-27 at 13:07 -0500, Scott Wood wrote:
> On Fri, 2015-03-27 at 10:45 +1100, Michael Ellerman wrote:
> > On Thu, 2015-03-26 at 10:31 -0500, Emil Medve wrote:
> > > Hello Kumar,
> > > 
> > > 
> > > On 03/26/2015 10:18 AM, Kumar Gala wrote:
> > > > Why no commit message with what issue this change was trying to fix?
> > > 
> > > A while back, when I attempted to remove bootmem (in favor of just plain
> > > memblock as in powerpc land bootmem was just a wrapper to memblock
> > > anyway) I run at some point into a problem with an intermediate address
> > > value because of this '<< PAGE_SHIFT' on the wrong width variable. Using
> > > PFN_PHYS() took care of it (it has a cast) so I decided to get this
> > > defensive patch applied. Since, I dropped my bootmem/memblock patches in
> > > favor to Anton's (Blanchard) work so my concrete issue example is
> > > somewhat gone
> > 
> > I'm not a big fan of it unless it's actually fixing an issue. It's a lot of
> > churn and the end result is less readable IMHO.
> 
> It is fixing an issue -- the issue is that there are overflow errors in
> the code.  Some of the places Emil fixed are only for platforms that
> don't have physical addresses larger than pointers, or have the needed
> casts, or are known to be dealing with lowmem, but others aren't.  E.g.
> page_is_ram() and devmem_is_allowed() are buggy on 32-bit with 64-bit
> physical.

Right. So obviously I'm fine with all the cases where it fixes an actual bug.
 
> flush_dcache_icache_page() is buggy on mpc86xx with more than 4 GiB RAM
> -- though that would still be buggy even with this change, due to
> __flush_dcache_icache_phys taking unsigned long.  The entire concept of
> that function doesn't work for sizeof(phys_addr_t) > sizeof(void *), so
> in this case 86xx should be using the booke code instead.
> 
> Even in the places where overflow can't happen due to the above
> circumstances (other than having the needed cast), it's setting a bad
> example that can be copied to places where it will break, or the
> circumstances of the code could change (e.g. currently 64-bit-only code
> being used on 32-bit).

I agree with that in principle, but it looks like in a bunch of places this
patch ends up assigning the result to unsigned long anyway.

So those cases are just churn IMHO. The end result doesn't work with 64-bit
phys if the code is ever used there, even though it looks like maybe it should,
and it's also not setting a good example for other code. Those cases should
either be left alone, or fixed properly to use phys_addr_t.

cheers


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 1/3] powerpc/powernv: convert codes returned by OPAL calls

2015-03-29 Thread Michael Ellerman
On Fri, 2015-03-27 at 17:39 +0100, Cédric Le Goater wrote:
> OPAL has its own list of return codes. The patch provides a translation
> of such codes in errnos for the opal_sensor_read call, and possibly 
> others if needed.
> 
> Index: linux.git/arch/powerpc/platforms/powernv/opal.c
> ===
> --- linux.git.orig/arch/powerpc/platforms/powernv/opal.c
> +++ linux.git/arch/powerpc/platforms/powernv/opal.c
> @@ -894,6 +894,23 @@ void opal_free_sg_list(struct opal_sg_li
>   }
>  }
>  
> +int opal_error_code(int rc)
> +{
> + switch (rc) {
> + case OPAL_SUCCESS:  return 0;

Obviously correct.

> + case OPAL_PARAMETER:return -EINVAL;

Yep.

> + case OPAL_UNSUPPORTED:  return -ENOSYS;

You shouldn't use ENOSYS here, that should only ever mean "no such syscall",
otherwise you get very confusing results like read() returning ENOSYS.

> + case OPAL_ASYNC_COMPLETION: return -EAGAIN;

EAGAIN means "try what you did again", I don't think that's what
ASYNC_COMPLETION means, is it? It looks like it means, "don't try again, but
you need to wait for the result to be ready".

I'm not sure it maps well to any of the Linux codes, maybe EINPROGRESS ?

> + case OPAL_BUSY_EVENT:   return -EBUSY;

Yep.

> + case OPAL_NO_MEM:   return -ENOMEM;

Yep.

> + case OPAL_HARDWARE: return -ENOENT;

This is another one which I think you shouldn't use as it can lead to confusing
results at user level. eg:

  $ cat /sysfs/some/file
  Error: No such file or directory

Huh?

Looking at the skiboot code this looks like EIO is a good match.

> + case OPAL_INTERNAL_ERROR:   return -EIO;

Yeah as good as anything I guess.

> + default:
> + pr_err("%s: unexpected OPAL error %d\n", __func__, rc);
> + return -ERANGE;

I'm not sure about this one honestly, it means "Math result not representable".

I suspect the reason RTAS chose it was just that it's not EINVAL.

This should probably also just be EIO.


cheers


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 3/3] powerpc/powernv: remove opal_sensor_mutex

2015-03-29 Thread Michael Ellerman
On Fri, 2015-03-27 at 17:39 +0100, Cédric Le Goater wrote:
> The opal sensor mutex protects the opal_sensor_read call which
> can return a OPAL_BUSY code on IBM Power systems if a previous 
> request is in progress.
> 
> This can be handled at user level with a retry.

It can, but how does it actually look in practice?

It looks like the only use of opal_get_sensor_data() is show_sensor() in
drivers/hwmon/ibmpowernv.c.

Because that's a sysfs attribute folks will be generally just dumping that with
cat, or reading it in a shell script, neither of which will cope nicely with
EBUSY I think?

cheers


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 1/2] kernel/reboot.c: Add orderly_reboot for graceful reboot

2015-03-29 Thread Joel Stanley
The kernel has orderly_poweroff which allows the kernel to initiate a
graceful shutdown of userspace, by running /sbin/poweroff. This adds
orderly_reboot that will cause userspace to shut itself down by calling
/sbin/reboot.

This will be used for shutdown initiated by a system controller on
platforms that do not use ACPI.

Signed-off-by: Joel Stanley 
---
 include/linux/reboot.h |  1 +
 kernel/reboot.c| 51 +++---
 2 files changed, 49 insertions(+), 3 deletions(-)

diff --git a/include/linux/reboot.h b/include/linux/reboot.h
index 48bf152..a4ffcd9 100644
--- a/include/linux/reboot.h
+++ b/include/linux/reboot.h
@@ -68,6 +68,7 @@ void ctrl_alt_del(void);
 extern char poweroff_cmd[POWEROFF_CMD_PATH_LEN];
 
 extern int orderly_poweroff(bool force);
+extern int orderly_reboot(void);
 
 /*
  * Emergency restart, callable from an interrupt handler.
diff --git a/kernel/reboot.c b/kernel/reboot.c
index a3a9e24..d0aa1ec 100644
--- a/kernel/reboot.c
+++ b/kernel/reboot.c
@@ -306,8 +306,9 @@ void ctrl_alt_del(void)
 }
 
 char poweroff_cmd[POWEROFF_CMD_PATH_LEN] = "/sbin/poweroff";
+char reboot_cmd[POWEROFF_CMD_PATH_LEN] = "/sbin/reboot";
 
-static int __orderly_poweroff(bool force)
+static int run_cmd(const char *cmd)
 {
char **argv;
static char *envp[] = {
@@ -316,8 +317,7 @@ static int __orderly_poweroff(bool force)
NULL
};
int ret;
-
-   argv = argv_split(GFP_KERNEL, poweroff_cmd, NULL);
+   argv = argv_split(GFP_KERNEL, cmd, NULL);
if (argv) {
ret = call_usermodehelper(argv[0], argv, envp, UMH_WAIT_EXEC);
argv_free(argv);
@@ -325,8 +325,33 @@ static int __orderly_poweroff(bool force)
ret = -ENOMEM;
}
 
+   return ret;
+}
+
+static int __orderly_reboot(void)
+{
+   int ret;
+
+   ret = run_cmd(reboot_cmd);
+
+   if (ret) {
+   pr_warn("Failed to start orderly reboot: forcing the issue\n");
+   emergency_sync();
+   kernel_restart(NULL);
+   }
+
+   return ret;
+}
+
+static int __orderly_poweroff(bool force)
+{
+   int ret;
+
+   ret = run_cmd(reboot_cmd);
+
if (ret && force) {
pr_warn("Failed to start orderly shutdown: forcing the 
issue\n");
+
/*
 * I guess this should try to kick off some daemon to sync and
 * poweroff asap.  Or not even bother syncing if we're doing an
@@ -364,6 +389,26 @@ int orderly_poweroff(bool force)
 }
 EXPORT_SYMBOL_GPL(orderly_poweroff);
 
+static void reboot_work_func(struct work_struct *work)
+{
+   __orderly_reboot();
+}
+
+static DECLARE_WORK(reboot_work, reboot_work_func);
+
+/**
+ * orderly_reboot - Trigger an orderly system reboot
+ *
+ * This may be called from any context to trigger a system reboot.
+ * If the orderly reboot fails, it will force an immediate reboot.
+ */
+int orderly_reboot(void)
+{
+   schedule_work(&reboot_work);
+   return 0;
+}
+EXPORT_SYMBOL_GPL(orderly_reboot);
+
 static int __init reboot_setup(char *str)
 {
for (;;) {
-- 
2.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 2/2] powerpc/powernv: Reboot when requested by firmware

2015-03-29 Thread Joel Stanley
Use orderly_reboot so userspace will to shut itself down via the reboot
path. This is required for graceful reboot initiated by the BMC, such
as when a user uses ipmitool to issue a 'chassis power cycle' command.

Signed-off-by: Joel Stanley 
---
 arch/powerpc/platforms/powernv/opal-power.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal-power.c 
b/arch/powerpc/platforms/powernv/opal-power.c
index 48bf5b0..ac46c2c 100644
--- a/arch/powerpc/platforms/powernv/opal-power.c
+++ b/arch/powerpc/platforms/powernv/opal-power.c
@@ -29,8 +29,9 @@ static int opal_power_control_event(struct notifier_block *nb,
 
switch (type) {
case SOFT_REBOOT:
-   /* Fall through. The service processor is responsible for
-* bringing the machine back up */
+   pr_info("OPAL: reboot requested\n");
+   orderly_reboot();
+   break;
case SOFT_OFF:
pr_info("OPAL: poweroff requested\n");
orderly_poweroff(true);
-- 
2.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] powerpc: Replace mem_init_done with slab_is_available()

2015-03-29 Thread Michael Ellerman
We have a powerpc specific global called mem_init_done which is "set on
boot once kmalloc can be called".

But that's not *quite* true. We set it at the bottom of mem_init(), and
rely on the fact that mm_init() calls kmem_cache_init() immediately
after that, and nothing is running in parallel.

So replace it with the generic and 100% correct slab_is_available().

Signed-off-by: Michael Ellerman 
---
 arch/powerpc/include/asm/setup.h | 1 -
 arch/powerpc/kernel/pci-common.c | 2 +-
 arch/powerpc/kernel/rtas.c   | 4 ++--
 arch/powerpc/lib/alloc.c | 2 +-
 arch/powerpc/mm/mem.c| 3 ---
 arch/powerpc/mm/pgtable_32.c | 9 -
 arch/powerpc/mm/pgtable_64.c | 4 ++--
 arch/powerpc/platforms/cell/celleb_pci.c | 2 +-
 8 files changed, 11 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/setup.h b/arch/powerpc/include/asm/setup.h
index fbdf18cf954c..e9d384cbd021 100644
--- a/arch/powerpc/include/asm/setup.h
+++ b/arch/powerpc/include/asm/setup.h
@@ -7,7 +7,6 @@
 extern void ppc_printk_progress(char *s, unsigned short hex);
 
 extern unsigned int rtas_data;
-extern int mem_init_done;  /* set on boot once kmalloc can be called */
 extern unsigned long long memory_limit;
 extern unsigned long klimit;
 extern void *zalloc_maybe_bootmem(size_t size, gfp_t mask);
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 2a525c938158..bcf618bfff1e 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -76,7 +76,7 @@ struct pci_controller *pcibios_alloc_controller(struct 
device_node *dev)
list_add_tail(&phb->list_node, &hose_list);
spin_unlock(&hose_spinlock);
phb->dn = dev;
-   phb->is_dynamic = mem_init_done;
+   phb->is_dynamic = slab_is_available();
 #ifdef CONFIG_PPC64
if (dev) {
int nid = of_node_to_nid(dev);
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index 21c45a2d0706..fffdef2962eb 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -401,7 +401,7 @@ static char *__fetch_rtas_last_error(char *altbuf)
buf = altbuf;
} else {
buf = rtas_err_buf;
-   if (mem_init_done)
+   if (slab_is_available())
buf = kmalloc(RTAS_ERROR_LOG_MAX, GFP_ATOMIC);
}
if (buf)
@@ -461,7 +461,7 @@ int rtas_call(int token, int nargs, int nret, int *outputs, 
...)
 
if (buff_copy) {
log_error(buff_copy, ERR_TYPE_RTAS_LOG, 0);
-   if (mem_init_done)
+   if (slab_is_available())
kfree(buff_copy);
}
return ret;
diff --git a/arch/powerpc/lib/alloc.c b/arch/powerpc/lib/alloc.c
index 4a6c2cf890d9..60b0b3fc8fc1 100644
--- a/arch/powerpc/lib/alloc.c
+++ b/arch/powerpc/lib/alloc.c
@@ -10,7 +10,7 @@ void * __init_refok zalloc_maybe_bootmem(size_t size, gfp_t 
mask)
 {
void *p;
 
-   if (mem_init_done)
+   if (slab_is_available())
p = kzalloc(size, mask);
else {
p = memblock_virt_alloc(size, 0);
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index b7285a5870f8..45fda71feb27 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -61,7 +61,6 @@
 #define CPU_FTR_NOEXECUTE  0
 #endif
 
-int mem_init_done;
 unsigned long long memory_limit;
 
 #ifdef CONFIG_HIGHMEM
@@ -377,8 +376,6 @@ void __init mem_init(void)
pr_info("  * 0x%08lx..0x%08lx  : vmalloc & ioremap\n",
VMALLOC_START, VMALLOC_END);
 #endif /* CONFIG_PPC32 */
-
-   mem_init_done = 1;
 }
 
 void free_initmem(void)
diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c
index 03b1a3b0fbd5..57b41f8b7053 100644
--- a/arch/powerpc/mm/pgtable_32.c
+++ b/arch/powerpc/mm/pgtable_32.c
@@ -110,9 +110,8 @@ void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 __init_refok pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long 
address)
 {
pte_t *pte;
-   extern int mem_init_done;
 
-   if (mem_init_done) {
+   if (slab_is_available()) {
pte = (pte_t 
*)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
} else {
pte = __va(memblock_alloc(PAGE_SIZE, PAGE_SIZE));
@@ -219,7 +218,7 @@ __ioremap_caller(phys_addr_t addr, unsigned long size, 
unsigned long flags,
 * Don't allow anybody to remap normal RAM that we're using.
 * mem_init() sets high_memory so only do the check after that.
 */
-   if (mem_init_done && (p < virt_to_phys(high_memory)) &&
+   if (slab_is_available() && (p < virt_to_phys(high_memory)) &&
!(__allow_ioremap_reserved && memblock_is_region_reserved(p, 
size))) {
printk("__ioremap(): phys addr 0x%llx is RAM lr %pf\n",
  

[PATCH v2 3/4] selftests/powerpc: Add transactional syscall test

2015-03-29 Thread Sam Bobroff
Check that a syscall made during an active transaction will fail with
the correct failure code and that one made during a suspended
transaction will succeed.

Signed-off-by: Sam Bobroff 
---
v2:

Further testing has shown that the success or failure of the transactions was
affected by minor changes to the code, compiler optimisation and linker
settings.

To address this, I've moved the transactional part of the test to a separate
function, written in assembly. I've also extended the test to as many
transactions as it can fit into ten seconds, to better test for failures that
occur more rarely. This has stabilised the results, and it's no longer
necessary to use special compiler or linker flags.

 tools/testing/selftests/powerpc/tm/.gitignore  |1 +
 tools/testing/selftests/powerpc/tm/Makefile|4 +-
 .../testing/selftests/powerpc/tm/tm-syscall-asm.S  |   27 +
 .../testing/selftests/powerpc/tm/tm-syscall-asm.h  |2 +
 tools/testing/selftests/powerpc/tm/tm-syscall.c|  109 
 5 files changed, 142 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/powerpc/tm/tm-syscall-asm.S
 create mode 100644 tools/testing/selftests/powerpc/tm/tm-syscall-asm.h
 create mode 100644 tools/testing/selftests/powerpc/tm/tm-syscall.c

diff --git a/tools/testing/selftests/powerpc/tm/.gitignore 
b/tools/testing/selftests/powerpc/tm/.gitignore
index 33d02cc..2699635d 100644
--- a/tools/testing/selftests/powerpc/tm/.gitignore
+++ b/tools/testing/selftests/powerpc/tm/.gitignore
@@ -1 +1,2 @@
 tm-resched-dscr
+tm-syscall
diff --git a/tools/testing/selftests/powerpc/tm/Makefile 
b/tools/testing/selftests/powerpc/tm/Makefile
index 2cede23..93bbff3 100644
--- a/tools/testing/selftests/powerpc/tm/Makefile
+++ b/tools/testing/selftests/powerpc/tm/Makefile
@@ -1,8 +1,10 @@
-PROGS := tm-resched-dscr
+PROGS := tm-resched-dscr tm-syscall
+CFLAGS:=$(CFLAGS) -mhtm
 
 all: $(PROGS)
 
 $(PROGS): ../harness.c
+tm-syscall: tm-syscall-asm.S
 
 run_tests: all
@-for PROG in $(PROGS); do \
diff --git a/tools/testing/selftests/powerpc/tm/tm-syscall-asm.S 
b/tools/testing/selftests/powerpc/tm/tm-syscall-asm.S
new file mode 100644
index 000..2b2daa7
--- /dev/null
+++ b/tools/testing/selftests/powerpc/tm/tm-syscall-asm.S
@@ -0,0 +1,27 @@
+#include 
+#include 
+
+   .text
+FUNC_START(getppid_tm_active_impl)
+   tbegin.
+   beq 1f
+   li  r0, __NR_getppid
+   sc
+   tend.
+   blr
+1:
+   li  r3, -1
+   blr
+
+FUNC_START(getppid_tm_suspended_impl)
+   tbegin.
+   beq 1f
+   li  r0, __NR_getppid
+   tsuspend.
+   sc
+   tresume.
+   tend.
+   blr
+1:
+   li  r3, -1
+   blr
diff --git a/tools/testing/selftests/powerpc/tm/tm-syscall-asm.h 
b/tools/testing/selftests/powerpc/tm/tm-syscall-asm.h
new file mode 100644
index 000..6136328
--- /dev/null
+++ b/tools/testing/selftests/powerpc/tm/tm-syscall-asm.h
@@ -0,0 +1,2 @@
+extern int getppid_tm_active_impl(void);
+extern int getppid_tm_suspended_impl(void);
diff --git a/tools/testing/selftests/powerpc/tm/tm-syscall.c 
b/tools/testing/selftests/powerpc/tm/tm-syscall.c
new file mode 100644
index 000..ff3b15c
--- /dev/null
+++ b/tools/testing/selftests/powerpc/tm/tm-syscall.c
@@ -0,0 +1,109 @@
+/* Test the kernel's system call code to ensure that a system call
+ * made from within an active HTM transaction is aborted with the
+ * correct failure code.
+ * Conversely, ensure that a system call made from within a
+ * suspended transaction can succeed.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "utils.h"
+#include "tm-syscall-asm.h"
+
+unsigned retries = 0;
+
+#define TEST_DURATION 10 /* seconds */
+#define TM_RETRIES 100
+
+long failure_code(void)
+{
+   return __builtin_get_texasr() >> 56;
+}
+
+bool failure_is_persistent(void)
+{
+   return (failure_code() & TM_CAUSE_PERSISTENT) == TM_CAUSE_PERSISTENT;
+}
+
+bool failure_is_syscall(void)
+{
+   return (failure_code() & TM_CAUSE_SYSCALL) == TM_CAUSE_SYSCALL;
+}
+
+pid_t getppid_tm(bool suspend)
+{
+   int i;
+   pid_t pid;
+
+   for (i = 0; i < TM_RETRIES; i++) {
+   if (suspend)
+   pid = getppid_tm_suspended_impl();
+   else
+   pid = getppid_tm_active_impl();
+   if (pid >= 0)
+   return pid;
+   if (failure_is_persistent()) {
+   if (failure_is_syscall())
+   return -1;
+   printf("Unexpected persistent transaction failure.\n");
+   printf("TEXASR 0x%016lx, TFIAR 0x%016lx.\n",
+  __builtin_get_texasr(), __builtin_get_tfiar());
+   exit(-1);
+   }
+   retries++;
+   }
+   printf("Exceeded limit of %d temporary transaction failures.\n",

[PATCH v2 1/4] powerpc/tm: Abort syscalls in active transactions

2015-03-29 Thread Sam Bobroff
This patch changes the syscall handler to doom (tabort) active
transactions when a syscall is made and return immediately without
performing the syscall.

Currently, the system call instruction automatically suspends an
active transaction which causes side effects to persist when an active
transaction fails.

This does change the kernel's behaviour, but in a way that was
documented as unsupported. It doesn't reduce functionality because
syscalls will still be performed after tsuspend. It also provides a
consistent interface and makes the behaviour of user code
substantially the same across powerpc and platforms that do not
support suspended transactions (e.g. x86 and s390).

Performance measurements using
http://ozlabs.org/~anton/junkcode/null_syscall.c
indicate the cost of a system call increases by about 0.5%.

Signed-off-by: Sam Bobroff 
---
v2:

Also update the failure code table.

 Documentation/powerpc/transactional_memory.txt |   32 
 arch/powerpc/include/uapi/asm/tm.h |2 +-
 arch/powerpc/kernel/entry_64.S |   19 ++
 3 files changed, 36 insertions(+), 17 deletions(-)

diff --git a/Documentation/powerpc/transactional_memory.txt 
b/Documentation/powerpc/transactional_memory.txt
index 9791e98..98b39af 100644
--- a/Documentation/powerpc/transactional_memory.txt
+++ b/Documentation/powerpc/transactional_memory.txt
@@ -74,22 +74,23 @@ Causes of transaction aborts
 Syscalls
 
 
-Performing syscalls from within transaction is not recommended, and can lead
-to unpredictable results.
+Syscalls made from within an active transaction will not be performed and the
+transaction will be doomed by the kernel with the failure code TM_CAUSE_SYSCALL
+| TM_CAUSE_PERSISTENT.
 
-Syscalls do not by design abort transactions, but beware: The kernel code will
-not be running in transactional state.  The effect of syscalls will always
-remain visible, but depending on the call they may abort your transaction as a
-side-effect, read soon-to-be-aborted transactional data that should not remain
-invisible, etc.  If you constantly retry a transaction that constantly aborts
-itself by calling a syscall, you'll have a livelock & make no progress.
+Syscalls made from within a suspended transaction are performed as normal and
+the transaction is not explicitly doomed by the kernel.  However, what the
+kernel does to perform the syscall may result in the transaction being doomed
+by the hardware.  The syscall is performed in suspended mode so any side
+effects will be persistent, independent of transaction success or failure.  No
+guarantees are provided by the kernel about which syscalls will affect
+transaction success.
 
-Simple syscalls (e.g. sigprocmask()) "could" be OK.  Even things like write()
-from, say, printf() should be OK as long as the kernel does not access any
-memory that was accessed transactionally.
-
-Consider any syscalls that happen to work as debug-only -- not recommended for
-production use.  Best to queue them up till after the transaction is over.
+Care must be taken when relying on syscalls to abort during active transactions
+if the calls are made via a library.  Libraries may cache values (which may
+give the appearance of success) or perform operations that cause transaction
+failure before entering the kernel (which may produce different failure codes).
+Examples are glibc's getpid() and lazy symbol resolution.
 
 
 Signals
@@ -176,8 +177,7 @@ kernel aborted a transaction:
  TM_CAUSE_RESCHED   Thread was rescheduled.
  TM_CAUSE_TLBI  Software TLB invalide.
  TM_CAUSE_FAC_UNAV  FP/VEC/VSX unavailable trap.
- TM_CAUSE_SYSCALL   Currently unused; future syscalls that must abort
-transactions for consistency will use this.
+ TM_CAUSE_SYSCALL   Syscall from active transaction.
  TM_CAUSE_SIGNALSignal delivered.
  TM_CAUSE_MISC  Currently unused.
  TM_CAUSE_ALIGNMENT Alignment fault.
diff --git a/arch/powerpc/include/uapi/asm/tm.h 
b/arch/powerpc/include/uapi/asm/tm.h
index 5d836b7..5047659 100644
--- a/arch/powerpc/include/uapi/asm/tm.h
+++ b/arch/powerpc/include/uapi/asm/tm.h
@@ -11,7 +11,7 @@
 #define TM_CAUSE_RESCHED   0xde
 #define TM_CAUSE_TLBI  0xdc
 #define TM_CAUSE_FAC_UNAV  0xda
-#define TM_CAUSE_SYSCALL   0xd8  /* future use */
+#define TM_CAUSE_SYSCALL   0xd8
 #define TM_CAUSE_MISC  0xd6  /* future use */
 #define TM_CAUSE_SIGNAL0xd4
 #define TM_CAUSE_ALIGNMENT 0xd2
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index d180caf2..85bf81d 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * System calls.
@@ -145,6 +146,24 @@ END_FW_FTR_SECTION_IFSET(FW_FEATURE_SPLPAR)
andi.   r11,r10,_TIF_SYSCALL_DOTRACE
bne syscall_dotrace
 .Lsyscall_dotrace_cont:
+#ifdef CONF

[PATCH v2 2/4] selftests/powerpc: Move get_auxv_entry() to harness.c

2015-03-29 Thread Sam Bobroff
Move get_auxv_entry() from pmu/lib.c up to harness.c in order to make
it available to other tests.

Signed-off-by: Sam Bobroff 
---
 tools/testing/selftests/powerpc/harness.c |   47 +
 tools/testing/selftests/powerpc/pmu/lib.c |   47 -
 tools/testing/selftests/powerpc/pmu/lib.h |1 -
 tools/testing/selftests/powerpc/utils.h   |2 +-
 4 files changed, 48 insertions(+), 49 deletions(-)

diff --git a/tools/testing/selftests/powerpc/harness.c 
b/tools/testing/selftests/powerpc/harness.c
index 8ebc58a..f7997af 100644
--- a/tools/testing/selftests/powerpc/harness.c
+++ b/tools/testing/selftests/powerpc/harness.c
@@ -11,6 +11,10 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
 
 #include "subunit.h"
 #include "utils.h"
@@ -112,3 +116,46 @@ int test_harness(int (test_function)(void), char *name)
 
return rc;
 }
+
+static char auxv[4096];
+
+void *get_auxv_entry(int type)
+{
+   ElfW(auxv_t) *p;
+   void *result;
+   ssize_t num;
+   int fd;
+
+   fd = open("/proc/self/auxv", O_RDONLY);
+   if (fd == -1) {
+   perror("open");
+   return NULL;
+   }
+
+   result = NULL;
+
+   num = read(fd, auxv, sizeof(auxv));
+   if (num < 0) {
+   perror("read");
+   goto out;
+   }
+
+   if (num > sizeof(auxv)) {
+   printf("Overflowed auxv buffer\n");
+   goto out;
+   }
+
+   p = (ElfW(auxv_t) *)auxv;
+
+   while (p->a_type != AT_NULL) {
+   if (p->a_type == type) {
+   result = (void *)p->a_un.a_val;
+   break;
+   }
+
+   p++;
+   }
+out:
+   close(fd);
+   return result;
+}
diff --git a/tools/testing/selftests/powerpc/pmu/lib.c 
b/tools/testing/selftests/powerpc/pmu/lib.c
index 9768dea..a07104c 100644
--- a/tools/testing/selftests/powerpc/pmu/lib.c
+++ b/tools/testing/selftests/powerpc/pmu/lib.c
@@ -5,15 +5,10 @@
 
 #define _GNU_SOURCE/* For CPU_ZERO etc. */
 
-#include 
 #include 
-#include 
-#include 
 #include 
 #include 
 #include 
-#include 
-#include 
 #include 
 
 #include "utils.h"
@@ -256,45 +251,3 @@ out:
return rc;
 }
 
-static char auxv[4096];
-
-void *get_auxv_entry(int type)
-{
-   ElfW(auxv_t) *p;
-   void *result;
-   ssize_t num;
-   int fd;
-
-   fd = open("/proc/self/auxv", O_RDONLY);
-   if (fd == -1) {
-   perror("open");
-   return NULL;
-   }
-
-   result = NULL;
-
-   num = read(fd, auxv, sizeof(auxv));
-   if (num < 0) {
-   perror("read");
-   goto out;
-   }
-
-   if (num > sizeof(auxv)) {
-   printf("Overflowed auxv buffer\n");
-   goto out;
-   }
-
-   p = (ElfW(auxv_t) *)auxv;
-
-   while (p->a_type != AT_NULL) {
-   if (p->a_type == type) {
-   result = (void *)p->a_un.a_val;
-   break;
-   }
-
-   p++;
-   }
-out:
-   close(fd);
-   return result;
-}
diff --git a/tools/testing/selftests/powerpc/pmu/lib.h 
b/tools/testing/selftests/powerpc/pmu/lib.h
index 0f0339c..ca5d72a 100644
--- a/tools/testing/selftests/powerpc/pmu/lib.h
+++ b/tools/testing/selftests/powerpc/pmu/lib.h
@@ -29,7 +29,6 @@ extern int notify_parent(union pipe write_pipe);
 extern int notify_parent_of_error(union pipe write_pipe);
 extern pid_t eat_cpu(int (test_function)(void));
 extern bool require_paranoia_below(int level);
-extern void *get_auxv_entry(int type);
 
 struct addr_range {
uint64_t first, last;
diff --git a/tools/testing/selftests/powerpc/utils.h 
b/tools/testing/selftests/powerpc/utils.h
index a93777a..64f53cd 100644
--- a/tools/testing/selftests/powerpc/utils.h
+++ b/tools/testing/selftests/powerpc/utils.h
@@ -19,7 +19,7 @@ typedef uint8_t u8;
 
 
 int test_harness(int (test_function)(void), char *name);
-
+extern void *get_auxv_entry(int type);
 
 /* Yes, this is evil */
 #define FAIL_IF(x) \
-- 
1.7.10.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2 4/4] powerpc/tm: Correct minor documentation typos

2015-03-29 Thread Sam Bobroff
Signed-off-by: Sam Bobroff 
---
v2:

Discovered some typos while updating the documentation.

 Documentation/powerpc/transactional_memory.txt |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/powerpc/transactional_memory.txt 
b/Documentation/powerpc/transactional_memory.txt
index 98b39af..ba0a2a4 100644
--- a/Documentation/powerpc/transactional_memory.txt
+++ b/Documentation/powerpc/transactional_memory.txt
@@ -175,7 +175,7 @@ These are defined in , and distinguish different 
reasons why the
 kernel aborted a transaction:
 
  TM_CAUSE_RESCHED   Thread was rescheduled.
- TM_CAUSE_TLBI  Software TLB invalide.
+ TM_CAUSE_TLBI  Software TLB invalid.
  TM_CAUSE_FAC_UNAV  FP/VEC/VSX unavailable trap.
  TM_CAUSE_SYSCALL   Syscall from active transaction.
  TM_CAUSE_SIGNALSignal delivered.
@@ -185,7 +185,7 @@ kernel aborted a transaction:
 
 These can be checked by the user program's abort handler as TEXASR[0:7].  If
 bit 7 is set, it indicates that the error is consider persistent.  For example
-a TM_CAUSE_ALIGNMENT will be persistent while a TM_CAUSE_RESCHED will not.q
+a TM_CAUSE_ALIGNMENT will be persistent while a TM_CAUSE_RESCHED will not.
 
 GDB
 ===
-- 
1.7.10.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2 0/4] powerpc/tm: Abort syscalls in active transactions

2015-03-29 Thread Sam Bobroff

See the first patch for a description of the reasoning behind this
change.

This set includes the change, a kernel selftest for it and
some slight refactoring of the selftest code.


v2:
Patch 1/4: powerpc/tm: Abort syscalls in active transactions

Also update the failure code table.

Patch 3/4: selftests/powerpc: Add transactional syscall test

Further testing has shown that the success or failure of the transactions was
affected by minor changes to the code, compiler optimisation and linker
settings.

To address this, I've moved the transactional part of the test to a separate
function, written in assembly. I've also extended the test to as many
transactions as it can fit into ten seconds, to better test for failures that
occur more rarely. This has stabilised the results, and it's no longer
necessary to use special compiler or linker flags.

Patch 4/4: powerpc/tm: Correct minor documentation typos

Discovered some typos while updating the documentation.


Sam Bobroff (4):
  powerpc/tm: Abort syscalls in active transactions
  selftests/powerpc: Move get_auxv_entry() to harness.c
  selftests/powerpc: Add transactional syscall test
  powerpc/tm: Correct minor documentation typos

 Documentation/powerpc/transactional_memory.txt |   36 +++
 arch/powerpc/include/uapi/asm/tm.h |2 +-
 arch/powerpc/kernel/entry_64.S |   19 
 tools/testing/selftests/powerpc/harness.c  |   47 +
 tools/testing/selftests/powerpc/pmu/lib.c  |   47 -
 tools/testing/selftests/powerpc/pmu/lib.h  |1 -
 tools/testing/selftests/powerpc/tm/.gitignore  |1 +
 tools/testing/selftests/powerpc/tm/Makefile|4 +-
 .../testing/selftests/powerpc/tm/tm-syscall-asm.S  |   27 +
 .../testing/selftests/powerpc/tm/tm-syscall-asm.h  |2 +
 tools/testing/selftests/powerpc/tm/tm-syscall.c|  109 
 tools/testing/selftests/powerpc/utils.h|2 +-
 12 files changed, 228 insertions(+), 69 deletions(-)
 create mode 100644 tools/testing/selftests/powerpc/tm/tm-syscall-asm.S
 create mode 100644 tools/testing/selftests/powerpc/tm/tm-syscall-asm.h
 create mode 100644 tools/testing/selftests/powerpc/tm/tm-syscall.c

-- 
1.7.10.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v7 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock

2015-03-29 Thread Benjamin Herrenschmidt
On Wed, 2015-03-25 at 13:34 -0400, Sowmini Varadhan wrote:
> Investigation of multithreaded iperf experiments on an ethernet
> interface show the iommu->lock as the hottest lock identified by
> lockstat, with something of the order of  21M contentions out of
> 27M acquisitions, and an average wait time of 26 us for the lock.
> This is not efficient. A more scalable design is to follow the ppc
> model, where the iommu_table has multiple pools, each stretching
> over a segment of the map, and with a separate lock for each pool.
> This model allows for better parallelization of the iommu map search.
> 
 .../...
> 
> diff --git a/include/linux/iommu-common.h b/include/linux/iommu-common.h
> new file mode 100644
> index 000..197111b
> --- /dev/null
> +++ b/include/linux/iommu-common.h
> @@ -0,0 +1,48 @@
> +#ifndef _LINUX_IOMMU_COMMON_H
> +#define _LINUX_IOMMU_COMMON_H
> +
> +#include 
> +#include 
> +#include 
> +
> +#define IOMMU_POOL_HASHBITS 4
> +#define IOMMU_NR_POOLS  (1 << IOMMU_POOL_HASHBITS)

I don't like those macros. You changed the value from what we had on
powerpc. It could be that the new values are as good for us but I'd
like to do a bit differently. Can you make the bits a variable ? Or at
least an arch-provided macro which we can change later if needed ?

> +struct iommu_pool {
> + unsigned long   start;
> + unsigned long   end;
> + unsigned long   hint;
> + spinlock_t  lock;
> +};
> +
> +struct iommu_table {

Let's make it clear that this is for allocation of DMA space only, it
would thus make my life easier when adapting powerpc to use a different
names, something like "struct iommu_area" works for me, or "iommu
alloc_region" .. whatever you fancy the most.

> + unsigned long   page_table_map_base;
> + unsigned long   page_table_shift;

Minor nit but I'm not fan of the naming here, maybe just use
entry_shift ? I don't like too long identifiers :) Same for
base, could just be "base" or "table_base".

> + unsigned long   nr_pools;
> + void(*flush_all)(struct iommu_table *);

Call this "lazy_flush", document that it is optional and when it is
called.

> + unsigned long   poolsize;
> + struct iommu_pool   arena_pool[IOMMU_NR_POOLS];

Why adding the 'arena' prefix ? What was wrong with "pools" in the
powerpc imlementation ?

> + u32 flags;
> +#define  IOMMU_HAS_LARGE_POOL0x0001
> +#define  IOMMU_NO_SPAN_BOUND 0x0002
> + struct iommu_pool   large_pool;
> + unsigned long   *map;
> +};
> +
> +extern void iommu_tbl_pool_init(struct iommu_table *iommu,
> + unsigned long num_entries,
> + u32 page_table_shift,
> + void (*flush_all)(struct iommu_table *),
> + bool large_pool, u32 npools,
> + bool skip_span_boundary_check);
> +
> +extern unsigned long iommu_tbl_range_alloc(struct device *dev,
> +struct iommu_table *iommu,
> +unsigned long npages,
> +unsigned long *handle);
> +
> +extern void iommu_tbl_range_free(struct iommu_table *iommu,
> +  u64 dma_addr, unsigned long npages,
> +  unsigned long entry);
> +
> +#endif
> diff --git a/lib/Makefile b/lib/Makefile
> index 3c3b30b..0ea2ac6 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -102,7 +102,7 @@ obj-$(CONFIG_AUDIT_GENERIC) += audit.o
>  obj-$(CONFIG_AUDIT_COMPAT_GENERIC) += compat_audit.o
>  
>  obj-$(CONFIG_SWIOTLB) += swiotlb.o
> -obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o
> +obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o iommu-common.o
>  obj-$(CONFIG_FAULT_INJECTION) += fault-inject.o
>  obj-$(CONFIG_NOTIFIER_ERROR_INJECTION) += notifier-error-inject.o
>  obj-$(CONFIG_CPU_NOTIFIER_ERROR_INJECT) += cpu-notifier-error-inject.o
> diff --git a/lib/iommu-common.c b/lib/iommu-common.c
> new file mode 100644
> index 000..bb7e706
> --- /dev/null
> +++ b/lib/iommu-common.c
> @@ -0,0 +1,235 @@
> +/*
> + * IOMMU mmap management and range allocation functions.
> + * Based almost entirely upon the powerpc iommu allocator.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define IOMMU_LARGE_ALLOC15

Make that a variable, here too, the arch might want to tweak it.

I think 15 is actually not a good value for powerpc with 4k iommu pages
and 64k PAGE_SIZE, we should make the above some kind of factor of
PAGE_SHIFT - iommu_page_shift.

> +static   DEFINE_PER_CPU(unsigned int, iommu_pool_hash);
> +
> +static void setup_iommu_pool_hash(void)
> +{
> + unsigned int i;
> + static bool do_once;
> +
> + if (do_once)
> + return;
> + do_once = true;
> + for_each_possible_cpu(i)

[PATCH 2/2] KVM: PPC: Remove page table walk helpers

2015-03-29 Thread Aneesh Kumar K.V
This patch remove helpers which we had used only once in the code.
Limiting page table walk variants help in ensuring that we won't
end up with code walking page table with wrong assumptions.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/pgtable.h  | 21 -
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 62 -
 arch/powerpc/kvm/e500_mmu_host.c|  2 +-
 3 files changed, 28 insertions(+), 57 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index 9835ac4173b7..92fe01c355a9 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -249,27 +249,6 @@ extern int gup_hugepte(pte_t *ptep, unsigned long sz, 
unsigned long addr,
 #endif
 pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
 unsigned *shift);
-
-static inline pte_t *lookup_linux_ptep(pgd_t *pgdir, unsigned long hva,
-unsigned long *pte_sizep)
-{
-   pte_t *ptep;
-   unsigned long ps = *pte_sizep;
-   unsigned int shift;
-
-   ptep = find_linux_pte_or_hugepte(pgdir, hva, &shift);
-   if (!ptep)
-   return NULL;
-   if (shift)
-   *pte_sizep = 1ul << shift;
-   else
-   *pte_sizep = PAGE_SIZE;
-
-   if (ps > *pte_sizep)
-   return NULL;
-
-   return ptep;
-}
 #endif /* __ASSEMBLY__ */
 
 #endif /* __KERNEL__ */
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 625407e4d3b0..73e083cb9f7e 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -131,25 +131,6 @@ static void remove_revmap_chain(struct kvm *kvm, long 
pte_index,
unlock_rmap(rmap);
 }
 
-static pte_t lookup_linux_pte_and_update(pgd_t *pgdir, unsigned long hva,
- int writing, unsigned long *pte_sizep)
-{
-   pte_t *ptep;
-   unsigned long ps = *pte_sizep;
-   unsigned int hugepage_shift;
-
-   ptep = find_linux_pte_or_hugepte(pgdir, hva, &hugepage_shift);
-   if (!ptep)
-   return __pte(0);
-   if (hugepage_shift)
-   *pte_sizep = 1ul << hugepage_shift;
-   else
-   *pte_sizep = PAGE_SIZE;
-   if (ps > *pte_sizep)
-   return __pte(0);
-   return kvmppc_read_update_linux_pte(ptep, writing, hugepage_shift);
-}
-
 static inline void unlock_hpte(__be64 *hpte, unsigned long hpte_v)
 {
asm volatile(PPC_RELEASE_BARRIER "" : : : "memory");
@@ -166,10 +147,10 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long 
flags,
struct revmap_entry *rev;
unsigned long g_ptel;
struct kvm_memory_slot *memslot;
-   unsigned long pte_size;
+   unsigned hpage_shift;
unsigned long is_io;
unsigned long *rmap;
-   pte_t pte;
+   pte_t *ptep;
unsigned int writing;
unsigned long mmu_seq;
unsigned long rcbits;
@@ -208,22 +189,33 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long 
flags,
 
/* Translate to host virtual address */
hva = __gfn_to_hva_memslot(memslot, gfn);
+   ptep = find_linux_pte_or_hugepte(pgdir, hva, &hpage_shift);
+   if (ptep) {
+   pte_t pte;
+   unsigned int host_pte_size;
 
-   /* Look up the Linux PTE for the backing page */
-   pte_size = psize;
-   pte = lookup_linux_pte_and_update(pgdir, hva, writing, &pte_size);
-   if (pte_present(pte) && !pte_protnone(pte)) {
-   if (writing && !pte_write(pte))
-   /* make the actual HPTE be read-only */
-   ptel = hpte_make_readonly(ptel);
-   is_io = hpte_cache_bits(pte_val(pte));
-   pa = pte_pfn(pte) << PAGE_SHIFT;
-   pa |= hva & (pte_size - 1);
-   pa |= gpa & ~PAGE_MASK;
-   }
+   if (hpage_shift)
+   host_pte_size = 1ul << hpage_shift;
+   else
+   host_pte_size = PAGE_SIZE;
+   /*
+* We should always find the guest page size
+* to <= host page size, if host is using hugepage
+*/
+   if (host_pte_size < psize)
+   return H_PARAMETER;
 
-   if (pte_size < psize)
-   return H_PARAMETER;
+   pte = kvmppc_read_update_linux_pte(ptep, writing, hpage_shift);
+   if (pte_present(pte) && !pte_protnone(pte)) {
+   if (writing && !pte_write(pte))
+   /* make the actual HPTE be read-only */
+   ptel = hpte_make_readonly(ptel);
+   is_io = hpte_cache_bits(pte_val(pte));
+   pa = pte_pfn(pte) << PAGE_SHIFT;
+   pa |= hva & (host_pte_size - 1);
+   pa |= gpa & ~PA

[PATCH 1/2] KVM: PPC: Use READ_ONCE when dereferencing pte_t pointer

2015-03-29 Thread Aneesh Kumar K.V
pte can get updated from other CPUs as part of multiple activities
like THP split, huge page collapse, unmap. We need to make sure we
don't reload the pte value again and again for different checks.

Signed-off-by: Aneesh Kumar K.V 
---
Note:
This is posted previously as part of
http://article.gmane.org/gmane.linux.ports.ppc.embedded/79278

 arch/powerpc/include/asm/kvm_book3s_64.h |  5 -
 arch/powerpc/kvm/e500_mmu_host.c | 20 
 2 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index cc073a7ac2b7..f06820c67175 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -290,7 +290,10 @@ static inline pte_t kvmppc_read_update_linux_pte(pte_t 
*ptep, int writing,
pte_t old_pte, new_pte = __pte(0);
 
while (1) {
-   old_pte = *ptep;
+   /*
+* Make sure we don't reload from ptep
+*/
+   old_pte = READ_ONCE(*ptep);
/*
 * wait until _PAGE_BUSY is clear then set it atomically
 */
diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
index cc536d4a75ef..5840d546aa03 100644
--- a/arch/powerpc/kvm/e500_mmu_host.c
+++ b/arch/powerpc/kvm/e500_mmu_host.c
@@ -469,14 +469,18 @@ static inline int kvmppc_e500_shadow_map(struct 
kvmppc_vcpu_e500 *vcpu_e500,
 
pgdir = vcpu_e500->vcpu.arch.pgdir;
ptep = lookup_linux_ptep(pgdir, hva, &tsize_pages);
-   if (pte_present(*ptep))
-   wimg = (*ptep >> PTE_WIMGE_SHIFT) & MAS2_WIMGE_MASK;
-   else {
-   if (printk_ratelimit())
-   pr_err("%s: pte not present: gfn %lx, pfn %lx\n",
-   __func__, (long)gfn, pfn);
-   ret = -EINVAL;
-   goto out;
+   if (ptep) {
+   pte_t pte = READ_ONCE(*ptep);
+
+   if (pte_present(pte))
+   wimg = (pte_val(pte) >> PTE_WIMGE_SHIFT) &
+   MAS2_WIMGE_MASK;
+   else {
+   pr_err_ratelimited("%s: pte not present: gfn %lx,pfn 
%lx\n",
+  __func__, (long)gfn, pfn);
+   ret = -EINVAL;
+   goto out;
+   }
}
kvmppc_e500_ref_setup(ref, gtlbe, pfn, wimg);
 
-- 
2.1.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V5 1/2] powerpc/mm/thp: Make page table walk safe against thp split/collapse

2015-03-29 Thread Aneesh Kumar K.V
We can disable a THP split or a hugepage collapse by disabling irq.
We do send IPI to all the cpus in the early part of split/collapse,
and disabling local irq ensure we don't make progress with
split/collapse. If the THP is getting split we return NULL from
find_linux_pte_or_hugepte(). For all the current callers it should be ok.
We need to be careful if we want to use returned pte_t pointer outside
the irq disabled region. W.r.t to THP split, the pfn remains the same,
but then a hugepage collapse will result in a pfn change. There are
few steps we can take to avoid a hugepage collapse.One way is to take page
reference inside the irq disable region. Other option is to take
mmap_sem so that a parallel collapse will not happen. We can also
disable collapse by taking pmd_lock. Another method used by kvm
subsystem is to check whether we had a mmu_notifer update in between
using mmu_notifier_retry().

Signed-off-by: Aneesh Kumar K.V 
---
Changes from V4:
* Update function comment for __find_linux_pte_or_hugepte

 arch/powerpc/include/asm/pgtable.h   | 11 ++-
 arch/powerpc/kernel/eeh.c|  6 --
 arch/powerpc/kernel/io-workarounds.c | 10 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c  |  5 +++--
 arch/powerpc/kvm/book3s_hv_rm_mmu.c  | 30 +++---
 arch/powerpc/kvm/e500_mmu_host.c | 14 --
 arch/powerpc/mm/hash_utils_64.c  |  2 +-
 arch/powerpc/mm/hugetlbpage.c| 22 --
 arch/powerpc/perf/callchain.c| 24 ++--
 9 files changed, 88 insertions(+), 36 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index 92fe01c355a9..11a38635dd65 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -247,8 +247,17 @@ extern int gup_hugepte(pte_t *ptep, unsigned long sz, 
unsigned long addr,
 #define pmd_large(pmd) 0
 #define has_transparent_hugepage() 0
 #endif
-pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
+pte_t *__find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
 unsigned *shift);
+static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
+  unsigned *shift)
+{
+   if (!arch_irqs_disabled()) {
+   pr_info("%s called with irq enabled\n", __func__);
+   dump_stack();
+   }
+   return __find_linux_pte_or_hugepte(pgdir, ea, shift);
+}
 #endif /* __ASSEMBLY__ */
 
 #endif /* __KERNEL__ */
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 3b2252e7731b..8424b232e598 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -330,9 +330,11 @@ static inline unsigned long eeh_token_to_phys(unsigned 
long token)
int hugepage_shift;
 
/*
-* We won't find hugepages here, iomem
+* We won't find hugepages here(this is iomem). Hence we are not
+* worried about _PAGE_SPLITTING/collapse. Also we will not hit
+* page table free, because of init_mm.
 */
-   ptep = find_linux_pte_or_hugepte(init_mm.pgd, token, &hugepage_shift);
+   ptep = __find_linux_pte_or_hugepte(init_mm.pgd, token, &hugepage_shift);
if (!ptep)
return token;
WARN_ON(hugepage_shift);
diff --git a/arch/powerpc/kernel/io-workarounds.c 
b/arch/powerpc/kernel/io-workarounds.c
index 24b968f8e4d8..63d9cc4d7366 100644
--- a/arch/powerpc/kernel/io-workarounds.c
+++ b/arch/powerpc/kernel/io-workarounds.c
@@ -71,15 +71,15 @@ struct iowa_bus *iowa_mem_find_bus(const PCI_IO_ADDR addr)
vaddr = (unsigned long)PCI_FIX_ADDR(addr);
if (vaddr < PHB_IO_BASE || vaddr >= PHB_IO_END)
return NULL;
-
-   ptep = find_linux_pte_or_hugepte(init_mm.pgd, vaddr,
+   /*
+* We won't find huge pages here (iomem). Also can't hit
+* a page table free due to init_mm
+*/
+   ptep = __find_linux_pte_or_hugepte(init_mm.pgd, vaddr,
 &hugepage_shift);
if (ptep == NULL)
paddr = 0;
else {
-   /*
-* we don't have hugepages backing iomem
-*/
WARN_ON(hugepage_shift);
paddr = pte_pfn(*ptep) << PAGE_SHIFT;
}
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 534acb3c6c3d..26df3864d85a 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -539,12 +539,13 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, 
struct kvm_vcpu *vcpu,
if (!writing && hpte_is_writable(r)) {
unsigned int hugepage_shift;
pte_t *ptep, pte

[PATCH V5 2/2] powerpc/mm/thp: Return pte address if we find trans_splitting.

2015-03-29 Thread Aneesh Kumar K.V
For THP that is marked trans splitting, we return the pte.
This require the callers to handle the pmd_trans_splitting scenario,
if they care. All the current callers are either looking at pfn or
write_ok, hence we don't need to update them.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/kvm_book3s_64.h | 12 ++--
 arch/powerpc/kvm/book3s_64_mmu_hv.c  |  9 +++--
 arch/powerpc/kvm/book3s_hv_rm_mmu.c  |  2 +-
 arch/powerpc/mm/hugetlbpage.c|  9 -
 4 files changed, 10 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index f06820c67175..5233a35d80e2 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -281,11 +281,9 @@ static inline int hpte_cache_flags_ok(unsigned long ptel, 
unsigned long io_type)
 
 /*
  * If it's present and writable, atomically set dirty and referenced bits and
- * return the PTE, otherwise return 0. If we find a transparent hugepage
- * and if it is marked splitting we return 0;
+ * return the PTE, otherwise return 0.
  */
-static inline pte_t kvmppc_read_update_linux_pte(pte_t *ptep, int writing,
-unsigned int hugepage)
+static inline pte_t kvmppc_read_update_linux_pte(pte_t *ptep, int writing)
 {
pte_t old_pte, new_pte = __pte(0);
 
@@ -301,12 +299,6 @@ static inline pte_t kvmppc_read_update_linux_pte(pte_t 
*ptep, int writing,
cpu_relax();
continue;
}
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-   /* If hugepage and is trans splitting return None */
-   if (unlikely(hugepage &&
-pmd_trans_splitting(pte_pmd(old_pte
-   return __pte(0);
-#endif
/* If pte is not present return None */
if (unlikely(!(pte_val(old_pte) & _PAGE_PRESENT)))
return __pte(0);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 26df3864d85a..0fe9c92e78ed 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -537,20 +537,17 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, 
struct kvm_vcpu *vcpu,
}
/* if the guest wants write access, see if that is OK */
if (!writing && hpte_is_writable(r)) {
-   unsigned int hugepage_shift;
pte_t *ptep, pte;
unsigned long flags;
-
/*
 * We need to protect against page table destruction
-* while looking up and updating the pte.
+* hugepage split and collapse.
 */
local_irq_save(flags);
ptep = find_linux_pte_or_hugepte(current->mm->pgd,
-hva, &hugepage_shift);
+hva, NULL);
if (ptep) {
-   pte = kvmppc_read_update_linux_pte(ptep, 1,
-  hugepage_shift);
+   pte = kvmppc_read_update_linux_pte(ptep, 1);
if (pte_write(pte))
write_ok = 1;
}
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index f559b25de173..d839f08cb903 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -219,7 +219,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
local_irq_restore(flags);
return H_PARAMETER;
}
-   pte = kvmppc_read_update_linux_pte(ptep, writing, hpage_shift);
+   pte = kvmppc_read_update_linux_pte(ptep, writing);
if (pte_present(pte) && !pte_protnone(pte)) {
if (writing && !pte_write(pte))
/* make the actual HPTE be read-only */
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index f9a792829d4b..5349fa87e21d 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -1013,12 +1013,11 @@ pte_t *__find_linux_pte_or_hugepte(pgd_t *pgdir, 
unsigned long ea,
 * A hugepage collapse is captured by pmd_none, because
 * it mark the pmd none and do a hpte invalidate.
 *
-* A hugepage split is captured by pmd_trans_splitting
-* because we mark the pmd trans splitting and do a
-* hpte invalidate
-*
+  

Re: [PATCH 2/2] KVM: PPC: Remove page table walk helpers

2015-03-29 Thread Benjamin Herrenschmidt
On Mon, 2015-03-30 at 10:39 +0530, Aneesh Kumar K.V wrote:
> This patch remove helpers which we had used only once in the code.
> Limiting page table walk variants help in ensuring that we won't
> end up with code walking page table with wrong assumptions.
> 
> Signed-off-by: Aneesh Kumar K.V 

Alex, it would be preferable to have this (and the previous one) in the
powerpc tree due to dependencies with further fixes to our page table
walking vs. THP, so if you're happy, just give us an ack.

Cheers,
Ben.

> ---
>  arch/powerpc/include/asm/pgtable.h  | 21 -
>  arch/powerpc/kvm/book3s_hv_rm_mmu.c | 62 
> -
>  arch/powerpc/kvm/e500_mmu_host.c|  2 +-
>  3 files changed, 28 insertions(+), 57 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/pgtable.h 
> b/arch/powerpc/include/asm/pgtable.h
> index 9835ac4173b7..92fe01c355a9 100644
> --- a/arch/powerpc/include/asm/pgtable.h
> +++ b/arch/powerpc/include/asm/pgtable.h
> @@ -249,27 +249,6 @@ extern int gup_hugepte(pte_t *ptep, unsigned long sz, 
> unsigned long addr,
>  #endif
>  pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
>unsigned *shift);
> -
> -static inline pte_t *lookup_linux_ptep(pgd_t *pgdir, unsigned long hva,
> -  unsigned long *pte_sizep)
> -{
> - pte_t *ptep;
> - unsigned long ps = *pte_sizep;
> - unsigned int shift;
> -
> - ptep = find_linux_pte_or_hugepte(pgdir, hva, &shift);
> - if (!ptep)
> - return NULL;
> - if (shift)
> - *pte_sizep = 1ul << shift;
> - else
> - *pte_sizep = PAGE_SIZE;
> -
> - if (ps > *pte_sizep)
> - return NULL;
> -
> - return ptep;
> -}
>  #endif /* __ASSEMBLY__ */
>  
>  #endif /* __KERNEL__ */
> diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
> b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> index 625407e4d3b0..73e083cb9f7e 100644
> --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> @@ -131,25 +131,6 @@ static void remove_revmap_chain(struct kvm *kvm, long 
> pte_index,
>   unlock_rmap(rmap);
>  }
>  
> -static pte_t lookup_linux_pte_and_update(pgd_t *pgdir, unsigned long hva,
> -   int writing, unsigned long *pte_sizep)
> -{
> - pte_t *ptep;
> - unsigned long ps = *pte_sizep;
> - unsigned int hugepage_shift;
> -
> - ptep = find_linux_pte_or_hugepte(pgdir, hva, &hugepage_shift);
> - if (!ptep)
> - return __pte(0);
> - if (hugepage_shift)
> - *pte_sizep = 1ul << hugepage_shift;
> - else
> - *pte_sizep = PAGE_SIZE;
> - if (ps > *pte_sizep)
> - return __pte(0);
> - return kvmppc_read_update_linux_pte(ptep, writing, hugepage_shift);
> -}
> -
>  static inline void unlock_hpte(__be64 *hpte, unsigned long hpte_v)
>  {
>   asm volatile(PPC_RELEASE_BARRIER "" : : : "memory");
> @@ -166,10 +147,10 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long 
> flags,
>   struct revmap_entry *rev;
>   unsigned long g_ptel;
>   struct kvm_memory_slot *memslot;
> - unsigned long pte_size;
> + unsigned hpage_shift;
>   unsigned long is_io;
>   unsigned long *rmap;
> - pte_t pte;
> + pte_t *ptep;
>   unsigned int writing;
>   unsigned long mmu_seq;
>   unsigned long rcbits;
> @@ -208,22 +189,33 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long 
> flags,
>  
>   /* Translate to host virtual address */
>   hva = __gfn_to_hva_memslot(memslot, gfn);
> + ptep = find_linux_pte_or_hugepte(pgdir, hva, &hpage_shift);
> + if (ptep) {
> + pte_t pte;
> + unsigned int host_pte_size;
>  
> - /* Look up the Linux PTE for the backing page */
> - pte_size = psize;
> - pte = lookup_linux_pte_and_update(pgdir, hva, writing, &pte_size);
> - if (pte_present(pte) && !pte_protnone(pte)) {
> - if (writing && !pte_write(pte))
> - /* make the actual HPTE be read-only */
> - ptel = hpte_make_readonly(ptel);
> - is_io = hpte_cache_bits(pte_val(pte));
> - pa = pte_pfn(pte) << PAGE_SHIFT;
> - pa |= hva & (pte_size - 1);
> - pa |= gpa & ~PAGE_MASK;
> - }
> + if (hpage_shift)
> + host_pte_size = 1ul << hpage_shift;
> + else
> + host_pte_size = PAGE_SIZE;
> + /*
> +  * We should always find the guest page size
> +  * to <= host page size, if host is using hugepage
> +  */
> + if (host_pte_size < psize)
> + return H_PARAMETER;
>  
> - if (pte_size < psize)
> - return H_PARAMETER;
> + pte = kvmppc_read_update_linux_pte(ptep, writing, hpage_shift);
> + if (pte_present(pte) && !pte_protnone(pte)) {
> + i

Re: [PATCH 2/2] powerpc/powernv: Reboot when requested by firmware

2015-03-29 Thread Michael Ellerman
On Mon, 2015-03-30 at 12:45 +1030, Joel Stanley wrote:
> Use orderly_reboot so userspace will to shut itself down via the reboot
> path. This is required for graceful reboot initiated by the BMC, such
> as when a user uses ipmitool to issue a 'chassis power cycle' command.
> 
> Signed-off-by: Joel Stanley 
> ---
>  arch/powerpc/platforms/powernv/opal-power.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/opal-power.c 
> b/arch/powerpc/platforms/powernv/opal-power.c
> index 48bf5b0..ac46c2c 100644
> --- a/arch/powerpc/platforms/powernv/opal-power.c
> +++ b/arch/powerpc/platforms/powernv/opal-power.c
> @@ -29,8 +29,9 @@ static int opal_power_control_event(struct notifier_block 
> *nb,
>  
>   switch (type) {
>   case SOFT_REBOOT:
> - /* Fall through. The service processor is responsible for
> -  * bringing the machine back up */
> + pr_info("OPAL: reboot requested\n");
> + orderly_reboot();
> + break;

Acked-by: Michael Ellerman 


Andrew, do you want to take these 2 via your tree? Assuming folks are OK with
patch 1.

cheers


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 1/3] powerpc/powernv: convert codes returned by OPAL calls

2015-03-29 Thread Cedric Le Goater
On 03/30/2015 04:05 AM, Michael Ellerman wrote:
> On Fri, 2015-03-27 at 17:39 +0100, Cédric Le Goater wrote:
>> OPAL has its own list of return codes. The patch provides a translation
>> of such codes in errnos for the opal_sensor_read call, and possibly 
>> others if needed.
>>
>> Index: linux.git/arch/powerpc/platforms/powernv/opal.c
>> ===
>> --- linux.git.orig/arch/powerpc/platforms/powernv/opal.c
>> +++ linux.git/arch/powerpc/platforms/powernv/opal.c
>> @@ -894,6 +894,23 @@ void opal_free_sg_list(struct opal_sg_li
>>  }
>>  }
>>  
>> +int opal_error_code(int rc)
>> +{
>> +switch (rc) {
>> +case OPAL_SUCCESS:  return 0;
> 
> Obviously correct.

He. Initially, I didn't put a case for SUCCESS, but we have code doing :

ret = be64_to_cpu(msg.params[1]);


>> +case OPAL_PARAMETER:return -EINVAL;
> 
> Yep.
> 
>> +case OPAL_UNSUPPORTED:  return -ENOSYS;
> 
> You shouldn't use ENOSYS here, that should only ever mean "no such syscall",
> otherwise you get very confusing results like read() returning ENOSYS.

Indeed. How about ENODEV then ? 

>> +case OPAL_ASYNC_COMPLETION: return -EAGAIN;
> 
> EAGAIN means "try what you did again", I don't think that's what
> ASYNC_COMPLETION means, is it? It looks like it means, "don't try again, but
> you need to wait for the result to be ready".
> 
> I'm not sure it maps well to any of the Linux codes, maybe EINPROGRESS ?

Yes. This is better.

>> +case OPAL_BUSY_EVENT:   return -EBUSY;
> 
> Yep.
> 
>> +case OPAL_NO_MEM:   return -ENOMEM;
> 
> Yep.
> 
>> +case OPAL_HARDWARE: return -ENOENT;
> 
> This is another one which I think you shouldn't use as it can lead to 
> confusing
> results at user level. eg:
> 
>   $ cat /sysfs/some/file
>   Error: No such file or directory
> 
> Huh?
> 
> Looking at the skiboot code this looks like EIO is a good match.

ok. 

>> +case OPAL_INTERNAL_ERROR:   return -EIO;
> 
> Yeah as good as anything I guess.
> 
>> +default:
>> +pr_err("%s: unexpected OPAL error %d\n", __func__, rc);
>> +return -ERANGE;
> 
> I'm not sure about this one honestly, it means "Math result not 
> representable".
> 
> I suspect the reason RTAS chose it was just that it's not EINVAL.
> 
> This should probably also just be EIO.

ok. I will change it. 

Thanks,

C.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] powerpc: Reword the "returning from prom_init" message

2015-03-29 Thread Michael Ellerman
We get way too many bug reports that say "the kernel is hung in
prom_init", which stems from the fact that the last piece of output
people see is "returning from prom_init".

The kernel is almost never hung in prom_init(), it's just that it's
crashed somewhere after prom_init() but prior to the console coming up.

The existing message should give a clue to that, ie. "returning from"
indicates that prom_init() has finished, but it doesn't seem to work.
Let's try something different.

This prints:

  Quiescing Open Firmware ...
  Booting Linux via __start() ...

Which hopefully makes it clear that prom_init() is not the problem, and
although __start() probably isn't either, it's at least the right place
to begin looking.

Signed-off-by: Michael Ellerman 
---
 arch/powerpc/kernel/prom_init.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 1a85d8f96739..fd1fe4c37599 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -2898,7 +2898,7 @@ unsigned long __init prom_init(unsigned long r3, unsigned 
long r4,
 * Call OF "quiesce" method to shut down pending DMA's from
 * devices etc...
 */
-   prom_printf("Calling quiesce...\n");
+   prom_printf("Quiescing Open Firmware ...\n");
call_prom("quiesce", 0, 0);
 
/*
@@ -2910,7 +2910,7 @@ unsigned long __init prom_init(unsigned long r3, unsigned 
long r4,
 
/* Don't print anything after quiesce under OPAL, it crashes OFW */
if (of_platform != PLATFORM_OPAL) {
-   prom_printf("returning from prom_init\n");
+   prom_printf("Booting Linux via __start() ...\n");
prom_debug("->dt_header_start=0x%x\n", hdr);
}
 
-- 
2.1.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 3/3] powerpc/powernv: remove opal_sensor_mutex

2015-03-29 Thread Cedric Le Goater
On 03/30/2015 04:09 AM, Michael Ellerman wrote:
> On Fri, 2015-03-27 at 17:39 +0100, Cédric Le Goater wrote:
>> The opal sensor mutex protects the opal_sensor_read call which
>> can return a OPAL_BUSY code on IBM Power systems if a previous 
>> request is in progress.
>>
>> This can be handled at user level with a retry.
> 
> It can, but how does it actually look in practice?
> 
> It looks like the only use of opal_get_sensor_data() is show_sensor() in
> drivers/hwmon/ibmpowernv.c.
> 
> Because that's a sysfs attribute folks will be generally just dumping 
> that with cat, or reading it in a shell script, neither of which will 
> cope nicely with EBUSY I think?

It won't, I agree but it should only happen when running concurrent cat 
commands on the hwmon sysfs files. The event should be rare enough.

Anyhow, this is not a big issue. We can drop that patch. The real "issue"
is the time it takes to get some values back from the FSP. This is what
user space has been most surprised about.

Thanks,

C.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 1/3] powerpc/powernv: convert codes returned by OPAL calls

2015-03-29 Thread Michael Ellerman
On Mon, 2015-03-30 at 08:37 +0200, Cedric Le Goater wrote:
> On 03/30/2015 04:05 AM, Michael Ellerman wrote:
> > On Fri, 2015-03-27 at 17:39 +0100, Cédric Le Goater wrote:
> >> OPAL has its own list of return codes. The patch provides a translation
> >> of such codes in errnos for the opal_sensor_read call, and possibly 
> >> others if needed.
> >>
> >> +  case OPAL_UNSUPPORTED:  return -ENOSYS;
> > 
> > You shouldn't use ENOSYS here, that should only ever mean "no such syscall",
> > otherwise you get very confusing results like read() returning ENOSYS.
> 
> Indeed. How about ENODEV then ? 

That can also be confusing from userspace.

I think it's probably best just to use EIO, as far as userspace is concerned if
the kernel lets it call an unsupported OPAL routine that is more or less a
kernel bug.

cheers


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 1/3] powerpc/powernv: convert codes returned by OPAL calls

2015-03-29 Thread Cedric Le Goater
On 03/30/2015 08:54 AM, Michael Ellerman wrote:
> On Mon, 2015-03-30 at 08:37 +0200, Cedric Le Goater wrote:
>> On 03/30/2015 04:05 AM, Michael Ellerman wrote:
>>> On Fri, 2015-03-27 at 17:39 +0100, Cédric Le Goater wrote:
 OPAL has its own list of return codes. The patch provides a translation
 of such codes in errnos for the opal_sensor_read call, and possibly 
 others if needed.

 +  case OPAL_UNSUPPORTED:  return -ENOSYS;
>>>
>>> You shouldn't use ENOSYS here, that should only ever mean "no such syscall",
>>> otherwise you get very confusing results like read() returning ENOSYS.
>>
>> Indeed. How about ENODEV then ? 
> 
> That can also be confusing from userspace.
> 
> I think it's probably best just to use EIO, as far as userspace is concerned 
> if
> the kernel lets it call an unsupported OPAL routine that is more or less a
> kernel bug.

OK. Will do.

Thanks,

C.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev