date:20160320

Re: [PATCH v8 42/45] drivers/of: Rename unflatten_dt_node()

2016-03-20 Thread Rob Herring

On Mon, Mar 7, 2016 at 6:56 PM, Gavin Shan  wrote:
> On Tue, Mar 01, 2016 at 08:40:12PM -0600, Rob Herring wrote:
>>On Thu, Feb 18, 2016 at 9:16 PM, Gavin Shan  wrote:
>>> On Wed, Feb 17, 2016 at 08:59:53AM -0600, Rob Herring wrote:
On Tue, Feb 16, 2016 at 9:44 PM, Gavin Shan  
wrote:
> This renames unflatten_dt_node() to unflatten_dt_nodes() as it
> populates multiple device nodes from FDT blob. No logical changes
> introduced.
>
> Signed-off-by: Gavin Shan 
> ---
>  drivers/of/fdt.c | 14 +++---
>  1 file changed, 7 insertions(+), 7 deletions(-)

Acked-by: Rob Herring 

I'm happy to take patches 40-42 for 4.6 if the rest of the series
doesn't go in given they fix a separate problem. I just need to know
soon (or at least they need to go into -next soon).

>>>
>>> Thanks for quick response, Rob. It depends how much comments I will
>>> receive for the powerpc/powernv part. Except that, all parts including
>>> this one have been ack'ed. I can discuss it with Michael Ellerman.
>>> By the way, how soon you need the decision to merge 40-42? If that's
>>> one or two weeks later, I don't think the reivew on the whole series
>>> can be done.
>>
>>Well, it's been 2 weeks now. I need to know this week.
>>
>>> Also, I think you probably can merge 40-44 as they're all about
>>> fdt.c. If they can be merged at one time, I needn't bother (cc)
>>> you again if I need send a updated revision. Thanks for your
>>> review.
>>
>>I did not include 43 and 44 as they are only needed for the rest of your 
>>series.
>>
>
> Rob, sorry for late reponse. I really expect this series to be merged to 4.6 
> and
> I was checking reviewers' bandwidth to review it. Unfortunately, I didn't 
> receive
> any comments except yours until now. That means this series has to miss 4.6. 
> Please
> pick/merge 41 and 42 if no body has objection. Thanks again for your time on 
> this.

Too late for 4.6.

Rob
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v4 4/7] PCI: Modify resource_alignment to support multiple devices

2016-03-20 Thread Alex Williamson

On Thu, 17 Mar 2016 19:28:34 +0800
Yongji Xie  wrote:

> On 2016/3/17 0:30, Alex Williamson wrote:
> > On Mon,  7 Mar 2016 15:48:35 +0800
> > Yongji Xie  wrote:
> >  
> >> When vfio passthrough a PCI device of which MMIO BARs
> >> are smaller than PAGE_SIZE, guest will not handle the
> >> mmio accesses to the BARs which leads to mmio emulations
> >> in host.
> >>
> >> This is because vfio will not allow to passthrough one
> >> BAR's mmio page which may be shared with other BARs.
> >>
> >> To solve this performance issue, this patch modifies
> >> resource_alignment to support syntax where multiple
> >> devices get the same alignment. So we can use something
> >> like "pci=resource_alignment=*:*:*.*:noresize" to
> >> enforce the alignment of all MMIO BARs to be at least
> >> PAGE_SIZE so that one BAR's mmio page would not be
> >> shared with other BARs.
> >>
> >> Signed-off-by: Yongji Xie 
> >> ---
> >>   Documentation/kernel-parameters.txt |2 +
> >>   drivers/pci/pci.c   |   90 
> >> ++-
> >>   include/linux/pci.h |4 ++
> >>   3 files changed, 85 insertions(+), 11 deletions(-)
> >>
> >> diff --git a/Documentation/kernel-parameters.txt 
> >> b/Documentation/kernel-parameters.txt
> >> index 8028631..74b38ab 100644
> >> --- a/Documentation/kernel-parameters.txt
> >> +++ b/Documentation/kernel-parameters.txt
> >> @@ -2918,6 +2918,8 @@ bytes respectively. Such letter suffixes can also be 
> >> entirely omitted.
> >>aligned memory resources.
> >>If  is not specified,
> >>PAGE_SIZE is used as alignment.
> >> +  , ,  and  can be set to
> >> +  "*" which means match all values.  
> > I don't see anywhere that you're automatically enabling this for your
> > platform, so presumably you're expecting users to determine on their
> > own that they have a performance problem and hoping that they'll figure
> > out that they need to use this option to resolve it.  The first irate
> > question you'll get back is why doesn't this happen automatically?  
> 
> Actually, I just want to make the code simple. Maybe it is
> not a good idea to let users enable this option manually.
> I will try to fix it like that:

That's not entirely what I meant, I think having a way for a user to
enable it is a good thing, but maybe there need to be cases where it's
enabled automatically.  With 4k pages, this is often not an issue since
the PCI spec recommends 4k or 8k alignment for resources, but that
doesn't preclude an unusual device where a user might want to enable it
anyway.  At 64k page size, problems become more common, so we need to
think about either enabling it automatically or somehow making it more
apparent to the user that the option is available for this purpose.

> 
> diff --git a/arch/powerpc/include/asm/pci.h b/arch/powerpc/include/asm/pci.h
> index 6f8065a..6659752 100644
> --- a/arch/powerpc/include/asm/pci.h
> +++ b/arch/powerpc/include/asm/pci.h
> @@ -30,6 +30,8 @@
>   #define PCIBIOS_MIN_IO 0x1000
>   #define PCIBIOS_MIN_MEM0x1000
> 
> +#define PCIBIOS_MIN_ALIGNMENT  PAGE_SIZE
> +
>   struct pci_dev;
> 
>   /* Values for the `which' argument to sys_pciconfig_iobase syscall.  */
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index dadd28a..9f644e4 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -4593,6 +4593,8 @@ EXPORT_SYMBOL_GPL(pci_ignore_hotplug);
>   static char resource_alignment_param[RESOURCE_ALIGNMENT_PARAM_SIZE] = {0};
>   static DEFINE_SPINLOCK(resource_alignment_lock);
> 
> +#define DISABLE_ARCH_ALIGNMENT -1
> +#define DEFAULT_ALIGNMENT  -2
>   /**
>* pci_specified_resource_alignment - get resource alignment specified 
> by user.
>* @dev: the PCI device to get
> @@ -4609,6 +4611,9 @@ static resource_size_t 
> pci_specified_resource_alignment(struct pci_dev *dev,
>  char *p;
>  bool invalid = false;
> 
> +#ifdef PCIBIOS_MIN_ALIGNMENT
> +   align = PCIBIOS_MIN_ALIGNMENT;
> +#endif
>  spin_lock(&resource_alignment_lock);
>  p = resource_alignment_param;
>  while (*p) {
> @@ -4617,7 +4622,7 @@ static resource_size_t 
> pci_specified_resource_alignment(struct pci_dev *dev,
>  p[count] == '@') {
>  p += count + 1;
>  } else {
> -   align_order = -1;
> +   align_order = DEFAULT_ALIGNMENT;
>  }
>  if (p[0] == '*' && p[1] == ':') {
>  seg = -1;
> @@ -4673,8 +4678,10 @@ static resource_size_t 
> pci_specified_resource_alignment(struct pci_dev *dev,
>  (bus == dev->bus->number || bus == -1) &&
>  (slot == PCI_SLOT(dev->devfn) || slot == -1) &&
>  (func == PCI

[PATCH] powerpc/rtas: fix array overrun in ppc_rtas() syscall

2016-03-20 Thread Andrew Donnellan

If ppc_rtas() is called with args.nargs == 16 and args.nret == 0, args.rets
is set to point to &args.args[16], which is beyond the end of the args.args
array. This results in a minor read overrun of the array when we check the
first return code (which, per PAPR, is a required output of all RTAS calls)
to see if there's been a hardware error.

Change the nargs/nret check to ensure nargs is <= 15, allowing room for the
status code. Users shouldn't be calling with nret == 0, but there's no real
harm if they do, so we don't stop them.

Signed-off-by: Andrew Donnellan 

---

Found with the assistance of Coverity Scan.

The dodgy read doesn't currently leak anything at all to userspace, as
args.rets isn't copied back to userspace.
---
 arch/powerpc/kernel/rtas.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index 28736ff..8da209f 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -1070,7 +1070,7 @@ asmlinkage int ppc_rtas(struct rtas_args __user *uargs)
nret  = be32_to_cpu(args.nret);
token = be32_to_cpu(args.token);
 
-   if (nargs > ARRAY_SIZE(args.args)
+   if (nargs >= ARRAY_SIZE(args.args)
|| nret > ARRAY_SIZE(args.args)
|| nargs + nret > ARRAY_SIZE(args.args))
return -EINVAL;
-- 
Andrew Donnellan  Software Engineer, OzLabs
andrew.donnel...@au1.ibm.com  Australia Development Lab, Canberra
+61 2 6201 8874 (work)IBM Australia Limited

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH kernel] KVM: PPC: Create a virtual-mode only TCE table handlers

2016-03-20 Thread Paul Mackerras

On Fri, Mar 18, 2016 at 01:50:42PM +1100, Alexey Kardashevskiy wrote:
> Upcoming in-kernel VFIO acceleration needs different handling in real
> and virtual modes which makes it hard to support both modes in
> the same handler.
> 
> This creates a copy of kvmppc_rm_h_stuff_tce and kvmppc_rm_h_put_tce
> in addition to the existing kvmppc_rm_h_put_tce_indirect.
> 
> This also fixes linker breakage when only PR KVM was selected (leaving
> HV KVM off): the kvmppc_h_put_tce/kvmppc_h_stuff_tce functions
> would not compile at all and the linked would fail.
> 
> Signed-off-by: Alexey Kardashevskiy 

Acked-by: Paul Mackerras 

Paolo, will you take this directly, or do you want me to generate
a pull request?

Paul.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v4] powerpc/pci: Assign fixed PHB number based on device-tree properties

2016-03-20 Thread Guilherme G. Piccoli

The domain/PHB field of PCI addresses has its value obtained from a
global variable, incremented each time a new domain (represented by
struct pci_controller) is added on the system. The domain addition
process happens during boot or due to PCI device hotplug.

As recent kernels are using predictable naming for network interfaces,
the network stack is more tied to PCI naming. This can be a problem in
hotplug scenarios, because PCI addresses will change if devices are
removed and then re-added. This situation seems unusual, but it can
happen if a user wants to replace a NIC without rebooting the machine,
for example.

This patch changes the way PCI domain values are generated: now, we use
device-tree properties to assign fixed PHB numbers to PCI addresses
when available (meaning pSeries and PowerNV cases). We also use a bitmap
to allow dynamic PHB numbering when device-tree properties are not
used. This bitmap keeps track of used PHB numbers and if a PHB is
released (by hotplug operations for example), it allows the reuse of
this PHB number, avoiding PCI address to change in case of device remove
and re-add soon after. No functional changes were introduced.

Reviewed-by: Gavin Shan 
Signed-off-by: Guilherme G. Piccoli 
---
 arch/powerpc/kernel/pci-common.c | 40 +---
 1 file changed, 37 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 0f7a60f..bc31ac1 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -44,8 +44,11 @@
 static DEFINE_SPINLOCK(hose_spinlock);
 LIST_HEAD(hose_list);
 
-/* XXX kill that some day ... */
-static int global_phb_number;  /* Global phb counter */
+/* For dynamic PHB numbering on get_phb_number(): max number of PHBs. */
+#defineMAX_PHBS8192
+
+/* For dynamic PHB numbering: used/free PHBs tracking bitmap. */
+static DECLARE_BITMAP(phb_bitmap, MAX_PHBS);
 
 /* ISA Memory physical address */
 resource_size_t isa_mem_base;
@@ -64,6 +67,32 @@ struct dma_map_ops *get_pci_dma_ops(void)
 }
 EXPORT_SYMBOL(get_pci_dma_ops);
 
+static int get_phb_number(struct device_node *dn)
+{
+   const __be64 *prop64;
+   const __be32 *regs;
+   int phb_id = 0;
+
+   /* try fixed PHB numbering first, by checking archs and reading
+* the respective device-tree property. */
+   if (machine_is(pseries)) {
+   regs = of_get_property(dn, "reg", NULL);
+   if (regs)
+   return (int)(be32_to_cpu(regs[1]) & 0x);
+   } else if (machine_is(powernv)) {
+   prop64 = of_get_property(dn, "ibm,opal-phbid", NULL);
+   if (prop64)
+   return (int)(be64_to_cpup(prop64) & 0x);
+   }
+
+   /* if not pSeries nor PowerNV, fallback to dynamic PHB numbering */
+   phb_id = find_first_zero_bit(phb_bitmap, MAX_PHBS);
+   BUG_ON(phb_id >= MAX_PHBS); /* reached maximum number of PHBs */
+   set_bit(phb_id, phb_bitmap);
+
+   return phb_id;
+}
+
 struct pci_controller *pcibios_alloc_controller(struct device_node *dev)
 {
struct pci_controller *phb;
@@ -72,7 +101,7 @@ struct pci_controller *pcibios_alloc_controller(struct 
device_node *dev)
if (phb == NULL)
return NULL;
spin_lock(&hose_spinlock);
-   phb->global_number = global_phb_number++;
+   phb->global_number = get_phb_number(dev);
list_add_tail(&phb->list_node, &hose_list);
spin_unlock(&hose_spinlock);
phb->dn = dev;
@@ -94,6 +123,11 @@ EXPORT_SYMBOL_GPL(pcibios_alloc_controller);
 void pcibios_free_controller(struct pci_controller *phb)
 {
spin_lock(&hose_spinlock);
+
+   /* clear bit of phb_bitmap to allow reuse of this phb number */
+   if (phb->global_number < MAX_PHBS)
+   clear_bit(phb->global_number, phb_bitmap);
+
list_del(&phb->list_node);
spin_unlock(&hose_spinlock);
 
-- 
2.1.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] cxl: fix setting of _PAGE_USER bit when handling page faults

2016-03-20 Thread Ian Munsie

Excerpts from andrew.donnellan's message of 2016-03-18 15:01:21 +1100:
> Fixes: f204e0b8cedd ("cxl: Driver code for powernv PCIe based cards for
> userspace access")

It doesn't fix that since there was no cxl kernel API support at the
time, so this wasn't a regression - just something we missed when the
kernel api was added (I believe the broken test in the code was a left
over from some early bringup work and would never have been exercised on
an upstream kernel until then).

> Currently, this should only affect cxlflash.

We haven't run into any problems because of this that I am aware of - do
we have a test case for this?

> -if ((!ctx->kernel) || ~(dar & (1ULL << 63)))
> +if ((!ctx->kernel) || !(dar & (1ULL << 63)))

Should it be the top two bits?

-Ian

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH next] cxl: Allow PSL timebase to not sync

2016-03-20 Thread Michael Neuling


> IMO, we should ditch the module parameter altogether and never treat
> timebase sync failure as fatal, and leave that up to any applications
> actually need it to check.

I agree with this this.

Mikey
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v9 2/3] kernel.h: add to_user_ptr()

2016-03-20 Thread Daniel Vetter

On Thu, Mar 17, 2016 at 02:33:50PM -0700, Joe Perches wrote:
> On Thu, 2016-03-17 at 18:19 -0300, Gustavo Padovan wrote:
> > 2016-03-17 Joe Perches :
> > > On Thu, 2016-03-17 at 16:50 -0400, Rob Clark wrote:
> > > > On Thu, Mar 17, 2016 at 4:40 PM, Joe Perches  wrote:
> > > []
> > > > > It's a name that seems like it should be a straightforward
> > > > > cast of a kernel pointer to a __user pointer like:
> > > > > 
> > > > > static inline void __user *to_user_ptr(void *p)
> > > > > {
> > > > > return (void __user *)p;
> > > > > }
> > > > ahh, ok.  I guess I was used to using it in the context of ioctl
> > > > structs..  in that context u64 -> (void __user *) made more sense.
> > > > 
> > > > Maybe uapi_to_ptr()?  (ok, not super-creative.. maybe someone has a
> > > > better idea)
> > > Maybe u64_to_user_ptr?
> > That is a good name. If everyone agrees I can resend this patch
> > changing it to u64_to_user_ptr. Then should we still keep it on
> > kernel.h?
> 
> I've no particular opinion about location,
> but maybe compat.h might be appropriate.
> 
> Maybe add all variants:
> 
>   void __user *u32_to_user_ptr(u32 val)
>   void __user *u64_to_user_ptr(u64 val)
>   u32 user_ptr_to_u32(void __user *p)
>   u64 user_ptr_to_u64(void __user *p)
> 
> Maybe there's something about 32 bit userspace on
> 64 OS that should be done too.

Tbh I really don't think we should add 32bit variants and encourage the
mispractice of having 32bit user ptrs in ioctl structs and stuff. Anyway,
just my bikeshed on top ;-)
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: rename to_user_ptr to __to_user_ptr

2016-03-20 Thread Gustavo Padovan

Hi,

2016-03-17 Gustavo Padovan :

> From: Gustavo Padovan 
> 
> to_user_ptr() is a local macro defined by signal_32.c, rename it to
> __to_user_ptr() as now we will have a global to_user_ptr() defined by
> kernel.h that has a different meaning from this one.
> 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Michael Ellerman 
> Signed-off-by: Gustavo Padovan 
> ---
>  arch/powerpc/kernel/signal_32.c | 18 +-
>  1 file changed, 9 insertions(+), 9 deletions(-)

We changed our mind about the names, so ignore this patch. Sorry for the
noise.

Gustavo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [v6, 5/5] mmc: sdhci-of-esdhc: fix host version for T4240-R1.0-R2.0

2016-03-20 Thread Arnd Bergmann

On Thursday 17 March 2016 12:01:01 Rob Herring wrote:
> On Mon, Mar 14, 2016 at 05:45:43PM +, Scott Wood wrote:

> > >> This makes the driver non-portable. Better identify the specific
> > >> workarounds based on the compatible string for this device, or add a
> > >> boolean DT property for the quirk.
> > >>
> > >>Arnd
> > > 
> > > [Lu Yangbo-B47093] Hi Arnd, we did have a discussion about using DTS in 
> > > v1 before.
> > > https://patchwork.kernel.org/patch/6834221/
> > > 
> > > We don’t have a separate DTS file for each revision of an SOC and if we 
> > > did, we'd constantly have people using the wrong one.
> > > In addition, the device tree is stable ABI and errata are often 
> > > discovered after device tree are deployed.
> > > See the link for details.
> > > 
> > > So we decide to read SVR from the device-config/guts MMIO block other 
> > > than using DTS.
> > > Thanks.
> > 
> > Also note that this driver is already only for fsl-specific hardware,
> > and it will still work even if fsl_guts doesn't find anything to bind to
> > -- it just wouldn't be able to detect errata based on SVR in that case.
> 
> IIRC, it is the same IP block as i.MX and Arnd's point is this won't 
> even compile on !PPC. It is things like this that prevent sharing the 
> driver.

I think the first four patches take care of building for ARM,
but the problem remains if you want to enable COMPILE_TEST as
we need for certain automated checking.

> Dealing with Si revs is a common problem. We should have a 
> common solution. There is soc_device for this purpose.

Exactly. The last time this came up, I think we agreed to implement a
helper using glob_match() on the soc_device strings. Unfortunately
this hasn't happened then, but I'd still prefer that over yet another
vendor-specific way of dealing with the generic issue.

Arnd
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v9 2/3] kernel.h: add to_user_ptr()

2016-03-20 Thread Rob Clark

On Thu, Mar 17, 2016 at 4:40 PM, Joe Perches  wrote:
> On Thu, 2016-03-17 at 16:33 -0400, Rob Clark wrote:
>> On Thu, Mar 17, 2016 at 4:22 PM, Joe Perches  wrote:
>> > On Thu, 2016-03-17 at 15:43 -0300, Gustavo Padovan wrote:
>> > > 2016-03-17 Gustavo Padovan :
>> > > > 2016-03-17 Joe Perches :
>> > > > > On Thu, 2016-03-17 at 14:30 -0300, Gustavo Padovan wrote:
>> > > > > > This function had copies in 3 different files. Unify them in
>> > > > > > kernel.h.
>> > > > > This is only used by gpu/drm.
>> > > > >
>> > > > > I think this is a poor name for a generic function
>> > > > > that would be in kernel.h.
>> > > > >
>> > > > > Isn't there an include file in linux/drm that's
>> > > > > appropriate for this.  Maybe drmP.h
>> > > > >
>> > > > > Maybe prefix this function name with drm_ too.
>> > > > No, the next patch adds a user to drivers/staging (which will be moved
>> > > > to drivers/dma-buf) soon. Maybe move to a different header in
>> > > > include/linux/? not sure which one.
>> > > >
>> > > > >
>> > > > >
>> > > > > Also, there's this that might conflict:
>> > > > >
>> > > > > arch/powerpc/kernel/signal_32.c:#define to_user_ptr(p)  
>> > > > > ptr_to_compat(p)
>> > > > > arch/powerpc/kernel/signal_32.c:#define to_user_ptr(p)  
>> > > > > ((unsigned long)(p))
>> > > > Right, I'll figure out how to replace these two too.
>> > > The powerpc to_user_ptr has a different meaning from the one I'm adding
>> > > in this patch. I propose we just rename powerpc's to_user_ptr to
>> > > __to_user_ptr and leave the rest as is.
>> > I think that's not a good idea, and you should really check
>> > this concept with the powerpc folk (added to to:s and cc:ed)
>> >
>> > If it were really added, then the function meaning is incorrect.
>> >
>> > This is taking a u64, casting that to (unsigned long/uint_ptr_t),
>> > then converting that to a user pointer.
>> >
>> > Does that naming and use make sense on x86-32 or arm32?
>> >
>> fwiw Gustavo's version of to_user_ptr() is in use on arm32 and arm64..
>> Not entirely sure what doesn't make sense about it
>
> It's a name that seems like it should be a straightforward
> cast of a kernel pointer to a __user pointer like:
>
> static inline void __user *to_user_ptr(void *p)
> {
> return (void __user *)p;
> }

ahh, ok.  I guess I was used to using it in the context of ioctl
structs..  in that context u64 -> (void __user *) made more sense.

Maybe uapi_to_ptr()?  (ok, not super-creative.. maybe someone has a better idea)

BR,
-R

> As a static function in a single file, it's not
> great, but OK, fine, it's static.
>
> As a global function in kernel.h, it's misleading.
>
>
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v9 2/3] kernel.h: add to_user_ptr()

2016-03-20 Thread Joe Perches

On Thu, 2016-03-17 at 15:43 -0300, Gustavo Padovan wrote:
> 2016-03-17 Gustavo Padovan :
> > 2016-03-17 Joe Perches :
> > > On Thu, 2016-03-17 at 14:30 -0300, Gustavo Padovan wrote:
> > > > 
> > > > This function had copies in 3 different files. Unify them in
> > > > kernel.h.
> > > This is only used by gpu/drm.
> > > 
> > > I think this is a poor name for a generic function
> > > that would be in kernel.h.
> > > 
> > > Isn't there an include file in linux/drm that's
> > > appropriate for this.  Maybe drmP.h
> > > 
> > > Maybe prefix this function name with drm_ too.
> > No, the next patch adds a user to drivers/staging (which will be moved
> > to drivers/dma-buf) soon. Maybe move to a different header in
> > include/linux/? not sure which one.
> > 
> > > 
> > > Also, there's this that might conflict:
> > > 
> > > arch/powerpc/kernel/signal_32.c:#define to_user_ptr(p)  
> > > ptr_to_compat(p)
> > > arch/powerpc/kernel/signal_32.c:#define to_user_ptr(p)  
> > > ((unsigned long)(p))
> > Right, I'll figure out how to replace these two too.
> The powerpc to_user_ptr has a different meaning from the one I'm adding
> in this patch. I propose we just rename powerpc's to_user_ptr to
> __to_user_ptr and leave the rest as is.

I think that's not a good idea, and you should really check
this concept with the powerpc folk (added to to:s and cc:ed)

If it were really added, then the function meaning is incorrect.

This is taking a u64, casting that to (unsigned long/uint_ptr_t),
then converting that to a user pointer.

Does that naming and use make sense on x86-32 or arm32?

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: linux-next: build failure after merge of the aio tree

2016-03-20 Thread Benjamin LaHaise

On Wed, Mar 16, 2016 at 02:59:38PM +0100, Arnd Bergmann wrote:
> On Wednesday 16 March 2016 13:12:36 Andy Shevchenko wrote:
> > 
> > > I've also sent a patch that fixes the link error on ARM and that should
> > > work on all other architectures too.
> > 
> > In case of avr32 signalfd_read() fails. Does your patch help with it as 
> > well?
> > 
> > P.S. Bisecting shows same culprit: 150a0b4905f1 ("aio: add support for
> > async openat()")
> 
> I don't know. What is the symptom on avr32? My patch only removes the
> get_user() instances on 64-bit values and replaces them with a
> single copy_from_user() call.

Which is the wrong fix.  Arch code should be able to handle 64 bit values 
in all the get/put_user() variants.  We use 64 bit variables all over the 
place in interfaces to userspace.

-ben
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH kernel 04/10] powerpc/powernv/npu: TCE Kill helpers cleanup

2016-03-20 Thread Alistair Popple

On Wed, 9 Mar 2016 17:29:00 Alexey Kardashevskiy wrote:
> NPU PHB TCE Kill register is exactly the same as in the rest of POWER8
> so let's reuse the existing code for NPU. The only bit missing is
> a helper to reset the entire TCE cache so this moves such a helper
> from NPU code and renames it.
> 
> Since pnv_npu_tce_invalidate() does really invalidate the entire cache,
> this uses pnv_pci_ioda2_tce_invalidate_entire() directly for NPU.
> This adds an explicit comment for workaround for invalidating NPU TCE
> cache.
> 
> Signed-off-by: Alexey Kardashevskiy 

Reviewed-by: Alistair Popple 

> ---
>  arch/powerpc/platforms/powernv/npu-dma.c  | 41 
> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c | 29 ++
>  arch/powerpc/platforms/powernv/pci.h  |  7 +-
>  3 files changed, 25 insertions(+), 52 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
> b/arch/powerpc/platforms/powernv/npu-dma.c
> index 7229acd..778570c 100644
> --- a/arch/powerpc/platforms/powernv/npu-dma.c
> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> @@ -25,8 +25,6 @@
>   * Other types of TCE cache invalidation are not functional in the
>   * hardware.
>   */
> -#define TCE_KILL_INVAL_ALL PPC_BIT(0)
> -
>  static struct pci_dev *get_pci_dev(struct device_node *dn)
>  {
>   return PCI_DN(dn)->pcidev;
> @@ -161,45 +159,6 @@ static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct 
> pnv_ioda_pe *npe,
>   return pe;
>  }
>  
> -void pnv_npu_tce_invalidate_entire(struct pnv_ioda_pe *npe)
> -{
> - struct pnv_phb *phb = npe->phb;
> -
> - if (WARN_ON(phb->type != PNV_PHB_NPU ||
> - !phb->ioda.tce_inval_reg ||
> - !(npe->flags & PNV_IODA_PE_DEV)))
> - return;
> -
> - mb(); /* Ensure previous TCE table stores are visible */
> - __raw_writeq(cpu_to_be64(TCE_KILL_INVAL_ALL),
> - phb->ioda.tce_inval_reg);
> -}
> -
> -void pnv_npu_tce_invalidate(struct pnv_ioda_pe *npe,
> - struct iommu_table *tbl,
> - unsigned long index,
> - unsigned long npages,
> - bool rm)
> -{
> - struct pnv_phb *phb = npe->phb;
> -
> - /* We can only invalidate the whole cache on NPU */
> - unsigned long val = TCE_KILL_INVAL_ALL;
> -
> - if (WARN_ON(phb->type != PNV_PHB_NPU ||
> - !phb->ioda.tce_inval_reg ||
> - !(npe->flags & PNV_IODA_PE_DEV)))
> - return;
> -
> - mb(); /* Ensure previous TCE table stores are visible */
> - if (rm)
> - __raw_rm_writeq(cpu_to_be64(val),
> -   (__be64 __iomem *) phb->ioda.tce_inval_reg_phys);
> - else
> - __raw_writeq(cpu_to_be64(val),
> - phb->ioda.tce_inval_reg);
> -}
> -
>  void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe)
>  {
>   struct pnv_ioda_pe *gpe;
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 33e9489..90cdf49 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1824,9 +1824,23 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
>   .get = pnv_tce_get,
>  };
>  
> +#define TCE_KILL_INVAL_ALL  PPC_BIT(0)
>  #define TCE_KILL_INVAL_PE   PPC_BIT(1)
>  #define TCE_KILL_INVAL_TCE  PPC_BIT(2)
>  
> +void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm)
> +{
> + const unsigned long val = TCE_KILL_INVAL_ALL;
> +
> + mb(); /* Ensure previous TCE table stores are visible */
> + if (rm)
> + __raw_rm_writeq(cpu_to_be64(val),
> + (__be64 __iomem *)
> + phb->ioda.tce_inval_reg_phys);
> + else
> + __raw_writeq(cpu_to_be64(val), phb->ioda.tce_inval_reg);
> +}
> +
>  static inline void pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe)
>  {
>   /* 01xb - invalidate TCEs that match the specified PE# */
> @@ -1847,7 +1861,7 @@ static inline void 
> pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe)
>   if (!npe || npe->phb->type != PNV_PHB_NPU)
>   continue;
>  
> - pnv_npu_tce_invalidate_entire(npe);
> + pnv_pci_ioda2_tce_invalidate_entire(npe->phb, false);
>   }
>  }
>  
> @@ -1896,14 +1910,19 @@ static void pnv_pci_ioda2_tce_invalidate(struct 
> iommu_table *tbl,
>   index, npages);
>  
>   if (pe->flags & PNV_IODA_PE_PEER)
> - /* Invalidate PEs using the same TCE table */
> + /*
> +  * The NVLink hardware does not support TCE kill
> +  * per TCE entry so we have to invalidate
> +  * the entire cache for it.
> +  */
>   for (i = 0; i <

Re: [PATCH kernel 05/10] powerpc/powernv/npu: Use the correct IOMMU page size

2016-03-20 Thread Alistair Popple

Thanks for fixing Alexey!

On Wed, 9 Mar 2016 17:29:01 Alexey Kardashevskiy wrote:
> This uses the page size from iommu_table instead of hard-coded 4K.
> This should cause no change in behavior.
> 
> While we are here, move bits around to prepare for further rework
> which will define and use iommu_table_group_ops.
> 
> Signed-off-by: Alexey Kardashevskiy 

Reviewed-by: Alistair Popple 

> ---
>  arch/powerpc/platforms/powernv/npu-dma.c | 11 +--
>  1 file changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
> b/arch/powerpc/platforms/powernv/npu-dma.c
> index 778570c..5bd5fee 100644
> --- a/arch/powerpc/platforms/powernv/npu-dma.c
> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> @@ -204,8 +204,7 @@ static void pnv_npu_disable_bypass(struct pnv_ioda_pe 
> *npe)
>   struct pnv_phb *phb = npe->phb;
>   struct pci_dev *gpdev;
>   struct pnv_ioda_pe *gpe;
> - void *addr;
> - unsigned int size;
> + struct iommu_table *tbl;
>   int64_t rc;
>  
>   /*
> @@ -219,11 +218,11 @@ static void pnv_npu_disable_bypass(struct pnv_ioda_pe 
> *npe)
>   if (!gpe)
>   return;
>  
> - addr = (void *)gpe->table_group.tables[0]->it_base;
> - size = gpe->table_group.tables[0]->it_size << 3;
> + tbl = gpe->table_group.tables[0];
>   rc = opal_pci_map_pe_dma_window(phb->opal_id, npe->pe_number,
> - npe->pe_number, 1, __pa(addr),
> - size, 0x1000);
> + npe->pe_number, 1, __pa(tbl->it_base),
> + tbl->it_size << 3,
> + IOMMU_PAGE_SIZE(tbl));
>   if (rc != OPAL_SUCCESS)
>   pr_warn("%s: Error %lld setting DMA window on PHB#%d-PE#%d\n",
>   __func__, rc, phb->hose->global_number, npe->pe_number);
> 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RFCv3 02/17] pseries: Add support for hash table resizing

2016-03-20 Thread David Gibson

This adds support for using experimental hypercalls to change the size
of the main hash page table while running as a PAPR guest.  For now these
hypercalls are only in experimental qemu versions.

The interface is two part: first H_RESIZE_HPT_PREPARE is used to allocate
and prepare the new hash table.  This may be slow, but can be done
asynchronously.  Then, H_RESIZE_HPT_COMMIT is used to switch to the new
hash table.  This requires that no CPUs be concurrently updating the HPT,
and so must be run under stop_machine().

This also adds a debugfs file which can be used to manually control
HPT resizing or testing purposes.

Signed-off-by: David Gibson 
Reviewed-by: Paul Mackerras 
---
 arch/powerpc/include/asm/machdep.h|   1 +
 arch/powerpc/mm/hash_utils_64.c   |  28 +
 arch/powerpc/platforms/pseries/lpar.c | 110 ++
 3 files changed, 139 insertions(+)

diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index fd22442..52f8361 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -61,6 +61,7 @@ struct machdep_calls {
   unsigned long addr,
   unsigned char *hpte_slot_array,
   int psize, int ssize, int local);
+   int (*resize_hpt)(unsigned long shift);
/*
 * Special for kexec.
 * To be called in real mode with interrupts disabled. No locks are
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 7635b1c..f27347a 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1589,3 +1590,30 @@ void setup_initial_memory_limit(phys_addr_t 
first_memblock_base,
/* Finally limit subsequent allocations */
memblock_set_current_limit(ppc64_rma_size);
 }
+
+static int ppc64_pft_size_get(void *data, u64 *val)
+{
+   *val = ppc64_pft_size;
+   return 0;
+}
+
+static int ppc64_pft_size_set(void *data, u64 val)
+{
+   if (!ppc_md.resize_hpt)
+   return -ENODEV;
+   return ppc_md.resize_hpt(val);
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_ppc64_pft_size,
+   ppc64_pft_size_get, ppc64_pft_size_set, "%llu\n");
+
+static int __init hash64_debugfs(void)
+{
+   if (!debugfs_create_file("pft-size", 0600, powerpc_debugfs_root,
+NULL, &fops_ppc64_pft_size)) {
+   pr_err("lpar: unable to create ppc64_pft_size debugsfs file\n");
+   }
+
+   return 0;
+}
+machine_device_initcall(pseries, hash64_debugfs);
diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index 2415a0d..ed9738d 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -27,6 +27,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -603,6 +605,113 @@ static int __init disable_bulk_remove(char *str)
 
 __setup("bulk_remove=", disable_bulk_remove);
 
+#define HPT_RESIZE_TIMEOUT 1 /* ms */
+
+struct hpt_resize_state {
+   unsigned long shift;
+   int commit_rc;
+};
+
+static int pseries_lpar_resize_hpt_commit(void *data)
+{
+   struct hpt_resize_state *state = data;
+
+   state->commit_rc = plpar_resize_hpt_commit(0, state->shift);
+   if (state->commit_rc != H_SUCCESS)
+   return -EIO;
+
+   /* Hypervisor has transitioned the HTAB, update our globals */
+   ppc64_pft_size = state->shift;
+   htab_size_bytes = 1UL << ppc64_pft_size;
+   htab_hash_mask = (htab_size_bytes >> 7) - 1;
+
+   return 0;
+}
+
+/* Must be called in user context */
+static int pseries_lpar_resize_hpt(unsigned long shift)
+{
+   struct hpt_resize_state state = {
+   .shift = shift,
+   .commit_rc = H_FUNCTION,
+   };
+   unsigned int delay, total_delay = 0;
+   int rc;
+   ktime_t t0, t1, t2;
+
+   might_sleep();
+
+   if (!firmware_has_feature(FW_FEATURE_HPT_RESIZE))
+   return -ENODEV;
+
+   printk(KERN_INFO "lpar: Attempting to resize HPT to shift %lu\n",
+  shift);
+
+   t0 = ktime_get();
+
+   rc = plpar_resize_hpt_prepare(0, shift);
+   while (H_IS_LONG_BUSY(rc)) {
+   delay = get_longbusy_msecs(rc);
+   total_delay += delay;
+   if (total_delay > HPT_RESIZE_TIMEOUT) {
+   /* prepare call with shift==0 cancels an
+* in-progress resize */
+   rc = plpar_resize_hpt_prepare(0, 0);
+   if (rc != H_SUCCESS)
+   printk(KERN_WARNING
+  "lpar: Unexpected error %d cancelling 
timed out HPT resize\n",
+

[RFCv3 01/17] pseries: Add hypercall wrappers for hash page table resizing

2016-03-20 Thread David Gibson

This adds the hypercall numbers and wrapper functions for the hash page
table resizing hypercalls.

These are experimental "platform specific" values for now, until we have a
formal PAPR update.

It also adds a new firmware feature flag to track the presence of the
HPT resizing calls.

Signed-off-by: David Gibson 
Reviewed-by: Paul Mackerras 
---
 arch/powerpc/include/asm/firmware.h   |  5 +++--
 arch/powerpc/include/asm/hvcall.h |  2 ++
 arch/powerpc/include/asm/plpar_wrappers.h | 12 
 arch/powerpc/platforms/pseries/firmware.c |  1 +
 4 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/firmware.h 
b/arch/powerpc/include/asm/firmware.h
index b062924..32435d2 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -42,7 +42,7 @@
 #define FW_FEATURE_SPLPAR  ASM_CONST(0x0010)
 #define FW_FEATURE_LPARASM_CONST(0x0040)
 #define FW_FEATURE_PS3_LV1 ASM_CONST(0x0080)
-/* FreeASM_CONST(0x0100) */
+#define FW_FEATURE_HPT_RESIZE  ASM_CONST(0x0100)
 #define FW_FEATURE_CMO ASM_CONST(0x0200)
 #define FW_FEATURE_VPHNASM_CONST(0x0400)
 #define FW_FEATURE_XCMOASM_CONST(0x0800)
@@ -66,7 +66,8 @@ enum {
FW_FEATURE_MULTITCE | FW_FEATURE_SPLPAR | FW_FEATURE_LPAR |
FW_FEATURE_CMO | FW_FEATURE_VPHN | FW_FEATURE_XCMO |
FW_FEATURE_SET_MODE | FW_FEATURE_BEST_ENERGY |
-   FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN,
+   FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN |
+   FW_FEATURE_HPT_RESIZE,
FW_FEATURE_PSERIES_ALWAYS = 0,
FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL,
FW_FEATURE_POWERNV_ALWAYS = 0,
diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 0bc9c28..d9d0891 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -294,6 +294,8 @@
 
 /* Platform specific hcalls, used by KVM */
 #define H_RTAS 0xf000
+#define H_RESIZE_HPT_PREPARE   0xf003
+#define H_RESIZE_HPT_COMMIT0xf004
 
 /* "Platform specific hcalls", provided by PHYP */
 #define H_GET_24X7_CATALOG_PAGE0xF078
diff --git a/arch/powerpc/include/asm/plpar_wrappers.h 
b/arch/powerpc/include/asm/plpar_wrappers.h
index 1b39424..b7ee6d9 100644
--- a/arch/powerpc/include/asm/plpar_wrappers.h
+++ b/arch/powerpc/include/asm/plpar_wrappers.h
@@ -242,6 +242,18 @@ static inline long plpar_pte_protect(unsigned long flags, 
unsigned long ptex,
return plpar_hcall_norets(H_PROTECT, flags, ptex, avpn);
 }
 
+static inline long plpar_resize_hpt_prepare(unsigned long flags,
+   unsigned long shift)
+{
+   return plpar_hcall_norets(H_RESIZE_HPT_PREPARE, flags, shift);
+}
+
+static inline long plpar_resize_hpt_commit(unsigned long flags,
+  unsigned long shift)
+{
+   return plpar_hcall_norets(H_RESIZE_HPT_COMMIT, flags, shift);
+}
+
 static inline long plpar_tce_get(unsigned long liobn, unsigned long ioba,
unsigned long *tce_ret)
 {
diff --git a/arch/powerpc/platforms/pseries/firmware.c 
b/arch/powerpc/platforms/pseries/firmware.c
index 8c80588..7b287be 100644
--- a/arch/powerpc/platforms/pseries/firmware.c
+++ b/arch/powerpc/platforms/pseries/firmware.c
@@ -63,6 +63,7 @@ hypertas_fw_features_table[] = {
{FW_FEATURE_VPHN,   "hcall-vphn"},
{FW_FEATURE_SET_MODE,   "hcall-set-mode"},
{FW_FEATURE_BEST_ENERGY,"hcall-best-energy-1*"},
+   {FW_FEATURE_HPT_RESIZE, "hcall-hpt-resize"},
 };
 
 /* Build up the firmware features bitmask using the contents of
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RFCv3 05/17] powerpc/kvm: Corectly report KVM_CAP_PPC_ALLOC_HTAB

2016-03-20 Thread David Gibson

At present KVM on powerpc always reports KVM_CAP_PPC_ALLOC_HTAB as enabled.
However, the ioctl() it advertises (KVM_PPC_ALLOCATE_HTAB) only actually
works on KVM HV.  On KVM PR it will fail with ENOTTY.

qemu already has a workaround for this, so it's not breaking things in
practice, but it would be better to advertise this correctly.

Signed-off-by: David Gibson 
---
 arch/powerpc/kvm/powerpc.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 19aa59b..1803c96 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -521,7 +521,6 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 #ifdef CONFIG_PPC_BOOK3S_64
case KVM_CAP_SPAPR_TCE:
case KVM_CAP_SPAPR_TCE_64:
-   case KVM_CAP_PPC_ALLOC_HTAB:
case KVM_CAP_PPC_RTAS:
case KVM_CAP_PPC_FIXUP_HCALL:
case KVM_CAP_PPC_ENABLE_HCALL:
@@ -530,6 +529,10 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 #endif
r = 1;
break;
+
+   case KVM_CAP_PPC_ALLOC_HTAB:
+   r = hv_enabled;
+   break;
 #endif /* CONFIG_PPC_BOOK3S_64 */
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
case KVM_CAP_PPC_SMT:
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RFCv3 00/17] PAPR HPT resizing, guest & host side

2016-03-20 Thread David Gibson

This is an implementation of the kernel parts of the PAPR hashed page
table (HPT) resizing extension.

It contains a complete guest-side implementation - or as complete as
it can be until we have a final PAPR change.

It also contains a draft host side implementation for KVM HV (the KVM
PR and TCG host-side implementations live in qemu).  This works, but
is very slow in the critical section (where the guest must be
stopped).  It is significantly slower than the TCG/PR implementation;
unusably slow for large hash tables (~2.8s for a 1G HPT).

I'm still looking into what's the cause of the slowness, and I'm not
sure yet if the current approach can be tweaked to be fast enough, or
if it will require a new approach.

Changes since RFCv2:
  * Completely new approach to handling KVM HV implementation.  Much
simpler synchronization requirements, but also slower
  * Rebase to latest Linus' tree
  * Changed number for capability, so as not to collide
  * Host side now actually works

David Gibson (17):
  pseries: Add hypercall wrappers for hash page table resizing
  pseries: Add support for hash table resizing
  pseries: Advertise HPT resizing support via CAS
  pseries: Automatically resize HPT for memory hot add/remove
  powerpc/kvm: Corectly report KVM_CAP_PPC_ALLOC_HTAB
  powerpc/kvm: Add capability flag for hashed page table resizing
  powerpc/kvm: Rename kvm_alloc_hpt() for clarity
  powerpc/kvm: Gather HPT related variables into sub-structure
  powerpc/kvm: Don't store values derivable from HPT order
  powerpc/kvm: Split HPT allocation from activation
  powerpc/kvm: Allow KVM_PPC_ALLOCATE_HTAB ioctl() to change HPT size
  powerpc/kvm: Create kvmppc_unmap_hpte_helper()
  powerpc/kvm: KVM-HV HPT resizing stub implementation
  powerpc/kvm: Outline of KVM-HV HPT resizing implementation
  powerpc/kvm: KVM-HV HPT resizing, preparation path
  powerpc/kvm: HVM-HV HPT resizing, commit path
  powerpc/kvm: Advertise availablity of HPT resizing on KVM HV

 arch/powerpc/include/asm/firmware.h   |   5 +-
 arch/powerpc/include/asm/hvcall.h |   2 +
 arch/powerpc/include/asm/kvm_book3s.h |  12 +-
 arch/powerpc/include/asm/kvm_book3s_64.h  |  15 +
 arch/powerpc/include/asm/kvm_host.h   |  17 +-
 arch/powerpc/include/asm/kvm_ppc.h|  11 +-
 arch/powerpc/include/asm/machdep.h|   1 +
 arch/powerpc/include/asm/plpar_wrappers.h |  12 +
 arch/powerpc/include/asm/prom.h   |   1 +
 arch/powerpc/include/asm/sparsemem.h  |   1 +
 arch/powerpc/kernel/prom_init.c   |   2 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c   | 626 --
 arch/powerpc/kvm/book3s_hv.c  |  37 +-
 arch/powerpc/kvm/book3s_hv_builtin.c  |   8 +-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c   |  68 ++--
 arch/powerpc/kvm/powerpc.c|  17 +-
 arch/powerpc/mm/hash_utils_64.c   |  57 +++
 arch/powerpc/mm/mem.c |   4 +
 arch/powerpc/platforms/pseries/firmware.c |   1 +
 arch/powerpc/platforms/pseries/lpar.c | 110 ++
 include/uapi/linux/kvm.h  |   1 +
 21 files changed, 825 insertions(+), 183 deletions(-)

-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RFCv3 08/17] powerpc/kvm: Gather HPT related variables into sub-structure

2016-03-20 Thread David Gibson

Currently, the powerpc kvm_arch structure contains a number of variables
tracking the state of the guest's hashed page table (HPT) in KVM HV.  This
patch gathers them all together into a single kvm_hpt_info substructure.
This makes life more convenient for the upcoming HPT resizing
implementation.

Signed-off-by: David Gibson 
---
 arch/powerpc/include/asm/kvm_host.h | 16 ---
 arch/powerpc/kvm/book3s_64_mmu_hv.c | 90 ++---
 arch/powerpc/kvm/book3s_hv.c|  2 +-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 62 -
 4 files changed, 87 insertions(+), 83 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index d7b3431..549e3ae 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -226,11 +226,19 @@ struct kvm_arch_memory_slot {
 #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
 };
 
+struct kvm_hpt_info {
+   unsigned long virt;
+   struct revmap_entry *rev;
+   unsigned long npte;
+   unsigned long mask;
+   u32 order;
+   int cma;
+};
+
 struct kvm_arch {
unsigned int lpid;
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
-   unsigned long hpt_virt;
-   struct revmap_entry *revmap;
+   struct kvm_hpt_info hpt;
unsigned int host_lpid;
unsigned long host_lpcr;
unsigned long sdr1;
@@ -239,14 +247,10 @@ struct kvm_arch {
unsigned long lpcr;
unsigned long vrma_slb_v;
int hpte_setup_done;
-   u32 hpt_order;
atomic_t vcpus_running;
u32 online_vcores;
-   unsigned long hpt_npte;
-   unsigned long hpt_mask;
atomic_t hpte_mod_interest;
cpumask_t need_tlb_flush;
-   int hpt_cma_alloc;
struct dentry *debugfs_dir;
struct dentry *htab_dentry;
 #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 1164ab6..152534c 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -61,12 +61,12 @@ long kvmppc_alloc_hpt(struct kvm *kvm, u32 *htab_orderp)
order = PPC_MIN_HPT_ORDER;
}
 
-   kvm->arch.hpt_cma_alloc = 0;
+   kvm->arch.hpt.cma = 0;
page = kvm_alloc_hpt_cma(1ul << (order - PAGE_SHIFT));
if (page) {
hpt = (unsigned long)pfn_to_kaddr(page_to_pfn(page));
memset((void *)hpt, 0, (1ul << order));
-   kvm->arch.hpt_cma_alloc = 1;
+   kvm->arch.hpt.cma = 1;
}
 
/* Lastly try successively smaller sizes from the page allocator */
@@ -81,20 +81,20 @@ long kvmppc_alloc_hpt(struct kvm *kvm, u32 *htab_orderp)
if (!hpt)
return -ENOMEM;
 
-   kvm->arch.hpt_virt = hpt;
-   kvm->arch.hpt_order = order;
+   kvm->arch.hpt.virt = hpt;
+   kvm->arch.hpt.order = order;
/* HPTEs are 2**4 bytes long */
-   kvm->arch.hpt_npte = 1ul << (order - 4);
+   kvm->arch.hpt.npte = 1ul << (order - 4);
/* 128 (2**7) bytes in each HPTEG */
-   kvm->arch.hpt_mask = (1ul << (order - 7)) - 1;
+   kvm->arch.hpt.mask = (1ul << (order - 7)) - 1;
 
/* Allocate reverse map array */
-   rev = vmalloc(sizeof(struct revmap_entry) * kvm->arch.hpt_npte);
+   rev = vmalloc(sizeof(struct revmap_entry) * kvm->arch.hpt.npte);
if (!rev) {
pr_err("kvmppc_alloc_hpt: Couldn't alloc reverse map array\n");
goto out_freehpt;
}
-   kvm->arch.revmap = rev;
+   kvm->arch.hpt.rev = rev;
kvm->arch.sdr1 = __pa(hpt) | (order - 18);
 
pr_info("KVM guest htab at %lx (order %ld), LPID %x\n",
@@ -105,7 +105,7 @@ long kvmppc_alloc_hpt(struct kvm *kvm, u32 *htab_orderp)
return 0;
 
  out_freehpt:
-   if (kvm->arch.hpt_cma_alloc)
+   if (kvm->arch.hpt.cma)
kvm_free_hpt_cma(page, 1 << (order - PAGE_SHIFT));
else
free_pages(hpt, order - PAGE_SHIFT);
@@ -127,10 +127,10 @@ long kvmppc_alloc_reset_hpt(struct kvm *kvm, u32 
*htab_orderp)
goto out;
}
}
-   if (kvm->arch.hpt_virt) {
-   order = kvm->arch.hpt_order;
+   if (kvm->arch.hpt.virt) {
+   order = kvm->arch.hpt.order;
/* Set the entire HPT to 0, i.e. invalid HPTEs */
-   memset((void *)kvm->arch.hpt_virt, 0, 1ul << order);
+   memset((void *)kvm->arch.hpt.virt, 0, 1ul << order);
/*
 * Reset all the reverse-mapping chains for all memslots
 */
@@ -151,13 +151,13 @@ long kvmppc_alloc_reset_hpt(struct kvm *kvm, u32 
*htab_orderp)
 void kvmppc_free_hpt(struct kvm *kvm)
 {
kvmppc_free_lpid(kvm->arch.lpid);
-   vfree(kvm->arch.revmap);
-   if (kvm->arch.hpt_cma_alloc)
-   kvm_free_hpt_cma(virt_to_page(kvm->arch.hpt_virt),
-

[RFCv3 04/17] pseries: Automatically resize HPT for memory hot add/remove

2016-03-20 Thread David Gibson

We've now implemented code in the pseries platform to use the new PAPR
interface to allow resizing the hash page table (HPT) at runtime.

This patch uses that interface to automatically attempt to resize the HPT
when memory is hot added or removed.  This tries to always keep the HPT at
a reasonable size for our current memory size.

Signed-off-by: David Gibson 
Reviewed-by: Paul Mackerras 
---
 arch/powerpc/include/asm/sparsemem.h |  1 +
 arch/powerpc/mm/hash_utils_64.c  | 29 +
 arch/powerpc/mm/mem.c|  4 
 3 files changed, 34 insertions(+)

diff --git a/arch/powerpc/include/asm/sparsemem.h 
b/arch/powerpc/include/asm/sparsemem.h
index f6fc0ee..737335c 100644
--- a/arch/powerpc/include/asm/sparsemem.h
+++ b/arch/powerpc/include/asm/sparsemem.h
@@ -16,6 +16,7 @@
 #endif /* CONFIG_SPARSEMEM */
 
 #ifdef CONFIG_MEMORY_HOTPLUG
+extern void resize_hpt_for_hotplug(unsigned long new_mem_size);
 extern int create_section_mapping(unsigned long start, unsigned long end);
 extern int remove_section_mapping(unsigned long start, unsigned long end);
 #ifdef CONFIG_NUMA
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index f27347a..8ae9097 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -647,6 +647,35 @@ static unsigned long __init htab_get_table_size(void)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
+void resize_hpt_for_hotplug(unsigned long new_mem_size)
+{
+   unsigned target_hpt_shift;
+
+   if (!ppc_md.resize_hpt)
+   return;
+
+   target_hpt_shift = htab_shift_for_mem_size(new_mem_size);
+
+   /*
+* To avoid lots of HPT resizes if memory size is fluctuating
+* across a boundary, we deliberately have some hysterisis
+* here: we immediately increase the HPT size if the target
+* shift exceeds the current shift, but we won't attempt to
+* reduce unless the target shift is at least 2 below the
+* current shift
+*/
+   if ((target_hpt_shift > ppc64_pft_size)
+   || (target_hpt_shift < (ppc64_pft_size - 1))) {
+   int rc;
+
+   rc = ppc_md.resize_hpt(target_hpt_shift);
+   if (rc)
+   printk(KERN_WARNING
+  "Unable to resize hash page table to target 
order %d: %d\n",
+  target_hpt_shift, rc);
+   }
+}
+
 int create_section_mapping(unsigned long start, unsigned long end)
 {
int rc = htab_bolt_mapping(start, end, __pa(start),
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index ac79dbd..be733be 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -121,6 +121,8 @@ int arch_add_memory(int nid, u64 start, u64 size, bool 
for_device)
unsigned long nr_pages = size >> PAGE_SHIFT;
int rc;
 
+   resize_hpt_for_hotplug(memblock_phys_mem_size());
+
pgdata = NODE_DATA(nid);
 
start = (unsigned long)__va(start);
@@ -161,6 +163,8 @@ int arch_remove_memory(u64 start, u64 size)
 */
vm_unmap_aliases();
 
+   resize_hpt_for_hotplug(memblock_phys_mem_size());
+
return ret;
 }
 #endif
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RFCv3 03/17] pseries: Advertise HPT resizing support via CAS

2016-03-20 Thread David Gibson

The hypervisor needs to know a guest is capable of using the HPT resizing
PAPR extension in order to make full advantage of it for memory hotplug.

If the hypervisor knows the guest is HPT resize aware, it can size the
initial HPT based on the initial guest RAM size, relying on the guest to
resize the HPT when more memory is hot-added.  Without this, the hypervisor
must size the HPT for the maximum possible guest RAM, which can lead to
a huge waste of space if the guest never actually expends to that maximum
size.

This patch advertises the guest's support for HPT resizing via the
ibm,client-architecture-support OF interface.  Obviously, the actual
encoding in the CAS vector is tentative until the extension is officially
incorporated into PAPR.  For now we use bit 0 of (previously unused) byte 8
of option vector 5.

Signed-off-by: David Gibson 
Reviewed-by: Anshuman Khandual 
Reviewed-by: Paul Mackerras 
---
 arch/powerpc/include/asm/prom.h | 1 +
 arch/powerpc/kernel/prom_init.c | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index 7f436ba..ef08208 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -151,6 +151,7 @@ struct of_drconf_cell {
 #define OV5_XCMO   0x0440  /* Page Coalescing */
 #define OV5_TYPE1_AFFINITY 0x0580  /* Type 1 NUMA affinity */
 #define OV5_PRRN   0x0540  /* Platform Resource Reassignment */
+#define OV5_HPT_RESIZE 0x0880  /* Hash Page Table resizing */
 #define OV5_PFO_HW_RNG 0x0E80  /* PFO Random Number Generator */
 #define OV5_PFO_HW_842 0x0E40  /* PFO Compression Accelerator */
 #define OV5_PFO_HW_ENCR0x0E20  /* PFO Encryption Accelerator */
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index da51925..c6feafb 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -713,7 +713,7 @@ unsigned char ibm_architecture_vec[] = {
OV5_FEAT(OV5_TYPE1_AFFINITY) | OV5_FEAT(OV5_PRRN),
0,
0,
-   0,
+   OV5_FEAT(OV5_HPT_RESIZE),
/* WARNING: The offset of the "number of cores" field below
 * must match by the macro below. Update the definition if
 * the structure layout changes.
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RFCv3 06/17] powerpc/kvm: Add capability flag for hashed page table resizing

2016-03-20 Thread David Gibson

This adds a new powerpc-specific KVM_CAP_SPAPR_RESIZE_HPT capability to
advertise whether KVM is capable of handling the PAPR extensions for
resizing the hashed page table during guest runtime.

At present, HPT resizing is possible with KVM PR without kernel
modification, since the HPT is managed within qemu.  It's not possible yet
with KVM HV, because the HPT is managed by KVM.  At present, qemu has to
use other capabilities which (by accident) reveal whether PR or HV is in
use to know if it can advertise HPT resizing capability to the guest.

To avoid ambiguity with existing kernels, the encoding is a bit odd.
0 means "unknown" since that's what previous kernels will return
1 means "HPT resize possible if available if and only if the HPT is 
allocated in
  userspace, rather than in the kernel".  In practice this is the same
  test as userspace already uses, but this makes it explicit.
2 will mean "HPT resize available and implemented in-kernel"

For now we always return 1, but the intention is to return 2 once HPT
resize is implemented for KVM HV.

Signed-off-by: David Gibson 
---
 arch/powerpc/kvm/powerpc.c | 3 +++
 include/uapi/linux/kvm.h   | 1 +
 2 files changed, 4 insertions(+)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 1803c96..55ab059 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -587,6 +587,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_SPAPR_MULTITCE:
r = 1;
break;
+   case KVM_CAP_SPAPR_RESIZE_HPT:
+   r = 1; /* resize allowed only if HPT is outside kernel */
+   break;
 #endif
default:
r = 0;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index a7f1f80..5374bd8 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -865,6 +865,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_SPAPR_TCE_64 125
 #define KVM_CAP_ARM_PMU_V3 126
 #define KVM_CAP_VCPU_ATTRIBUTES 127
+#define KVM_CAP_SPAPR_RESIZE_HPT 128
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RFCv3 09/17] powerpc/kvm: Don't store values derivable from HPT order

2016-03-20 Thread David Gibson

Currently the kvm_hpt_info structure stores the hashed page table's order,
and also the number of HPTEs it contains and a mask for its size.  The
last two can be easily derived from the order, so remove them and just
calculate them as necessary with a couple of helper inlines.

Signed-off-by: David Gibson 
---
 arch/powerpc/include/asm/kvm_book3s_64.h | 12 
 arch/powerpc/include/asm/kvm_host.h  |  2 --
 arch/powerpc/kvm/book3s_64_mmu_hv.c  | 28 +---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c  | 18 +-
 4 files changed, 34 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 7529aab..9f762aa 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -435,6 +435,18 @@ extern void kvmppc_mmu_debugfs_init(struct kvm *kvm);
 
 extern void kvmhv_rm_send_ipi(int cpu);
 
+static inline unsigned long kvmppc_hpt_npte(struct kvm_hpt_info *hpt)
+{
+   /* HPTEs are 2**4 bytes long */
+   return 1UL << (hpt->order - 4);
+}
+
+static inline unsigned long kvmppc_hpt_mask(struct kvm_hpt_info *hpt)
+{
+   /* 128 (2**7) bytes in each HPTEG */
+   return (1UL << (hpt->order - 7)) - 1;
+}
+
 #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
 
 #endif /* __ASM_KVM_BOOK3S_64_H__ */
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 549e3ae..4c4f325 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -229,8 +229,6 @@ struct kvm_arch_memory_slot {
 struct kvm_hpt_info {
unsigned long virt;
struct revmap_entry *rev;
-   unsigned long npte;
-   unsigned long mask;
u32 order;
int cma;
 };
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 152534c..c057c81 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -83,13 +83,9 @@ long kvmppc_alloc_hpt(struct kvm *kvm, u32 *htab_orderp)
 
kvm->arch.hpt.virt = hpt;
kvm->arch.hpt.order = order;
-   /* HPTEs are 2**4 bytes long */
-   kvm->arch.hpt.npte = 1ul << (order - 4);
-   /* 128 (2**7) bytes in each HPTEG */
-   kvm->arch.hpt.mask = (1ul << (order - 7)) - 1;
 
/* Allocate reverse map array */
-   rev = vmalloc(sizeof(struct revmap_entry) * kvm->arch.hpt.npte);
+   rev = vmalloc(sizeof(struct revmap_entry) * 
kvmppc_hpt_npte(&kvm->arch.hpt));
if (!rev) {
pr_err("kvmppc_alloc_hpt: Couldn't alloc reverse map array\n");
goto out_freehpt;
@@ -192,8 +188,8 @@ void kvmppc_map_vrma(struct kvm_vcpu *vcpu, struct 
kvm_memory_slot *memslot,
if (npages > 1ul << (40 - porder))
npages = 1ul << (40 - porder);
/* Can't use more than 1 HPTE per HPTEG */
-   if (npages > kvm->arch.hpt.mask + 1)
-   npages = kvm->arch.hpt.mask + 1;
+   if (npages > kvmppc_hpt_mask(&kvm->arch.hpt) + 1)
+   npages = kvmppc_hpt_mask(&kvm->arch.hpt) + 1;
 
hp0 = HPTE_V_1TB_SEG | (VRMA_VSID << (40 - 16)) |
HPTE_V_BOLTED | hpte0_pgsize_encoding(psize);
@@ -203,7 +199,8 @@ void kvmppc_map_vrma(struct kvm_vcpu *vcpu, struct 
kvm_memory_slot *memslot,
for (i = 0; i < npages; ++i) {
addr = i << porder;
/* can't use hpt_hash since va > 64 bits */
-   hash = (i ^ (VRMA_VSID ^ (VRMA_VSID << 25))) & 
kvm->arch.hpt.mask;
+   hash = (i ^ (VRMA_VSID ^ (VRMA_VSID << 25)))
+   & kvmppc_hpt_mask(&kvm->arch.hpt);
/*
 * We assume that the hash table is empty and no
 * vcpus are using it at this stage.  Since we create
@@ -1268,7 +1265,7 @@ static ssize_t kvm_htab_read(struct file *file, char 
__user *buf,
 
/* Skip uninteresting entries, i.e. clean on not-first pass */
if (!first_pass) {
-   while (i < kvm->arch.hpt.npte &&
+   while (i < kvmppc_hpt_npte(&kvm->arch.hpt) &&
   !hpte_dirty(revp, hptp)) {
++i;
hptp += 2;
@@ -1278,7 +1275,7 @@ static ssize_t kvm_htab_read(struct file *file, char 
__user *buf,
hdr.index = i;
 
/* Grab a series of valid entries */
-   while (i < kvm->arch.hpt.npte &&
+   while (i < kvmppc_hpt_npte(&kvm->arch.hpt) &&
   hdr.n_valid < 0x &&
   nb + HPTE_SIZE < count &&
   record_hpte(flags, hptp, hpte, revp, 1, first_pass)) {
@@ -1294,7 +1291,7 @@ static ssize_t kvm_htab_read(struct file *file, char 
__user *buf,
++revp;
}
/* Now skip invalid entries while we can */
-   while

[RFCv3 07/17] powerpc/kvm: Rename kvm_alloc_hpt() for clarity

2016-03-20 Thread David Gibson

The difference between kvm_alloc_hpt() and kvmppc_alloc_hpt() is not at
all obvious from the name.  In practice kvmppc_alloc_hpt() allocates an HPT
by whatever means, and clals kvm_alloc_hpt() which will attempt to allocate
it with CMA only.

To make this less confusing, rename kvm_alloc_hpt() to kvm_alloc_hpt_cma().
Similarly, kvm_release_hpt() is renamed kvm_free_hpt_cma().

Signed-off-by: David Gibson 
---
 arch/powerpc/include/asm/kvm_ppc.h   | 4 ++--
 arch/powerpc/kvm/book3s_64_mmu_hv.c  | 8 
 arch/powerpc/kvm/book3s_hv_builtin.c | 8 
 3 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 2544eda..49cb8b4 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -186,8 +186,8 @@ extern long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
unsigned long tce_value, unsigned long npages);
 extern long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 unsigned long ioba);
-extern struct page *kvm_alloc_hpt(unsigned long nr_pages);
-extern void kvm_release_hpt(struct page *page, unsigned long nr_pages);
+extern struct page *kvm_alloc_hpt_cma(unsigned long nr_pages);
+extern void kvm_free_hpt_cma(struct page *page, unsigned long nr_pages);
 extern int kvmppc_core_init_vm(struct kvm *kvm);
 extern void kvmppc_core_destroy_vm(struct kvm *kvm);
 extern void kvmppc_core_free_memslot(struct kvm *kvm,
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index c7b78d8..1164ab6 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -62,7 +62,7 @@ long kvmppc_alloc_hpt(struct kvm *kvm, u32 *htab_orderp)
}
 
kvm->arch.hpt_cma_alloc = 0;
-   page = kvm_alloc_hpt(1ul << (order - PAGE_SHIFT));
+   page = kvm_alloc_hpt_cma(1ul << (order - PAGE_SHIFT));
if (page) {
hpt = (unsigned long)pfn_to_kaddr(page_to_pfn(page));
memset((void *)hpt, 0, (1ul << order));
@@ -106,7 +106,7 @@ long kvmppc_alloc_hpt(struct kvm *kvm, u32 *htab_orderp)
 
  out_freehpt:
if (kvm->arch.hpt_cma_alloc)
-   kvm_release_hpt(page, 1 << (order - PAGE_SHIFT));
+   kvm_free_hpt_cma(page, 1 << (order - PAGE_SHIFT));
else
free_pages(hpt, order - PAGE_SHIFT);
return -ENOMEM;
@@ -153,8 +153,8 @@ void kvmppc_free_hpt(struct kvm *kvm)
kvmppc_free_lpid(kvm->arch.lpid);
vfree(kvm->arch.revmap);
if (kvm->arch.hpt_cma_alloc)
-   kvm_release_hpt(virt_to_page(kvm->arch.hpt_virt),
-   1 << (kvm->arch.hpt_order - PAGE_SHIFT));
+   kvm_free_hpt_cma(virt_to_page(kvm->arch.hpt_virt),
+1 << (kvm->arch.hpt_order - PAGE_SHIFT));
else
free_pages(kvm->arch.hpt_virt,
   kvm->arch.hpt_order - PAGE_SHIFT);
diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c 
b/arch/powerpc/kvm/book3s_hv_builtin.c
index 5f0380d..16cd00c 100644
--- a/arch/powerpc/kvm/book3s_hv_builtin.c
+++ b/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -49,19 +49,19 @@ static int __init early_parse_kvm_cma_resv(char *p)
 }
 early_param("kvm_cma_resv_ratio", early_parse_kvm_cma_resv);
 
-struct page *kvm_alloc_hpt(unsigned long nr_pages)
+struct page *kvm_alloc_hpt_cma(unsigned long nr_pages)
 {
VM_BUG_ON(order_base_2(nr_pages) < KVM_CMA_CHUNK_ORDER - PAGE_SHIFT);
 
return cma_alloc(kvm_cma, nr_pages, order_base_2(HPT_ALIGN_PAGES));
 }
-EXPORT_SYMBOL_GPL(kvm_alloc_hpt);
+EXPORT_SYMBOL_GPL(kvm_alloc_hpt_cma);
 
-void kvm_release_hpt(struct page *page, unsigned long nr_pages)
+void kvm_free_hpt_cma(struct page *page, unsigned long nr_pages)
 {
cma_release(kvm_cma, page, nr_pages);
 }
-EXPORT_SYMBOL_GPL(kvm_release_hpt);
+EXPORT_SYMBOL_GPL(kvm_free_hpt_cma);
 
 /**
  * kvm_cma_reserve() - reserve area for kvm hash pagetable
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RFCv3 10/17] powerpc/kvm: Split HPT allocation from activation

2016-03-20 Thread David Gibson

Currently, kvmppc_alloc_hpt() both allocates a new hashed page table (HPT)
and sets it up as the active page table for a VM.  For the upcoming HPT
resize implementation we're going to want to allocate HPTs separately from
activating them.

So, split the allocation itself out into kvmppc_allocate_hpt() and perform
the activation with a new kvmppc_set_hpt() function.  Likewise we split
kvmppc_free_hpt(), which just frees the HPT, from kvmppc_release_hpt()
which unsets it as an active HPT, then frees it.

We also move the logic to fall back to smaller HPT sizes if the first try
fails into the single caller which used that behaviour,
kvmppc_hv_setup_htab_rma().  This introduces a slight semantic change, in
that previously if the initial attempt at CMA allocation faile, we would
fall back to attempting smaller sizes with the page allocator.  Now, we
try first CMA, then the page allocator at each size.  As far as I can tell
this change should be harmless.

To match, we make kvmppc_free_hpt() just free the actual HPT itself.  The
call to kvmppc_free_lpid() that was there, we move to the single caller.

Signed-off-by: David Gibson 

# Conflicts:
#   arch/powerpc/kvm/book3s_64_mmu_hv.c
---
 arch/powerpc/include/asm/kvm_book3s_64.h |  3 ++
 arch/powerpc/include/asm/kvm_ppc.h   |  5 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c  | 89 
 arch/powerpc/kvm/book3s_hv.c | 18 +--
 4 files changed, 65 insertions(+), 50 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 9f762aa..17ea22f 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -20,6 +20,9 @@
 #ifndef __ASM_KVM_BOOK3S_64_H__
 #define __ASM_KVM_BOOK3S_64_H__
 
+/* Power architecture requires HPT is at least 256kB */
+#define PPC_MIN_HPT_ORDER  18
+
 #ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
 static inline struct kvmppc_book3s_shadow_vcpu *svcpu_get(struct kvm_vcpu 
*vcpu)
 {
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 49cb8b4..154dd63 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -155,9 +155,10 @@ extern void kvmppc_core_destroy_mmu(struct kvm_vcpu *vcpu);
 extern int kvmppc_kvm_pv(struct kvm_vcpu *vcpu);
 extern void kvmppc_map_magic(struct kvm_vcpu *vcpu);
 
-extern long kvmppc_alloc_hpt(struct kvm *kvm, u32 *htab_orderp);
+extern int kvmppc_allocate_hpt(struct kvm_hpt_info *info, u32 order);
+extern void kvmppc_set_hpt(struct kvm *kvm, struct kvm_hpt_info *info);
 extern long kvmppc_alloc_reset_hpt(struct kvm *kvm, u32 *htab_orderp);
-extern void kvmppc_free_hpt(struct kvm *kvm);
+extern void kvmppc_free_hpt(struct kvm_hpt_info *info);
 extern long kvmppc_prepare_vrma(struct kvm *kvm,
struct kvm_userspace_memory_region *mem);
 extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index c057c81..518b573 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -40,74 +40,69 @@
 
 #include "trace_hv.h"
 
-/* Power architecture requires HPT is at least 256kB */
-#define PPC_MIN_HPT_ORDER  18
-
 static long kvmppc_virtmode_do_h_enter(struct kvm *kvm, unsigned long flags,
long pte_index, unsigned long pteh,
unsigned long ptel, unsigned long *pte_idx_ret);
 static void kvmppc_rmap_reset(struct kvm *kvm);
 
-long kvmppc_alloc_hpt(struct kvm *kvm, u32 *htab_orderp)
+int kvmppc_allocate_hpt(struct kvm_hpt_info *info, u32 order)
 {
-   unsigned long hpt = 0;
-   struct revmap_entry *rev;
+   unsigned long hpt;
+   int cma;
struct page *page = NULL;
-   long order = KVM_DEFAULT_HPT_ORDER;
-
-   if (htab_orderp) {
-   order = *htab_orderp;
-   if (order < PPC_MIN_HPT_ORDER)
-   order = PPC_MIN_HPT_ORDER;
-   }
+   struct revmap_entry *rev;
+   unsigned long npte;
 
-   kvm->arch.hpt.cma = 0;
+   hpt = 0;
+   cma = 0;
page = kvm_alloc_hpt_cma(1ul << (order - PAGE_SHIFT));
if (page) {
hpt = (unsigned long)pfn_to_kaddr(page_to_pfn(page));
memset((void *)hpt, 0, (1ul << order));
-   kvm->arch.hpt.cma = 1;
+   cma = 1;
}
 
-   /* Lastly try successively smaller sizes from the page allocator */
-   /* Only do this if userspace didn't specify a size via ioctl */
-   while (!hpt && order > PPC_MIN_HPT_ORDER && !htab_orderp) {
-   hpt = __get_free_pages(GFP_KERNEL|__GFP_ZERO|__GFP_REPEAT|
-  __GFP_NOWARN, order - PAGE_SHIFT);
-   if (!hpt)
-   --order;
-   }
+   if (!hpt)
+   hpt = __get_free_pages(GFP_KERNEL|__GFP_ZERO|

[RFCv3 11/17] powerpc/kvm: Allow KVM_PPC_ALLOCATE_HTAB ioctl() to change HPT size

2016-03-20 Thread David Gibson

The KVM_PPC_ALLOCATE_HTAB ioctl() is used to set the size of hashed page
table (HPT) that userspace expects a guest VM to have, and is also used to
clear that HPT when necessary (e.g. guest reboot).

At present, once the ioctl() is called for the first time, the HPT size can
never be changed thereafter - it will be cleared but always sized as from
the first call.

With upcoming HPT resize implementation, we're going to need to allow
userspace to resize the HPT at reset (to change it back to the default size
if the guest changed it).

So, we need to allow this ioctl() to change the HPT size.

Signed-off-by: David Gibson 
---
 arch/powerpc/include/asm/kvm_ppc.h  |  2 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c | 52 -
 arch/powerpc/kvm/book3s_hv.c|  5 +---
 3 files changed, 30 insertions(+), 29 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 154dd63..5a1daa0 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -157,7 +157,7 @@ extern void kvmppc_map_magic(struct kvm_vcpu *vcpu);
 
 extern int kvmppc_allocate_hpt(struct kvm_hpt_info *info, u32 order);
 extern void kvmppc_set_hpt(struct kvm *kvm, struct kvm_hpt_info *info);
-extern long kvmppc_alloc_reset_hpt(struct kvm *kvm, u32 *htab_orderp);
+extern long kvmppc_alloc_reset_hpt(struct kvm *kvm, int order);
 extern void kvmppc_free_hpt(struct kvm_hpt_info *info);
 extern long kvmppc_prepare_vrma(struct kvm *kvm,
struct kvm_userspace_memory_region *mem);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 518b573..e975c5a 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -103,10 +103,22 @@ void kvmppc_set_hpt(struct kvm *kvm, struct kvm_hpt_info 
*info)
info->virt, (long)info->order, kvm->arch.lpid);
 }
 
-long kvmppc_alloc_reset_hpt(struct kvm *kvm, u32 *htab_orderp)
+void kvmppc_free_hpt(struct kvm_hpt_info *info)
+{
+   vfree(info->rev);
+   if (info->cma)
+   kvm_free_hpt_cma(virt_to_page(info->virt),
+1 << (info->order - PAGE_SHIFT));
+   else
+   free_pages(info->virt, info->order - PAGE_SHIFT);
+   info->virt = 0;
+   info->order = 0;
+}
+
+long kvmppc_alloc_reset_hpt(struct kvm *kvm, int order)
 {
long err = -EBUSY;
-   long order;
+   struct kvm_hpt_info info;
 
mutex_lock(&kvm->lock);
if (kvm->arch.hpte_setup_done) {
@@ -118,8 +130,9 @@ long kvmppc_alloc_reset_hpt(struct kvm *kvm, u32 
*htab_orderp)
goto out;
}
}
-   if (kvm->arch.hpt.virt) {
-   order = kvm->arch.hpt.order;
+   if (kvm->arch.hpt.order == order) {
+   /* We already have a suitable HPT */
+
/* Set the entire HPT to 0, i.e. invalid HPTEs */
memset((void *)kvm->arch.hpt.virt, 0, 1ul << order);
/*
@@ -128,33 +141,24 @@ long kvmppc_alloc_reset_hpt(struct kvm *kvm, u32 
*htab_orderp)
kvmppc_rmap_reset(kvm);
/* Ensure that each vcpu will flush its TLB on next entry. */
cpumask_setall(&kvm->arch.need_tlb_flush);
-   *htab_orderp = order;
err = 0;
-   } else {
-   struct kvm_hpt_info info;
-
-   err = kvmppc_allocate_hpt(&info, *htab_orderp);
-   if (err < 0)
-   goto out;
-   kvmppc_set_hpt(kvm, &info);
+   goto out;
}
+
+   if (kvm->arch.hpt.virt)
+   kvmppc_free_hpt(&kvm->arch.hpt);
+
+   
+   err = kvmppc_allocate_hpt(&info, order);
+   if (err < 0)
+   goto out;
+   kvmppc_set_hpt(kvm, &info);
+   
  out:
mutex_unlock(&kvm->lock);
return err;
 }
 
-void kvmppc_free_hpt(struct kvm_hpt_info *info)
-{
-   vfree(info->rev);
-   if (info->cma)
-   kvm_free_hpt_cma(virt_to_page(info->virt),
-1 << (info->order - PAGE_SHIFT));
-   else
-   free_pages(info->virt, info->order - PAGE_SHIFT);
-   info->virt = 0;
-   info->order = 0;
-}
-
 /* Bits in first HPTE dword for pagesize 4k, 64k or 16M */
 static inline unsigned long hpte0_pgsize_encoding(unsigned long pgsize)
 {
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 18eb106..2289ce3 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3301,12 +3301,9 @@ static long kvm_arch_vm_ioctl_hv(struct file *filp,
r = -EFAULT;
if (get_user(htab_order, (u32 __user *)argp))
break;
-   r = kvmppc_alloc_reset_hpt(kvm, &htab_order);
+   r = kvmppc_alloc_reset_hpt(kvm, htab_order);
if (r)

[RFCv3 12/17] powerpc/kvm: Create kvmppc_unmap_hpte_helper()

2016-03-20 Thread David Gibson

The kvm_unmap_rmapp() function, called from certain MMU notifiers, is used
to force all guest mappings of a particular host page to be set ABSENT, and
removed from the reverse mappings.

For HPT resizing, we will have some cases where we want to set just a
single guest HPTE ABSENT and remove its reverse mappings.  To prepare with
this, we split out the logic from kvm_unmap_rmapp() to evict a single HPTE,
moving it to a new helper function.

Signed-off-by: David Gibson 
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c | 75 +
 1 file changed, 43 insertions(+), 32 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index e975c5a..89878a4 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -710,13 +710,52 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned long 
hva,
return kvm_handle_hva_range(kvm, hva, hva + 1, handler);
 }
 
+/* Must be called with both HPTE and rmap locked */
+static void kvmppc_unmap_hpte(struct kvm *kvm, unsigned long idx,
+ unsigned long *rmapp, unsigned long gfn)
+{
+   __be64 *hptep = (__be64 *) (kvm->arch.hpt.virt + (idx << 4));
+   struct revmap_entry *rev = kvm->arch.hpt.rev;
+   unsigned long j, h;
+   unsigned long ptel, psize, rcbits;
+
+   j = rev[idx].forw;
+   if (j == idx) {
+   /* chain is now empty */
+   *rmapp &= ~(KVMPPC_RMAP_PRESENT | KVMPPC_RMAP_INDEX);
+   } else {
+   /* remove idx from chain */
+   h = rev[idx].back;
+   rev[h].forw = j;
+   rev[j].back = h;
+   rev[idx].forw = rev[idx].back = idx;
+   *rmapp = (*rmapp & ~KVMPPC_RMAP_INDEX) | j;
+   }
+
+   /* Now check and modify the HPTE */
+   ptel = rev[idx].guest_rpte;
+   psize = hpte_page_size(be64_to_cpu(hptep[0]), ptel);
+   if ((be64_to_cpu(hptep[0]) & HPTE_V_VALID) &&
+   hpte_rpn(ptel, psize) == gfn) {
+   hptep[0] |= cpu_to_be64(HPTE_V_ABSENT);
+   kvmppc_invalidate_hpte(kvm, hptep, idx);
+   /* Harvest R and C */
+   rcbits = be64_to_cpu(hptep[1]) & (HPTE_R_R | HPTE_R_C);
+   *rmapp |= rcbits << KVMPPC_RMAP_RC_SHIFT;
+   if (rcbits & HPTE_R_C)
+   kvmppc_update_rmap_change(rmapp, psize);
+   if (rcbits & ~rev[idx].guest_rpte) {
+   rev[idx].guest_rpte = ptel | rcbits;
+   note_hpte_modification(kvm, &rev[idx]);
+   }
+   }
+}
+
 static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp,
   unsigned long gfn)
 {
-   struct revmap_entry *rev = kvm->arch.hpt.rev;
-   unsigned long h, i, j;
+   unsigned long i;
__be64 *hptep;
-   unsigned long ptel, psize, rcbits;
 
for (;;) {
lock_rmap(rmapp);
@@ -739,36 +778,8 @@ static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long 
*rmapp,
cpu_relax();
continue;
}
-   j = rev[i].forw;
-   if (j == i) {
-   /* chain is now empty */
-   *rmapp &= ~(KVMPPC_RMAP_PRESENT | KVMPPC_RMAP_INDEX);
-   } else {
-   /* remove i from chain */
-   h = rev[i].back;
-   rev[h].forw = j;
-   rev[j].back = h;
-   rev[i].forw = rev[i].back = i;
-   *rmapp = (*rmapp & ~KVMPPC_RMAP_INDEX) | j;
-   }
 
-   /* Now check and modify the HPTE */
-   ptel = rev[i].guest_rpte;
-   psize = hpte_page_size(be64_to_cpu(hptep[0]), ptel);
-   if ((be64_to_cpu(hptep[0]) & HPTE_V_VALID) &&
-   hpte_rpn(ptel, psize) == gfn) {
-   hptep[0] |= cpu_to_be64(HPTE_V_ABSENT);
-   kvmppc_invalidate_hpte(kvm, hptep, i);
-   /* Harvest R and C */
-   rcbits = be64_to_cpu(hptep[1]) & (HPTE_R_R | HPTE_R_C);
-   *rmapp |= rcbits << KVMPPC_RMAP_RC_SHIFT;
-   if (rcbits & HPTE_R_C)
-   kvmppc_update_rmap_change(rmapp, psize);
-   if (rcbits & ~rev[i].guest_rpte) {
-   rev[i].guest_rpte = ptel | rcbits;
-   note_hpte_modification(kvm, &rev[i]);
-   }
-   }
+   kvmppc_unmap_hpte(kvm, i, rmapp, gfn);
unlock_rmap(rmapp);
__unlock_hpte(hptep, be64_to_cpu(hptep[0]));
}
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RFCv3 13/17] powerpc/kvm: KVM-HV HPT resizing stub implementation

2016-03-20 Thread David Gibson

This patch adds a stub (always failing) implementation of the hypercalls
for the HPT resizing PAPR extension.

For now we include a hack which makes it safe for qemu to call ENABLE_HCALL
on these hypercalls, although it will have no effect.  That should go away
once the PAPR change is formalized and we can use "real" hcall numbers.

Signed-off-by: David Gibson 
---
 arch/powerpc/include/asm/kvm_book3s.h |  6 ++
 arch/powerpc/kvm/book3s_64_mmu_hv.c   | 19 +++
 arch/powerpc/kvm/book3s_hv.c  |  8 
 arch/powerpc/kvm/powerpc.c|  6 ++
 4 files changed, 39 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 8f39796..81f2b77 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -191,6 +191,12 @@ extern void kvmppc_copy_to_svcpu(struct 
kvmppc_book3s_shadow_vcpu *svcpu,
 struct kvm_vcpu *vcpu);
 extern void kvmppc_copy_from_svcpu(struct kvm_vcpu *vcpu,
   struct kvmppc_book3s_shadow_vcpu *svcpu);
+extern unsigned long do_h_resize_hpt_prepare(struct kvm_vcpu *vcpu,
+unsigned long flags,
+unsigned long shift);
+extern unsigned long do_h_resize_hpt_commit(struct kvm_vcpu *vcpu,
+   unsigned long flags,
+   unsigned long shift);
 
 static inline struct kvmppc_vcpu_book3s *to_book3s(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 89878a4..0a69b64 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -1129,6 +1129,25 @@ void kvmppc_unpin_guest_page(struct kvm *kvm, void *va, 
unsigned long gpa,
 }
 
 /*
+ * HPT resizing
+ */
+
+unsigned long do_h_resize_hpt_prepare(struct kvm_vcpu *vcpu,
+ unsigned long flags,
+ unsigned long shift)
+{
+   return H_HARDWARE;
+}
+
+unsigned long do_h_resize_hpt_commit(struct kvm_vcpu *vcpu,
+unsigned long flags,
+unsigned long shift)
+{
+   return H_HARDWARE;
+}
+
+
+/*
  * Functions for reading and writing the hash table via reads and
  * writes on a file descriptor.
  *
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 2289ce3..878b4a7 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -737,6 +737,14 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
kvmppc_get_gpr(vcpu, 5),
kvmppc_get_gpr(vcpu, 6));
break;
+   case H_RESIZE_HPT_PREPARE:
+   ret = do_h_resize_hpt_prepare(vcpu, kvmppc_get_gpr(vcpu, 4),
+ kvmppc_get_gpr(vcpu, 5));
+   break;
+   case H_RESIZE_HPT_COMMIT:
+   ret = do_h_resize_hpt_commit(vcpu, kvmppc_get_gpr(vcpu, 4),
+kvmppc_get_gpr(vcpu, 5));
+   break;
case H_RTAS:
if (list_empty(&vcpu->kvm->arch.rtas_tokens))
return RESUME_HOST;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 55ab059..900393b 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -1302,6 +1302,12 @@ static int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
unsigned long hcall = cap->args[0];
 
r = -EINVAL;
+   /* Hack: until we have proper hcall numbers allocated */
+   if ((hcall == H_RESIZE_HPT_PREPARE)
+   || (hcall == H_RESIZE_HPT_COMMIT)) {
+   r = 0;
+   break;
+   }
if (hcall > MAX_HCALL_OPCODE || (hcall & 3) ||
cap->args[1] > 1)
break;
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RFCv3 15/17] powerpc/kvm: KVM-HV HPT resizing, preparation path

2016-03-20 Thread David Gibson

This adds code to initialize an HPT resize operation, and complete its
prepare phase, including allocating and clearing a tentative new HPT.  It
also includes corresponding code to free things afterwards.

Signed-off-by: David Gibson 
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 05e8d52..acc6dd4 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -54,6 +54,10 @@ struct kvm_resize_hpt {
/* These fields protected by kvm->lock */
int error;
bool prepare_done;
+
+   /* Private to the work thread, until prepare_done is true,
+* then protected by kvm->resize_hpt_sem */
+   struct kvm_hpt_info hpt;
 };
 
 #ifdef DEBUG_RESIZE_HPT
@@ -1157,6 +1161,17 @@ void kvmppc_unpin_guest_page(struct kvm *kvm, void *va, 
unsigned long gpa,
  */
 static int resize_hpt_allocate(struct kvm_resize_hpt *resize)
 {
+   int rc;
+
+   rc = kvmppc_allocate_hpt(&resize->hpt, resize->order);
+   if (rc == -ENOMEM)
+   return H_NO_MEM;
+   else if (rc < 0)
+   return H_HARDWARE;
+
+   resize_hpt_debug(resize, "resize_hpt_allocate(): HPT @ 0x%lx\n",
+resize->hpt.virt);
+
return H_SUCCESS;
 }
 
@@ -1172,6 +1187,10 @@ static void resize_hpt_pivot(struct kvm_resize_hpt 
*resize)
 static void resize_hpt_release(struct kvm *kvm, struct kvm_resize_hpt *resize)
 {
BUG_ON(kvm->arch.resize_hpt != resize);
+
+   if (resize->hpt.virt)
+   kvmppc_free_hpt(&resize->hpt);
+
kvm->arch.resize_hpt = NULL;
kfree(resize);
 }
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RFCv3 17/17] powerpc/kvm: Advertise availablity of HPT resizing on KVM HV

2016-03-20 Thread David Gibson

This updates the KVM_CAP_SPAPR_RESIZE_HPT capability to advertise the
presence of in-kernel HPT resizing on KVM HV.  In fact the HPT resizing
isn't fully implemented, but this allows us to experiment with what's
there.

Signed-off-by: David Gibson 
---
 arch/powerpc/kvm/powerpc.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 900393b..1b59b23 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -588,7 +588,10 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
r = 1;
break;
case KVM_CAP_SPAPR_RESIZE_HPT:
-   r = 1; /* resize allowed only if HPT is outside kernel */
+   if (hv_enabled)
+   r = 2; /* In-kernel resize implementation */
+   else
+   r = 1; /* outside kernel resize allowed */
break;
 #endif
default:
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RFCv3 14/17] powerpc/kvm: Outline of KVM-HV HPT resizing implementation

2016-03-20 Thread David Gibson

This adds an outline (not yet working) of an implementation for the HPT
resizing PAPR extension.  Specifically it adds the work function which will
handle preparation for the resize, and synchronization between this, the
the HPT resizing hypercalls, the guest page fault path and guest HPT update
paths.

Signed-off-by: David Gibson 
---
 arch/powerpc/include/asm/kvm_host.h |   3 +
 arch/powerpc/kvm/book3s_64_mmu_hv.c | 177 +++-
 arch/powerpc/kvm/book3s_hv.c|   4 +
 3 files changed, 182 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 4c4f325..6c41c07 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -233,6 +233,8 @@ struct kvm_hpt_info {
int cma;
 };
 
+struct kvm_resize_hpt;
+
 struct kvm_arch {
unsigned int lpid;
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
@@ -251,6 +253,7 @@ struct kvm_arch {
cpumask_t need_tlb_flush;
struct dentry *debugfs_dir;
struct dentry *htab_dentry;
+   struct kvm_resize_hpt *resize_hpt; /* protected by kvm->lock */
 #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
 #ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
struct mutex hpt_mutex;
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 0a69b64..05e8d52 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -43,6 +43,30 @@
 static long kvmppc_virtmode_do_h_enter(struct kvm *kvm, unsigned long flags,
long pte_index, unsigned long pteh,
unsigned long ptel, unsigned long *pte_idx_ret);
+#define DEBUG_RESIZE_HPT   1
+
+struct kvm_resize_hpt {
+   /* These fields read-only after init */
+   struct kvm *kvm;
+   struct work_struct work;
+   u32 order;
+
+   /* These fields protected by kvm->lock */
+   int error;
+   bool prepare_done;
+};
+
+#ifdef DEBUG_RESIZE_HPT
+#define resize_hpt_debug(resize, ...)  \
+   do {\
+   printk(KERN_DEBUG "RESIZE HPT %p: ", resize);   \
+   printk(__VA_ARGS__);\
+   } while (0)
+#else
+#define resize_hpt_debug(resize, ...)  \
+   do { } while (0)
+#endif
+
 static void kvmppc_rmap_reset(struct kvm *kvm);
 
 int kvmppc_allocate_hpt(struct kvm_hpt_info *info, u32 order)
@@ -1131,19 +1155,168 @@ void kvmppc_unpin_guest_page(struct kvm *kvm, void 
*va, unsigned long gpa,
 /*
  * HPT resizing
  */
+static int resize_hpt_allocate(struct kvm_resize_hpt *resize)
+{
+   return H_SUCCESS;
+}
+
+static int resize_hpt_rehash(struct kvm_resize_hpt *resize)
+{
+   return H_HARDWARE;
+}
+
+static void resize_hpt_pivot(struct kvm_resize_hpt *resize)
+{
+}
+
+static void resize_hpt_release(struct kvm *kvm, struct kvm_resize_hpt *resize)
+{
+   BUG_ON(kvm->arch.resize_hpt != resize);
+   kvm->arch.resize_hpt = NULL;
+   kfree(resize);
+}
+
+static void resize_hpt_prepare_work(struct work_struct *work)
+{
+   struct kvm_resize_hpt *resize = container_of(work,
+struct kvm_resize_hpt,
+work);
+   struct kvm *kvm = resize->kvm;
+   int err;
+
+   resize_hpt_debug(resize, "resize_hpt_prepare_work(): order = %d\n",
+resize->order);
+
+   err = resize_hpt_allocate(resize);
+
+   mutex_lock(&kvm->lock);
+
+   resize->error = err;
+   resize->prepare_done = true;
+
+   mutex_unlock(&kvm->lock);
+}
 
 unsigned long do_h_resize_hpt_prepare(struct kvm_vcpu *vcpu,
  unsigned long flags,
  unsigned long shift)
 {
-   return H_HARDWARE;
+   struct kvm *kvm = vcpu->kvm;
+   struct kvm_resize_hpt *resize;
+   int ret;
+
+   if (flags != 0)
+   return H_PARAMETER;
+
+   if (shift && ((shift < 18) || (shift > 46)))
+   return H_PARAMETER;
+
+   mutex_lock(&kvm->lock);
+
+   resize = kvm->arch.resize_hpt;
+
+   if (resize) {
+   if (resize->order == shift) {
+   /* Suitable resize in progress */
+   if (resize->prepare_done) {
+   ret = resize->error;
+   if (ret != H_SUCCESS)
+   resize_hpt_release(kvm, resize);
+   } else {
+   ret = H_LONG_BUSY_ORDER_100_MSEC;
+   }
+
+   goto out;
+   }
+
+   /* not suitable, cancel it */
+   resize_hpt_release(kvm, resize);
+   }
+
+   ret = H_SUCCESS;
+   if (!shift)
+   goto out; /*

[RFCv3 16/17] powerpc/kvm: HVM-HV HPT resizing, commit path

2016-03-20 Thread David Gibson

This adds code for the "guts" of an HPT resize operation: rehashing
HPTEs from the current HPT into the new resized HPT, and switching the
guest over to the new HPT.

This is performed by the H_RESIZE_HPT_COMMIT hypercall.  The guest is
prevented from running during this operation, to simplify
synchronization.  the guest is expected to prepare itself for a
potentailly long pause before making the hcall; Linux guests use
stop_machine() for this.

To reduce the amount of stuff we need to do (and thus the latency of the
operation) we only rehash bolted entries, expecting the guest to refault
other HPTEs after the resize is complete.

Signed-off-by: David Gibson 
---
 arch/powerpc/include/asm/kvm_book3s.h |   6 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c   | 167 +-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c   |  10 +-
 3 files changed, 174 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 81f2b77..935fbba 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -156,8 +156,10 @@ extern void kvmppc_giveup_ext(struct kvm_vcpu *vcpu, ulong 
msr);
 extern int kvmppc_emulate_paired_single(struct kvm_run *run, struct kvm_vcpu 
*vcpu);
 extern kvm_pfn_t kvmppc_gpa_to_pfn(struct kvm_vcpu *vcpu, gpa_t gpa,
bool writing, bool *writable);
-extern void kvmppc_add_revmap_chain(struct kvm *kvm, struct revmap_entry *rev,
-   unsigned long *rmap, long pte_index, int realmode);
+extern void kvmppc_add_revmap_chain(struct kvm_hpt_info *hpt,
+   struct revmap_entry *rev,
+   unsigned long *rmap,
+   long pte_index, int realmode);
 extern void kvmppc_update_rmap_change(unsigned long *rmap, unsigned long 
psize);
 extern void kvmppc_invalidate_hpte(struct kvm *kvm, __be64 *hptep,
unsigned long pte_index);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index acc6dd4..b6ec7f3 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -641,7 +641,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
/* don't lose previous R and C bits */
r |= be64_to_cpu(hptep[1]) & (HPTE_R_R | HPTE_R_C);
} else {
-   kvmppc_add_revmap_chain(kvm, rev, rmap, index, 0);
+   kvmppc_add_revmap_chain(&kvm->arch.hpt, rev, rmap, index, 0);
}
 
hptep[1] = cpu_to_be64(r);
@@ -1175,13 +1175,176 @@ static int resize_hpt_allocate(struct kvm_resize_hpt 
*resize)
return H_SUCCESS;
 }
 
+static unsigned long resize_hpt_rehash_hpte(struct kvm_resize_hpt *resize,
+   unsigned long idx)
+{
+   struct kvm *kvm = resize->kvm;
+   struct kvm_hpt_info *old = &kvm->arch.hpt;
+   struct kvm_hpt_info *new = &resize->hpt;
+   unsigned long old_hash_mask = (1ULL << (old->order - 7)) - 1;
+   unsigned long new_hash_mask = (1ULL << (new->order - 7)) - 1;
+   __be64 *hptep, *new_hptep;
+   unsigned long vpte, rpte, guest_rpte;
+   int ret;
+   struct revmap_entry *rev;
+   unsigned long apsize, psize, avpn, pteg, hash;
+   unsigned long new_idx, new_pteg, replace_vpte;
+
+   hptep = (__be64 *)(old->virt + (idx << 4));
+   while (!try_lock_hpte(hptep, HPTE_V_HVLOCK))
+   cpu_relax();
+
+   vpte = be64_to_cpu(hptep[0]);
+
+   ret = H_SUCCESS;
+   if (!(vpte & HPTE_V_VALID) && !(vpte & HPTE_V_ABSENT))
+   /* Nothing to do */
+   goto out;
+
+   /* Unmap */
+   rev = &old->rev[idx];
+   guest_rpte = rev->guest_rpte;
+
+   ret = H_HARDWARE;
+   apsize = hpte_page_size(vpte, guest_rpte);
+   if (!apsize)
+   goto out;
+
+   if (vpte & HPTE_V_VALID) {
+   unsigned long gfn = hpte_rpn(guest_rpte, apsize);
+   int srcu_idx = srcu_read_lock(&kvm->srcu);
+   struct kvm_memory_slot *memslot =
+   __gfn_to_memslot(kvm_memslots(kvm), gfn);
+
+   if (memslot) {
+   unsigned long *rmapp;
+   rmapp = &memslot->arch.rmap[gfn - memslot->base_gfn];
+   
+   lock_rmap(rmapp);
+   kvmppc_unmap_hpte(kvm, idx, rmapp, gfn);
+   unlock_rmap(rmapp);
+   }
+
+   srcu_read_unlock(&kvm->srcu, srcu_idx);
+   }
+
+   /* Reload PTE after unmap */
+   vpte = be64_to_cpu(hptep[0]);
+
+   BUG_ON(vpte & HPTE_V_VALID);
+   BUG_ON(!(vpte & HPTE_V_ABSENT));
+
+   ret = H_SUCCESS;
+   if (!(vpte & HPTE_V_BOLTED))
+   goto out;
+
+   rpte = be64_to_cpu(hptep[1]);
+   psize = hpte_base_page_size(vpte,

Re: [PATCH kernel 06/10] powerpc/powernv/npu: Simplify DMA setup

2016-03-20 Thread Alistair Popple

On Wed, 16 Mar 2016 16:55:50 David Gibson wrote:
> On Wed, Mar 09, 2016 at 05:29:02PM +1100, Alexey Kardashevskiy wrote:
> > NPU devices are quite specific, in fact they represent side DMA channel
> > of a GPU device. The GPU/NPU driver never actually configures DMA
> > for NPU devices, instead it relies on the platform code to propagate
> > DMA setup to NPU devices when a main GPU device is being configured.
> > When GPU is being set up, the same configuration - bypass or 32bit DMA -
> > is used for NPU. This makes DMA setup explicit.
> > 
> > pnv_npu_ioda_controller_ops::pnv_npu_dma_set_mask is moved to pci-ioda,
> > made static and prints warning as dma_set_mask() should never be called
> > on this function as in any case it will not configure GPU; so we make
> > this explicit.
> > 
> > Instead of using PNV_IODA_PE_PEER and peers[] (which next patch will
> > remove), we test every PCI device if there are corresponding NVLink
> > devices. If there are any, we propagate bypass mode to just found NPU
> > devices by calling the setup helper directly (which takes @bypass) and
> > avoid guessing (i.e. calculating from DMA mask) whether we need bypass
> > or not on NPU devices. Since DMA setup happens in very rare occasion,
> > this will not slow down booting or VFIO start/stop much.
> > 
> > This renames pnv_npu_disable_bypass to pnv_npu_dma_set_32 to make it
> > more clear what the function really does which is programming 32bit
> > table address to the TVT ("disabling bypass" means writing zeroes to
> > the TVT).
> > 
> > This removes pnv_npu_dma_set_bypass() from pnv_npu_ioda_fixup() as
> > the DMA configuration on NPU does not matter until dma_set_mask() is
> > called on GPU and that will do the NPU DMA configuration.
> > 
> > This removes phb->dma_dev_setup initialization for NPU as
> > pnv_pci_ioda_dma_dev_setup is no-op for it anyway.
> > 
> > Signed-off-by: Alexey Kardashevskiy 
> 
> I'm having trouble making sense of the commit message, but the actual
> changes look fine as best I can tell.

For background the NPU NVLink PCI "devices" are actually emulated in firmware 
and are mainly used for link training. Their DMA/TCE setup must match the GPU 
which is connected via PCIe and NVLink so any changes to the DMA/TCE setup on 
the GPU PCIe device need to be propagated to the NVLink device as this is what 
device drivers expect and it doesn't make much sense to do anything else.

Originally we were going to propagate DMA/TCE changes the other way (NVLink 
device to PCI device) as well, but it proved unnecessary and unused. This 
patch cleans up the last bit of that behaviour and looks good to me as well.

> Reviewed-by: David Gibson 

Reviewed-by: Alistair Popple 

> 
> 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] cxl: fix setting of _PAGE_USER bit when handling page faults

2016-03-20 Thread Andrew Donnellan


On 18/03/16 17:30, Ian Munsie wrote:

Excerpts from andrew.donnellan's message of 2016-03-18 15:01:21 +1100:

Fixes: f204e0b8cedd ("cxl: Driver code for powernv PCIe based cards for
userspace access")


It doesn't fix that since there was no cxl kernel API support at the
time, so this wasn't a regression - just something we missed when the
kernel api was added (I believe the broken test in the code was a left
over from some early bringup work and would never have been exercised on
an upstream kernel until then).


Ah, fair enough - I just looked at what git blame told me. You're right, 
it's not a fix to that commit per se. Happy to drop this tag.



We haven't run into any problems because of this that I am aware of - do
we have a test case for this?


I'd be surprised if it caused noticeable problems - the presence of the 
_PAGE_USER bit when it's not necessary shouldn't break anything, as 
opposed to the absence of _PAGE_USER when it is necessary. Not entirely 
sure what the test case would be.





-if ((!ctx->kernel) || ~(dar & (1ULL << 63)))
+if ((!ctx->kernel) || !(dar & (1ULL << 63)))


Should it be the top two bits?


benh told me that the top bit should be enough - anything above 0x8000* 
should be kernel space.


--
Andrew Donnellan  Software Engineer, OzLabs
andrew.donnel...@au1.ibm.com  Australia Development Lab, Canberra
+61 2 6201 8874 (work)IBM Australia Limited

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH kernel 08/10] powerpc/powernv/npu: Add NPU devices to IOMMU group

2016-03-20 Thread David Gibson

On Wed, Mar 09, 2016 at 05:29:04PM +1100, Alexey Kardashevskiy wrote:
> NPU devices have their own TVT which means they are isolated and can be
> passed to the userspace via VFIO. The first step is to create an IOMMU
> group and attach devices there so does the patch.
> 
> This adds a helper to npu-dma.c which gets GPU from the NPU's pdev and
> then walks through all devices on the same bus to determine which NPUs
> belong to the same GPU.
> 
> This adds an additional loop over PEs in pnv_ioda_setup_dma() as the main
> loop skips NPU PEs as they do not have 32bit DMA segments.
> 
> This uses get_gpu_pci_dev_and_pe() to get @gpdev rather than
> pnv_pci_get_gpu_dev() as the following patch will use @gpe as well.
> 
> Signed-off-by: Alexey Kardashevskiy 

I'm not entirely clear on how these devices are assigned to groups.
Do they each get their own groups, or is the NPU device in the same
group as its corresponding GPU (I would have thought the latter makes
sense).

> ---
>  arch/powerpc/platforms/powernv/npu-dma.c  | 40 
> +++
>  arch/powerpc/platforms/powernv/pci-ioda.c |  8 +++
>  arch/powerpc/platforms/powernv/pci.h  |  1 +
>  3 files changed, 49 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
> b/arch/powerpc/platforms/powernv/npu-dma.c
> index 866d3d3..e5a5feb 100644
> --- a/arch/powerpc/platforms/powernv/npu-dma.c
> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> @@ -263,3 +263,43 @@ void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, 
> bool bypass)
>   }
>   }
>  }
> +
> +void pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
> +{
> + struct iommu_table *tbl;
> + struct pnv_phb *phb = npe->phb;
> + struct pci_bus *pbus = phb->hose->bus;
> + struct pci_dev *npdev, *gpdev = NULL, *gptmp;
> + struct pnv_ioda_pe *gpe = get_gpu_pci_dev_and_pe(npe, &gpdev);
> +
> + if (!gpe || !gpdev)
> + return;
> +
> + iommu_register_group(&npe->table_group, phb->hose->global_number,
> + npe->pe_number);
> +
> + tbl = pnv_pci_table_alloc(phb->hose->node);
> +
> + list_for_each_entry(npdev, &pbus->devices, bus_list) {
> + gptmp = pnv_pci_get_gpu_dev(npdev);
> +
> + if (gptmp != gpdev)
> + continue;
> +
> + /*
> +  * The iommu_add_device() picks an IOMMU group from
> +  * the first IOMMU group attached to the iommu_table
> +  * so we need to pretend that there is a table so
> +  * iommu_add_device() can complete the job.
> +  * We unlink the tempopary table from the group afterwards.
> +  */
> + pnv_pci_link_table_and_group(phb->hose->node, 0,
> + tbl, &npe->table_group);
> + set_iommu_table_base(&npdev->dev, tbl);
> + iommu_add_device(&npdev->dev);
> + set_iommu_table_base(&npdev->dev, NULL);
> + pnv_pci_unlink_table_and_group(tbl, &npe->table_group);
> + }
> +
> + iommu_free_table(tbl, "");
> +}
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 5a6cf2e..becd168 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -2570,6 +2570,14 @@ static void pnv_ioda_setup_dma(struct pnv_phb *phb)
>   remaining -= segs;
>   base += segs;
>   }
> + /*
> +  * Create an IOMMU group and add devices to it.
> +  * DMA setup is to be done via GPU's dma_set_mask().
> +  */
> + if (phb->type == PNV_PHB_NPU) {
> + list_for_each_entry(pe, &phb->ioda.pe_dma_list, dma_link)
> + pnv_pci_npu_setup_iommu(pe);
> + }
>  }
>  
>  #ifdef CONFIG_PCI_MSI
> diff --git a/arch/powerpc/platforms/powernv/pci.h 
> b/arch/powerpc/platforms/powernv/pci.h
> index 06405fd..0c0083a 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -235,5 +235,6 @@ extern void pnv_teardown_msi_irqs(struct pci_dev *pdev);
>  /* Nvlink functions */
>  extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass);
>  extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool 
> rm);
> +extern void pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe);
>  
>  #endif /* __POWERNV_PCI_H */

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFCv3 00/17] PAPR HPT resizing, guest & host side

2016-03-20 Thread David Gibson

On Mon, Mar 21, 2016 at 02:53:07PM +1100, David Gibson wrote:
> This is an implementation of the kernel parts of the PAPR hashed page
> table (HPT) resizing extension.
> 
> It contains a complete guest-side implementation - or as complete as
> it can be until we have a final PAPR change.
> 
> It also contains a draft host side implementation for KVM HV (the KVM
> PR and TCG host-side implementations live in qemu).  This works, but
> is very slow in the critical section (where the guest must be
> stopped).  It is significantly slower than the TCG/PR implementation;
> unusably slow for large hash tables (~2.8s for a 1G HPT).

Since posting this, I've managed to bring this down to ~570ms for a 1G
HPT.  Still slow, but much better.  The optimization to do this was to
skip rehashing an HPTE if neither VALID|ABSENT are set *before*
locking the HPTE.  I believe this is safe, since nothing should be
able to add new VALID|ABSENT HPTEs while the guest is stopped.

> 
> I'm still looking into what's the cause of the slowness, and I'm not
> sure yet if the current approach can be tweaked to be fast enough, or
> if it will require a new approach.
> 
> Changes since RFCv2:
>   * Completely new approach to handling KVM HV implementation.  Much
> simpler synchronization requirements, but also slower
>   * Rebase to latest Linus' tree
>   * Changed number for capability, so as not to collide
>   * Host side now actually works
> 
> David Gibson (17):
>   pseries: Add hypercall wrappers for hash page table resizing
>   pseries: Add support for hash table resizing
>   pseries: Advertise HPT resizing support via CAS
>   pseries: Automatically resize HPT for memory hot add/remove
>   powerpc/kvm: Corectly report KVM_CAP_PPC_ALLOC_HTAB
>   powerpc/kvm: Add capability flag for hashed page table resizing
>   powerpc/kvm: Rename kvm_alloc_hpt() for clarity
>   powerpc/kvm: Gather HPT related variables into sub-structure
>   powerpc/kvm: Don't store values derivable from HPT order
>   powerpc/kvm: Split HPT allocation from activation
>   powerpc/kvm: Allow KVM_PPC_ALLOCATE_HTAB ioctl() to change HPT size
>   powerpc/kvm: Create kvmppc_unmap_hpte_helper()
>   powerpc/kvm: KVM-HV HPT resizing stub implementation
>   powerpc/kvm: Outline of KVM-HV HPT resizing implementation
>   powerpc/kvm: KVM-HV HPT resizing, preparation path
>   powerpc/kvm: HVM-HV HPT resizing, commit path
>   powerpc/kvm: Advertise availablity of HPT resizing on KVM HV
> 
>  arch/powerpc/include/asm/firmware.h   |   5 +-
>  arch/powerpc/include/asm/hvcall.h |   2 +
>  arch/powerpc/include/asm/kvm_book3s.h |  12 +-
>  arch/powerpc/include/asm/kvm_book3s_64.h  |  15 +
>  arch/powerpc/include/asm/kvm_host.h   |  17 +-
>  arch/powerpc/include/asm/kvm_ppc.h|  11 +-
>  arch/powerpc/include/asm/machdep.h|   1 +
>  arch/powerpc/include/asm/plpar_wrappers.h |  12 +
>  arch/powerpc/include/asm/prom.h   |   1 +
>  arch/powerpc/include/asm/sparsemem.h  |   1 +
>  arch/powerpc/kernel/prom_init.c   |   2 +-
>  arch/powerpc/kvm/book3s_64_mmu_hv.c   | 626 
> --
>  arch/powerpc/kvm/book3s_hv.c  |  37 +-
>  arch/powerpc/kvm/book3s_hv_builtin.c  |   8 +-
>  arch/powerpc/kvm/book3s_hv_rm_mmu.c   |  68 ++--
>  arch/powerpc/kvm/powerpc.c|  17 +-
>  arch/powerpc/mm/hash_utils_64.c   |  57 +++
>  arch/powerpc/mm/mem.c |   4 +
>  arch/powerpc/platforms/pseries/firmware.c |   1 +
>  arch/powerpc/platforms/pseries/lpar.c | 110 ++
>  include/uapi/linux/kvm.h  |   1 +
>  21 files changed, 825 insertions(+), 183 deletions(-)
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH kernel 07/10] powerpc/powernv/npu: Rework TCE Kill handling

2016-03-20 Thread Alistair Popple

On Wed, 9 Mar 2016 17:29:03 Alexey Kardashevskiy wrote:
> The pnv_ioda_pe struct keeps an array of peers. At the moment it is only
> used to link GPU and NPU for 2 purposes:
> 
> 1. Access NPU _quickly_ when configuring DMA for GPU - this was addressed
> in the previos patch by removing use of it as DMA setup is not what
> the kernel would constantly do.

This was implemented using peers[] because we had peers[] anyway to deal with 
TCE cache invalidation. I agree there's no reason to keep it around solely for 
speed.

> 2. Invalidate TCE cache for NPU when it is invalidated for GPU.
> GPU and NPU are in different PE. There is already a mechanism to
> attach multiple iommu_table_group to the same iommu_table (used for VFIO),
> we can reuse it here so does this patch.

Ok, this makes sense. I wasn't aware of iommu_table_groups but it looks like a 
more elegant way of solving the problem. I'm not familiar with the way iommu 
groups work but the changes make sense to me as far as I can tell.

> This gets rid of peers[] array and PNV_IODA_PE_PEER flag as they are
> not needed anymore.
>
> While we are here, add TCE cache invalidation after changing TVT.

Good idea, even though I guess we're unlikely to hit a problem in practice as 
I'm pretty sure on a normal system the links would get retrained between runs 
with different TVTs which implies the NPU gets reset too.

> Signed-off-by: Alexey Kardashevskiy 

Reviewed-By: Alistair Popple 

> ---
>  arch/powerpc/platforms/powernv/npu-dma.c  | 75 
+--
>  arch/powerpc/platforms/powernv/pci-ioda.c | 57 +++
>  arch/powerpc/platforms/powernv/pci.h  |  6 ---
>  3 files changed, 29 insertions(+), 109 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
> index 8960e46..866d3d3 100644
> --- a/arch/powerpc/platforms/powernv/npu-dma.c
> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> @@ -136,22 +136,17 @@ static struct pnv_ioda_pe 
*get_gpu_pci_dev_and_pe(struct pnv_ioda_pe *npe,
>   struct pnv_ioda_pe *pe;
>   struct pci_dn *pdn;
>  
> - if (npe->flags & PNV_IODA_PE_PEER) {
> - pe = npe->peers[0];
> - pdev = pe->pdev;
> - } else {
> - pdev = pnv_pci_get_gpu_dev(npe->pdev);
> - if (!pdev)
> - return NULL;
> + pdev = pnv_pci_get_gpu_dev(npe->pdev);
> + if (!pdev)
> + return NULL;
>  
> - pdn = pci_get_pdn(pdev);
> - if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE))
> - return NULL;
> + pdn = pci_get_pdn(pdev);
> + if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE))
> + return NULL;
>  
> - hose = pci_bus_to_host(pdev->bus);
> - phb = hose->private_data;
> - pe = &phb->ioda.pe_array[pdn->pe_number];
> - }
> + hose = pci_bus_to_host(pdev->bus);
> + phb = hose->private_data;
> + pe = &phb->ioda.pe_array[pdn->pe_number];
>  
>   if (gpdev)
>   *gpdev = pdev;
> @@ -159,42 +154,6 @@ static struct pnv_ioda_pe 
*get_gpu_pci_dev_and_pe(struct pnv_ioda_pe *npe,
>   return pe;
>  }
>  
> -void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe)
> -{
> - struct pnv_ioda_pe *gpe;
> - struct pci_dev *gpdev;
> - int i, avail = -1;
> -
> - if (!npe->pdev || !(npe->flags & PNV_IODA_PE_DEV))
> - return;
> -
> - gpe = get_gpu_pci_dev_and_pe(npe, &gpdev);
> - if (!gpe)
> - return;
> -
> - for (i = 0; i < PNV_IODA_MAX_PEER_PES; i++) {
> - /* Nothing to do if the PE is already connected. */
> - if (gpe->peers[i] == npe)
> - return;
> -
> - if (!gpe->peers[i])
> - avail = i;
> - }
> -
> - if (WARN_ON(avail < 0))
> - return;
> -
> - gpe->peers[avail] = npe;
> - gpe->flags |= PNV_IODA_PE_PEER;
> -
> - /*
> -  * We assume that the NPU devices only have a single peer PE
> -  * (the GPU PCIe device PE).
> -  */
> - npe->peers[0] = gpe;
> - npe->flags |= PNV_IODA_PE_PEER;
> -}
> -
>  /*
>   * Enables 32 bit DMA on NPU.
>   */
> @@ -225,6 +184,13 @@ static void pnv_npu_dma_set_32(struct pnv_ioda_pe *npe)
>   if (rc != OPAL_SUCCESS)
>   pr_warn("%s: Error %lld setting DMA window on PHB#%d-PE#%d\n",
>   __func__, rc, phb->hose->global_number, npe-
>pe_number);
> + else
> + pnv_pci_ioda2_tce_invalidate_entire(phb, false);
> +
> + /* Add the table to the list so its TCE cache will get invalidated */
> + npe->table_group.tables[0] = tbl;
> + pnv_pci_link_table_and_group(phb->hose->node, 0,
> + tbl, &npe->table_group);
>  
>   /*
>* We don't initialise npu_pe->tce32_table as we always use
> @@ -245,10 +211,10 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe 
*npe)
>   int64_t

38 matches

Mail list logo