date:20190724

[PATCH] Documentation: move Documentation/virtual to Documentation/virt

2019-07-24 Thread Christoph Hellwig

Renaming docs seems to be en vogue at the moment, so fix on of the
grossly misnamed directories.  We usually never use "virtual" as
a shortcut for virtualization in the kernel, but always virt,
as seen in the virt/ top-level directory.  Fix up the documentation
to match that.

Fixes: ed16648eb5b8 ("Move kvm, uml, and lguest subdirectories under a common 
"virtual" directory, I.E:")
Signed-off-by: Christoph Hellwig 
---
 Documentation/admin-guide/kernel-parameters.txt | 2 +-
 Documentation/{virtual => virt}/index.rst   | 0
 .../{virtual => virt}/kvm/amd-memory-encryption.rst | 0
 Documentation/{virtual => virt}/kvm/api.txt | 2 +-
 Documentation/{virtual => virt}/kvm/arm/hyp-abi.txt | 0
 Documentation/{virtual => virt}/kvm/arm/psci.txt| 0
 Documentation/{virtual => virt}/kvm/cpuid.rst   | 0
 Documentation/{virtual => virt}/kvm/devices/README  | 0
 .../{virtual => virt}/kvm/devices/arm-vgic-its.txt  | 0
 Documentation/{virtual => virt}/kvm/devices/arm-vgic-v3.txt | 0
 Documentation/{virtual => virt}/kvm/devices/arm-vgic.txt| 0
 Documentation/{virtual => virt}/kvm/devices/mpic.txt| 0
 Documentation/{virtual => virt}/kvm/devices/s390_flic.txt   | 0
 Documentation/{virtual => virt}/kvm/devices/vcpu.txt| 0
 Documentation/{virtual => virt}/kvm/devices/vfio.txt| 0
 Documentation/{virtual => virt}/kvm/devices/vm.txt  | 0
 Documentation/{virtual => virt}/kvm/devices/xics.txt| 0
 Documentation/{virtual => virt}/kvm/devices/xive.txt| 0
 Documentation/{virtual => virt}/kvm/halt-polling.txt| 0
 Documentation/{virtual => virt}/kvm/hypercalls.txt  | 4 ++--
 Documentation/{virtual => virt}/kvm/index.rst   | 0
 Documentation/{virtual => virt}/kvm/locking.txt | 0
 Documentation/{virtual => virt}/kvm/mmu.txt | 2 +-
 Documentation/{virtual => virt}/kvm/msr.txt | 0
 Documentation/{virtual => virt}/kvm/nested-vmx.txt  | 0
 Documentation/{virtual => virt}/kvm/ppc-pv.txt  | 0
 Documentation/{virtual => virt}/kvm/review-checklist.txt| 2 +-
 Documentation/{virtual => virt}/kvm/s390-diag.txt   | 0
 Documentation/{virtual => virt}/kvm/timekeeping.txt | 0
 Documentation/{virtual => virt}/kvm/vcpu-requests.rst   | 0
 Documentation/{virtual => virt}/paravirt_ops.rst| 0
 Documentation/{virtual => virt}/uml/UserModeLinux-HOWTO.txt | 0
 MAINTAINERS | 6 +++---
 arch/powerpc/include/uapi/asm/kvm_para.h| 2 +-
 arch/x86/kvm/mmu.c  | 2 +-
 include/uapi/linux/kvm.h| 4 ++--
 tools/include/uapi/linux/kvm.h  | 4 ++--
 virt/kvm/arm/arm.c  | 2 +-
 virt/kvm/arm/vgic/vgic-mmio-v3.c| 2 +-
 virt/kvm/arm/vgic/vgic.h| 4 ++--
 40 files changed, 19 insertions(+), 19 deletions(-)
 rename Documentation/{virtual => virt}/index.rst (100%)
 rename Documentation/{virtual => virt}/kvm/amd-memory-encryption.rst (100%)
 rename Documentation/{virtual => virt}/kvm/api.txt (99%)
 rename Documentation/{virtual => virt}/kvm/arm/hyp-abi.txt (100%)
 rename Documentation/{virtual => virt}/kvm/arm/psci.txt (100%)
 rename Documentation/{virtual => virt}/kvm/cpuid.rst (100%)
 rename Documentation/{virtual => virt}/kvm/devices/README (100%)
 rename Documentation/{virtual => virt}/kvm/devices/arm-vgic-its.txt (100%)
 rename Documentation/{virtual => virt}/kvm/devices/arm-vgic-v3.txt (100%)
 rename Documentation/{virtual => virt}/kvm/devices/arm-vgic.txt (100%)
 rename Documentation/{virtual => virt}/kvm/devices/mpic.txt (100%)
 rename Documentation/{virtual => virt}/kvm/devices/s390_flic.txt (100%)
 rename Documentation/{virtual => virt}/kvm/devices/vcpu.txt (100%)
 rename Documentation/{virtual => virt}/kvm/devices/vfio.txt (100%)
 rename Documentation/{virtual => virt}/kvm/devices/vm.txt (100%)
 rename Documentation/{virtual => virt}/kvm/devices/xics.txt (100%)
 rename Documentation/{virtual => virt}/kvm/devices/xive.txt (100%)
 rename Documentation/{virtual => virt}/kvm/halt-polling.txt (100%)
 rename Documentation/{virtual => virt}/kvm/hypercalls.txt (97%)
 rename Documentation/{virtual => virt}/kvm/index.rst (100%)
 rename Documentation/{virtual => virt}/kvm/locking.txt (100%)
 rename Documentation/{virtual => virt}/kvm/mmu.txt (99%)
 rename Documentation/{virtual => virt}/kvm/msr.txt (100%)
 rename Documentation/{virtual => virt}/kvm/nested-vmx.txt (100%)
 rename Documentation/{virtual => virt}/kvm/ppc-pv.txt (100%)
 rename Documentation/{virtual => virt}/kvm/review-checklist.txt (95%)
 rename Documentation/{virtual => virt}/kvm/s390-diag.txt (100%)
 rename Documentation/{virtual => virt}/kvm/timekeeping.txt (100%)
 rename Documen

Re: [PATCH] Documentation: move Documentation/virtual to Documentation/virt

2019-07-24 Thread Paolo Bonzini

On 24/07/19 09:24, Christoph Hellwig wrote:
> Renaming docs seems to be en vogue at the moment, so fix on of the
> grossly misnamed directories.  We usually never use "virtual" as
> a shortcut for virtualization in the kernel, but always virt,
> as seen in the virt/ top-level directory.  Fix up the documentation
> to match that.
> 
> Fixes: ed16648eb5b8 ("Move kvm, uml, and lguest subdirectories under a common 
> "virtual" directory, I.E:")
> Signed-off-by: Christoph Hellwig 

Queued, thanks.  I can't count how many times I said "I really should
rename that directory".

Paolo

> ---
>  Documentation/admin-guide/kernel-parameters.txt | 2 +-
>  Documentation/{virtual => virt}/index.rst   | 0
>  .../{virtual => virt}/kvm/amd-memory-encryption.rst | 0
>  Documentation/{virtual => virt}/kvm/api.txt | 2 +-
>  Documentation/{virtual => virt}/kvm/arm/hyp-abi.txt | 0
>  Documentation/{virtual => virt}/kvm/arm/psci.txt| 0
>  Documentation/{virtual => virt}/kvm/cpuid.rst   | 0
>  Documentation/{virtual => virt}/kvm/devices/README  | 0
>  .../{virtual => virt}/kvm/devices/arm-vgic-its.txt  | 0
>  Documentation/{virtual => virt}/kvm/devices/arm-vgic-v3.txt | 0
>  Documentation/{virtual => virt}/kvm/devices/arm-vgic.txt| 0
>  Documentation/{virtual => virt}/kvm/devices/mpic.txt| 0
>  Documentation/{virtual => virt}/kvm/devices/s390_flic.txt   | 0
>  Documentation/{virtual => virt}/kvm/devices/vcpu.txt| 0
>  Documentation/{virtual => virt}/kvm/devices/vfio.txt| 0
>  Documentation/{virtual => virt}/kvm/devices/vm.txt  | 0
>  Documentation/{virtual => virt}/kvm/devices/xics.txt| 0
>  Documentation/{virtual => virt}/kvm/devices/xive.txt| 0
>  Documentation/{virtual => virt}/kvm/halt-polling.txt| 0
>  Documentation/{virtual => virt}/kvm/hypercalls.txt  | 4 ++--
>  Documentation/{virtual => virt}/kvm/index.rst   | 0
>  Documentation/{virtual => virt}/kvm/locking.txt | 0
>  Documentation/{virtual => virt}/kvm/mmu.txt | 2 +-
>  Documentation/{virtual => virt}/kvm/msr.txt | 0
>  Documentation/{virtual => virt}/kvm/nested-vmx.txt  | 0
>  Documentation/{virtual => virt}/kvm/ppc-pv.txt  | 0
>  Documentation/{virtual => virt}/kvm/review-checklist.txt| 2 +-
>  Documentation/{virtual => virt}/kvm/s390-diag.txt   | 0
>  Documentation/{virtual => virt}/kvm/timekeeping.txt | 0
>  Documentation/{virtual => virt}/kvm/vcpu-requests.rst   | 0
>  Documentation/{virtual => virt}/paravirt_ops.rst| 0
>  Documentation/{virtual => virt}/uml/UserModeLinux-HOWTO.txt | 0
>  MAINTAINERS | 6 +++---
>  arch/powerpc/include/uapi/asm/kvm_para.h| 2 +-
>  arch/x86/kvm/mmu.c  | 2 +-
>  include/uapi/linux/kvm.h| 4 ++--
>  tools/include/uapi/linux/kvm.h  | 4 ++--
>  virt/kvm/arm/arm.c  | 2 +-
>  virt/kvm/arm/vgic/vgic-mmio-v3.c| 2 +-
>  virt/kvm/arm/vgic/vgic.h| 4 ++--
>  40 files changed, 19 insertions(+), 19 deletions(-)
>  rename Documentation/{virtual => virt}/index.rst (100%)
>  rename Documentation/{virtual => virt}/kvm/amd-memory-encryption.rst (100%)
>  rename Documentation/{virtual => virt}/kvm/api.txt (99%)
>  rename Documentation/{virtual => virt}/kvm/arm/hyp-abi.txt (100%)
>  rename Documentation/{virtual => virt}/kvm/arm/psci.txt (100%)
>  rename Documentation/{virtual => virt}/kvm/cpuid.rst (100%)
>  rename Documentation/{virtual => virt}/kvm/devices/README (100%)
>  rename Documentation/{virtual => virt}/kvm/devices/arm-vgic-its.txt (100%)
>  rename Documentation/{virtual => virt}/kvm/devices/arm-vgic-v3.txt (100%)
>  rename Documentation/{virtual => virt}/kvm/devices/arm-vgic.txt (100%)
>  rename Documentation/{virtual => virt}/kvm/devices/mpic.txt (100%)
>  rename Documentation/{virtual => virt}/kvm/devices/s390_flic.txt (100%)
>  rename Documentation/{virtual => virt}/kvm/devices/vcpu.txt (100%)
>  rename Documentation/{virtual => virt}/kvm/devices/vfio.txt (100%)
>  rename Documentation/{virtual => virt}/kvm/devices/vm.txt (100%)
>  rename Documentation/{virtual => virt}/kvm/devices/xics.txt (100%)
>  rename Documentation/{virtual => virt}/kvm/devices/xive.txt (100%)
>  rename Documentation/{virtual => virt}/kvm/halt-polling.txt (100%)
>  rename Documentation/{virtual => virt}/kvm/hypercalls.txt (97%)
>  rename Documentation/{virtual => virt}/kvm/index.rst (100%)
>  rename Documentation/{virtual => virt}/kvm/locking.txt (100%)
>  rename Documentation/{virtual => virt}/kvm/mmu.txt (99%)
>  rename Documentation/{virtual => virt}/kvm/msr.txt (100%)
>  rename Documentation/{virtual => virt}/kvm/neste

Re: [PATCH v3 02/12] fpga: dfl: fme: add DFL_FPGA_FME_PORT_RELEASE/ASSIGN ioctl support.

2019-07-24 Thread Greg KH

On Tue, Jul 23, 2019 at 12:51:25PM +0800, Wu Hao wrote:
> +/**
> + * dfl_fpga_cdev_config_port - configure a port feature dev
> + * @cdev: parent container device.
> + * @port_id: id of the port feature device.
> + * @release: release port or assign port back.
> + *
> + * This function allows user to release port platform device or assign it 
> back.
> + * e.g. to safely turn one port from PF into VF for PCI device SRIOV support,
> + * release port platform device is one necessary step.
> + */
> +int dfl_fpga_cdev_config_port(struct dfl_fpga_cdev *cdev, int port_id,
> +   bool release)
> +{
> + return release ? detach_port_dev(cdev, port_id) :
> +  attach_port_dev(cdev, port_id);
> +}
> +EXPORT_SYMBOL_GPL(dfl_fpga_cdev_config_port);

That's a horrible api.  Every time you see this call in code, you have
to go and look up what "bool" means here.  There's no reason for it.

Just have 2 different functions, one that attaches a port, and one that
detaches it.  That way when you read the code that calls this function,
you know what it does instantly without having to go look up some api
function somewhere else.

Write code for people to read first.  And you are saving nothing here by
trying to do two different things in the same exact function.

thanks,

greg k-h

Re: [PATCH v3 01/12] fpga: dfl: fme: support 512bit data width PR

2019-07-24 Thread Greg KH

On Tue, Jul 23, 2019 at 12:51:24PM +0800, Wu Hao wrote:
> In early partial reconfiguration private feature, it only
> supports 32bit data width when writing data to hardware for
> PR. 512bit data width PR support is an important optimization
> for some specific solutions (e.g. XEON with FPGA integrated),
> it allows driver to use AVX512 instruction to improve the
> performance of partial reconfiguration. e.g. programming one
> 100MB bitstream image via this 512bit data width PR hardware
> only takes ~300ms, but 32bit revision requires ~3s per test
> result.
> 
> Please note now this optimization is only done on revision 2
> of this PR private feature which is only used in integrated
> solution that AVX512 is always supported. This revision 2
> hardware doesn't support 32bit PR.
> 
> Signed-off-by: Ananda Ravuri 
> Signed-off-by: Xu Yilun 
> Signed-off-by: Wu Hao 
> Acked-by: Alan Tull 
> Signed-off-by: Moritz Fischer 
> ---
> v2: remove DRV/MODULE_VERSION modifications
> ---
>  drivers/fpga/dfl-fme-mgr.c | 110 
> ++---
>  drivers/fpga/dfl-fme-pr.c  |  43 +++---
>  drivers/fpga/dfl-fme.h |   2 +
>  drivers/fpga/dfl.h |   5 +++
>  4 files changed, 129 insertions(+), 31 deletions(-)
> 
> diff --git a/drivers/fpga/dfl-fme-mgr.c b/drivers/fpga/dfl-fme-mgr.c
> index b3f7eee..46e17f0 100644
> --- a/drivers/fpga/dfl-fme-mgr.c
> +++ b/drivers/fpga/dfl-fme-mgr.c
> @@ -22,6 +22,7 @@
>  #include 
>  #include 
>  
> +#include "dfl.h"
>  #include "dfl-fme-pr.h"
>  
>  /* FME Partial Reconfiguration Sub Feature Register Set */
> @@ -30,6 +31,7 @@
>  #define FME_PR_STS   0x10
>  #define FME_PR_DATA  0x18
>  #define FME_PR_ERR   0x20
> +#define FME_PR_512_DATA  0x40 /* Data Register for 512bit 
> datawidth PR */
>  #define FME_PR_INTFC_ID_L0xA8
>  #define FME_PR_INTFC_ID_H0xB0
>  
> @@ -67,8 +69,43 @@
>  #define PR_WAIT_TIMEOUT   800
>  #define PR_HOST_STATUS_IDLE  0
>  
> +#if defined(CONFIG_X86) && defined(CONFIG_AS_AVX512)
> +
> +#include 
> +#include 
> +
> +static inline int is_cpu_avx512_enabled(void)
> +{
> + return cpu_feature_enabled(X86_FEATURE_AVX512F);
> +}

That's a very arch specific function, why would a driver ever care about
this?

> +
> +static inline void copy512(const void *src, void __iomem *dst)
> +{
> + kernel_fpu_begin();
> +
> + asm volatile("vmovdqu64 (%0), %%zmm0;"
> +  "vmovntdq %%zmm0, (%1);"
> +  :
> +  : "r"(src), "r"(dst)
> +  : "memory");
> +
> + kernel_fpu_end();
> +}

Shouldn't this be an arch-specific function somewhere?  Burying this in
a random driver is not ok.  Please make this generic for all systems.

> +#else
> +static inline int is_cpu_avx512_enabled(void)
> +{
> + return 0;
> +}
> +
> +static inline void copy512(const void *src, void __iomem *dst)
> +{
> + WARN_ON_ONCE(1);

Are you trying to get reports from syzbot?  :)

Please fix this all up.

greg k-h

Re: [PATCH v3 03/12] fpga: dfl: pci: enable SRIOV support.

2019-07-24 Thread Greg KH

On Tue, Jul 23, 2019 at 12:51:26PM +0800, Wu Hao wrote:
> This patch enables the standard sriov support. It allows user to
> enable SRIOV (and VFs), then user could pass through accelerators
> (VFs) into virtual machine or use VFs directly in host.
> 
> Signed-off-by: Zhang Yi Z 
> Signed-off-by: Xu Yilun 
> Signed-off-by: Wu Hao 
> Acked-by: Alan Tull 
> Acked-by: Moritz Fischer 
> Signed-off-by: Moritz Fischer 
> ---
> v2: remove DRV/MODULE_VERSION modifications.
> ---
>  drivers/fpga/dfl-pci.c | 39 +++
>  drivers/fpga/dfl.c | 41 +
>  drivers/fpga/dfl.h |  1 +
>  3 files changed, 81 insertions(+)
> 
> diff --git a/drivers/fpga/dfl-pci.c b/drivers/fpga/dfl-pci.c
> index 66b5720..0e65d81 100644
> --- a/drivers/fpga/dfl-pci.c
> +++ b/drivers/fpga/dfl-pci.c
> @@ -223,8 +223,46 @@ int cci_pci_probe(struct pci_dev *pcidev, const struct 
> pci_device_id *pcidevid)
>   return ret;
>  }
>  
> +static int cci_pci_sriov_configure(struct pci_dev *pcidev, int num_vfs)
> +{
> + struct cci_drvdata *drvdata = pci_get_drvdata(pcidev);
> + struct dfl_fpga_cdev *cdev = drvdata->cdev;
> + int ret = 0;
> +
> + mutex_lock(&cdev->lock);
> +
> + if (!num_vfs) {
> + /*
> +  * disable SRIOV and then put released ports back to default
> +  * PF access mode.
> +  */
> + pci_disable_sriov(pcidev);
> +
> + __dfl_fpga_cdev_config_port_vf(cdev, false);
> +
> + } else if (cdev->released_port_num == num_vfs) {
> + /*
> +  * only enable SRIOV if cdev has matched released ports, put
> +  * released ports into VF access mode firstly.
> +  */
> + __dfl_fpga_cdev_config_port_vf(cdev, true);
> +
> + ret = pci_enable_sriov(pcidev, num_vfs);
> + if (ret)
> + __dfl_fpga_cdev_config_port_vf(cdev, false);
> + } else {
> + ret = -EINVAL;
> + }
> +
> + mutex_unlock(&cdev->lock);
> + return ret;
> +}
> +
>  static void cci_pci_remove(struct pci_dev *pcidev)
>  {
> + if (dev_is_pf(&pcidev->dev))
> + cci_pci_sriov_configure(pcidev, 0);
> +
>   cci_remove_feature_devs(pcidev);
>   pci_disable_pcie_error_reporting(pcidev);
>  }
> @@ -234,6 +272,7 @@ static void cci_pci_remove(struct pci_dev *pcidev)
>   .id_table = cci_pcie_id_tbl,
>   .probe = cci_pci_probe,
>   .remove = cci_pci_remove,
> + .sriov_configure = cci_pci_sriov_configure,
>  };
>  
>  module_pci_driver(cci_pci_driver);
> diff --git a/drivers/fpga/dfl.c b/drivers/fpga/dfl.c
> index e04ed45..c3a8e1d 100644
> --- a/drivers/fpga/dfl.c
> +++ b/drivers/fpga/dfl.c
> @@ -1112,6 +1112,47 @@ int dfl_fpga_cdev_config_port(struct dfl_fpga_cdev 
> *cdev, int port_id,
>  }
>  EXPORT_SYMBOL_GPL(dfl_fpga_cdev_config_port);
>  
> +static void config_port_vf(struct device *fme_dev, int port_id, bool is_vf)
> +{
> + void __iomem *base;
> + u64 v;
> +
> + base = dfl_get_feature_ioaddr_by_id(fme_dev, FME_FEATURE_ID_HEADER);
> +
> + v = readq(base + FME_HDR_PORT_OFST(port_id));
> +
> + v &= ~FME_PORT_OFST_ACC_CTRL;
> + v |= FIELD_PREP(FME_PORT_OFST_ACC_CTRL,
> + is_vf ? FME_PORT_OFST_ACC_VF : FME_PORT_OFST_ACC_PF);
> +
> + writeq(v, base + FME_HDR_PORT_OFST(port_id));
> +}
> +
> +/**
> + * __dfl_fpga_cdev_config_port_vf - configure port to VF access mode
> + *
> + * @cdev: parent container device.
> + * @if_vf: true for VF access mode, and false for PF access mode
> + *
> + * Return: 0 on success, negative error code otherwise.
> + *
> + * This function is needed in sriov configuration routine. It could be used 
> to
> + * configures the released ports access mode to VF or PF.
> + * The caller needs to hold lock for protection.
> + */
> +void __dfl_fpga_cdev_config_port_vf(struct dfl_fpga_cdev *cdev, bool is_vf)
> +{
> + struct dfl_feature_platform_data *pdata;
> +
> + list_for_each_entry(pdata, &cdev->port_dev_list, node) {
> + if (device_is_registered(&pdata->dev->dev))
> + continue;
> +
> + config_port_vf(cdev->fme_dev, pdata->id, is_vf);
> + }
> +}
> +EXPORT_SYMBOL_GPL(__dfl_fpga_cdev_config_port_vf);

Why are you exporting a function with a leading __?

You are expecting someone else, in who knows what code, to do locking
correctly?  If so, and the caller always has to have a local lock, then
it's not a big deal, just drop the '__', otherwise if you have to have a
specific lock for a specific device, then you have a really complex and
probably broken api here :(

thanks,

greg k-h

Re: [PATCH v3 04/12] fpga: dfl: afu: add AFU state related sysfs interfaces

2019-07-24 Thread Greg KH

On Tue, Jul 23, 2019 at 12:51:27PM +0800, Wu Hao wrote:
> This patch introduces more sysfs interfaces for Accelerated
> Function Unit (AFU). These interfaces allow users to read
> current AFU Power State (APx), read / clear AFU Power (APx)
> events which are sticky to identify transient APx state,
> and manage AFU's LTR (latency tolerance reporting).
> 
> Signed-off-by: Ananda Ravuri 
> Signed-off-by: Xu Yilun 
> Signed-off-by: Wu Hao 
> Acked-by: Alan Tull 
> Signed-off-by: Moritz Fischer 
> ---
> v2: rebased, and remove DRV/MODULE_VERSION modifications
> v3: update kernel version and date in sysfs doc
> ---
>  Documentation/ABI/testing/sysfs-platform-dfl-port |  30 +
>  drivers/fpga/dfl-afu-main.c   | 137 
> ++
>  drivers/fpga/dfl.h|  11 ++
>  3 files changed, 178 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-platform-dfl-port 
> b/Documentation/ABI/testing/sysfs-platform-dfl-port
> index 6a92dda..5961fb2 100644
> --- a/Documentation/ABI/testing/sysfs-platform-dfl-port
> +++ b/Documentation/ABI/testing/sysfs-platform-dfl-port
> @@ -14,3 +14,33 @@ Description:   Read-only. User can program different 
> PR bitstreams to FPGA
>   Accelerator Function Unit (AFU) for different functions. It
>   returns uuid which could be used to identify which PR bitstream
>   is programmed in this AFU.
> +
> +What:/sys/bus/platform/devices/dfl-port.0/power_state
> +Date:July 2019
> +KernelVersion:   5.4
> +Contact: Wu Hao 
> +Description: Read-only. It reports the APx (AFU Power) state, different APx
> + means different throttling level. When reading this file, it
> + returns "0" - Normal / "1" - AP1 / "2" - AP2 / "6" - AP6.
> +
> +What:/sys/bus/platform/devices/dfl-port.0/ap1_event
> +Date:July 2019
> +KernelVersion:   5.4
> +Contact: Wu Hao 
> +Description: Read-write. Read or set 1 to clear AP1 (AFU Power State 1)
> + event. It's used to indicate transient AP1 state.

So reading the value changes the state of the system?  That's almost
always never a good idea.

Force userspace to write the value to change something.  Otherwise all
libraries that use sysfs will be accidentally changing the state of your
system without you ever knowing it.



> +
> +What:/sys/bus/platform/devices/dfl-port.0/ap2_event
> +Date:July 2019
> +KernelVersion:   5.4
> +Contact: Wu Hao 
> +Description: Read-write. Read or set 1 to clear AP2 (AFU Power State 2)
> + event. It's used to indicate transient AP2 state.
> +
> +What:/sys/bus/platform/devices/dfl-port.0/ltr
> +Date:July 2019
> +KernelVersion:   5.4
> +Contact: Wu Hao 
> +Description: Read-write. Read and set AFU latency tolerance reporting value.
> + Set ltr to 1 if the AFU can tolerate latency >= 40us or set it
> + to 0 if it is latency sensitive.
> diff --git a/drivers/fpga/dfl-afu-main.c b/drivers/fpga/dfl-afu-main.c
> index 68b4d08..cb3f73e 100644
> --- a/drivers/fpga/dfl-afu-main.c
> +++ b/drivers/fpga/dfl-afu-main.c
> @@ -141,8 +141,145 @@ static int port_get_id(struct platform_device *pdev)
>  }
>  static DEVICE_ATTR_RO(id);
>  
> +static ssize_t
> +ltr_show(struct device *dev, struct device_attribute *attr, char *buf)
> +{
> + struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> + void __iomem *base;
> + u64 v;
> +
> + base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_HEADER);
> +
> + mutex_lock(&pdata->lock);
> + v = readq(base + PORT_HDR_CTRL);
> + mutex_unlock(&pdata->lock);

Why do you need a lock to call readq()?  What are you protecting here?


> +
> + return sprintf(buf, "%x\n", (u8)FIELD_GET(PORT_CTRL_LATENCY, v));
> +}
> +
> +static ssize_t
> +ltr_store(struct device *dev, struct device_attribute *attr,
> +   const char *buf, size_t count)
> +{
> + struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> + void __iomem *base;
> + u8 ltr;
> + u64 v;
> +
> + if (kstrtou8(buf, 0,  1)
> + return -EINVAL;

Are you doing anything with this value?  If not, how about just using
the sysfs boolean read function and if it is 1, then do the clearing?

Same for all other show/store functions in here.

thanks,

greg k-h

doc: mds: nitpicking and typo fix

2019-07-24 Thread Pavel Machek


Consistently end sentences, fix typo.

Signed-off-by: Pavel Machek 

commit 310cb17613f46db97cebbd9dc11187961e4b1c69
Author: Pavel 
Date:   Mon May 20 10:46:35 2019 +0200

doc: typo fix, consistency in mds.

diff --git a/Documentation/x86/mds.rst b/Documentation/x86/mds.rst
index 5d4330b..9983b50 100644
--- a/Documentation/x86/mds.rst
+++ b/Documentation/x86/mds.rst
@@ -54,13 +54,13 @@ needed for exploiting MDS requires:
  - to control the load to trigger a fault or assist
 
  - to have a disclosure gadget which exposes the speculatively accessed
-   data for consumption through a side channel.
+   data for consumption through a side channel
 
  - to control the pointer through which the disclosure gadget exposes the
data
 
 The existence of such a construct in the kernel cannot be excluded with
-100% certainty, but the complexity involved makes it extremly unlikely.
+100% certainty, but the complexity involved makes it extremely unlikely.
 
 There is one exception, which is untrusted BPF. The functionality of
 untrusted BPF is limited, but it needs to be thoroughly investigated


-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature

Re: [PATCH v3 09/12] fpga: dfl: afu: add STP (SignalTap) support

2019-07-24 Thread Greg KH

On Tue, Jul 23, 2019 at 12:51:32PM +0800, Wu Hao wrote:
> STP (SignalTap) is one of the private features under the port for
> debugging. This patch adds private feature driver support for it
> to allow userspace applications to mmap related mmio region and
> provide STP service.
> 
> Signed-off-by: Xu Yilun 
> Signed-off-by: Wu Hao 
> Acked-by: Moritz Fischer 
> Acked-by: Alan Tull 
> Signed-off-by: Moritz Fischer 
> ---
>  drivers/fpga/dfl-afu-main.c | 34 ++
>  1 file changed, 34 insertions(+)
> 
> diff --git a/drivers/fpga/dfl-afu-main.c b/drivers/fpga/dfl-afu-main.c
> index 15dd4cb..395f96e 100644
> --- a/drivers/fpga/dfl-afu-main.c
> +++ b/drivers/fpga/dfl-afu-main.c
> @@ -514,6 +514,36 @@ static void port_afu_uinit(struct platform_device *pdev,
>   .uinit = port_afu_uinit,
>  };
>  
> +static int port_stp_init(struct platform_device *pdev,
> +  struct dfl_feature *feature)
> +{
> + struct resource *res = &pdev->resource[feature->resource_index];
> +
> + dev_dbg(&pdev->dev, "PORT STP Init.\n");

ftrace is your friend, no need to do a lot of "look I am here!"
messages.

> +
> + return afu_mmio_region_add(dev_get_platdata(&pdev->dev),
> +DFL_PORT_REGION_INDEX_STP,
> +resource_size(res), res->start,
> +DFL_PORT_REGION_MMAP | DFL_PORT_REGION_READ |
> +DFL_PORT_REGION_WRITE);
> +}
> +
> +static void port_stp_uinit(struct platform_device *pdev,
> +struct dfl_feature *feature)
> +{
> + dev_dbg(&pdev->dev, "PORT STP UInit.\n");

Same here.

Why have this function at all if it does not do anything?


thanks,

greg k-h

Re: [PATCH v3 09/12] fpga: dfl: afu: add STP (SignalTap) support

2019-07-24 Thread Wu Hao

On Wed, Jul 24, 2019 at 12:11:09PM +0200, Greg KH wrote:
> On Tue, Jul 23, 2019 at 12:51:32PM +0800, Wu Hao wrote:
> > STP (SignalTap) is one of the private features under the port for
> > debugging. This patch adds private feature driver support for it
> > to allow userspace applications to mmap related mmio region and
> > provide STP service.
> > 
> > Signed-off-by: Xu Yilun 
> > Signed-off-by: Wu Hao 
> > Acked-by: Moritz Fischer 
> > Acked-by: Alan Tull 
> > Signed-off-by: Moritz Fischer 
> > ---
> >  drivers/fpga/dfl-afu-main.c | 34 ++
> >  1 file changed, 34 insertions(+)
> > 
> > diff --git a/drivers/fpga/dfl-afu-main.c b/drivers/fpga/dfl-afu-main.c
> > index 15dd4cb..395f96e 100644
> > --- a/drivers/fpga/dfl-afu-main.c
> > +++ b/drivers/fpga/dfl-afu-main.c
> > @@ -514,6 +514,36 @@ static void port_afu_uinit(struct platform_device 
> > *pdev,
> > .uinit = port_afu_uinit,
> >  };
> >  
> > +static int port_stp_init(struct platform_device *pdev,
> > +struct dfl_feature *feature)
> > +{
> > +   struct resource *res = &pdev->resource[feature->resource_index];
> > +
> > +   dev_dbg(&pdev->dev, "PORT STP Init.\n");
> 
> ftrace is your friend, no need to do a lot of "look I am here!"
> messages.

Hi Greg,

Thanks for the code review!

Sure, let me remove them.

> 
> > +
> > +   return afu_mmio_region_add(dev_get_platdata(&pdev->dev),
> > +  DFL_PORT_REGION_INDEX_STP,
> > +  resource_size(res), res->start,
> > +  DFL_PORT_REGION_MMAP | DFL_PORT_REGION_READ |
> > +  DFL_PORT_REGION_WRITE);
> > +}
> > +
> > +static void port_stp_uinit(struct platform_device *pdev,
> > +  struct dfl_feature *feature)
> > +{
> > +   dev_dbg(&pdev->dev, "PORT STP UInit.\n");
> 
> Same here.
> 
> Why have this function at all if it does not do anything?

Let me remove them in the next version. actually uinit callback is
always required in current code, i will add one more patch to change
it, and remove all uinit functions who do nothing, it does save code.

Thanks for the comments.
Hao

> 
> 
> thanks,
> 
> greg k-h

Re: [PATCH v3 04/12] fpga: dfl: afu: add AFU state related sysfs interfaces

2019-07-24 Thread Wu Hao

On Wed, Jul 24, 2019 at 11:41:10AM +0200, Greg KH wrote:
> On Tue, Jul 23, 2019 at 12:51:27PM +0800, Wu Hao wrote:
> > This patch introduces more sysfs interfaces for Accelerated
> > Function Unit (AFU). These interfaces allow users to read
> > current AFU Power State (APx), read / clear AFU Power (APx)
> > events which are sticky to identify transient APx state,
> > and manage AFU's LTR (latency tolerance reporting).
> > 
> > Signed-off-by: Ananda Ravuri 
> > Signed-off-by: Xu Yilun 
> > Signed-off-by: Wu Hao 
> > Acked-by: Alan Tull 
> > Signed-off-by: Moritz Fischer 
> > ---
> > v2: rebased, and remove DRV/MODULE_VERSION modifications
> > v3: update kernel version and date in sysfs doc
> > ---
> >  Documentation/ABI/testing/sysfs-platform-dfl-port |  30 +
> >  drivers/fpga/dfl-afu-main.c   | 137 
> > ++
> >  drivers/fpga/dfl.h|  11 ++
> >  3 files changed, 178 insertions(+)
> > 
> > diff --git a/Documentation/ABI/testing/sysfs-platform-dfl-port 
> > b/Documentation/ABI/testing/sysfs-platform-dfl-port
> > index 6a92dda..5961fb2 100644
> > --- a/Documentation/ABI/testing/sysfs-platform-dfl-port
> > +++ b/Documentation/ABI/testing/sysfs-platform-dfl-port
> > @@ -14,3 +14,33 @@ Description: Read-only. User can program different 
> > PR bitstreams to FPGA
> > Accelerator Function Unit (AFU) for different functions. It
> > returns uuid which could be used to identify which PR bitstream
> > is programmed in this AFU.
> > +
> > +What:  /sys/bus/platform/devices/dfl-port.0/power_state
> > +Date:  July 2019
> > +KernelVersion: 5.4
> > +Contact:   Wu Hao 
> > +Description:   Read-only. It reports the APx (AFU Power) state, 
> > different APx
> > +   means different throttling level. When reading this file, it
> > +   returns "0" - Normal / "1" - AP1 / "2" - AP2 / "6" - AP6.
> > +
> > +What:  /sys/bus/platform/devices/dfl-port.0/ap1_event
> > +Date:  July 2019
> > +KernelVersion: 5.4
> > +Contact:   Wu Hao 
> > +Description:   Read-write. Read or set 1 to clear AP1 (AFU Power State 
> > 1)
> > +   event. It's used to indicate transient AP1 state.
> 
> So reading the value changes the state of the system?  That's almost
> always never a good idea.
> 
> Force userspace to write the value to change something.  Otherwise all
> libraries that use sysfs will be accidentally changing the state of your
> system without you ever knowing it.

Oh.. I think the description makes some misunderstanding here, will fix it
in the next version. This AP1/AP2 event will only be cleared by write 1 to
it, read will not change the state.

> 
> > +
> > +What:  /sys/bus/platform/devices/dfl-port.0/ap2_event
> > +Date:  July 2019
> > +KernelVersion: 5.4
> > +Contact:   Wu Hao 
> > +Description:   Read-write. Read or set 1 to clear AP2 (AFU Power State 
> > 2)
> > +   event. It's used to indicate transient AP2 state.
> > +
> > +What:  /sys/bus/platform/devices/dfl-port.0/ltr
> > +Date:  July 2019
> > +KernelVersion: 5.4
> > +Contact:   Wu Hao 
> > +Description:   Read-write. Read and set AFU latency tolerance 
> > reporting value.
> > +   Set ltr to 1 if the AFU can tolerate latency >= 40us or set it
> > +   to 0 if it is latency sensitive.
> > diff --git a/drivers/fpga/dfl-afu-main.c b/drivers/fpga/dfl-afu-main.c
> > index 68b4d08..cb3f73e 100644
> > --- a/drivers/fpga/dfl-afu-main.c
> > +++ b/drivers/fpga/dfl-afu-main.c
> > @@ -141,8 +141,145 @@ static int port_get_id(struct platform_device *pdev)
> >  }
> >  static DEVICE_ATTR_RO(id);
> >  
> > +static ssize_t
> > +ltr_show(struct device *dev, struct device_attribute *attr, char *buf)
> > +{
> > +   struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> > +   void __iomem *base;
> > +   u64 v;
> > +
> > +   base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_HEADER);
> > +
> > +   mutex_lock(&pdata->lock);
> > +   v = readq(base + PORT_HDR_CTRL);
> > +   mutex_unlock(&pdata->lock);
> 
> Why do you need a lock to call readq()?  What are you protecting here?

If this code is running on 32bit machine, readq will be replaced with 2
readl operation. If that is the case, should we protect the code against
it?

> 
> 
> > +
> > +   return sprintf(buf, "%x\n", (u8)FIELD_GET(PORT_CTRL_LATENCY, v));
> > +}
> > +
> > +static ssize_t
> > +ltr_store(struct device *dev, struct device_attribute *attr,
> > + const char *buf, size_t count)
> > +{
> > +   struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> > +   void __iomem *base;
> > +   u8 ltr;
> > +   u64 v;
> > +
> > +   if (kstrtou8(buf, 0,  1)
> > +   return -EINVAL;
> 
> Are you doing anything with this value?  If not, how about just using
> the sysfs boolean read function and if it is 1, then do the clearin

Re: [PATCH v3 03/12] fpga: dfl: pci: enable SRIOV support.

2019-07-24 Thread Wu Hao

On Wed, Jul 24, 2019 at 11:37:44AM +0200, Greg KH wrote:
> On Tue, Jul 23, 2019 at 12:51:26PM +0800, Wu Hao wrote:
> > This patch enables the standard sriov support. It allows user to
> > enable SRIOV (and VFs), then user could pass through accelerators
> > (VFs) into virtual machine or use VFs directly in host.
> > 
> > Signed-off-by: Zhang Yi Z 
> > Signed-off-by: Xu Yilun 
> > Signed-off-by: Wu Hao 
> > Acked-by: Alan Tull 
> > Acked-by: Moritz Fischer 
> > Signed-off-by: Moritz Fischer 
> > ---
> > v2: remove DRV/MODULE_VERSION modifications.
> > ---
> >  drivers/fpga/dfl-pci.c | 39 +++
> >  drivers/fpga/dfl.c | 41 +
> >  drivers/fpga/dfl.h |  1 +
> >  3 files changed, 81 insertions(+)
> > 
> > diff --git a/drivers/fpga/dfl-pci.c b/drivers/fpga/dfl-pci.c
> > index 66b5720..0e65d81 100644
> > --- a/drivers/fpga/dfl-pci.c
> > +++ b/drivers/fpga/dfl-pci.c
> > @@ -223,8 +223,46 @@ int cci_pci_probe(struct pci_dev *pcidev, const struct 
> > pci_device_id *pcidevid)
> > return ret;
> >  }
> >  
> > +static int cci_pci_sriov_configure(struct pci_dev *pcidev, int num_vfs)
> > +{
> > +   struct cci_drvdata *drvdata = pci_get_drvdata(pcidev);
> > +   struct dfl_fpga_cdev *cdev = drvdata->cdev;
> > +   int ret = 0;
> > +
> > +   mutex_lock(&cdev->lock);
> > +
> > +   if (!num_vfs) {
> > +   /*
> > +* disable SRIOV and then put released ports back to default
> > +* PF access mode.
> > +*/
> > +   pci_disable_sriov(pcidev);
> > +
> > +   __dfl_fpga_cdev_config_port_vf(cdev, false);
> > +
> > +   } else if (cdev->released_port_num == num_vfs) {
> > +   /*
> > +* only enable SRIOV if cdev has matched released ports, put
> > +* released ports into VF access mode firstly.
> > +*/
> > +   __dfl_fpga_cdev_config_port_vf(cdev, true);
> > +
> > +   ret = pci_enable_sriov(pcidev, num_vfs);
> > +   if (ret)
> > +   __dfl_fpga_cdev_config_port_vf(cdev, false);
> > +   } else {
> > +   ret = -EINVAL;
> > +   }
> > +
> > +   mutex_unlock(&cdev->lock);
> > +   return ret;
> > +}
> > +
> >  static void cci_pci_remove(struct pci_dev *pcidev)
> >  {
> > +   if (dev_is_pf(&pcidev->dev))
> > +   cci_pci_sriov_configure(pcidev, 0);
> > +
> > cci_remove_feature_devs(pcidev);
> > pci_disable_pcie_error_reporting(pcidev);
> >  }
> > @@ -234,6 +272,7 @@ static void cci_pci_remove(struct pci_dev *pcidev)
> > .id_table = cci_pcie_id_tbl,
> > .probe = cci_pci_probe,
> > .remove = cci_pci_remove,
> > +   .sriov_configure = cci_pci_sriov_configure,
> >  };
> >  
> >  module_pci_driver(cci_pci_driver);
> > diff --git a/drivers/fpga/dfl.c b/drivers/fpga/dfl.c
> > index e04ed45..c3a8e1d 100644
> > --- a/drivers/fpga/dfl.c
> > +++ b/drivers/fpga/dfl.c
> > @@ -1112,6 +1112,47 @@ int dfl_fpga_cdev_config_port(struct dfl_fpga_cdev 
> > *cdev, int port_id,
> >  }
> >  EXPORT_SYMBOL_GPL(dfl_fpga_cdev_config_port);
> >  
> > +static void config_port_vf(struct device *fme_dev, int port_id, bool is_vf)
> > +{
> > +   void __iomem *base;
> > +   u64 v;
> > +
> > +   base = dfl_get_feature_ioaddr_by_id(fme_dev, FME_FEATURE_ID_HEADER);
> > +
> > +   v = readq(base + FME_HDR_PORT_OFST(port_id));
> > +
> > +   v &= ~FME_PORT_OFST_ACC_CTRL;
> > +   v |= FIELD_PREP(FME_PORT_OFST_ACC_CTRL,
> > +   is_vf ? FME_PORT_OFST_ACC_VF : FME_PORT_OFST_ACC_PF);
> > +
> > +   writeq(v, base + FME_HDR_PORT_OFST(port_id));
> > +}
> > +
> > +/**
> > + * __dfl_fpga_cdev_config_port_vf - configure port to VF access mode
> > + *
> > + * @cdev: parent container device.
> > + * @if_vf: true for VF access mode, and false for PF access mode
> > + *
> > + * Return: 0 on success, negative error code otherwise.
> > + *
> > + * This function is needed in sriov configuration routine. It could be 
> > used to
> > + * configures the released ports access mode to VF or PF.
> > + * The caller needs to hold lock for protection.
> > + */
> > +void __dfl_fpga_cdev_config_port_vf(struct dfl_fpga_cdev *cdev, bool is_vf)
> > +{
> > +   struct dfl_feature_platform_data *pdata;
> > +
> > +   list_for_each_entry(pdata, &cdev->port_dev_list, node) {
> > +   if (device_is_registered(&pdata->dev->dev))
> > +   continue;
> > +
> > +   config_port_vf(cdev->fme_dev, pdata->id, is_vf);
> > +   }
> > +}
> > +EXPORT_SYMBOL_GPL(__dfl_fpga_cdev_config_port_vf);
> 
> Why are you exporting a function with a leading __?
> 
> You are expecting someone else, in who knows what code, to do locking
> correctly?  If so, and the caller always has to have a local lock, then
> it's not a big deal, just drop the '__', otherwise if you have to have a
> specific lock for a specific device, then you have a really complex and
> probably broken api here :(

Yes, I just want to remind the user of this API, caller needs to

Re: [PATCH v3 02/12] fpga: dfl: fme: add DFL_FPGA_FME_PORT_RELEASE/ASSIGN ioctl support.

2019-07-24 Thread Wu Hao

On Wed, Jul 24, 2019 at 11:33:57AM +0200, Greg KH wrote:
> On Tue, Jul 23, 2019 at 12:51:25PM +0800, Wu Hao wrote:
> > +/**
> > + * dfl_fpga_cdev_config_port - configure a port feature dev
> > + * @cdev: parent container device.
> > + * @port_id: id of the port feature device.
> > + * @release: release port or assign port back.
> > + *
> > + * This function allows user to release port platform device or assign it 
> > back.
> > + * e.g. to safely turn one port from PF into VF for PCI device SRIOV 
> > support,
> > + * release port platform device is one necessary step.
> > + */
> > +int dfl_fpga_cdev_config_port(struct dfl_fpga_cdev *cdev, int port_id,
> > + bool release)
> > +{
> > +   return release ? detach_port_dev(cdev, port_id) :
> > +attach_port_dev(cdev, port_id);
> > +}
> > +EXPORT_SYMBOL_GPL(dfl_fpga_cdev_config_port);
> 
> That's a horrible api.  Every time you see this call in code, you have
> to go and look up what "bool" means here.  There's no reason for it.
> 
> Just have 2 different functions, one that attaches a port, and one that
> detaches it.  That way when you read the code that calls this function,
> you know what it does instantly without having to go look up some api
> function somewhere else.
> 
> Write code for people to read first.  And you are saving nothing here by
> trying to do two different things in the same exact function.

I see, you're right, it saves everybody's time on reading, very important.
I will fix this and keep it in mind. Thank you.

Hao

> 
> thanks,
> 
> greg k-h

Re: [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing

2019-07-24 Thread Joel Fernandes

On Wed, Jul 24, 2019 at 01:28:42PM +0900, Minchan Kim wrote:
> On Tue, Jul 23, 2019 at 10:20:49AM -0400, Joel Fernandes wrote:
> > On Tue, Jul 23, 2019 at 03:13:58PM +0900, Minchan Kim wrote:
> > > Hi Joel,
> > > 
> > > On Mon, Jul 22, 2019 at 05:32:04PM -0400, Joel Fernandes (Google) wrote:
> > > > The page_idle tracking feature currently requires looking up the pagemap
> > > > for a process followed by interacting with /sys/kernel/mm/page_idle.
> > > > This is quite cumbersome and can be error-prone too. If between
> > > 
> > > cumbersome: That's the fair tradeoff between idle page tracking and
> > > clear_refs because idle page tracking could check even though the page
> > > is not mapped.
> > 
> > It is fair tradeoff, but could be made simpler. The userspace code got
> > reduced by a good amount as well.
> > 
> > > error-prone: What's the error?
> > 
> > We see in normal Android usage, that some of the times pages appear not to 
> > be
> > idle even when they really are idle. Reproducing this is a bit unpredictable
> > and happens at random occasions. With this new interface, we are seeing this
> > happen much much lesser.
> 
> I don't know how you did test. Maybe that could be contributed by
> swapping out or shared pages touched by other processes or some kernel
> behavior not to keep access bit of their operation.

It could be something along these lines is my thinking as well. So we know
its already has issues due to what you mentioned, I am not sure what else
needs investigation?

> Please investigate more what's the root cause. That would be important
> point to justify for the patch motivation.

The motivation is security. I am dropping the 'accuracy' factor I mentioned
from the patch description since it created a lot of confusion.

> > > > More over looking up PFN from pagemap in Android devices is not
> > > > supported by unprivileged process and requires SYS_ADMIN and gives 0 for
> > > > the PFN.
> > > > 
> > > > This patch adds support to directly interact with page_idle tracking at
> > > > the PID level by introducing a /proc//page_idle file. This
> > > > eliminates the need for userspace to calculate the mapping of the page.
> > > > It follows the exact same semantics as the global
> > > > /sys/kernel/mm/page_idle, however it is easier to use for some usecases
> > > > where looking up PFN is not needed and also does not require SYS_ADMIN.
> > > 
> > > Ah, so the primary goal is to provide convinience interface and it would
> > > help accurary, too. IOW, accuracy is not your main goal?
> > 
> > There are a couple of primary goals: Security, conveience and also solving
> > the accuracy/reliability problem we are seeing. Do keep in mind looking up
> > PFN has security implications. The PFN field in pagemap is zeroed if the 
> > user
> > does not have CAP_SYS_ADMIN.
> 
> Myaybe you don't need PFN. is it?

With the traditional idle tracking, PFN is needed which has the mentioned
security issues. This patch solves it. And the interface is identical and
familiar to the existing page_idle bitmap interface.

> > > > In Android, we are using this for the heap profiler (heapprofd) which
> > > > profiles and pin points code paths which allocates and leaves memory
> > > > idle for long periods of time.
> > > 
> > > So the goal is to detect idle pages with idle memory tracking?
> > 
> > Isn't that what idle memory tracking does?
> 
> To me, it's rather misleading. Please read motivation section in document.
> The feature would be good to detect workingset pages, not idle pages
> because workingset pages are never freed, swapped out and even we could
> count on newly allocated pages.
> 
> Motivation
> ==
> 
> The idle page tracking feature allows to track which memory pages are being
> accessed by a workload and which are idle. This information can be useful for
> estimating the workload's working set size, which, in turn, can be taken into
> account when configuring the workload parameters, setting memory cgroup 
> limits,
> or deciding where to place the workload within a compute cluster.

As we discussed by chat, we could collect additional metadata to check if
pages were swapped or freed ever since the time we marked them as idle.
However this can be incremental improvement.

> > > It couldn't work well because such idle pages could finally swap out and
> > > lose every flags of the page descriptor which is working mechanism of
> > > idle page tracking. It should have named "workingset page tracking",
> > > not "idle page tracking".
> > 
> > The heap profiler that uses page-idle tracking is not to measure working 
> > set,
> > but to look for pages that are idle for long periods of time.
> 
> It's important part. Please include it in the description so that people
> understands what's the usecase. As I said above, if it aims for finding
> idle pages durting the period, current idle page tracking feature is not
> good ironically.

Ok, I will mention.

> > Thanks for bringing up the swapping c

Re: [PATCH v3 01/12] fpga: dfl: fme: support 512bit data width PR

2019-07-24 Thread Wu Hao

On Wed, Jul 24, 2019 at 11:35:32AM +0200, Greg KH wrote:
> On Tue, Jul 23, 2019 at 12:51:24PM +0800, Wu Hao wrote:
> > In early partial reconfiguration private feature, it only
> > supports 32bit data width when writing data to hardware for
> > PR. 512bit data width PR support is an important optimization
> > for some specific solutions (e.g. XEON with FPGA integrated),
> > it allows driver to use AVX512 instruction to improve the
> > performance of partial reconfiguration. e.g. programming one
> > 100MB bitstream image via this 512bit data width PR hardware
> > only takes ~300ms, but 32bit revision requires ~3s per test
> > result.
> > 
> > Please note now this optimization is only done on revision 2
> > of this PR private feature which is only used in integrated
> > solution that AVX512 is always supported. This revision 2
> > hardware doesn't support 32bit PR.
> > 
> > Signed-off-by: Ananda Ravuri 
> > Signed-off-by: Xu Yilun 
> > Signed-off-by: Wu Hao 
> > Acked-by: Alan Tull 
> > Signed-off-by: Moritz Fischer 
> > ---
> > v2: remove DRV/MODULE_VERSION modifications
> > ---
> >  drivers/fpga/dfl-fme-mgr.c | 110 
> > ++---
> >  drivers/fpga/dfl-fme-pr.c  |  43 +++---
> >  drivers/fpga/dfl-fme.h |   2 +
> >  drivers/fpga/dfl.h |   5 +++
> >  4 files changed, 129 insertions(+), 31 deletions(-)
> > 
> > diff --git a/drivers/fpga/dfl-fme-mgr.c b/drivers/fpga/dfl-fme-mgr.c
> > index b3f7eee..46e17f0 100644
> > --- a/drivers/fpga/dfl-fme-mgr.c
> > +++ b/drivers/fpga/dfl-fme-mgr.c
> > @@ -22,6 +22,7 @@
> >  #include 
> >  #include 
> >  
> > +#include "dfl.h"
> >  #include "dfl-fme-pr.h"
> >  
> >  /* FME Partial Reconfiguration Sub Feature Register Set */
> > @@ -30,6 +31,7 @@
> >  #define FME_PR_STS 0x10
> >  #define FME_PR_DATA0x18
> >  #define FME_PR_ERR 0x20
> > +#define FME_PR_512_DATA0x40 /* Data Register for 512bit 
> > datawidth PR */
> >  #define FME_PR_INTFC_ID_L  0xA8
> >  #define FME_PR_INTFC_ID_H  0xB0
> >  
> > @@ -67,8 +69,43 @@
> >  #define PR_WAIT_TIMEOUT   800
> >  #define PR_HOST_STATUS_IDLE0
> >  
> > +#if defined(CONFIG_X86) && defined(CONFIG_AS_AVX512)
> > +
> > +#include 
> > +#include 
> > +
> > +static inline int is_cpu_avx512_enabled(void)
> > +{
> > +   return cpu_feature_enabled(X86_FEATURE_AVX512F);
> > +}
> 
> That's a very arch specific function, why would a driver ever care about
> this?

Yes, this is only applied to a specific FPGA solution, which FPGA
has been integrated with XEON. Hardware indicates this using register
to software. As it's cpu integrated solution, so CPU always has this
AVX512 capability. The only check we do, is make sure this is not
manually disabled by kernel.

With this hardware, software could use AVX512 to accelerate the FPGA
partial reconfiguration as mentioned in the patch commit message.
It brings performance benifits to people who uses it. This is only one
optimization (512 vs 32bit data write to hw) for a specific hardware.

For other discrete solutions, e.g. FPGA PCIe Card, this is not used
at all as driver does check hardware register to avoid any AVX512 code.

> 
> > +
> > +static inline void copy512(const void *src, void __iomem *dst)
> > +{
> > +   kernel_fpu_begin();
> > +
> > +   asm volatile("vmovdqu64 (%0), %%zmm0;"
> > +"vmovntdq %%zmm0, (%1);"
> > +:
> > +: "r"(src), "r"(dst)
> > +: "memory");
> > +
> > +   kernel_fpu_end();
> > +}
> 
> Shouldn't this be an arch-specific function somewhere?  Burying this in
> a random driver is not ok.  Please make this generic for all systems.

If more people need the same avx operation like this in kernel, then maybe
this can be moved to some arch-specific lib code somewhere as some common
functions to everybody, but i am not very sure if this is the case. Let me
think about this more.

> 
> > +#else
> > +static inline int is_cpu_avx512_enabled(void)
> > +{
> > +   return 0;
> > +}
> > +
> > +static inline void copy512(const void *src, void __iomem *dst)
> > +{
> > +   WARN_ON_ONCE(1);
> 
> Are you trying to get reports from syzbot?  :)

Oh.. no.. I will remove it. :)

Thank you very much!

Hao

> 
> Please fix this all up.
> 
> greg k-h

[PATCH] hung_task: Allow printing warnings every check interval

2019-07-24 Thread Dmitry Safonov

Hung task detector has one timeout and has two associated actions on it:
- issuing warnings with names and stacks of blocked tasks
- panic()

We want switches to panic (and reboot) if there's a task
in uninterruptible sleep for some minutes - at that moment something
ugly has happened and the box needs a reboot.
But we also want to detect conditions that are "out of range"
or approaching the point of failure. Under such conditions we want
to issue an "early warning" of an impending failure, minutes before
the switch is going to panic.

Those "early warnings" serve a purpose while monitoring the network
infrastructure. Those are also valuable on post-mortem analysis, when
the logs from userspace applications aren't enough.
Furthermore, we have a test pool of long-running duts that are
constantly under close to real-world load for weeks. And such early
warnings allowed to figure out some bottle necks without much engineer
work intervention.

There are also not yet upstream patches for other kinds of "early
warnings" as prints whenever a mutex/semaphore is released after being
held for long time, but those patches are much more intricate and have
their runtime cost.

It seems rather easy to add printing tasks and their stacks for
notification and debugging purposes into hung task detector without
complicating the code or major cost (prints are with KERN_INFO loglevel
and so don't go on console, only into dmesg log).

Since commit a2e514453861 ("kernel/hung_task.c: allow to set checking
interval separately from timeout") it's possible to set checking
interval for hung task detector with `hung_task_check_interval_secs`.

Provide `hung_task_interval_warnings` sysctl that allows printing
hung tasks every detection interval. It's not ratelimited, so the root
should be cautious configuring it.

Cc: Andrew Morton 
Cc: Dmitry Vyukov 
Cc: Ingo Molnar 
Cc: Jonathan Corbet 
Cc: Tetsuo Handa 
Cc: Thomas Gleixner 
Cc: "Peter Zijlstra (Intel)" 
Cc: Vasiliy Khoruzhick 
Cc: linux-doc@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Signed-off-by: Dmitry Safonov 
---
 Documentation/admin-guide/sysctl/kernel.rst | 20 -
 include/linux/sched/sysctl.h|  1 +
 kernel/hung_task.c  | 50 ++---
 kernel/sysctl.c |  8 
 4 files changed, 62 insertions(+), 17 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/kernel.rst 
b/Documentation/admin-guide/sysctl/kernel.rst
index 032c7cd3cede..2e36620ec1e4 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -45,6 +45,7 @@ show up in /proc/sys/kernel:
 - hung_task_timeout_secs
 - hung_task_check_interval_secs
 - hung_task_warnings
+- hung_task_interval_warnings
 - hyperv_record_panic_msg
 - kexec_load_disabled
 - kptr_restrict
@@ -383,14 +384,29 @@ Possible values to set are in range {0..LONG_MAX/HZ}.
 hung_task_warnings:
 ===
 
-The maximum number of warnings to report. During a check interval
-if a hung task is detected, this value is decreased by 1.
+The maximum number of warnings to report. If after timeout a hung
+task is present, this value is decreased by 1 every check interval,
+producing a warning.
 When this value reaches 0, no more warnings will be reported.
 This file shows up if CONFIG_DETECT_HUNG_TASK is enabled.
 
 -1: report an infinite number of warnings.
 
 
+hung_task_interval_warnings:
+===
+
+The same as hung_task_warnings, but set the number of interval
+warnings to be issued about detected hung tasks during check
+interval. That will produce warnings *before* the timeout happens.
+If a hung task is detected during check interval, this value is
+decreased by 1. When this value reaches 0, only timeout warnings
+will be reported.
+This file shows up if CONFIG_DETECT_HUNG_TASK is enabled.
+
+-1: report an infinite number of check interval warnings.
+
+
 hyperv_record_panic_msg:
 
 
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index d4f6215ee03f..89f55e914673 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -12,6 +12,7 @@ extern unsigned int  sysctl_hung_task_panic;
 extern unsigned long sysctl_hung_task_timeout_secs;
 extern unsigned long sysctl_hung_task_check_interval_secs;
 extern int sysctl_hung_task_warnings;
+extern int sysctl_hung_task_interval_warnings;
 extern int proc_dohung_task_timeout_secs(struct ctl_table *table, int write,
 void __user *buffer,
 size_t *lenp, loff_t *ppos);
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 14a625c16cb3..cd971eef8226 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -49,6 +49,7 @@ unsigned long __read_mostly sysctl_hung_task_timeout_secs = 
CONFIG_DEFAULT_HUNG_
 unsigned long __read_mostly sysctl_hung_task_check_interval_secs;
 
 int __read_mostly sysctl_hung_ta

[PATCH v15 00/13] TCU patchset v15

2019-07-24 Thread Paul Cercueil

Hi,

This is the V15 of my Ingenic TCU patchet.

The big change since V14 is that the custom MFD driver
(ex patch 04/13) was dropped in favor of a small patch to syscon
and a "simple-mfd" compatible.

The patchset was based on mips/mips-next, but all of them minus
the last one will apply cleanly on v5.3-rc1.

Changelog:

* [02/13]: Remove info about MFD driver
* [03/13]: Add "simple-mfd" compatible string
* [04/13]: New patch
* [05/13]: - Use CLK_OF_DECLARE_DRIVER since we use "simple-mfd"
   - Use device_node_to_regmap()
* [06/13]: Use device_node_to_regmap()
* [07/13]: Use device_node_to_regmap()
* [09/13]: Add "simple-mfd" compatible string

Cheers,
-Paul

[PATCH v15 01/13] dt-bindings: ingenic: Add DT bindings for TCU clocks

2019-07-24 Thread Paul Cercueil

This header provides clock numbers for the ingenic,tcu
DT binding.

Signed-off-by: Paul Cercueil 
Tested-by: Mathieu Malaterre 
Tested-by: Artur Rojek 
Reviewed-by: Rob Herring 
Acked-by: Stephen Boyd 
---

Notes:
v2: Use SPDX identifier for the license

v3/v4: No change

v5: s/JZ47*_/TCU_/ and dropped *_CLK_LAST defines

v6-v15: No change

 include/dt-bindings/clock/ingenic,tcu.h | 20 
 1 file changed, 20 insertions(+)
 create mode 100644 include/dt-bindings/clock/ingenic,tcu.h

diff --git a/include/dt-bindings/clock/ingenic,tcu.h 
b/include/dt-bindings/clock/ingenic,tcu.h
new file mode 100644
index ..d569650a7945
--- /dev/null
+++ b/include/dt-bindings/clock/ingenic,tcu.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * This header provides clock numbers for the ingenic,tcu DT binding.
+ */
+
+#ifndef __DT_BINDINGS_CLOCK_INGENIC_TCU_H__
+#define __DT_BINDINGS_CLOCK_INGENIC_TCU_H__
+
+#define TCU_CLK_TIMER0 0
+#define TCU_CLK_TIMER1 1
+#define TCU_CLK_TIMER2 2
+#define TCU_CLK_TIMER3 3
+#define TCU_CLK_TIMER4 4
+#define TCU_CLK_TIMER5 5
+#define TCU_CLK_TIMER6 6
+#define TCU_CLK_TIMER7 7
+#define TCU_CLK_WDT8
+#define TCU_CLK_OST9
+
+#endif /* __DT_BINDINGS_CLOCK_INGENIC_TCU_H__ */
-- 
2.21.0.593.g511ec345e18

[PATCH v15 02/13] doc: Add doc for the Ingenic TCU hardware

2019-07-24 Thread Paul Cercueil

Add documentation about the Timer/Counter Unit (TCU) present in the
Ingenic JZ47xx SoCs.

The Timer/Counter Unit (TCU) in Ingenic JZ47xx SoCs is a multi-function
hardware block. It features up to to eight channels, that can be used as
counters, timers, or PWM.

- JZ4725B, JZ4750, JZ4755 only have six TCU channels. The other SoCs all
  have eight channels.

- JZ4725B introduced a separate channel, called Operating System Timer
  (OST). It is a 32-bit programmable timer. On JZ4770 and above, it is
  64-bit.

- Each one of the TCU channels has its own clock, which can be reparented
  to three different clocks (pclk, ext, rtc), gated, and reclocked, through
  their TCSR register.
  * The watchdog and OST hardware blocks also feature a TCSR register with
the same format in their register space.
  * The TCU registers used to gate/ungate can also gate/ungate the watchdog
and OST clocks.

- Each TCU channel works in one of two modes:
  * mode TCU1: channels cannot work in sleep mode, but are easier to
operate.
  * mode TCU2: channels can work in sleep mode, but the operation is a bit
more complicated than with TCU1 channels.

- The mode of each TCU channel depends on the SoC used:
  * On the oldest SoCs (up to JZ4740), all of the eight channels operate in
TCU1 mode.
  * On JZ4725B, channel 5 operates as TCU2, the others operate as TCU1.
  * On newest SoCs (JZ4750 and above), channels 1-2 operate as TCU2, the
others operate as TCU1.

- Each channel can generate an interrupt. Some channels share an interrupt
  line, some don't, and this changes between SoC versions:
  * on older SoCs (JZ4740 and below), channel 0 and channel 1 have their
own interrupt line; channels 2-7 share the last interrupt line.
  * On JZ4725B, channel 0 has its own interrupt; channels 1-5 share one
interrupt line; the OST uses the last interrupt line.
  * on newer SoCs (JZ4750 and above), channel 5 has its own interrupt;
channels 0-4 and (if eight channels) 6-7 all share one interrupt line;
the OST uses the last interrupt line.

Signed-off-by: Paul Cercueil 
Tested-by: Mathieu Malaterre 
Tested-by: Artur Rojek 
---

Notes:
v4: New patch in this series

v5: Added information about number of channels, and improved
documentation about channel modes

v6: Add info about OST (can be 32-bit on older SoCs)

v7-v11: No change

v12: Add details about new implementation

v13: No change

v14: Convert to ReStructured Text

v15: Remove info about MFD driver

 Documentation/index.rst|  1 +
 Documentation/mips/index.rst   | 11 +
 Documentation/mips/ingenic-tcu.rst | 71 ++
 3 files changed, 83 insertions(+)
 create mode 100644 Documentation/mips/index.rst
 create mode 100644 Documentation/mips/ingenic-tcu.rst

diff --git a/Documentation/index.rst b/Documentation/index.rst
index 70ae148ec980..87214feda41f 100644
--- a/Documentation/index.rst
+++ b/Documentation/index.rst
@@ -143,6 +143,7 @@ implementation.
arm64/index
ia64/index
m68k/index
+   mips/index
riscv/index
s390/index
sh/index
diff --git a/Documentation/mips/index.rst b/Documentation/mips/index.rst
new file mode 100644
index ..321b4794f3b8
--- /dev/null
+++ b/Documentation/mips/index.rst
@@ -0,0 +1,11 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===
+MIPS-specific Documentation
+===
+
+.. toctree::
+   :maxdepth: 1
+   :numbered:
+
+   ingenic-tcu
diff --git a/Documentation/mips/ingenic-tcu.rst 
b/Documentation/mips/ingenic-tcu.rst
new file mode 100644
index ..c4ef4c45aade
--- /dev/null
+++ b/Documentation/mips/ingenic-tcu.rst
@@ -0,0 +1,71 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===
+Ingenic JZ47xx SoCs Timer/Counter Unit hardware
+===
+
+The Timer/Counter Unit (TCU) in Ingenic JZ47xx SoCs is a multi-function
+hardware block. It features up to to eight channels, that can be used as
+counters, timers, or PWM.
+
+- JZ4725B, JZ4750, JZ4755 only have six TCU channels. The other SoCs all
+  have eight channels.
+
+- JZ4725B introduced a separate channel, called Operating System Timer
+  (OST). It is a 32-bit programmable timer. On JZ4760B and above, it is
+  64-bit.
+
+- Each one of the TCU channels has its own clock, which can be reparented to 
three
+  different clocks (pclk, ext, rtc), gated, and reclocked, through their TCSR 
register.
+
+- The watchdog and OST hardware blocks also feature a TCSR register with 
the same
+  format in their register space.
+- The TCU registers used to gate/ungate can also gate/ungate the watchdog 
and
+  OST clocks.
+
+- Each TCU channel works in one of two modes:
+
+- mode TCU1: channels cannot work in sleep mode, but are easier to
+  operate.
+- mode TCU2: channels can work in sleep mode, but the operation

[PATCH v15 03/13] dt-bindings: Add doc for the Ingenic TCU drivers

2019-07-24 Thread Paul Cercueil

Add documentation about how to properly use the Ingenic TCU
(Timer/Counter Unit) drivers from devicetree.

Signed-off-by: Paul Cercueil 
Reviewed-by: Rob Herring 
Tested-by: Mathieu Malaterre 
Tested-by: Artur Rojek 
---

Notes:
v4: New patch in this series. Corresponds to V2 patches 3-4-5 with
 added content.

v5:
 - Edited PWM/watchdog DT bindings documentation to point to the new
   document.
 - Moved main document to
   Documentation/devicetree/bindings/timer/ingenic,tcu.txt
 - Updated documentation to reflect the new devicetree bindings.

v6:
 - Removed PWM/watchdog documentation files as asked by upstream
 - Removed doc about properties that should be implicit
 - Removed doc about ingenic,timer-channel /
   ingenic,clocksource-channel as they are gone
 - Fix WDT clock name in the binding doc
 - Fix lengths of register areas in watchdog/pwm nodes

v7: No change

v8:
 - Fix address of the PWM node
 - Added doc about system timer and clocksource children nodes

v9:
 - Remove doc about system timer and clocksource children
   nodes...
 - Add doc about ingenic,pwm-channels-mask property

v10: No change

v11: Fix info about default value of ingenic,pwm-channels-mask

v12: Drop sub-nodes for now; they will be introduced in a follow-up
 patchset.

v13:
 - Revert back to v11. Turns out it was okay.
 - Remove 'interrupt-parent' of the list of required properties.

v14: No change

v15: Add "simple-mfd" compatible string

 .../bindings/pwm/ingenic,jz47xx-pwm.txt   |  22 ---
 .../devicetree/bindings/timer/ingenic,tcu.txt | 137 ++
 .../bindings/watchdog/ingenic,jz4740-wdt.txt  |  17 ---
 3 files changed, 137 insertions(+), 39 deletions(-)
 delete mode 100644 Documentation/devicetree/bindings/pwm/ingenic,jz47xx-pwm.txt
 create mode 100644 Documentation/devicetree/bindings/timer/ingenic,tcu.txt
 delete mode 100644 
Documentation/devicetree/bindings/watchdog/ingenic,jz4740-wdt.txt

diff --git a/Documentation/devicetree/bindings/pwm/ingenic,jz47xx-pwm.txt 
b/Documentation/devicetree/bindings/pwm/ingenic,jz47xx-pwm.txt
deleted file mode 100644
index 493bec80d59b..
--- a/Documentation/devicetree/bindings/pwm/ingenic,jz47xx-pwm.txt
+++ /dev/null
@@ -1,22 +0,0 @@
-Ingenic JZ47xx PWM Controller
-=
-
-Required properties:
-- compatible: Should be "ingenic,jz4740-pwm"
-- #pwm-cells: Should be 3. See pwm.txt in this directory for a description
-  of the cells format.
-- clocks : phandle to the external clock.
-- clock-names : Should be "ext".
-
-
-Example:
-
-   pwm: pwm@10002000 {
-   compatible = "ingenic,jz4740-pwm";
-   reg = <0x10002000 0x1000>;
-
-   #pwm-cells = <3>;
-
-   clocks = <&ext>;
-   clock-names = "ext";
-   };
diff --git a/Documentation/devicetree/bindings/timer/ingenic,tcu.txt 
b/Documentation/devicetree/bindings/timer/ingenic,tcu.txt
new file mode 100644
index ..5a4b9ddd9470
--- /dev/null
+++ b/Documentation/devicetree/bindings/timer/ingenic,tcu.txt
@@ -0,0 +1,137 @@
+Ingenic JZ47xx SoCs Timer/Counter Unit devicetree bindings
+==
+
+For a description of the TCU hardware and drivers, have a look at
+Documentation/mips/ingenic-tcu.txt.
+
+Required properties:
+
+- compatible: Must be one of:
+  * ingenic,jz4740-tcu
+  * ingenic,jz4725b-tcu
+  * ingenic,jz4770-tcu
+  followed by "simple-mfd".
+- reg: Should be the offset/length value corresponding to the TCU registers
+- clocks: List of phandle & clock specifiers for clocks external to the TCU.
+  The "pclk", "rtc" and "ext" clocks should be provided. The "tcu" clock
+  should be provided if the SoC has it.
+- clock-names: List of name strings for the external clocks.
+- #clock-cells: Should be <1>;
+  Clock consumers specify this argument to identify a clock. The valid values
+  may be found in .
+- interrupt-controller : Identifies the node as an interrupt controller
+- #interrupt-cells : Specifies the number of cells needed to encode an
+  interrupt source. The value should be 1.
+- interrupts : Specifies the interrupt the controller is connected to.
+
+Optional properties:
+
+- ingenic,pwm-channels-mask: Bitmask of TCU channels reserved for PWM use.
+  Default value is 0xfc.
+
+
+Children nodes
+==
+
+
+PWM node:
+-
+
+Required properties:
+
+- compatible: Must be one of:
+  * ingenic,jz4740-pwm
+  * ingenic,jz4725b-pwm
+- #pwm-cells: Should be 3. See ../pwm/pwm.txt for a description of the cell
+  format.
+- clocks: List of phandle & clock specifiers for the TCU clocks.
+- clock-names: List of name strings for the TCU clocks.
+
+
+Watchdog node:
+--
+
+Required properties:
+
+- compatible: Must be "ingenic,jz4740-w

[PATCH v15 04/13] mfd/syscon: Add device_node_to_regmap()

2019-07-24 Thread Paul Cercueil

device_node_to_regmap() is exactly like syscon_node_to_regmap(), but it
does not check that the node is compatible with "syscon", and won't
attach the first clock it finds to the regmap.

The rationale behind this, is that one device node with a standard
compatible string "foo,bar" can be covered by multiple drivers sharing a
regmap, or by a single driver doing all the job without a regmap, but
these are implementation details which shouldn't reflect on the
devicetree.

Signed-off-by: Paul Cercueil 
---

Notes:
v15: New patch

 drivers/mfd/syscon.c   | 46 +-
 include/linux/mfd/syscon.h |  6 +
 2 files changed, 36 insertions(+), 16 deletions(-)

diff --git a/drivers/mfd/syscon.c b/drivers/mfd/syscon.c
index b65e585fc8c6..660723276481 100644
--- a/drivers/mfd/syscon.c
+++ b/drivers/mfd/syscon.c
@@ -40,7 +40,7 @@ static const struct regmap_config syscon_regmap_config = {
.reg_stride = 4,
 };
 
-static struct syscon *of_syscon_register(struct device_node *np)
+static struct syscon *of_syscon_register(struct device_node *np, bool 
check_clk)
 {
struct clk *clk;
struct syscon *syscon;
@@ -51,9 +51,6 @@ static struct syscon *of_syscon_register(struct device_node 
*np)
struct regmap_config syscon_config = syscon_regmap_config;
struct resource res;
 
-   if (!of_device_is_compatible(np, "syscon"))
-   return ERR_PTR(-EINVAL);
-
syscon = kzalloc(sizeof(*syscon), GFP_KERNEL);
if (!syscon)
return ERR_PTR(-ENOMEM);
@@ -117,16 +114,18 @@ static struct syscon *of_syscon_register(struct 
device_node *np)
goto err_regmap;
}
 
-   clk = of_clk_get(np, 0);
-   if (IS_ERR(clk)) {
-   ret = PTR_ERR(clk);
-   /* clock is optional */
-   if (ret != -ENOENT)
-   goto err_clk;
-   } else {
-   ret = regmap_mmio_attach_clk(regmap, clk);
-   if (ret)
-   goto err_attach;
+   if (check_clk) {
+   clk = of_clk_get(np, 0);
+   if (IS_ERR(clk)) {
+   ret = PTR_ERR(clk);
+   /* clock is optional */
+   if (ret != -ENOENT)
+   goto err_clk;
+   } else {
+   ret = regmap_mmio_attach_clk(regmap, clk);
+   if (ret)
+   goto err_attach;
+   }
}
 
syscon->regmap = regmap;
@@ -150,7 +149,8 @@ static struct syscon *of_syscon_register(struct device_node 
*np)
return ERR_PTR(ret);
 }
 
-struct regmap *syscon_node_to_regmap(struct device_node *np)
+static struct regmap *device_node_get_regmap(struct device_node *np,
+bool check_clk)
 {
struct syscon *entry, *syscon = NULL;
 
@@ -165,13 +165,27 @@ struct regmap *syscon_node_to_regmap(struct device_node 
*np)
spin_unlock(&syscon_list_slock);
 
if (!syscon)
-   syscon = of_syscon_register(np);
+   syscon = of_syscon_register(np, check_clk);
 
if (IS_ERR(syscon))
return ERR_CAST(syscon);
 
return syscon->regmap;
 }
+
+struct regmap *device_node_to_regmap(struct device_node *np)
+{
+   return device_node_get_regmap(np, false);
+}
+EXPORT_SYMBOL_GPL(device_node_to_regmap);
+
+struct regmap *syscon_node_to_regmap(struct device_node *np)
+{
+   if (!of_device_is_compatible(np, "syscon"))
+   return ERR_PTR(-EINVAL);
+
+   return device_node_get_regmap(np, true);
+}
 EXPORT_SYMBOL_GPL(syscon_node_to_regmap);
 
 struct regmap *syscon_regmap_lookup_by_compatible(const char *s)
diff --git a/include/linux/mfd/syscon.h b/include/linux/mfd/syscon.h
index 8cfda0554381..112dc66262cc 100644
--- a/include/linux/mfd/syscon.h
+++ b/include/linux/mfd/syscon.h
@@ -17,12 +17,18 @@
 struct device_node;
 
 #ifdef CONFIG_MFD_SYSCON
+extern struct regmap *device_node_to_regmap(struct device_node *np);
 extern struct regmap *syscon_node_to_regmap(struct device_node *np);
 extern struct regmap *syscon_regmap_lookup_by_compatible(const char *s);
 extern struct regmap *syscon_regmap_lookup_by_phandle(
struct device_node *np,
const char *property);
 #else
+static inline struct regmap *device_node_to_regmap(struct device_node *np)
+{
+   return ERR_PTR(-ENOTSUPP);
+}
+
 static inline struct regmap *syscon_node_to_regmap(struct device_node *np)
 {
return ERR_PTR(-ENOTSUPP);
-- 
2.21.0.593.g511ec345e18

[PATCH v15 05/13] clk: ingenic: Add driver for the TCU clocks

2019-07-24 Thread Paul Cercueil

Add driver to support the clocks provided by the Timer/Counter Unit
(TCU) of the JZ47xx SoCs from Ingenic.

Signed-off-by: Paul Cercueil 
Tested-by: Mathieu Malaterre 
Tested-by: Artur Rojek 
---

Notes:
v12: New patch

v13:
 - Don't enable/disable the TCU clock on demand. Enable it in the probe
   and call it a day.
 - Register suspend callbacks to gate/ungate the TCU clock on
   suspend/resume.
 - Use pr_fmt and pr_crit instead of custom TCU_ERR() macro
 - Remove useless dependency on COMMON_CLK in Kconfig
 - Remove registration of clkdev

v14: Change %i to %d

v15:
 - Use CLK_OF_DECLARE_DRIVER macro since we use "simple-mfd"
 - Use device_node_to_regmap()

 drivers/clk/ingenic/Kconfig  |  10 +-
 drivers/clk/ingenic/Makefile |   1 +
 drivers/clk/ingenic/tcu.c| 474 +++
 3 files changed, 484 insertions(+), 1 deletion(-)
 create mode 100644 drivers/clk/ingenic/tcu.c

diff --git a/drivers/clk/ingenic/Kconfig b/drivers/clk/ingenic/Kconfig
index fe8db93cf21a..1cb489959a99 100644
--- a/drivers/clk/ingenic/Kconfig
+++ b/drivers/clk/ingenic/Kconfig
@@ -1,5 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0-only
-menu "Ingenic JZ47xx CGU drivers"
+menu "Ingenic SoCs drivers"
depends on MIPS
 
 config INGENIC_CGU_COMMON
@@ -45,4 +45,12 @@ config INGENIC_CGU_JZ4780
 
  If building for a JZ4780 SoC, you want to say Y here.
 
+config INGENIC_TCU_CLK
+   bool "Ingenic JZ47xx TCU clocks driver"
+   default MACH_INGENIC
+   select MFD_SYSCON
+   help
+ Support the clocks of the Timer/Counter Unit (TCU) of the Ingenic
+ JZ47xx SoCs.
+
 endmenu
diff --git a/drivers/clk/ingenic/Makefile b/drivers/clk/ingenic/Makefile
index 250570a809d3..097220b05131 100644
--- a/drivers/clk/ingenic/Makefile
+++ b/drivers/clk/ingenic/Makefile
@@ -4,3 +4,4 @@ obj-$(CONFIG_INGENIC_CGU_JZ4740)+= jz4740-cgu.o
 obj-$(CONFIG_INGENIC_CGU_JZ4725B)  += jz4725b-cgu.o
 obj-$(CONFIG_INGENIC_CGU_JZ4770)   += jz4770-cgu.o
 obj-$(CONFIG_INGENIC_CGU_JZ4780)   += jz4780-cgu.o
+obj-$(CONFIG_INGENIC_TCU_CLK)  += tcu.o
diff --git a/drivers/clk/ingenic/tcu.c b/drivers/clk/ingenic/tcu.c
new file mode 100644
index ..a1a5f9cb439e
--- /dev/null
+++ b/drivers/clk/ingenic/tcu.c
@@ -0,0 +1,474 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * JZ47xx SoCs TCU clocks driver
+ * Copyright (C) 2019 Paul Cercueil 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+/* 8 channels max + watchdog + OST */
+#define TCU_CLK_COUNT  10
+
+#undef pr_fmt
+#define pr_fmt(fmt) "ingenic-tcu-clk: " fmt
+
+enum tcu_clk_parent {
+   TCU_PARENT_PCLK,
+   TCU_PARENT_RTC,
+   TCU_PARENT_EXT,
+};
+
+struct ingenic_soc_info {
+   unsigned int num_channels;
+   bool has_ost;
+   bool has_tcu_clk;
+};
+
+struct ingenic_tcu_clk_info {
+   struct clk_init_data init_data;
+   u8 gate_bit;
+   u8 tcsr_reg;
+};
+
+struct ingenic_tcu_clk {
+   struct clk_hw hw;
+   unsigned int idx;
+   struct ingenic_tcu *tcu;
+   const struct ingenic_tcu_clk_info *info;
+};
+
+struct ingenic_tcu {
+   const struct ingenic_soc_info *soc_info;
+   struct regmap *map;
+   struct clk *clk;
+
+   struct clk_hw_onecell_data *clocks;
+};
+
+static struct ingenic_tcu *ingenic_tcu;
+
+static inline struct ingenic_tcu_clk *to_tcu_clk(struct clk_hw *hw)
+{
+   return container_of(hw, struct ingenic_tcu_clk, hw);
+}
+
+static int ingenic_tcu_enable(struct clk_hw *hw)
+{
+   struct ingenic_tcu_clk *tcu_clk = to_tcu_clk(hw);
+   const struct ingenic_tcu_clk_info *info = tcu_clk->info;
+   struct ingenic_tcu *tcu = tcu_clk->tcu;
+
+   regmap_write(tcu->map, TCU_REG_TSCR, BIT(info->gate_bit));
+
+   return 0;
+}
+
+static void ingenic_tcu_disable(struct clk_hw *hw)
+{
+   struct ingenic_tcu_clk *tcu_clk = to_tcu_clk(hw);
+   const struct ingenic_tcu_clk_info *info = tcu_clk->info;
+   struct ingenic_tcu *tcu = tcu_clk->tcu;
+
+   regmap_write(tcu->map, TCU_REG_TSSR, BIT(info->gate_bit));
+}
+
+static int ingenic_tcu_is_enabled(struct clk_hw *hw)
+{
+   struct ingenic_tcu_clk *tcu_clk = to_tcu_clk(hw);
+   const struct ingenic_tcu_clk_info *info = tcu_clk->info;
+   unsigned int value;
+
+   regmap_read(tcu_clk->tcu->map, TCU_REG_TSR, &value);
+
+   return !(value & BIT(info->gate_bit));
+}
+
+static bool ingenic_tcu_enable_regs(struct clk_hw *hw)
+{
+   struct ingenic_tcu_clk *tcu_clk = to_tcu_clk(hw);
+   const struct ingenic_tcu_clk_info *info = tcu_clk->info;
+   struct ingenic_tcu *tcu = tcu_clk->tcu;
+   bool enabled = false;
+
+   /*
+* If the SoC has no global TCU clock, we must ungate the channel's
+* clock to be able to access its registers.
+* If we have a TCU clock, it will be enabled automatically as it has
+*

[PATCH v15 06/13] irqchip: Add irq-ingenic-tcu driver

2019-07-24 Thread Paul Cercueil

This driver handles the interrupt controller built in the Timer/Counter
Unit (TCU) of the JZ47xx SoCs from Ingenic.

Signed-off-by: Paul Cercueil 
Tested-by: Mathieu Malaterre 
Tested-by: Artur Rojek 
Reviewed-by: Thomas Gleixner 
---

Notes:
v12: New patch

v13: No change

v14: Remove empty lines in structure definitions

v15: Use device_node_to_regmap()

 drivers/irqchip/Kconfig   |  11 ++
 drivers/irqchip/Makefile  |   1 +
 drivers/irqchip/irq-ingenic-tcu.c | 182 ++
 3 files changed, 194 insertions(+)
 create mode 100644 drivers/irqchip/irq-ingenic-tcu.c

diff --git a/drivers/irqchip/Kconfig b/drivers/irqchip/Kconfig
index 80e10f4e213a..3c8308e6b3a7 100644
--- a/drivers/irqchip/Kconfig
+++ b/drivers/irqchip/Kconfig
@@ -315,6 +315,17 @@ config INGENIC_IRQ
depends on MACH_INGENIC
default y
 
+config INGENIC_TCU_IRQ
+   bool "Ingenic JZ47xx TCU interrupt controller"
+   default MACH_INGENIC
+   depends on MIPS || COMPILE_TEST
+   select MFD_SYSCON
+   help
+ Support for interrupts in the Timer/Counter Unit (TCU) of the Ingenic
+ JZ47xx SoCs.
+
+ If unsure, say N.
+
 config RENESAS_H8300H_INTC
 bool
select IRQ_DOMAIN
diff --git a/drivers/irqchip/Makefile b/drivers/irqchip/Makefile
index 8d0fcec6ab23..cc7c43932f16 100644
--- a/drivers/irqchip/Makefile
+++ b/drivers/irqchip/Makefile
@@ -75,6 +75,7 @@ obj-$(CONFIG_RENESAS_H8300H_INTC) += irq-renesas-h8300h.o
 obj-$(CONFIG_RENESAS_H8S_INTC) += irq-renesas-h8s.o
 obj-$(CONFIG_ARCH_SA1100)  += irq-sa11x0.o
 obj-$(CONFIG_INGENIC_IRQ)  += irq-ingenic.o
+obj-$(CONFIG_INGENIC_TCU_IRQ)  += irq-ingenic-tcu.o
 obj-$(CONFIG_IMX_GPCV2)+= irq-imx-gpcv2.o
 obj-$(CONFIG_PIC32_EVIC)   += irq-pic32-evic.o
 obj-$(CONFIG_MSCC_OCELOT_IRQ)  += irq-mscc-ocelot.o
diff --git a/drivers/irqchip/irq-ingenic-tcu.c 
b/drivers/irqchip/irq-ingenic-tcu.c
new file mode 100644
index ..6d05cefe9d79
--- /dev/null
+++ b/drivers/irqchip/irq-ingenic-tcu.c
@@ -0,0 +1,182 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * JZ47xx SoCs TCU IRQ driver
+ * Copyright (C) 2019 Paul Cercueil 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct ingenic_tcu {
+   struct regmap *map;
+   struct clk *clk;
+   struct irq_domain *domain;
+   unsigned int nb_parent_irqs;
+   u32 parent_irqs[3];
+};
+
+static void ingenic_tcu_intc_cascade(struct irq_desc *desc)
+{
+   struct irq_chip *irq_chip = irq_data_get_irq_chip(&desc->irq_data);
+   struct irq_domain *domain = irq_desc_get_handler_data(desc);
+   struct irq_chip_generic *gc = irq_get_domain_generic_chip(domain, 0);
+   struct regmap *map = gc->private;
+   uint32_t irq_reg, irq_mask;
+   unsigned int i;
+
+   regmap_read(map, TCU_REG_TFR, &irq_reg);
+   regmap_read(map, TCU_REG_TMR, &irq_mask);
+
+   chained_irq_enter(irq_chip, desc);
+
+   irq_reg &= ~irq_mask;
+
+   for_each_set_bit(i, (unsigned long *)&irq_reg, 32)
+   generic_handle_irq(irq_linear_revmap(domain, i));
+
+   chained_irq_exit(irq_chip, desc);
+}
+
+static void ingenic_tcu_gc_unmask_enable_reg(struct irq_data *d)
+{
+   struct irq_chip_generic *gc = irq_data_get_irq_chip_data(d);
+   struct irq_chip_type *ct = irq_data_get_chip_type(d);
+   struct regmap *map = gc->private;
+   u32 mask = d->mask;
+
+   irq_gc_lock(gc);
+   regmap_write(map, ct->regs.ack, mask);
+   regmap_write(map, ct->regs.enable, mask);
+   *ct->mask_cache |= mask;
+   irq_gc_unlock(gc);
+}
+
+static void ingenic_tcu_gc_mask_disable_reg(struct irq_data *d)
+{
+   struct irq_chip_generic *gc = irq_data_get_irq_chip_data(d);
+   struct irq_chip_type *ct = irq_data_get_chip_type(d);
+   struct regmap *map = gc->private;
+   u32 mask = d->mask;
+
+   irq_gc_lock(gc);
+   regmap_write(map, ct->regs.disable, mask);
+   *ct->mask_cache &= ~mask;
+   irq_gc_unlock(gc);
+}
+
+static void ingenic_tcu_gc_mask_disable_reg_and_ack(struct irq_data *d)
+{
+   struct irq_chip_generic *gc = irq_data_get_irq_chip_data(d);
+   struct irq_chip_type *ct = irq_data_get_chip_type(d);
+   struct regmap *map = gc->private;
+   u32 mask = d->mask;
+
+   irq_gc_lock(gc);
+   regmap_write(map, ct->regs.ack, mask);
+   regmap_write(map, ct->regs.disable, mask);
+   irq_gc_unlock(gc);
+}
+
+static int __init ingenic_tcu_irq_init(struct device_node *np,
+  struct device_node *parent)
+{
+   struct irq_chip_generic *gc;
+   struct irq_chip_type *ct;
+   struct ingenic_tcu *tcu;
+   struct regmap *map;
+   unsigned int i;
+   int ret, irqs;
+
+   map = device_node_to_regmap(np);
+   if (IS_ERR(map))
+   ret

[PATCH v15 08/13] clk: jz4740: Add TCU clock

2019-07-24 Thread Paul Cercueil

Add the missing TCU clock to the list of clocks supplied by the CGU for
the JZ4740 SoC.

Signed-off-by: Paul Cercueil 
Tested-by: Mathieu Malaterre 
Tested-by: Artur Rojek 
Acked-by: Stephen Boyd 
Acked-by: Rob Herring 
---

Notes:
v5: New patch

v6-v15: No change

 drivers/clk/ingenic/jz4740-cgu.c   | 6 ++
 include/dt-bindings/clock/jz4740-cgu.h | 1 +
 2 files changed, 7 insertions(+)

diff --git a/drivers/clk/ingenic/jz4740-cgu.c b/drivers/clk/ingenic/jz4740-cgu.c
index 4c0a20949c2c..67f8a0e14284 100644
--- a/drivers/clk/ingenic/jz4740-cgu.c
+++ b/drivers/clk/ingenic/jz4740-cgu.c
@@ -222,6 +222,12 @@ static const struct ingenic_cgu_clk_info 
jz4740_cgu_clocks[] = {
.parents = { JZ4740_CLK_EXT, -1, -1, -1 },
.gate = { CGU_REG_CLKGR, 5 },
},
+
+   [JZ4740_CLK_TCU] = {
+   "tcu", CGU_CLK_GATE,
+   .parents = { JZ4740_CLK_EXT, -1, -1, -1 },
+   .gate = { CGU_REG_CLKGR, 1 },
+   },
 };
 
 static void __init jz4740_cgu_init(struct device_node *np)
diff --git a/include/dt-bindings/clock/jz4740-cgu.h 
b/include/dt-bindings/clock/jz4740-cgu.h
index 6ed83f926ae7..e82d77028581 100644
--- a/include/dt-bindings/clock/jz4740-cgu.h
+++ b/include/dt-bindings/clock/jz4740-cgu.h
@@ -34,5 +34,6 @@
 #define JZ4740_CLK_ADC 19
 #define JZ4740_CLK_I2C 20
 #define JZ4740_CLK_AIC 21
+#define JZ4740_CLK_TCU 22
 
 #endif /* __DT_BINDINGS_CLOCK_JZ4740_CGU_H__ */
-- 
2.21.0.593.g511ec345e18

[PATCH v15 07/13] clocksource: Add a new timer-ingenic driver

2019-07-24 Thread Paul Cercueil

This driver handles the TCU (Timer Counter Unit) present on the Ingenic
JZ47xx SoCs, and provides the kernel with a system timer, a clocksource
and a sched_clock.

Signed-off-by: Paul Cercueil 
Tested-by: Mathieu Malaterre 
Tested-by: Artur Rojek 
Reviewed-by: Thomas Gleixner 
---

Notes:
v2: Use SPDX identifier for the license

v3: - Move documentation to its own patch
- Search the devicetree for PWM clients, and use all the TCU
  channels that won't be used for PWM

v4: - Add documentation about why we search for PWM clients
- Verify that the PWM clients are for the TCU PWM driver

v5: Major overhaul. Too many changes to list. Consider it's a new
patch.

v6: - Add two API functions ingenic_tcu_request_channel and
  ingenic_tcu_release_channel. To be used by the PWM driver to
  request the use of a TCU channel. The driver will now dynamically
  move away the system timer or clocksource to a new TCU channel.
- The system timer now defaults to channel 0, the clocksource now
  defaults to channel 1 and is no more optional. The
  ingenic,timer-channel and ingenic,clocksource-channel devicetree
  properties are now gone.
- Fix round_rate / set_rate not calculating the prescale divider
  the same way. This caused problems when (parent_rate / div) would
  give a non-integer result. The behaviour is correct now.
- The clocksource clock is turned off on suspend now.

v7: Fix section mismatch by using builtin_platform_driver_probe()

v8: - Removed ingenic_tcu_[request,release]_channel, and the mechanism
  to dynamically change the TCU channel of the system timer or
  the clocksource.
- The driver's devicetree node can now have two more children
  nodes, that correspond to the system timer and clocksource.
  For these two, the driver will use the TCU timer that
  correspond to the memory resource supplied in their
  respective node.

v9: - Removed support for clocksource / timer children devicetree
  nodes. Now, we use a property "ingenic,pwm-channels-mask" to
  know which PWM channels are reserved for PWM use and should
  not be used as OS timers.

v10: - Use CLK_SET_RATE_UNGATE instead of CLK_SET_RATE_GATE + manually
   un-gating the clock before changing rate. Same for re-parenting.
 - Unconditionally create the clocksource and sched_clock even if
   the SoC possesses a OS Timer. That gives the choice back to the
   user which clocksource should be selected.
 - Use subsys_initcall() instead of builtin_platform_driver_probe().
   The OS Timer driver calls builtin_platform_driver_probe, which
   requires the device to be created before that.
 - Cosmetic cleanups

v11: - Change prototype of exported function
   ingenic_tcu_pwm_can_use_chn(), use a struct device * as first
   argument.
 - Read clocksource using the regmap instead of bypassing it.
   Bypassing the regmap makes sense only for the sched_clock where
   the read operation must be as fast as possible.
 - Fix incorrect format in pr_crit() macro

v12: - Clock handling and IRQ handling are gone, and are now handled
   in their own driver.
 - Obtain regmap from the ingenic-tcu MFD driver. As a result, we
   cannot bypass the regmap anymore for the sched_clock.

v13: No change

v14: Remove empty lines in structure definitions

v15: Use device_node_to_regmap()

 drivers/clocksource/Kconfig |  11 +
 drivers/clocksource/Makefile|   1 +
 drivers/clocksource/ingenic-timer.c | 356 
 3 files changed, 368 insertions(+)
 create mode 100644 drivers/clocksource/ingenic-timer.c

diff --git a/drivers/clocksource/Kconfig b/drivers/clocksource/Kconfig
index 5e9317dc3d39..a9cdc2c4f8bd 100644
--- a/drivers/clocksource/Kconfig
+++ b/drivers/clocksource/Kconfig
@@ -685,4 +685,15 @@ config MILBEAUT_TIMER
help
  Enables the support for Milbeaut timer driver.
 
+config INGENIC_TIMER
+   bool "Clocksource/timer using the TCU in Ingenic JZ SoCs"
+   default MACH_INGENIC
+   depends on MIPS || COMPILE_TEST
+   depends on COMMON_CLK
+   select MFD_SYSCON
+   select TIMER_OF
+   select IRQ_DOMAIN
+   help
+ Support for the timer/counter unit of the Ingenic JZ SoCs.
+
 endmenu
diff --git a/drivers/clocksource/Makefile b/drivers/clocksource/Makefile
index 2e7936e7833f..4dfe4225ece7 100644
--- a/drivers/clocksource/Makefile
+++ b/drivers/clocksource/Makefile
@@ -80,6 +80,7 @@ obj-$(CONFIG_ASM9260_TIMER)   += asm9260_timer.o
 obj-$(CONFIG_H8300_TMR8)   += h8300_timer8.o
 obj-$(CONFIG_H8300_TMR16)  += h8300_timer16.o
 obj-$(CON

[PATCH v15 12/13] MIPS: GCW0: Reduce system timer and clocksource to 750 kHz

2019-07-24 Thread Paul Cercueil

The default clock (12 MHz) is too fast for the system timer.

Signed-off-by: Paul Cercueil 
Tested-by: Mathieu Malaterre 
Tested-by: Artur Rojek 
---

Notes:
v8: New patch

v9: Don't configure clock timer1, as the OS Timer is used as
clocksource on this SoC

v10: Revert back to v8 bahaviour. Let the user choose what
 clocksource should be used.

v11: No change

v12: Move clocksource to channel 2, as channel 1 is used as PWM
 for the backlight.

v13-v15: No change

 arch/mips/boot/dts/ingenic/gcw0.dts | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/mips/boot/dts/ingenic/gcw0.dts 
b/arch/mips/boot/dts/ingenic/gcw0.dts
index 35f0291e8d38..f58d239c2058 100644
--- a/arch/mips/boot/dts/ingenic/gcw0.dts
+++ b/arch/mips/boot/dts/ingenic/gcw0.dts
@@ -2,6 +2,7 @@
 /dts-v1/;
 
 #include "jz4770.dtsi"
+#include 
 
 / {
compatible = "gcw,zero", "ingenic,jz4770";
@@ -60,3 +61,12 @@
/* The WiFi module is connected to the UHC. */
status = "okay";
 };
+
+&tcu {
+   /* 750 kHz for the system timer and clocksource */
+   assigned-clocks = <&tcu TCU_CLK_TIMER0>, <&tcu TCU_CLK_TIMER2>;
+   assigned-clock-rates = <75>, <75>;
+
+   /* PWM1 is in use, so reserve channel #2 for the clocksource */
+   ingenic,pwm-channels-mask = <0xfa>;
+};
-- 
2.21.0.593.g511ec345e18

[PATCH v15 10/13] MIPS: qi_lb60: Reduce system timer and clocksource to 750 kHz

2019-07-24 Thread Paul Cercueil

The default clock (12 MHz) is too fast for the system timer, which fails
to report time accurately.

Signed-off-by: Paul Cercueil 
Tested-by: Mathieu Malaterre 
Tested-by: Artur Rojek 
---

Notes:
v5: New patch

v6: Remove ingenic,clocksource-channel property

v7-v15: No change

 arch/mips/boot/dts/ingenic/qi_lb60.dts | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/mips/boot/dts/ingenic/qi_lb60.dts 
b/arch/mips/boot/dts/ingenic/qi_lb60.dts
index cc26650562c2..933d98ca8d93 100644
--- a/arch/mips/boot/dts/ingenic/qi_lb60.dts
+++ b/arch/mips/boot/dts/ingenic/qi_lb60.dts
@@ -2,6 +2,7 @@
 /dts-v1/;
 
 #include "jz4740.dtsi"
+#include 
 #include 
 
 / {
@@ -64,3 +65,9 @@
pinctrl-names = "default";
pinctrl-0 = <&pins_mmc>;
 };
+
+&tcu {
+   /* 750 kHz for the system timer and clocksource */
+   assigned-clocks = <&tcu TCU_CLK_TIMER0>, <&tcu TCU_CLK_TIMER1>;
+   assigned-clock-rates = <75>, <75>;
+};
-- 
2.21.0.593.g511ec345e18

[PATCH v15 13/13] MIPS: jz4740: Drop obsolete code

2019-07-24 Thread Paul Cercueil

The old clocksource/timer platform code is now obsoleted by the newly
introduced TCU drivers.

Signed-off-by: Paul Cercueil 
Tested-by: Mathieu Malaterre 
Tested-by: Artur Rojek 
---

Notes:
v5: New patch

v6-v11: No change

v12: Only remove clocksource code. The rest will eventually be
 removed in a future patchset when the PWM/watchdog drivers
 are updated.

v13-v15: No change

 arch/mips/jz4740/time.c | 151 +---
 1 file changed, 2 insertions(+), 149 deletions(-)

diff --git a/arch/mips/jz4740/time.c b/arch/mips/jz4740/time.c
index cb768e560d8b..5476899f0882 100644
--- a/arch/mips/jz4740/time.c
+++ b/arch/mips/jz4740/time.c
@@ -4,161 +4,14 @@
  *  JZ4740 platform time support
  */
 
-#include 
 #include 
-#include 
-#include 
-#include 
+#include 
 
-#include 
-#include 
-
-#include 
 #include 
-#include 
-
-#define TIMER_CLOCKEVENT 0
-#define TIMER_CLOCKSOURCE 1
-
-static uint16_t jz4740_jiffies_per_tick;
-
-static u64 jz4740_clocksource_read(struct clocksource *cs)
-{
-   return jz4740_timer_get_count(TIMER_CLOCKSOURCE);
-}
-
-static struct clocksource jz4740_clocksource = {
-   .name = "jz4740-timer",
-   .rating = 200,
-   .read = jz4740_clocksource_read,
-   .mask = CLOCKSOURCE_MASK(16),
-   .flags = CLOCK_SOURCE_IS_CONTINUOUS,
-};
-
-static u64 notrace jz4740_read_sched_clock(void)
-{
-   return jz4740_timer_get_count(TIMER_CLOCKSOURCE);
-}
-
-static irqreturn_t jz4740_clockevent_irq(int irq, void *devid)
-{
-   struct clock_event_device *cd = devid;
-
-   jz4740_timer_ack_full(TIMER_CLOCKEVENT);
-
-   if (!clockevent_state_periodic(cd))
-   jz4740_timer_disable(TIMER_CLOCKEVENT);
-
-   cd->event_handler(cd);
-
-   return IRQ_HANDLED;
-}
-
-static int jz4740_clockevent_set_periodic(struct clock_event_device *evt)
-{
-   jz4740_timer_set_count(TIMER_CLOCKEVENT, 0);
-   jz4740_timer_set_period(TIMER_CLOCKEVENT, jz4740_jiffies_per_tick);
-   jz4740_timer_irq_full_enable(TIMER_CLOCKEVENT);
-   jz4740_timer_enable(TIMER_CLOCKEVENT);
-
-   return 0;
-}
-
-static int jz4740_clockevent_resume(struct clock_event_device *evt)
-{
-   jz4740_timer_irq_full_enable(TIMER_CLOCKEVENT);
-   jz4740_timer_enable(TIMER_CLOCKEVENT);
-
-   return 0;
-}
-
-static int jz4740_clockevent_shutdown(struct clock_event_device *evt)
-{
-   jz4740_timer_disable(TIMER_CLOCKEVENT);
-
-   return 0;
-}
-
-static int jz4740_clockevent_set_next(unsigned long evt,
-   struct clock_event_device *cd)
-{
-   jz4740_timer_set_count(TIMER_CLOCKEVENT, 0);
-   jz4740_timer_set_period(TIMER_CLOCKEVENT, evt);
-   jz4740_timer_enable(TIMER_CLOCKEVENT);
-
-   return 0;
-}
-
-static struct clock_event_device jz4740_clockevent = {
-   .name = "jz4740-timer",
-   .features = CLOCK_EVT_FEAT_PERIODIC | CLOCK_EVT_FEAT_ONESHOT,
-   .set_next_event = jz4740_clockevent_set_next,
-   .set_state_shutdown = jz4740_clockevent_shutdown,
-   .set_state_periodic = jz4740_clockevent_set_periodic,
-   .set_state_oneshot = jz4740_clockevent_shutdown,
-   .tick_resume = jz4740_clockevent_resume,
-   .rating = 200,
-#ifdef CONFIG_MACH_JZ4740
-   .irq = JZ4740_IRQ_TCU0,
-#endif
-#if defined(CONFIG_MACH_JZ4770) || defined(CONFIG_MACH_JZ4780)
-   .irq = JZ4780_IRQ_TCU2,
-#endif
-};
-
-static struct irqaction timer_irqaction = {
-   .handler= jz4740_clockevent_irq,
-   .flags  = IRQF_PERCPU | IRQF_TIMER,
-   .name   = "jz4740-timerirq",
-   .dev_id = &jz4740_clockevent,
-};
 
 void __init plat_time_init(void)
 {
-   int ret;
-   uint32_t clk_rate;
-   uint16_t ctrl;
-   struct clk *ext_clk;
-
of_clk_init(NULL);
jz4740_timer_init();
-
-   ext_clk = clk_get(NULL, "ext");
-   if (IS_ERR(ext_clk))
-   panic("unable to get ext clock");
-   clk_rate = clk_get_rate(ext_clk) >> 4;
-   clk_put(ext_clk);
-
-   jz4740_jiffies_per_tick = DIV_ROUND_CLOSEST(clk_rate, HZ);
-
-   clockevent_set_clock(&jz4740_clockevent, clk_rate);
-   jz4740_clockevent.min_delta_ns = clockevent_delta2ns(100, 
&jz4740_clockevent);
-   jz4740_clockevent.min_delta_ticks = 100;
-   jz4740_clockevent.max_delta_ns = clockevent_delta2ns(0x, 
&jz4740_clockevent);
-   jz4740_clockevent.max_delta_ticks = 0x;
-   jz4740_clockevent.cpumask = cpumask_of(0);
-
-   clockevents_register_device(&jz4740_clockevent);
-
-   ret = clocksource_register_hz(&jz4740_clocksource, clk_rate);
-
-   if (ret)
-   printk(KERN_ERR "Failed to register clocksource: %d\n", ret);
-
-   sched_clock_register(jz4740_read_sched_clock, 16, clk_rate);
-
-   setup_irq(jz4740_clockevent.irq, &timer_irqaction);
-
-   ctrl = JZ_TIMER_CTRL_PRESCALE_16 | JZ_TIMER_CTRL_SRC_EXT;
-
-   jz4740_timer_set_ctrl(TIMER_CLOCKEVENT, ctrl);
-

[PATCH v15 11/13] MIPS: CI20: Reduce system timer and clocksource to 3 MHz

2019-07-24 Thread Paul Cercueil

The default clock (48 MHz) is too fast for the system timer.

Signed-off-by: Paul Cercueil 
Tested-by: Mathieu Malaterre 
Tested-by: Artur Rojek 
---

Notes:
v5: New patch

v6: Set also the rate for the clocksource channel's clock

v7: No change

v8: No change

v9: Don't configure clock timer1, as the OS Timer is used as
clocksource on this SoC

v10: Revert back to v8 bahaviour. Let the user choose what
 clocksource should be used.

v11-v15: No change

 arch/mips/boot/dts/ingenic/ci20.dts | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/mips/boot/dts/ingenic/ci20.dts 
b/arch/mips/boot/dts/ingenic/ci20.dts
index 4f7b1fa31cf5..2e9952311ecd 100644
--- a/arch/mips/boot/dts/ingenic/ci20.dts
+++ b/arch/mips/boot/dts/ingenic/ci20.dts
@@ -2,6 +2,7 @@
 /dts-v1/;
 
 #include "jz4780.dtsi"
+#include 
 #include 
 
 / {
@@ -238,3 +239,9 @@
bias-disable;
};
 };
+
+&tcu {
+   /* 3 MHz for the system timer and clocksource */
+   assigned-clocks = <&tcu TCU_CLK_TIMER0>, <&tcu TCU_CLK_TIMER1>;
+   assigned-clock-rates = <300>, <300>;
+};
-- 
2.21.0.593.g511ec345e18

[PATCH v15 09/13] MIPS: jz4740: Add DTS nodes for the TCU drivers

2019-07-24 Thread Paul Cercueil

Add DTS nodes for the JZ4780, JZ4770 and JZ4740 devicetree files.

Signed-off-by: Paul Cercueil 
Tested-by: Mathieu Malaterre 
Tested-by: Artur Rojek 
---

Notes:
v5: New patch

v6: Fix register lengths in watchdog/pwm nodes

v7: No change

v8: - Fix wrong start address for PWM node
- Add system timer and clocksource sub-nodes

v9: Drop timer and clocksource sub-nodes

v10-v11: No change

v12: Drop PWM/watchdog/OST sub-nodes, for now.

v13-v14: No change

v15: Add "simple-mfd" compatible string

 arch/mips/boot/dts/ingenic/jz4740.dtsi | 22 ++
 arch/mips/boot/dts/ingenic/jz4770.dtsi | 21 +
 arch/mips/boot/dts/ingenic/jz4780.dtsi | 23 +++
 3 files changed, 66 insertions(+)

diff --git a/arch/mips/boot/dts/ingenic/jz4740.dtsi 
b/arch/mips/boot/dts/ingenic/jz4740.dtsi
index 3ffaf63f22dd..058800bfc875 100644
--- a/arch/mips/boot/dts/ingenic/jz4740.dtsi
+++ b/arch/mips/boot/dts/ingenic/jz4740.dtsi
@@ -53,6 +53,28 @@
clock-names = "rtc";
};
 
+   tcu: timer@10002000 {
+   compatible = "ingenic,jz4740-tcu", "simple-mfd";
+   reg = <0x10002000 0x1000>;
+   #address-cells = <1>;
+   #size-cells = <1>;
+   ranges = <0x0 0x10002000 0x1000>;
+
+   #clock-cells = <1>;
+
+   clocks = <&cgu JZ4740_CLK_RTC
+ &cgu JZ4740_CLK_EXT
+ &cgu JZ4740_CLK_PCLK
+ &cgu JZ4740_CLK_TCU>;
+   clock-names = "rtc", "ext", "pclk", "tcu";
+
+   interrupt-controller;
+   #interrupt-cells = <1>;
+
+   interrupt-parent = <&intc>;
+   interrupts = <23 22 21>;
+   };
+
rtc_dev: rtc@10003000 {
compatible = "ingenic,jz4740-rtc";
reg = <0x10003000 0x40>;
diff --git a/arch/mips/boot/dts/ingenic/jz4770.dtsi 
b/arch/mips/boot/dts/ingenic/jz4770.dtsi
index 49ede6c14ff3..0bfb9edff3d0 100644
--- a/arch/mips/boot/dts/ingenic/jz4770.dtsi
+++ b/arch/mips/boot/dts/ingenic/jz4770.dtsi
@@ -46,6 +46,27 @@
#clock-cells = <1>;
};
 
+   tcu: timer@10002000 {
+   compatible = "ingenic,jz4770-tcu", "simple-mfd";
+   reg = <0x10002000 0x1000>;
+   #address-cells = <1>;
+   #size-cells = <1>;
+   ranges = <0x0 0x10002000 0x1000>;
+
+   #clock-cells = <1>;
+
+   clocks = <&cgu JZ4770_CLK_RTC
+ &cgu JZ4770_CLK_EXT
+ &cgu JZ4770_CLK_PCLK>;
+   clock-names = "rtc", "ext", "pclk";
+
+   interrupt-controller;
+   #interrupt-cells = <1>;
+
+   interrupt-parent = <&intc>;
+   interrupts = <27 26 25>;
+   };
+
pinctrl: pin-controller@1001 {
compatible = "ingenic,jz4770-pinctrl";
reg = <0x1001 0x600>;
diff --git a/arch/mips/boot/dts/ingenic/jz4780.dtsi 
b/arch/mips/boot/dts/ingenic/jz4780.dtsi
index b03cdec56de9..c54bd7cfec55 100644
--- a/arch/mips/boot/dts/ingenic/jz4780.dtsi
+++ b/arch/mips/boot/dts/ingenic/jz4780.dtsi
@@ -46,6 +46,29 @@
#clock-cells = <1>;
};
 
+   tcu: timer@10002000 {
+   compatible = "ingenic,jz4780-tcu",
+"ingenic,jz4770-tcu",
+"simple-mfd";
+   reg = <0x10002000 0x1000>;
+   #address-cells = <1>;
+   #size-cells = <1>;
+   ranges = <0x0 0x10002000 0x1000>;
+
+   #clock-cells = <1>;
+
+   clocks = <&cgu JZ4780_CLK_RTCLK
+ &cgu JZ4780_CLK_EXCLK
+ &cgu JZ4780_CLK_PCLK>;
+   clock-names = "rtc", "ext", "pclk";
+
+   interrupt-controller;
+   #interrupt-cells = <1>;
+
+   interrupt-parent = <&intc>;
+   interrupts = <27 26 25>;
+   };
+
rtc_dev: rtc@10003000 {
compatible = "ingenic,jz4780-rtc";
reg = <0x10003000 0x4c>;
-- 
2.21.0.593.g511ec345e18

Re: [PATCH] Documentation: move Documentation/virtual to Documentation/virt

2019-07-24 Thread Jonathan Corbet

On Wed, 24 Jul 2019 10:51:36 +0200
Paolo Bonzini  wrote:

> On 24/07/19 09:24, Christoph Hellwig wrote:
> > Renaming docs seems to be en vogue at the moment, so fix on of the
> > grossly misnamed directories.  We usually never use "virtual" as
> > a shortcut for virtualization in the kernel, but always virt,
> > as seen in the virt/ top-level directory.  Fix up the documentation
> > to match that.
> > 
> > Fixes: ed16648eb5b8 ("Move kvm, uml, and lguest subdirectories under a 
> > common "virtual" directory, I.E:")
> > Signed-off-by: Christoph Hellwig   
> 
> Queued, thanks.  I can't count how many times I said "I really should
> rename that directory".

...and it's up to Linus before I even got a chance to look at it - one has
to be fast around here...:)

There's nothing wrong with this move, but it does miss the point of much
of the reorganization that has been going on in the docs tree.  It's not
just a matter of getting more pleasing names; the real idea is to create a
better, more reader-focused organization on kernel documentation as a
whole.  Documentation/virt still has the sort of confusion of audiences
that we're trying to fix:

 - kvm/api.txt pretty clearly belongs in the userspace-api book, rather
   than tossed in with:

 - kvm/review-checklist.txt, which belongs in the subsystem guide, if only
   we'd gotten around to creating it yet, or

 - kvm/mmu.txt, which is information for kernel developers, or

 - uml/UserModeLinux-HOWTO.txt, which belongs in the admin guide.

I suspect that organization is going to be one of the main issues to talk
about in Lisbon.  Meanwhile, I hope that this rename won't preclude
organizational work in the future.

Thanks,

jon

RE: [PATCH v5] Documentation/checkpatch: Prefer strscpy/strscpy_pad over strcpy/strlcpy/strncpy

2019-07-24 Thread Gote, Nitin R

Hi,

> -Original Message-
> From: Gote, Nitin R [mailto:nitin.r.g...@intel.com]
> Sent: Tuesday, July 23, 2019 2:56 PM
> To: Joe Perches ; Kees Cook 
> Cc: cor...@lwn.net; a...@linux-foundation.org; a...@canonical.com;
> linux-doc@vger.kernel.org; kernel-harden...@lists.openwall.com
> Subject: RE: [PATCH v5] Documentation/checkpatch: Prefer
> strscpy/strscpy_pad over strcpy/strlcpy/strncpy
> 
> 
> > -Original Message-
> > From: Joe Perches [mailto:j...@perches.com]
> > Sent: Monday, July 22, 2019 11:11 PM
> > To: Kees Cook ; Gote, Nitin R
> > 
> > Cc: cor...@lwn.net; a...@linux-foundation.org; a...@canonical.com;
> > linux-doc@vger.kernel.org; kernel-harden...@lists.openwall.com
> > Subject: Re: [PATCH v5] Documentation/checkpatch: Prefer
> > strscpy/strscpy_pad over strcpy/strlcpy/strncpy
> >
> > On Mon, 2019-07-22 at 10:30 -0700, Kees Cook wrote:
> > > On Wed, Jul 17, 2019 at 10:00:05AM +0530, NitinGote wrote:
> > > > From: Nitin Gote 
> > > >
> > > > Added check in checkpatch.pl to
> > > > 1. Deprecate strcpy() in favor of strscpy().
> > > > 2. Deprecate strlcpy() in favor of strscpy().
> > > > 3. Deprecate strncpy() in favor of strscpy() or strscpy_pad().
> > > >
> > > > Updated strncpy() section in Documentation/process/deprecated.rst
> > > > to cover strscpy_pad() case.
> > > >
> > > > Signed-off-by: Nitin Gote 
> > >
> > > Reviewed-by: Kees Cook 
> > >
> > > Joe, does this address your checkpatch concerns?
> >
> > Well, kinda.
> >
> > strscpy_pad isn't used anywhere in the kernel.
> >
> > And
> >
> > +"strncpy"  => "strscpy, strscpy_pad or
> for non-
> > NUL-terminated strings, strncpy() can still be used, but destinations
> > should be marked with __nonstring",
> >
> > is a bit verbose.  This could be simply:
> >
> > +"strncpy" => "strscpy - for non-NUL-terminated uses,
> > + strncpy() dst
> > should be __nonstring",
> >
>

Could you please give your opinion on below comment.
 
> But, if the destination buffer needs extra NUL-padding for remaining size of
> destination, then safe replacement is strscpy_pad().  Right?  If yes, then 
> what
> is your opinion on below change :
> 
> "strncpy" => "strscpy, strcpy_pad - for non-NUL-terminated uses,
> strncpy() dst should be __nonstring",
> 
> 

If you agree on this, then I will include this change in next patch version.
 
 > -Nitin

Re: [PATCH v5] Documentation/checkpatch: Prefer strscpy/strscpy_pad over strcpy/strlcpy/strncpy

2019-07-24 Thread Joe Perches

On Wed, 2019-07-24 at 18:17 +, Gote, Nitin R wrote:
> Hi,

Hi again.

[]
> > > > > 3. Deprecate strncpy() in favor of strscpy() or strscpy_pad().

Please remember there does not exist a single actual use
of strscpy_pad in the kernel sources and no apparent real
need for it.  I don't find one anyway.

> Could you please give your opinion on below comment.
>  
> > But, if the destination buffer needs extra NUL-padding for remaining size of
> > destination, then safe replacement is strscpy_pad().  Right?  If yes, then 
> > what
> > is your opinion on below change :
> > 
> > "strncpy" => "strscpy, strcpy_pad - for non-NUL-terminated uses,
> > strncpy() dst should be __nonstring",
> > 
> If you agree on this, then I will include this change in next patch version.

Two things:

The kernel-doc documentation uses dest not dst.
I think stracpy should be preferred over strscpy.

[PATCH] Correct documentation for /proc/schedstat

2019-07-24 Thread Phil Frost

Commit 425e0968a25fa3f111f9919964cac079738140b5 ("sched: move code into
kernel/sched_stats.h") appears to have inadvertently changed the unit of
time from jiffies to nanoseconds as part of the implementation of CFS.

Signed-off-by: Phil Frost 
---
 Documentation/scheduler/sched-stats.txt | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/Documentation/scheduler/sched-stats.txt 
b/Documentation/scheduler/sched-stats.txt
index 8259b34a66ae..b6c1807a01b3 100644
--- a/Documentation/scheduler/sched-stats.txt
+++ b/Documentation/scheduler/sched-stats.txt
@@ -19,6 +19,11 @@ are no architectures which need more than three domain 
levels. The first
 field in the domain stats is a bit map indicating which cpus are affected
 by that domain.
 
+2.6.23 introduced the CFS scheduler, and also an inadvertent
+backwards-incompatible change to the statistics. Although the schedstat version
+is 14 in either case, in 2.6.23 and later, counters accumulate time in
+nanoseconds. Prior to that, jiffies.
+
 These fields are counters, and only increment.  Programs which make use
 of these will need to start with a baseline observation and then calculate
 the change in the counters at each subsequent observation.  A perl script
@@ -48,9 +53,10 @@ Next two are try_to_wake_up() statistics:
  6) # of times try_to_wake_up() was called to wake up the local cpu
 
 Next three are statistics describing scheduling latency:
- 7) sum of all time spent running by tasks on this processor (in jiffies)
+ 7) sum of all time spent running by tasks on this processor (in
+nanoseconds, or jiffies prior to 2.6.23)
  8) sum of all time spent waiting to run by tasks on this processor (in
-jiffies)
+nanoseconds, or jiffies prior to 2.6.23)
  9) # of timeslices run on this cpu
 
 
-- 
2.20.1 (Apple Git-117)

[PATCH v10 0/2] overlayfs override_creds=off

2019-07-24 Thread Mark Salyzyn

Patch series:

overlayfs: check CAP_DAC_READ_SEARCH before issuing exportfs_decode_fh
Add optional __get xattr method paired to __vfs_getxattr
overlayfs: add __get xattr method
overlayfs: internal getxattr operations without sepolicy checking
overlayfs: override_creds=off option bypass creator_cred

The first four patches address fundamental security issues that should
be solved regardless of the override_creds=off feature.

The fifth that adds the feature depends on these other fixes.

By default, all access to the upper, lower and work directories is the
recorded mounter's MAC and DAC credentials.  The incoming accesses are
checked against the caller's credentials.

If the principles of least privilege are applied for sepolicy, the
mounter's credentials might not overlap the credentials of the caller's
when accessing the overlayfs filesystem.  For example, a file that a
lower DAC privileged caller can execute, is MAC denied to the
generally higher DAC privileged mounter, to prevent an attack vector.

We add the option to turn off override_creds in the mount options; all
subsequent operations after mount on the filesystem will be only the
caller's credentials.  The module boolean parameter and mount option
override_creds is also added as a presence check for this "feature",
existence of /sys/module/overlay/parameters/overlay_creds

Signed-off-by: Mark Salyzyn 
Cc: Miklos Szeredi 
Cc: Jonathan Corbet 
Cc: Vivek Goyal 
Cc: Eric W. Biederman 
Cc: Amir Goldstein 
Cc: Randy Dunlap 
Cc: Stephen Smalley 
Cc: linux-unio...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org

---
v10:
- Rebase
- Return NULL on CAP_DAC_READ_SEARCH
- Add __get xattr method to solve sepolicy logging issue
- Drop unnecassary sys_admin sepolicy checking for administrative
  driver internal xattr functions.

v6:
- Drop CONFIG_OVERLAY_FS_OVERRIDE_CREDS.
- Do better with the documentation, drop rationalizations.
- pr_warn message adjusted to report consequences.

v5:
- beefed up the caveats in the Documentation
- Is dependent on
  "overlayfs: check CAP_DAC_READ_SEARCH before issuing exportfs_decode_fh"
  "overlayfs: check CAP_MKNOD before issuing vfs_whiteout"
- Added prwarn when override_creds=off

v4:
- spelling and grammar errors in text

v3:
- Change name from caller_credentials / creator_credentials to the
  boolean override_creds.
- Changed from creator to mounter credentials.
- Updated and fortified the documentation.
- Added CONFIG_OVERLAY_FS_OVERRIDE_CREDS

v2:
- Forward port changed attr to stat, resulting in a build error.
- altered commit message.

[PATCH v10 2/5] Add optional get xattr method paired to vfs_getxattr

2019-07-24 Thread Mark Salyzyn

Add an optional __get xattr method that would be called, if set, only
in __vfs_getxattr instead of the regular get xattr method.

Signed-off-by: Mark Salyzyn 
Cc: Miklos Szeredi 
Cc: Jonathan Corbet 
Cc: Vivek Goyal 
Cc: Eric W. Biederman 
Cc: Amir Goldstein 
Cc: Randy Dunlap 
Cc: Stephen Smalley 
Cc: linux-unio...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: kernel-t...@android.com
---
v10 - added to patch series
---
 fs/xattr.c| 11 ++-
 include/linux/xattr.h |  7 +--
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/fs/xattr.c b/fs/xattr.c
index 90dd78f0eb27..b8f4734e222f 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -306,6 +306,9 @@ __vfs_getxattr(struct dentry *dentry, struct inode *inode, 
const char *name,
handler = xattr_resolve_name(inode, &name);
if (IS_ERR(handler))
return PTR_ERR(handler);
+   if (unlikely(handler->__get))
+   return handler->__get(handler, dentry, inode, name, value,
+ size);
if (!handler->get)
return -EOPNOTSUPP;
return handler->get(handler, dentry, inode, name, value, size);
@@ -317,6 +320,7 @@ vfs_getxattr(struct dentry *dentry, const char *name, void 
*value, size_t size)
 {
struct inode *inode = dentry->d_inode;
int error;
+   const struct xattr_handler *handler;
 
error = xattr_permission(inode, name, MAY_READ);
if (error)
@@ -339,7 +343,12 @@ vfs_getxattr(struct dentry *dentry, const char *name, void 
*value, size_t size)
return ret;
}
 nolsm:
-   return __vfs_getxattr(dentry, inode, name, value, size);
+   handler = xattr_resolve_name(inode, &name);
+   if (IS_ERR(handler))
+   return PTR_ERR(handler);
+   if (!handler->get)
+   return -EOPNOTSUPP;
+   return handler->get(handler, dentry, inode, name, value, size);
 }
 EXPORT_SYMBOL_GPL(vfs_getxattr);
 
diff --git a/include/linux/xattr.h b/include/linux/xattr.h
index 6dad031be3c2..30f25e1ac571 100644
--- a/include/linux/xattr.h
+++ b/include/linux/xattr.h
@@ -30,10 +30,13 @@ struct xattr_handler {
const char *prefix;
int flags;  /* fs private flags */
bool (*list)(struct dentry *dentry);
-   int (*get)(const struct xattr_handler *, struct dentry *dentry,
+   int (*get)(const struct xattr_handler *handler, struct dentry *dentry,
   struct inode *inode, const char *name, void *buffer,
   size_t size);
-   int (*set)(const struct xattr_handler *, struct dentry *dentry,
+   int (*__get)(const struct xattr_handler *handler, struct dentry *dentry,
+struct inode *inode, const char *name, void *buffer,
+size_t size);
+   int (*set)(const struct xattr_handler *handler, struct dentry *dentry,
   struct inode *inode, const char *name, const void *buffer,
   size_t size, int flags);
 };
-- 
2.22.0.657.g960e92d24f-goog

[PATCH v10 1/5] overlayfs: check CAP_DAC_READ_SEARCH before issuing exportfs_decode_fh

2019-07-24 Thread Mark Salyzyn

Assumption never checked, should fail if the mounter creds are not
sufficient.

Signed-off-by: Mark Salyzyn 
Cc: Miklos Szeredi 
Cc: Jonathan Corbet 
Cc: Vivek Goyal 
Cc: Eric W. Biederman 
Cc: Amir Goldstein 
Cc: Randy Dunlap 
Cc: Stephen Smalley 
Cc: linux-unio...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: kernel-t...@android.com
---
v10:
- return NULL rather than ERR_PTR(-EPERM)
- did _not_ add it ovl_can_decode_fh() because of changes since last
  review, suspect needs to be added to ovl_lower_uuid_ok()?

v8 + v9:
- rebase

v7:
- This time for realz

v6:
- rebase

v5:
- dependency of "overlayfs: override_creds=off option bypass creator_cred"
---
 fs/overlayfs/namei.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/overlayfs/namei.c b/fs/overlayfs/namei.c
index e9717c2f7d45..9702f0d5309d 100644
--- a/fs/overlayfs/namei.c
+++ b/fs/overlayfs/namei.c
@@ -161,6 +161,9 @@ struct dentry *ovl_decode_real_fh(struct ovl_fh *fh, struct 
vfsmount *mnt,
if (!uuid_equal(&fh->uuid, &mnt->mnt_sb->s_uuid))
return NULL;
 
+   if (!capable(CAP_DAC_READ_SEARCH))
+   return NULL;
+
bytes = (fh->len - offsetof(struct ovl_fh, fid));
real = exportfs_decode_fh(mnt, (struct fid *)fh->fid,
  bytes >> 2, (int)fh->type,
-- 
2.22.0.657.g960e92d24f-goog

[PATCH v10 4/5] overlayfs: internal getxattr operations without sepolicy checking

2019-07-24 Thread Mark Salyzyn

Check impure, opaque, origin & meta xattr with no sepolicy audit
(using __vfs_getxattr) since these operations are internal to
overlayfs operations and do not disclose any data.  This became
an issue for credential override off since sys_admin would have
been required by the caller; whereas would have been inherently
present for the creator since it performed the mount.

This is a change in operations since we do not check in the new
ovl_vfs_getxattr function if the credential override is off or
not.  Reasoning is that the sepolicy check is unnecessary overhead,
especially since the check can be expensive.

Signed-off-by: Mark Salyzyn 
Cc: Miklos Szeredi 
Cc: Jonathan Corbet 
Cc: Vivek Goyal 
Cc: Eric W. Biederman 
Cc: Amir Goldstein 
Cc: Randy Dunlap 
Cc: Stephen Smalley 
Cc: linux-unio...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: kernel-t...@android.com
---
v10 - added to patch series
---
 fs/overlayfs/namei.c | 12 +++-
 fs/overlayfs/overlayfs.h |  2 ++
 fs/overlayfs/util.c  | 24 +++-
 3 files changed, 24 insertions(+), 14 deletions(-)

diff --git a/fs/overlayfs/namei.c b/fs/overlayfs/namei.c
index 9702f0d5309d..fb6c0cd7b65f 100644
--- a/fs/overlayfs/namei.c
+++ b/fs/overlayfs/namei.c
@@ -106,10 +106,11 @@ int ovl_check_fh_len(struct ovl_fh *fh, int fh_len)
 
 static struct ovl_fh *ovl_get_fh(struct dentry *dentry, const char *name)
 {
-   int res, err;
+   ssize_t res;
+   int err;
struct ovl_fh *fh = NULL;
 
-   res = vfs_getxattr(dentry, name, NULL, 0);
+   res = ovl_vfs_getxattr(dentry, name, NULL, 0);
if (res < 0) {
if (res == -ENODATA || res == -EOPNOTSUPP)
return NULL;
@@ -123,7 +124,7 @@ static struct ovl_fh *ovl_get_fh(struct dentry *dentry, 
const char *name)
if (!fh)
return ERR_PTR(-ENOMEM);
 
-   res = vfs_getxattr(dentry, name, fh, res);
+   res = ovl_vfs_getxattr(dentry, name, fh, res);
if (res < 0)
goto fail;
 
@@ -141,10 +142,11 @@ static struct ovl_fh *ovl_get_fh(struct dentry *dentry, 
const char *name)
return NULL;
 
 fail:
-   pr_warn_ratelimited("overlayfs: failed to get origin (%i)\n", res);
+   pr_warn_ratelimited("overlayfs: failed to get origin (%zi)\n", res);
goto out;
 invalid:
-   pr_warn_ratelimited("overlayfs: invalid origin (%*phN)\n", res, fh);
+   pr_warn_ratelimited("overlayfs: invalid origin (%*phN)\n",
+   (int)res, fh);
goto out;
 }
 
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index 73a02a263fbc..82574684a9b6 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -205,6 +205,8 @@ int ovl_want_write(struct dentry *dentry);
 void ovl_drop_write(struct dentry *dentry);
 struct dentry *ovl_workdir(struct dentry *dentry);
 const struct cred *ovl_override_creds(struct super_block *sb);
+ssize_t ovl_vfs_getxattr(struct dentry *dentry, const char *name, void *buf,
+size_t size);
 struct super_block *ovl_same_sb(struct super_block *sb);
 int ovl_can_decode_fh(struct super_block *sb);
 struct dentry *ovl_indexdir(struct super_block *sb);
diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
index f5678a3f8350..672459c3cff7 100644
--- a/fs/overlayfs/util.c
+++ b/fs/overlayfs/util.c
@@ -40,6 +40,12 @@ const struct cred *ovl_override_creds(struct super_block *sb)
return override_creds(ofs->creator_cred);
 }
 
+ssize_t ovl_vfs_getxattr(struct dentry *dentry, const char *name, void *buf,
+size_t size)
+{
+   return __vfs_getxattr(dentry, d_inode(dentry), name, buf, size);
+}
+
 struct super_block *ovl_same_sb(struct super_block *sb)
 {
struct ovl_fs *ofs = sb->s_fs_info;
@@ -537,9 +543,9 @@ void ovl_copy_up_end(struct dentry *dentry)
 
 bool ovl_check_origin_xattr(struct dentry *dentry)
 {
-   int res;
+   ssize_t res;
 
-   res = vfs_getxattr(dentry, OVL_XATTR_ORIGIN, NULL, 0);
+   res = ovl_vfs_getxattr(dentry, OVL_XATTR_ORIGIN, NULL, 0);
 
/* Zero size value means "copied up but origin unknown" */
if (res >= 0)
@@ -550,13 +556,13 @@ bool ovl_check_origin_xattr(struct dentry *dentry)
 
 bool ovl_check_dir_xattr(struct dentry *dentry, const char *name)
 {
-   int res;
+   ssize_t res;
char val;
 
if (!d_is_dir(dentry))
return false;
 
-   res = vfs_getxattr(dentry, name, &val, 1);
+   res = ovl_vfs_getxattr(dentry, name, &val, 1);
if (res == 1 && val == 'y')
return true;
 
@@ -837,13 +843,13 @@ int ovl_lock_rename_workdir(struct dentry *workdir, 
struct dentry *upperdir)
 /* err < 0, 0 if no metacopy xattr, 1 if metacopy xattr found */
 int ovl_check_metacopy_xattr(struct dentry *dentry)
 {
-   int res;
+   ssize_t res;
 
/* Only regular files can have metacopy xattr */
if (!S_ISREG

[PATCH v10 3/5] overlayfs: add __get xattr method

2019-07-24 Thread Mark Salyzyn

Because of the overlayfs getxattr recursion, the incoming inode fails
to update the selinux sid resulting in avc denials being reported
against a target context of u:object_r:unlabeled:s0.

Solution is to add a _get xattr method that calls the __vfs_getxattr
handler so that the context can be read in, rather than being denied
with an -EACCES when vfs_getxattr handler is called.

Signed-off-by: Mark Salyzyn 
Cc: Miklos Szeredi 
Cc: Jonathan Corbet 
Cc: Vivek Goyal 
Cc: Eric W. Biederman 
Cc: Amir Goldstein 
Cc: Randy Dunlap 
Cc: Stephen Smalley 
Cc: linux-unio...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: kernel-t...@android.com
---
v10 - added to patch series
---
 fs/overlayfs/inode.c | 15 +++
 fs/overlayfs/overlayfs.h |  2 ++
 fs/overlayfs/super.c | 18 ++
 3 files changed, 35 insertions(+)

diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index 7663aeb85fa3..d3b53849615c 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -362,6 +362,21 @@ int ovl_xattr_set(struct dentry *dentry, struct inode 
*inode, const char *name,
return err;
 }
 
+int __ovl_xattr_get(struct dentry *dentry, struct inode *inode,
+   const char *name, void *value, size_t size)
+{
+   ssize_t res;
+   const struct cred *old_cred;
+   struct dentry *realdentry =
+   ovl_i_dentry_upper(inode) ?: ovl_dentry_lower(dentry);
+
+   old_cred = ovl_override_creds(dentry->d_sb);
+   res = __vfs_getxattr(realdentry, d_inode(realdentry), name, value,
+size);
+   ovl_revert_creds(old_cred);
+   return res;
+}
+
 int ovl_xattr_get(struct dentry *dentry, struct inode *inode, const char *name,
  void *value, size_t size)
 {
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index 6934bcf030f0..73a02a263fbc 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -357,6 +357,8 @@ int ovl_xattr_set(struct dentry *dentry, struct inode 
*inode, const char *name,
  const void *value, size_t size, int flags);
 int ovl_xattr_get(struct dentry *dentry, struct inode *inode, const char *name,
  void *value, size_t size);
+int __ovl_xattr_get(struct dentry *dentry, struct inode *inode,
+   const char *name, void *value, size_t size);
 ssize_t ovl_listxattr(struct dentry *dentry, char *list, size_t size);
 struct posix_acl *ovl_get_acl(struct inode *inode, int type);
 int ovl_update_time(struct inode *inode, struct timespec64 *ts, int flags);
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index b368e2e102fa..82e1130de206 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -859,6 +859,14 @@ ovl_posix_acl_xattr_get(const struct xattr_handler 
*handler,
return ovl_xattr_get(dentry, inode, handler->name, buffer, size);
 }
 
+static int __maybe_unused
+__ovl_posix_acl_xattr_get(const struct xattr_handler *handler,
+ struct dentry *dentry, struct inode *inode,
+ const char *name, void *buffer, size_t size)
+{
+   return __ovl_xattr_get(dentry, inode, handler->name, buffer, size);
+}
+
 static int __maybe_unused
 ovl_posix_acl_xattr_set(const struct xattr_handler *handler,
struct dentry *dentry, struct inode *inode,
@@ -939,6 +947,13 @@ static int ovl_other_xattr_get(const struct xattr_handler 
*handler,
return ovl_xattr_get(dentry, inode, name, buffer, size);
 }
 
+static int __ovl_other_xattr_get(const struct xattr_handler *handler,
+struct dentry *dentry, struct inode *inode,
+const char *name, void *buffer, size_t size)
+{
+   return __ovl_xattr_get(dentry, inode, name, buffer, size);
+}
+
 static int ovl_other_xattr_set(const struct xattr_handler *handler,
   struct dentry *dentry, struct inode *inode,
   const char *name, const void *value,
@@ -952,6 +967,7 @@ ovl_posix_acl_access_xattr_handler = {
.name = XATTR_NAME_POSIX_ACL_ACCESS,
.flags = ACL_TYPE_ACCESS,
.get = ovl_posix_acl_xattr_get,
+   .__get = __ovl_posix_acl_xattr_get,
.set = ovl_posix_acl_xattr_set,
 };
 
@@ -960,6 +976,7 @@ ovl_posix_acl_default_xattr_handler = {
.name = XATTR_NAME_POSIX_ACL_DEFAULT,
.flags = ACL_TYPE_DEFAULT,
.get = ovl_posix_acl_xattr_get,
+   .__get = __ovl_posix_acl_xattr_get,
.set = ovl_posix_acl_xattr_set,
 };
 
@@ -972,6 +989,7 @@ static const struct xattr_handler ovl_own_xattr_handler = {
 static const struct xattr_handler ovl_other_xattr_handler = {
.prefix = "", /* catch all */
.get = ovl_other_xattr_get,
+   .__get = __ovl_other_xattr_get,
.set = ovl_other_xattr_set,
 };
 
-- 
2.22.0.657.g960e92d24f-goog

[PATCH v10 5/5] overlayfs: override_creds=off option bypass creator_cred

2019-07-24 Thread Mark Salyzyn

By default, all access to the upper, lower and work directories is the
recorded mounter's MAC and DAC credentials.  The incoming accesses are
checked against the caller's credentials.

If the principles of least privilege are applied, the mounter's
credentials might not overlap the credentials of the caller's when
accessing the overlayfs filesystem.  For example, a file that a lower
DAC privileged caller can execute, is MAC denied to the generally
higher DAC privileged mounter, to prevent an attack vector.

We add the option to turn off override_creds in the mount options; all
subsequent operations after mount on the filesystem will be only the
caller's credentials.  The module boolean parameter and mount option
override_creds is also added as a presence check for this "feature",
existence of /sys/module/overlay/parameters/override_creds.

It was not always this way.  Circa 4.6 there was no recorded mounter's
credentials, instead privileged access to upper or work directories
were temporarily increased to perform the operations.  The MAC
(selinux) policies were caller's in all cases.  override_creds=off
partially returns us to this older access model minus the insecure
temporary credential increases.  This is to permit use in a system
with non-overlapping security models for each executable including
the agent that mounts the overlayfs filesystem.  In Android
this is the case since init, which performs the mount operations,
has a minimal MAC set of privileges to reduce any attack surface,
and services that use the content have a different set of MAC
privileges (eg: read, for vendor labelled configuration, execute for
vendor libraries and modules).  The caveats are not a problem in
the Android usage model, however they should be fixed for
completeness and for general use in time.

Signed-off-by: Mark Salyzyn 
Cc: Miklos Szeredi 
Cc: Jonathan Corbet 
Cc: Vivek Goyal 
Cc: Eric W. Biederman 
Cc: Amir Goldstein 
Cc: Randy Dunlap 
Cc: Stephen Smalley 
Cc: linux-unio...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: kernel-t...@android.com
---
v10:
- Rebase (and expand because of increased revert_cred usage)

v9:
- Add to the caveats

v8:
- drop pr_warn message after straw poll to remove it.
- added a use case in the commit message

v7:
- change name of internal parameter to ovl_override_creds_def
- report override_creds only if different than default

v6:
- Drop CONFIG_OVERLAY_FS_OVERRIDE_CREDS.
- Do better with the documentation.
- pr_warn message adjusted to report consequences.

v5:
- beefed up the caveats in the Documentation
- Is dependent on
  "overlayfs: check CAP_DAC_READ_SEARCH before issuing exportfs_decode_fh"
  "overlayfs: check CAP_MKNOD before issuing vfs_whiteout"
- Added prwarn when override_creds=off

v4:
- spelling and grammar errors in text

v3:
- Change name from caller_credentials / creator_credentials to the
  boolean override_creds.
- Changed from creator to mounter credentials.
- Updated and fortified the documentation.
- Added CONFIG_OVERLAY_FS_OVERRIDE_CREDS

v2:
- Forward port changed attr to stat, resulting in a build error.
- altered commit message.

a
---
 Documentation/filesystems/overlayfs.txt | 23 +++
 fs/overlayfs/copy_up.c  |  2 +-
 fs/overlayfs/dir.c  | 11 ++-
 fs/overlayfs/file.c | 20 ++--
 fs/overlayfs/inode.c| 18 +-
 fs/overlayfs/namei.c|  6 +++---
 fs/overlayfs/overlayfs.h|  1 +
 fs/overlayfs/ovl_entry.h|  1 +
 fs/overlayfs/readdir.c  |  4 ++--
 fs/overlayfs/super.c| 22 +-
 fs/overlayfs/util.c | 12 ++--
 11 files changed, 87 insertions(+), 33 deletions(-)

diff --git a/Documentation/filesystems/overlayfs.txt 
b/Documentation/filesystems/overlayfs.txt
index 1da2f1668f08..d48125076602 100644
--- a/Documentation/filesystems/overlayfs.txt
+++ b/Documentation/filesystems/overlayfs.txt
@@ -102,6 +102,29 @@ Only the lists of names from directories are merged.  
Other content
 such as metadata and extended attributes are reported for the upper
 directory only.  These attributes of the lower directory are hidden.
 
+credentials
+---
+
+By default, all access to the upper, lower and work directories is the
+recorded mounter's MAC and DAC credentials.  The incoming accesses are
+checked against the caller's credentials.
+
+In the case where caller MAC or DAC credentials do not overlap, a
+use case available in older versions of the driver, the
+override_creds mount flag can be turned off and help when the use
+pattern has caller with legitimate credentials where the mounter
+does not.  Several unintended side effects will occur though.  The
+caller without certain key capabilities or lower privilege will not
+always be able to delete files or directories, create nodes, or
+

Re: [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing

2019-07-24 Thread Joel Fernandes

On Mon, Jul 22, 2019 at 03:06:39PM -0700, Andrew Morton wrote:
[snip] 
> > +   *end = *start + count * BITS_PER_BYTE;
> > +   if (*end > max_frame)
> > +   *end = max_frame;
> > +   return 0;
> > +}
> > +
> >
> > ...
> >
> > +static void add_page_idle_list(struct page *page,
> > +  unsigned long addr, struct mm_walk *walk)
> > +{
> > +   struct page *page_get;
> > +   struct page_node *pn;
> > +   int bit;
> > +   unsigned long frames;
> > +   struct page_idle_proc_priv *priv = walk->private;
> > +   u64 *chunk = (u64 *)priv->buffer;
> > +
> > +   if (priv->write) {
> > +   /* Find whether this page was asked to be marked */
> > +   frames = (addr - priv->start_addr) >> PAGE_SHIFT;
> > +   bit = frames % BITMAP_CHUNK_BITS;
> > +   chunk = &chunk[frames / BITMAP_CHUNK_BITS];
> > +   if (((*chunk >> bit) & 1) == 0)
> > +   return;
> > +   }
> > +
> > +   page_get = page_idle_get_page(page);
> > +   if (!page_get)
> > +   return;
> > +
> > +   pn = kmalloc(sizeof(*pn), GFP_ATOMIC);
> 
> I'm not liking this GFP_ATOMIC.  If I'm reading the code correctly,
> userspace can ask for an arbitrarily large number of GFP_ATOMIC
> allocations by doing a large read.  This can potentially exhaust page
> reserves which things like networking Rx interrupts need and can make
> this whole feature less reliable.

For the revision, I will pre-allocate the page nodes in advance so it does
not need to do this. Diff on top of this patch is below. Let me know any
comments, thanks.

Btw, I also dropped the idle_page_list_lock by putting the idle_page_list
list_head on the stack instead of heap.
---8<---

From: "Joel Fernandes (Google)" 
Subject: [PATCH] mm/page_idle: Avoid need for GFP_ATOMIC

GFP_ATOMIC can harm allocations does by other allocations that are in
need of reserves and the like. Pre-allocate the nodes list so that
spinlocked region can just use it.

Suggested-by: Andrew Morton 
Signed-off-by: Joel Fernandes (Google) 
---
 mm/page_idle.c | 19 +++
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/mm/page_idle.c b/mm/page_idle.c
index 874a60c41fef..b9c790721f16 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -266,6 +266,10 @@ struct page_idle_proc_priv {
unsigned long start_addr;
char *buffer;
int write;
+
+   /* Pre-allocate and provide nodes to add_page_idle_list() */
+   struct page_node *page_nodes;
+   int cur_page_node;
 };
 
 static void add_page_idle_list(struct page *page,
@@ -291,10 +295,7 @@ static void add_page_idle_list(struct page *page,
if (!page_get)
return;
 
-   pn = kmalloc(sizeof(*pn), GFP_ATOMIC);
-   if (!pn)
-   return;
-
+   pn = &(priv->page_nodes[priv->cur_page_node++]);
pn->page = page_get;
pn->addr = addr;
list_add(&pn->list, &idle_page_list);
@@ -379,6 +380,15 @@ ssize_t page_idle_proc_generic(struct file *file, char 
__user *ubuff,
priv.buffer = buffer;
priv.start_addr = start_addr;
priv.write = write;
+
+   priv.cur_page_node = 0;
+   priv.page_nodes = kzalloc(sizeof(struct page_node) * (end_frame - 
start_frame),
+ GFP_KERNEL);
+   if (!priv.page_nodes) {
+   ret = -ENOMEM;
+   goto out;
+   }
+
walk.private = &priv;
walk.mm = mm;
 
@@ -425,6 +435,7 @@ ssize_t page_idle_proc_generic(struct file *file, char 
__user *ubuff,
ret = copy_to_user(ubuff, buffer, count);
 
up_read(&mm->mmap_sem);
+   kfree(priv.page_nodes);
 out:
kfree(buffer);
 out_mmput:
-- 
2.22.0.657.g960e92d24f-goog

Re: [PATCH v10 3/5] overlayfs: add __get xattr method

2019-07-24 Thread Amir Goldstein

On Wed, Jul 24, 2019 at 10:57 PM Mark Salyzyn  wrote:
>
> Because of the overlayfs getxattr recursion, the incoming inode fails
> to update the selinux sid resulting in avc denials being reported
> against a target context of u:object_r:unlabeled:s0.

This description is too brief for me to understand the root problem.
What's wring with the overlayfs getxattr recursion w.r.t the selinux
security model?

Please give an example of your unprivileged mounter use case
to explain.

CC Vivek because I could really never understand all this.

>
> Solution is to add a _get xattr method that calls the __vfs_getxattr
> handler so that the context can be read in, rather than being denied
> with an -EACCES when vfs_getxattr handler is called.
>
> Signed-off-by: Mark Salyzyn 
> Cc: Miklos Szeredi 
> Cc: Jonathan Corbet 
> Cc: Vivek Goyal 
> Cc: Eric W. Biederman 
> Cc: Amir Goldstein 
> Cc: Randy Dunlap 
> Cc: Stephen Smalley 
> Cc: linux-unio...@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: kernel-t...@android.com
> ---
> v10 - added to patch series
> ---
>  fs/overlayfs/inode.c | 15 +++
>  fs/overlayfs/overlayfs.h |  2 ++
>  fs/overlayfs/super.c | 18 ++
>  3 files changed, 35 insertions(+)
>
> diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
> index 7663aeb85fa3..d3b53849615c 100644
> --- a/fs/overlayfs/inode.c
> +++ b/fs/overlayfs/inode.c
> @@ -362,6 +362,21 @@ int ovl_xattr_set(struct dentry *dentry, struct inode 
> *inode, const char *name,
> return err;
>  }
>
> +int __ovl_xattr_get(struct dentry *dentry, struct inode *inode,
> +   const char *name, void *value, size_t size)
> +{
> +   ssize_t res;
> +   const struct cred *old_cred;
> +   struct dentry *realdentry =
> +   ovl_i_dentry_upper(inode) ?: ovl_dentry_lower(dentry);
> +
> +   old_cred = ovl_override_creds(dentry->d_sb);
> +   res = __vfs_getxattr(realdentry, d_inode(realdentry), name, value,
> +size);
> +   ovl_revert_creds(old_cred);
> +   return res;
> +}
> +
>  int ovl_xattr_get(struct dentry *dentry, struct inode *inode, const char 
> *name,
>   void *value, size_t size)
>  {
> diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> index 6934bcf030f0..73a02a263fbc 100644
> --- a/fs/overlayfs/overlayfs.h
> +++ b/fs/overlayfs/overlayfs.h
> @@ -357,6 +357,8 @@ int ovl_xattr_set(struct dentry *dentry, struct inode 
> *inode, const char *name,
>   const void *value, size_t size, int flags);
>  int ovl_xattr_get(struct dentry *dentry, struct inode *inode, const char 
> *name,
>   void *value, size_t size);
> +int __ovl_xattr_get(struct dentry *dentry, struct inode *inode,
> +   const char *name, void *value, size_t size);
>  ssize_t ovl_listxattr(struct dentry *dentry, char *list, size_t size);
>  struct posix_acl *ovl_get_acl(struct inode *inode, int type);
>  int ovl_update_time(struct inode *inode, struct timespec64 *ts, int flags);
> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> index b368e2e102fa..82e1130de206 100644
> --- a/fs/overlayfs/super.c
> +++ b/fs/overlayfs/super.c
> @@ -859,6 +859,14 @@ ovl_posix_acl_xattr_get(const struct xattr_handler 
> *handler,
> return ovl_xattr_get(dentry, inode, handler->name, buffer, size);
>  }
>
> +static int __maybe_unused
> +__ovl_posix_acl_xattr_get(const struct xattr_handler *handler,
> + struct dentry *dentry, struct inode *inode,
> + const char *name, void *buffer, size_t size)
> +{
> +   return __ovl_xattr_get(dentry, inode, handler->name, buffer, size);
> +}
> +
>  static int __maybe_unused
>  ovl_posix_acl_xattr_set(const struct xattr_handler *handler,
> struct dentry *dentry, struct inode *inode,
> @@ -939,6 +947,13 @@ static int ovl_other_xattr_get(const struct 
> xattr_handler *handler,
> return ovl_xattr_get(dentry, inode, name, buffer, size);
>  }
>
> +static int __ovl_other_xattr_get(const struct xattr_handler *handler,
> +struct dentry *dentry, struct inode *inode,
> +const char *name, void *buffer, size_t size)
> +{
> +   return __ovl_xattr_get(dentry, inode, name, buffer, size);
> +}
> +
>  static int ovl_other_xattr_set(const struct xattr_handler *handler,
>struct dentry *dentry, struct inode *inode,
>const char *name, const void *value,
> @@ -952,6 +967,7 @@ ovl_posix_acl_access_xattr_handler = {
> .name = XATTR_NAME_POSIX_ACL_ACCESS,
> .flags = ACL_TYPE_ACCESS,
> .get = ovl_posix_acl_xattr_get,
> +   .__get = __ovl_posix_acl_xattr_get,
> .set = ovl_posix_acl_xattr_set,
>  };
>
> @@ -960,6 +976,7 @@ ovl_posix_acl_default_xattr_handler = {
> .name = XATTR_NAME_POSIX_ACL

Re: [PATCH v10 5/5] overlayfs: override_creds=off option bypass creator_cred

2019-07-24 Thread Amir Goldstein

On Wed, Jul 24, 2019 at 10:57 PM Mark Salyzyn  wrote:
>
> By default, all access to the upper, lower and work directories is the
> recorded mounter's MAC and DAC credentials.  The incoming accesses are
> checked against the caller's credentials.
>
> If the principles of least privilege are applied, the mounter's
> credentials might not overlap the credentials of the caller's when
> accessing the overlayfs filesystem.  For example, a file that a lower
> DAC privileged caller can execute, is MAC denied to the generally
> higher DAC privileged mounter, to prevent an attack vector.
>
> We add the option to turn off override_creds in the mount options; all
> subsequent operations after mount on the filesystem will be only the
> caller's credentials.  The module boolean parameter and mount option
> override_creds is also added as a presence check for this "feature",
> existence of /sys/module/overlay/parameters/override_creds.
>
> It was not always this way.  Circa 4.6 there was no recorded mounter's
> credentials, instead privileged access to upper or work directories
> were temporarily increased to perform the operations.  The MAC
> (selinux) policies were caller's in all cases.  override_creds=off
> partially returns us to this older access model minus the insecure
> temporary credential increases.  This is to permit use in a system
> with non-overlapping security models for each executable including
> the agent that mounts the overlayfs filesystem.  In Android
> this is the case since init, which performs the mount operations,
> has a minimal MAC set of privileges to reduce any attack surface,
> and services that use the content have a different set of MAC
> privileges (eg: read, for vendor labelled configuration, execute for
> vendor libraries and modules).  The caveats are not a problem in
> the Android usage model, however they should be fixed for
> completeness and for general use in time.
>
> Signed-off-by: Mark Salyzyn 
> Cc: Miklos Szeredi 
> Cc: Jonathan Corbet 
> Cc: Vivek Goyal 
> Cc: Eric W. Biederman 
> Cc: Amir Goldstein 
> Cc: Randy Dunlap 
> Cc: Stephen Smalley 
> Cc: linux-unio...@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: kernel-t...@android.com
> ---
> v10:
> - Rebase (and expand because of increased revert_cred usage)
>
> v9:
> - Add to the caveats
>
> v8:
> - drop pr_warn message after straw poll to remove it.
> - added a use case in the commit message
>
> v7:
> - change name of internal parameter to ovl_override_creds_def
> - report override_creds only if different than default
>
> v6:
> - Drop CONFIG_OVERLAY_FS_OVERRIDE_CREDS.
> - Do better with the documentation.
> - pr_warn message adjusted to report consequences.
>
> v5:
> - beefed up the caveats in the Documentation
> - Is dependent on
>   "overlayfs: check CAP_DAC_READ_SEARCH before issuing exportfs_decode_fh"
>   "overlayfs: check CAP_MKNOD before issuing vfs_whiteout"
> - Added prwarn when override_creds=off
>
> v4:
> - spelling and grammar errors in text
>
> v3:
> - Change name from caller_credentials / creator_credentials to the
>   boolean override_creds.
> - Changed from creator to mounter credentials.
> - Updated and fortified the documentation.
> - Added CONFIG_OVERLAY_FS_OVERRIDE_CREDS
>
> v2:
> - Forward port changed attr to stat, resulting in a build error.
> - altered commit message.
>
> a
> ---
>  Documentation/filesystems/overlayfs.txt | 23 +++
>  fs/overlayfs/copy_up.c  |  2 +-
>  fs/overlayfs/dir.c  | 11 ++-
>  fs/overlayfs/file.c | 20 ++--
>  fs/overlayfs/inode.c| 18 +-
>  fs/overlayfs/namei.c|  6 +++---
>  fs/overlayfs/overlayfs.h|  1 +
>  fs/overlayfs/ovl_entry.h|  1 +
>  fs/overlayfs/readdir.c  |  4 ++--
>  fs/overlayfs/super.c| 22 +-
>  fs/overlayfs/util.c | 12 ++--
>  11 files changed, 87 insertions(+), 33 deletions(-)
>
> diff --git a/Documentation/filesystems/overlayfs.txt 
> b/Documentation/filesystems/overlayfs.txt
> index 1da2f1668f08..d48125076602 100644
> --- a/Documentation/filesystems/overlayfs.txt
> +++ b/Documentation/filesystems/overlayfs.txt
> @@ -102,6 +102,29 @@ Only the lists of names from directories are merged.  
> Other content
>  such as metadata and extended attributes are reported for the upper
>  directory only.  These attributes of the lower directory are hidden.
>
> +credentials
> +---
> +
> +By default, all access to the upper, lower and work directories is the
> +recorded mounter's MAC and DAC credentials.  The incoming accesses are
> +checked against the caller's credentials.
> +
> +In the case where caller MAC or DAC credentials do not overlap, a
> +use case available in older versions of the driver, the
> +override_creds mount flag can be turned off and help wh

42 matches

Mail list logo