Re: [RFC PATCH v2 00/38] Nested Virtualization on KVM/ARM

2017-07-19 Thread Christoffer Dall
Hi Jintack,

On Tue, Jul 18, 2017 at 10:23:05PM -0400, Jintack Lim wrote:
> On Tue, Jul 18, 2017 at 12:58 PM, Jintack Lim  wrote:
> > Nested virtualization is the ability to run a virtual machine inside another
> > virtual machine. In other words, it’s about running a hypervisor (the guest
> > hypervisor) on top of another hypervisor (the host hypervisor).
> >
> > Supporting nested virtualization on ARM means that the hypervisor provides 
> > not
> > only EL0/EL1 execution environment to VMs as it usually does but also the
> > virtualization extensions including EL2 execution environment. Once the host
> > hypervisor provides those execution environments to the VMs, then the guest
> > hypervisor can run its own VMs (nested VMs) naturally.
> >
> > This series supports nested virtualization on arm64. ARM recently announced 
> > an
> > extension (ARMv8.3) which has support for nested virtualization[1]. This 
> > patch
> > set is based on the ARMv8.3 specification and tested on the FastModel with
> > ARMv8.3 extension.
> >
> > The whole patch set to support nested virtualization is huge over 70
> > patches, so I categorized them into four parts: CPU, memory, VGIC, and timer
> > virtualization. This patch series is the first part.
> >
> > CPU virtualization patch series provides basic nested virtualization 
> > framework
> > and instruction emulations including v8.1 VHE feature and v8.3 nested
> > virtualization feature for VMs.
> >
> > This patch series again can be divided into four parts. Patch 1 to 5 
> > introduces
> > nested virtualization by discovering hardware feature, adding a kernel
> > parameter and allowing the userspace to set the initial CPU mode to EL2.
> >
> > Patch 6 to 25 are to support the EL2 execution environment, the virtual 
> > EL2, to
> > a VM on v8.0 architecture. We de-privilege the guest hypervisor and emulate 
> > the
> > virtual EL2 mode in EL1 using the hardware features provided by ARMv8.3; The
> > host hypervisor manages virtual EL2 register state for the guest hypervisor
> > and shadow EL1 register state that reflects the virtual EL2 register state 
> > to
> > run the guest hypervisor in EL1.
> >
> > Patch 26 to 33 add support for the virtual EL2 with Virtualization Host
> > Extensions. These patches emulate newly defined registers and bits in v8.1 
> > and
> > allow the virtual EL2 to access EL2 register states via EL1 register 
> > accesses
> > as in the real EL2.
> >
> > Patch 34 to 38 are to support for the virtual EL2 with nested 
> > virtualization.
> > These enable recursive nested virtualization.
> >
> > This patch set is tested on the FastModel with the v8.3 extension for arm64 
> > and
> > a cubietruck for arm32. On the FastModel, the host and the guest kernels are
> > compiled with and without VHE, so there are four combinations. I was able to
> > boot SMP Linux in the nested VM on all four configurations and able to run
> > hackbench. I also checked that regular VMs could boot when the nested
> > virtualization kernel parameter was not set. On the cubietruck, I also 
> > verified
> > that regular VMs could boot as well.
> >
> > I'll share my experiment setup shortly.
> 
> I summarized my experiment setup here.
> 
> https://github.com/columbia/nesting-pub/wiki/Nested-virtualization-on-ARM-setup
> 

Thanks for sharing this.

> >
> > Even though this work has some limitations and TODOs, I'd appreciate early
> > feedback on this RFC. Specifically, I'm interested in:
> >
> > - Overall design to manage vcpu context for the virtual EL2
> > - Verifying correct EL2 register configurations such as HCR_EL2, CPTR_EL2
> >   (Patch 30 and 32)
> > - Patch organization and coding style
> 
> I also wonder if the hardware and/or KVM do not support nested
> virtualization but the userspace uses nested virtualization option,
> which one is better: giving an error or launching a regular VM
> silently.
> 

I think KVM should complain to userspace if userspace tries to set a
feature it does not support, and I think userspace should give as
meaningful an error message as possible to the user when that happens.

Thanks,
-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/6] Documentation: perf: hisi: Documentation for HiSilicon SoC PMU driver

2017-07-19 Thread Jonathan Cameron
On Tue, 18 Jul 2017 15:59:54 +0800
Shaokun Zhang  wrote:

> This patch adds documentation for the uncore PMUs on HiSilicon SoC.
> 
> Signed-off-by: Shaokun Zhang 
> Signed-off-by: Anurup M   
Hi Shaokun,

Sorry for the late reply on this (only recently joined Huawei)

This is a fairly generic review of the code rather than going into
the actual userspace ABI choices as this is an area I'm only just
starting to become familiar with.

Thanks,

Jonathan
> ---
>  Documentation/perf/hisi-pmu.txt | 51 
> +
>  1 file changed, 51 insertions(+)
>  create mode 100644 Documentation/perf/hisi-pmu.txt
> 
> diff --git a/Documentation/perf/hisi-pmu.txt b/Documentation/perf/hisi-pmu.txt
> new file mode 100644
> index 000..5fa0b1a
> --- /dev/null
> +++ b/Documentation/perf/hisi-pmu.txt
> @@ -0,0 +1,51 @@
> +HiSilicon SoC uncore Performance Monitoring Unit (PMU)
> +==
> +The HiSilicon SoC chip comprehends various independent system device PMUs
> +such as L3 cache (L3C), Hydra Home Agent (HHA) and DDRC. These PMUs are
> +independent and have hardware logic to gather statistics and performance
> +information.
> +
> +HiSilicon SoC encapsulates multiple CPU and IO dies. Each CPU cluster (CC
> +L) is made up of 4 cpu cores sharing one L3 cache; Each CPU die is called  
nitpick, I'd not have a line break mid acronym. 

> +Super CPU cluster (SCCL) and is made up of 6 CCLs. Each SCCL has two HHAs
> +(0 - 1) and four DDRCs (0 - 3), respectively.
> +
> +HiSilicon SoC uncore PMU driver
> +---
> +Each device PMU has separate registers for event counting, control and
> +interrupt, and the PMU driver shall register perf PMU drivers like L3C,
> +HHA and DDRC etc. The available events and configuration options shall
> +be described in the sysfs, see /sys/devices/hisi_*.  
Is there not a subsystem directory that would make more sense to
refer to than the full device list?

> +The "perf list" command shall list the available events from sysfs.
> +
> +Each L3C, HHA and DDRC in one SCCL are registered as an separate PMU with 
> perf.
> +The PMU name will appear in event listing as hisi_module 
> _.
> +where "index-id" is the index of module and "sccl-id" is the identifier of
> +the SCCL.
> +e.g. hisi_l3c0_1/rd_hit_cpipe is READ_HIT_CPIPE event of L3C index #0 and 
> SCCL
> +ID #1.
> +e.g. hisi_hha0_1/rx_operations is RX_OPERATIONS event of HHA index #0 and 
> SCCL
> +ID #1.
> +
> +The driver also provides a "cpumask" sysfs attribute, which shows the CPU 
> core
> +ID used to count the uncore PMU event.
> +
> +Example usage of perf:
> +$# perf list
> +hisi_l3c0_3/rd_hit_cpipe/ [kernel PMU event]
> +--
> +hisi_l3c0_3/wr_hit_cpipe/ [kernel PMU event]
> +--
> +hisi_l3c0_1/rd_hit_cpipe/ [kernel PMU event]
> +--
> +hisi_l3c0_1/wr_hit_cpipe/ [kernel PMU event]
> +--
> +
> +$# perf stat -a -e hisi_l3c0_1/rd_hit_cpipe/ sleep 5
> +$# perf stat -a -e hisi_l3c0_1/config=0x02/ sleep 5
> +
> +The current driver does not support sampling. So "perf record" is 
> unsupported.
> +Also attach to a task is unsupported as the events are all uncore.
> +
> +Note: Please contact the maintainer for a complete list of events supported 
> for
> +the PMU devices in the SoC and its information if needed.  


___
linuxarm mailing list

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/6] perf: hisi: Add support for HiSilicon SoC uncore PMU driver

2017-07-19 Thread Jonathan Cameron
On Tue, 18 Jul 2017 15:59:55 +0800
Shaokun Zhang  wrote:

> This patch adds support HiSilicon SoC uncore PMU driver framework and
> interfaces.
> 
> Signed-off-by: Shaokun Zhang 
> Signed-off-by: Anurup M   
A couple of minor things inline.

Jonathan
> ---
>  drivers/perf/Kconfig |   7 +
>  drivers/perf/Makefile|   1 +
>  drivers/perf/hisilicon/Makefile  |   1 +
>  drivers/perf/hisilicon/hisi_uncore_pmu.c | 411 
> +++
>  drivers/perf/hisilicon/hisi_uncore_pmu.h | 116 +
>  5 files changed, 536 insertions(+)
>  create mode 100644 drivers/perf/hisilicon/Makefile
>  create mode 100644 drivers/perf/hisilicon/hisi_uncore_pmu.c
>  create mode 100644 drivers/perf/hisilicon/hisi_uncore_pmu.h
> 
> diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
> index e5197ff..78fc4bc 100644
> --- a/drivers/perf/Kconfig
> +++ b/drivers/perf/Kconfig
> @@ -17,6 +17,13 @@ config ARM_PMU_ACPI
>   depends on ARM_PMU && ACPI
>   def_bool y
>  
> +config HISI_PMU
> + bool "HiSilicon SoC PMU"
> + depends on ARM64 && ACPI
> + help
> +   Support for HiSilicon SoC uncore performance monitoring
> +   unit (PMU), such as: L3C, HHA and DDRC.
> +
>  config QCOM_L2_PMU
>   bool "Qualcomm Technologies L2-cache PMU"
>   depends on ARCH_QCOM && ARM64 && ACPI
> diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
> index 6420bd4..41d3342 100644
> --- a/drivers/perf/Makefile
> +++ b/drivers/perf/Makefile
> @@ -1,5 +1,6 @@
>  obj-$(CONFIG_ARM_PMU) += arm_pmu.o arm_pmu_platform.o
>  obj-$(CONFIG_ARM_PMU_ACPI) += arm_pmu_acpi.o
> +obj-$(CONFIG_HISI_PMU) += hisilicon/
>  obj-$(CONFIG_QCOM_L2_PMU)+= qcom_l2_pmu.o
>  obj-$(CONFIG_QCOM_L3_PMU) += qcom_l3_pmu.o
>  obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o
> diff --git a/drivers/perf/hisilicon/Makefile b/drivers/perf/hisilicon/Makefile
> new file mode 100644
> index 000..2783bb3
> --- /dev/null
> +++ b/drivers/perf/hisilicon/Makefile
> @@ -0,0 +1 @@
> +obj-$(CONFIG_HISI_PMU) += hisi_uncore_pmu.o
> diff --git a/drivers/perf/hisilicon/hisi_uncore_pmu.c 
> b/drivers/perf/hisilicon/hisi_uncore_pmu.c
> new file mode 100644
> index 000..4eb04e1
> --- /dev/null
> +++ b/drivers/perf/hisilicon/hisi_uncore_pmu.c
> @@ -0,0 +1,411 @@
> +/*
> + * HiSilicon SoC Hardware event counters support
> + *
> + * Copyright (C) 2017 Hisilicon Limited
> + * Author: Anurup M 
> + * Shaokun Zhang 
> + *
> + * This code is based on the uncore PMUs like arm-cci and arm-ccn.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see .
> + */
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include "hisi_uncore_pmu.h"
> +
> +/*
> + * PMU format attributes
> + */
> +ssize_t hisi_format_sysfs_show(struct device *dev,
> +struct device_attribute *attr, char *buf)
> +{
> + struct dev_ext_attribute *eattr;
> +
> + eattr = container_of(attr, struct dev_ext_attribute, attr);
> +
> + return sprintf(buf, "%s\n", (char *)eattr->var);
> +}
> +
> +/*
> + * PMU event attributes
> + */
> +ssize_t hisi_event_sysfs_show(struct device *dev,
> +   struct device_attribute *attr, char *page)
> +{
> + struct dev_ext_attribute *eattr;
> +
> + eattr = container_of(attr, struct dev_ext_attribute, attr);
> +
> + return sprintf(page, "config=0x%lx\n", (unsigned long)eattr->var);
> +}
> +
> +/*
> + * sysfs cpumask attributes
> + */
> +ssize_t hisi_cpumask_sysfs_show(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + struct pmu *pmu = dev_get_drvdata(dev);
> + struct hisi_pmu *hisi_pmu = to_hisi_pmu(pmu);
> +
> + return cpumap_print_to_pagebuf(true, buf, &hisi_pmu->cpus);
> +}
> +
> +/* Read Super CPU cluster and CPU cluster ID from MPIDR_EL1 */
> +void hisi_read_scl_and_ccl_id(u32 *scl_id, u32 *ccl_id)
> +{
> + u64 mpidr;
> +
> + mpidr = read_cpuid_mpidr();
> + if (mpidr & MPIDR_MT_BITMASK) {
> + if (scl_id)
> + *scl_id = MPIDR_AFFINITY_LEVEL(mpidr, 3);
> + if (ccl_id)
> + *ccl_id = MPIDR_AFFINITY_LEVEL(mpidr, 2);
> + } else {
> + if (scl_id)
> + *scl_id = MPIDR_AFFINITY_LEVEL(mpidr, 2);
> + if (ccl_id)
> + *ccl_id = MPIDR_AFFINITY_LEVEL(mpidr, 1);
> + }
> +}
> +
> +sta

Re: [PATCH v3 3/6] perf: hisi: Add support for HiSilicon SoC L3C PMU driver

2017-07-19 Thread Jonathan Cameron
On Tue, 18 Jul 2017 15:59:56 +0800
Shaokun Zhang  wrote:

> This patch adds support for L3C PMU driver in HiSilicon SoC chip, Each
> L3C has own control, counter and interrupt registers and is an separate
> PMU. For each L3C PMU, it has 8-programable counters and supports 0x60
> events, event code is 8-bits and every counter is free-running.
> Interrupt is supported to handle counter (48-bits) overflow.
> 
> Signed-off-by: Shaokun Zhang 
> Signed-off-by: Anurup M   
Hi Shaokun,

A few minor points inline.

Thanks,

Jonathan
> ---
>  drivers/perf/hisilicon/Makefile  |   2 +-
>  drivers/perf/hisilicon/hisi_uncore_l3c_pmu.c | 556 
> +++
>  include/linux/cpuhotplug.h   |   1 +
>  3 files changed, 558 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/perf/hisilicon/hisi_uncore_l3c_pmu.c
> 
> diff --git a/drivers/perf/hisilicon/Makefile b/drivers/perf/hisilicon/Makefile
> index 2783bb3..4a3d3e6 100644
> --- a/drivers/perf/hisilicon/Makefile
> +++ b/drivers/perf/hisilicon/Makefile
> @@ -1 +1 @@
> -obj-$(CONFIG_HISI_PMU) += hisi_uncore_pmu.o
> +obj-$(CONFIG_HISI_PMU) += hisi_uncore_pmu.o hisi_uncore_l3c_pmu.o
> diff --git a/drivers/perf/hisilicon/hisi_uncore_l3c_pmu.c 
> b/drivers/perf/hisilicon/hisi_uncore_l3c_pmu.c
> new file mode 100644
> index 000..d430508
> --- /dev/null
> +++ b/drivers/perf/hisilicon/hisi_uncore_l3c_pmu.c
> @@ -0,0 +1,556 @@
> +/*
> + * HiSilicon SoC L3C uncore Hardware event counters support
> + *
> + * Copyright (C) 2017 Hisilicon Limited
> + * Author: Anurup M 
> + * Shaokun Zhang 
> + *
> + * This code is based on the uncore PMUs like arm-cci and arm-ccn.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see .
Missed this on previous but putting full gpl text in is not normally done any
more. 
> + */
> +#include 
> +#include   
Not immediately seeing any bitmaps in use in here.
(I may well have missed something though!)

> +#include 
> +#include 
> +#include 
> +#include 
> +#include "hisi_uncore_pmu.h"

> +
> +/* Check if the CPU belongs to the SCCL and CCL of PMU */
> +static bool hisi_l3c_is_cpu_in_ccl(struct hisi_pmu *l3c_pmu)
> +{
> + u32 ccl_id, scl_id;
> +
> + /* Read Super CPU cluster ID from CPU MPIDR_EL1 */
> + hisi_read_scl_and_ccl_id(&scl_id, &ccl_id);
> +
> + /* Check if the CPU match with the SCCL and CCL */
> + if (scl_id != l3c_pmu->scl_id)
> + return false;
> + if (ccl_id != l3c_pmu->ccl_id)
> + return false;
> +
> + /* Return true if matched */  
I think the name of the function makes this clear enough to
not need a comment here, but up to you.
> + return true;
> +}
> +
> +static int hisi_l3c_pmu_online_cpu(unsigned int cpu, struct hlist_node *node)
> +{
> + struct hisi_pmu *l3c_pmu;
> +
> + l3c_pmu = hlist_entry_safe(node, struct hisi_pmu, node);
> +
> + /* Proceed only when CPU belongs to the same SCCL and CCL */
> + if (!hisi_l3c_is_cpu_in_ccl(l3c_pmu))
> + return 0;
> +
> + /* If another CPU is already managing the CPU cluster, simply return */
> + if (!cpumask_empty(&l3c_pmu->cpus))
> + return 0;
> +
> + /* Use this CPU for event counting in the CCL */
> + cpumask_set_cpu(cpu, &l3c_pmu->cpus);
> +
> + return 0;
> +}
> +
> +static int hisi_l3c_pmu_offline_cpu(unsigned int cpu, struct hlist_node 
> *node)
> +{
> + struct hisi_pmu *l3c_pmu;
> + cpumask_t ccl_online_cpus;
> + unsigned int target;
> +
> + l3c_pmu = hlist_entry_safe(node, struct hisi_pmu, node);
> +
> + /* Proceed only when CPU belongs to the same SCCL and CCL */
> + if (!hisi_l3c_is_cpu_in_ccl(l3c_pmu))
> + return 0;
> +
> + /* Proceed if this CPU is used for event counting */
> + if (!cpumask_test_cpu(cpu, &l3c_pmu->cpus))
> + return 0;
> +
> + /* Give up ownership of CCL */
> + cpumask_test_and_clear_cpu(cpu, &l3c_pmu->cpus);
> +
> + /* Any other CPU for this CCL which is still online */
> + cpumask_and(&ccl_online_cpus, cpu_coregroup_mask(cpu),
> + cpu_online_mask);
> + target = cpumask_any_but(&ccl_online_cpus, cpu);
> + if (target >= nr_cpu_ids)
> + return 0;
> +
> + perf_pmu_migrate_context(&l3c_pmu->pmu, cpu, target);
> + /* Use this CPU for event counting in the CCL */
> + cpumask_set_cpu(target, &l3c_pmu->cpus);
> +
> + return 0;
> +}

[PATCH V3 0/9] cpufreq: transition-latency cleanups

2017-07-19 Thread Viresh Kumar
Hi Rafael,

This series tries to cleanup the code around transition-latency and its
users. Some of the old legacy code, which may not make much sense now,
is dropped as well. And some code consolidation is also done across
governors.

Based of: v4.13-rc1
Tested on: ARM64 Hikey board.

I have pushed it here as well (which gets tested by kbuild test bot):

git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm.git 
cpufreq/transition-latency

V2->V3:
- Rearranged patches to keep related stuff together
- Introduce CPUFREQ_NO_AUTO_DYNAMIC_SWITCHING flag (Rafael)
- Minor optimization in cpufreq_policy_transition_delay_us() and moved
  it to cpufreq.c (Rafael)
- Allow dynamic switching for drivers which don't know their transition
  latency.

V1->V2:
- While we still get rid of the limitation of 10ms for using
  ondemand/conservative, but we preserve the earlier behavior where the
  transition latency set to CPUFREQ_ETERNAL would not allow use of
  ondemand/conservative governors. Thanks to Dominik for his feedback on
  that.

--
viresh

Viresh Kumar (9):
  cpufreq: governor: Drop min_sampling_rate
  cpufreq: Use transition_delay_us for legacy governors as well
  cpufreq: Cap the default transition delay value to 10 ms
  cpufreq: Don't set transition_latency for setpolicy drivers
  cpufreq: arm_big_little: Make ->get_transition_latency() mandatory
  cpufreq: Replace "max_transition_latency" with "dynamic_switching"
  cpufreq: schedutil: Set dynamic_switching to true
  cpufreq: Add CPUFREQ_NO_AUTO_DYNAMIC_SWITCHING cpufreq driver flag
  cpufreq: Allow dynamic switching with CPUFREQ_ETERNAL latency

 Documentation/admin-guide/pm/cpufreq.rst |  8 
 drivers/cpufreq/arm_big_little.c | 10 --
 drivers/cpufreq/cpufreq-nforce2.c|  2 +-
 drivers/cpufreq/cpufreq.c| 34 
 drivers/cpufreq/cpufreq_conservative.c   |  6 --
 drivers/cpufreq/cpufreq_governor.c   | 17 ++--
 drivers/cpufreq/cpufreq_governor.h   |  3 +--
 drivers/cpufreq/cpufreq_ondemand.c   | 12 ---
 drivers/cpufreq/elanfreq.c   |  4 +---
 drivers/cpufreq/gx-suspmod.c |  2 +-
 drivers/cpufreq/intel_pstate.c   |  1 -
 drivers/cpufreq/longrun.c|  1 -
 drivers/cpufreq/pmac32-cpufreq.c |  7 +--
 drivers/cpufreq/sa1100-cpufreq.c |  5 +++--
 drivers/cpufreq/sa1110-cpufreq.c |  5 +++--
 drivers/cpufreq/sh-cpufreq.c |  3 +--
 drivers/cpufreq/speedstep-smi.c  |  2 +-
 drivers/cpufreq/unicore2-cpufreq.c   |  3 +--
 include/linux/cpufreq.h  | 18 -
 kernel/sched/cpufreq_schedutil.c | 12 ++-
 20 files changed, 65 insertions(+), 90 deletions(-)

-- 
2.13.0.71.gd7076ec9c9cb

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V3 1/9] cpufreq: governor: Drop min_sampling_rate

2017-07-19 Thread Viresh Kumar
The cpufreq core and governors aren't supposed to set a limit on how
fast we want to try changing the frequency. This is currently done for
the legacy governors with help of min_sampling_rate.

At worst, we may end up setting the sampling rate to a value lower than
the rate at which frequency can be changed and then one of the CPUs in
the policy will be only changing frequency for ever.

But that is something for the user to decide and there is no need to
have special handling for such cases in the core. Leave it for the user
to figure out.

Signed-off-by: Viresh Kumar 
---
 Documentation/admin-guide/pm/cpufreq.rst |  8 
 drivers/cpufreq/cpufreq_conservative.c   |  6 --
 drivers/cpufreq/cpufreq_governor.c   | 10 ++
 drivers/cpufreq/cpufreq_governor.h   |  1 -
 drivers/cpufreq/cpufreq_ondemand.c   | 12 
 include/linux/cpufreq.h  |  2 --
 6 files changed, 2 insertions(+), 37 deletions(-)

diff --git a/Documentation/admin-guide/pm/cpufreq.rst 
b/Documentation/admin-guide/pm/cpufreq.rst
index 463cf7e73db8..2eb3bf62393e 100644
--- a/Documentation/admin-guide/pm/cpufreq.rst
+++ b/Documentation/admin-guide/pm/cpufreq.rst
@@ -471,14 +471,6 @@ it is allowed to use (the ``scaling_max_freq`` policy 
limit).
 
# echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) > 
ondemand/sampling_rate
 
-
-``min_sampling_rate``
-   The minimum value of ``sampling_rate``.
-
-   Equal to 1 (10 ms) if :c:macro:`CONFIG_NO_HZ_COMMON` and
-   :c:data:`tick_nohz_active` are both set or to 20 times the value of
-   :c:data:`jiffies` in microseconds otherwise.
-
 ``up_threshold``
If the estimated CPU load is above this value (in percent), the governor
will set the frequency to the maximum value allowed for the policy.
diff --git a/drivers/cpufreq/cpufreq_conservative.c 
b/drivers/cpufreq/cpufreq_conservative.c
index 88220ff3e1c2..f20f20a77d4d 100644
--- a/drivers/cpufreq/cpufreq_conservative.c
+++ b/drivers/cpufreq/cpufreq_conservative.c
@@ -246,7 +246,6 @@ gov_show_one_common(sampling_rate);
 gov_show_one_common(sampling_down_factor);
 gov_show_one_common(up_threshold);
 gov_show_one_common(ignore_nice_load);
-gov_show_one_common(min_sampling_rate);
 gov_show_one(cs, down_threshold);
 gov_show_one(cs, freq_step);
 
@@ -254,12 +253,10 @@ gov_attr_rw(sampling_rate);
 gov_attr_rw(sampling_down_factor);
 gov_attr_rw(up_threshold);
 gov_attr_rw(ignore_nice_load);
-gov_attr_ro(min_sampling_rate);
 gov_attr_rw(down_threshold);
 gov_attr_rw(freq_step);
 
 static struct attribute *cs_attributes[] = {
-   &min_sampling_rate.attr,
&sampling_rate.attr,
&sampling_down_factor.attr,
&up_threshold.attr,
@@ -297,10 +294,7 @@ static int cs_init(struct dbs_data *dbs_data)
dbs_data->up_threshold = DEF_FREQUENCY_UP_THRESHOLD;
dbs_data->sampling_down_factor = DEF_SAMPLING_DOWN_FACTOR;
dbs_data->ignore_nice_load = 0;
-
dbs_data->tuners = tuners;
-   dbs_data->min_sampling_rate = MIN_SAMPLING_RATE_RATIO *
-   jiffies_to_usecs(10);
 
return 0;
 }
diff --git a/drivers/cpufreq/cpufreq_governor.c 
b/drivers/cpufreq/cpufreq_governor.c
index 47e24b5384b3..858081f9c3d7 100644
--- a/drivers/cpufreq/cpufreq_governor.c
+++ b/drivers/cpufreq/cpufreq_governor.c
@@ -47,14 +47,11 @@ ssize_t store_sampling_rate(struct gov_attr_set *attr_set, 
const char *buf,
 {
struct dbs_data *dbs_data = to_dbs_data(attr_set);
struct policy_dbs_info *policy_dbs;
-   unsigned int rate;
int ret;
-   ret = sscanf(buf, "%u", &rate);
+   ret = sscanf(buf, "%u", &dbs_data->sampling_rate);
if (ret != 1)
return -EINVAL;
 
-   dbs_data->sampling_rate = max(rate, dbs_data->min_sampling_rate);
-
/*
 * We are operating under dbs_data->mutex and so the list and its
 * entries can't be freed concurrently.
@@ -437,10 +434,7 @@ int cpufreq_dbs_governor_init(struct cpufreq_policy 
*policy)
latency = 1;
 
/* Bring kernel and HW constraints together */
-   dbs_data->min_sampling_rate = max(dbs_data->min_sampling_rate,
- MIN_LATENCY_MULTIPLIER * latency);
-   dbs_data->sampling_rate = max(dbs_data->min_sampling_rate,
- LATENCY_MULTIPLIER * latency);
+   dbs_data->sampling_rate = LATENCY_MULTIPLIER * latency;
 
if (!have_governor_per_policy())
gov->gdbs_data = dbs_data;
diff --git a/drivers/cpufreq/cpufreq_governor.h 
b/drivers/cpufreq/cpufreq_governor.h
index 0236ec2cd654..95f207eb820e 100644
--- a/drivers/cpufreq/cpufreq_governor.h
+++ b/drivers/cpufreq/cpufreq_governor.h
@@ -41,7 +41,6 @@ enum {OD_NORMAL_SAMPLE, OD_SUB_SAMPLE};
 struct dbs_data {
struct gov_attr_set attr_set;
void *tuners;
-   unsigned int min_sampling_rate;
unsigned int ignore_nice_load;
unsigned

Re: [PATCH 1/2] drivers: pwm: core: implement pwm mode

2017-07-19 Thread m18063
Hello Thierry,

I know you may be very busy but do you have any resolution regarding
this series and the one with title
"[PATCH v2 0/2] extends PWM framework to support PWM dead-times"

Thank you,
Claudiu

On 09.05.2017 15:15, Claudiu Beznea - M18063 wrote:
> Extends PWM framework to support PWM modes. The currently
> implemented PWM modes were called PWM complementary mode
> and PWM push-pull mode. For devices that have more than one
> output per PWM channel:
> - PWM complementary mode is standard working mode; in PWM
> complementary mode the rising and falling edges of the
> channels outputs have opposite levels, same duration and
> same starting time.
> - in PWM push-pull mode the channles outputs has same levels,
> same duration and the rising edges are delayed until the
> beginning of the next period.
> A new member was added in pwm_state structure in order to
> keep the new PWM argument.
> For PWM channels with inputs in DT, the mode could also be
> passed via DT. No new extra field need to be added in device
> tree; since the signal polarity is passed as a flag via DT,
> the field used to pass information for PWM channel polarity
> via DT is also used by PWM mode. Since, for the moment, only
> the complementary and push-pull modes are implemented, only
> one bit in flags used to pass polarity could also be used for
> PWM mode. The other drivers are not affected by this change.
> 
> Signed-off-by: Claudiu Beznea 
> ---
>  Documentation/pwm.txt | 24 +---
>  drivers/pwm/core.c| 13 +--
>  drivers/pwm/sysfs.c   | 52 
> +++
>  include/dt-bindings/pwm/pwm.h |  1 +
>  include/linux/pwm.h   | 37 --
>  5 files changed, 120 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/pwm.txt b/Documentation/pwm.txt
> index 789b27c..1b6fc59 100644
> --- a/Documentation/pwm.txt
> +++ b/Documentation/pwm.txt
> @@ -59,9 +59,9 @@ In addition to the PWM state, the PWM API also exposes PWM 
> arguments, which
>  are the reference PWM config one should use on this PWM.
>  PWM arguments are usually platform-specific and allows the PWM user to only
>  care about dutycycle relatively to the full period (like, duty = 50% of the
> -period). struct pwm_args contains 2 fields (period and polarity) and should
> -be used to set the initial PWM config (usually done in the probe function
> -of the PWM user). PWM arguments are retrieved with pwm_get_args().
> +period). struct pwm_args contains 3 fields (period, polarity and mode) and
> +should be used to set the initial PWM config (usually done in the probe
> +function of the PWM user). PWM arguments are retrieved with pwm_get_args().
>  
>  Using PWMs with the sysfs interface
>  ---
> @@ -100,6 +100,24 @@ enable - Enable/disable the PWM signal (read/write).
>   0 - disabled
>   1 - enabled
>  
> +mode - Set PWM signal type (for channels with more than one output
> +   per channel)
> + complementary - for an output signal as follow:
> + ________
> +PWMH1 __|  |__|  |__|  |__|  |__
> +  __________
> +PWML1   |__|  |__|  |__|  |__|
> +<--T-->
> + Where T is the signal period.
> +
> + push-pull - for an output signal as follows:
> + __  __
> +PWMH1 __|  ||  |
> +   __  __
> +PWML1 |  ||  |__
> +<--T-->
> + Where T is the signal period.
> +
>  Implementing a PWM driver
>  -
>  
> diff --git a/drivers/pwm/core.c b/drivers/pwm/core.c
> index a0860b3..4efc92d 100644
> --- a/drivers/pwm/core.c
> +++ b/drivers/pwm/core.c
> @@ -154,9 +154,14 @@ of_pwm_xlate_with_flags(struct pwm_chip *pc, const 
> struct of_phandle_args *args)
>  
>   pwm->args.period = args->args[1];
>   pwm->args.polarity = PWM_POLARITY_NORMAL;
> + pwm->args.mode = PWM_MODE_COMPLEMENTARY;
>  
> - if (args->args_count > 2 && args->args[2] & PWM_POLARITY_INVERTED)
> - pwm->args.polarity = PWM_POLARITY_INVERSED;
> + if (args->args_count > 2) {
> + if (args->args[2] & PWM_POLARITY_INVERTED)
> + pwm->args.polarity = PWM_POLARITY_INVERSED;
> + if (args->args[2] & PWM_MODE_PUSHPULL)
> + pwm->args.mode = PWM_MODE_PUSH_PULL;
> + }
>  
>   return pwm;
>  }
> @@ -579,6 +584,8 @@ int pwm_adjust_config(struct pwm_device *pwm)
>   pwm_get_args(pwm, &pargs);
>   pwm_get_state(pwm, &state);
>  
> + state.mode = pargs.mode;
> +
>   /*
>* If the current period is zero it means that either the PWM driver
>* does not support initial state retrieval or the PWM has not yet
> @@ -997,6 +1004,8 @@ static voi

[PATCH 0/3] simulated interrupts

2017-07-19 Thread Bartosz Golaszewski
Some frameworks (e.g. iio, gpiolib) use irq_work to implement simulated
interrupts that can be 'fired' from process context when needed and
requested just like normal interrupts. This is useful for testing and
development purposes.

Currently this code is reimplemented by every user. This series
proposes to add a new set of functions that can be used by drivers
that want to simulate interrupts without having to duplicate any
boilerplate code.

The first patch adds a simple irq simulator framework. The second
extends it with resource management. The third uses the new
functionality in the gpio-mockup testing driver.

NOTE: The next candidate for using this API would be iio-dummy-evgen.

Bartosz Golaszewski (3):
  irq/irq_sim: add a simple interrupt simulator framework
  irq/irq_sim: add a devres variant of irq_sim_init()
  gpio: mockup: use irq_sim

 Documentation/driver-model/devres.txt |   1 +
 drivers/gpio/Kconfig  |   2 +-
 drivers/gpio/gpio-mockup.c|  77 ++--
 include/linux/irq_sim.h   |  41 +
 init/Kconfig  |   4 +
 kernel/Makefile   |   1 +
 kernel/irq_sim.c  | 161 ++
 7 files changed, 216 insertions(+), 71 deletions(-)
 create mode 100644 include/linux/irq_sim.h
 create mode 100644 kernel/irq_sim.c

-- 
2.13.2

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] irq/irq_sim: add a devres variant of irq_sim_init()

2017-07-19 Thread Bartosz Golaszewski
Add a resource managed version of irq_sim_init(). This can be
conveniently used in device drivers.

Signed-off-by: Bartosz Golaszewski 
---
 Documentation/driver-model/devres.txt |  1 +
 include/linux/irq_sim.h   |  4 
 kernel/irq_sim.c  | 43 +++
 3 files changed, 48 insertions(+)

diff --git a/Documentation/driver-model/devres.txt 
b/Documentation/driver-model/devres.txt
index 30e04f7a690d..69f08c0f23a8 100644
--- a/Documentation/driver-model/devres.txt
+++ b/Documentation/driver-model/devres.txt
@@ -312,6 +312,7 @@ IRQ
   devm_irq_alloc_descs_from()
   devm_irq_alloc_generic_chip()
   devm_irq_setup_generic_chip()
+  devm_irq_sim_init()
 
 LED
   devm_led_classdev_register()
diff --git a/include/linux/irq_sim.h b/include/linux/irq_sim.h
index 0c1abf0e3244..94c4bfc9b7a9 100644
--- a/include/linux/irq_sim.h
+++ b/include/linux/irq_sim.h
@@ -9,6 +9,7 @@
 #define _LINUX_IRQ_SIM_H
 
 #include 
+#include 
 
 struct irq_sim_work_ctx {
struct irq_work work;
@@ -30,6 +31,9 @@ struct irq_sim {
 int irq_sim_init(struct irq_sim *sim, unsigned int num_irqs);
 void irq_sim_fini(struct irq_sim *sim);
 
+int devm_irq_sim_init(struct device *dev,
+ struct irq_sim *sim, unsigned int num_irqs);
+
 void irq_sim_fire(struct irq_sim *sim, unsigned int offset);
 
 int irq_sim_irqnum(struct irq_sim *sim, unsigned int offset);
diff --git a/kernel/irq_sim.c b/kernel/irq_sim.c
index 062110c90d2a..deb5a58eede5 100644
--- a/kernel/irq_sim.c
+++ b/kernel/irq_sim.c
@@ -89,6 +89,49 @@ void irq_sim_fini(struct irq_sim *sim)
 }
 EXPORT_SYMBOL_GPL(irq_sim_fini);
 
+struct irq_sim_devres {
+   struct irq_sim *sim;
+};
+
+static void devm_irq_sim_release(struct device *dev, void *res)
+{
+   struct irq_sim_devres *this = res;
+
+   irq_sim_fini(this->sim);
+}
+
+/**
+ * irq_sim_init - Initialize the interrupt simulator for a managed device.
+ *
+ * @dev:Device to initialize the simulator object for.
+ * @sim:The interrupt simulator object to initialize.
+ * @num_irqs:   Number of interrupts to allocate
+ *
+ * Returns 0 on success and a negative error number on failure.
+ */
+int devm_irq_sim_init(struct device *dev,
+ struct irq_sim *sim, unsigned int num_irqs)
+{
+   struct irq_sim_devres *dr;
+   int rv;
+
+   dr = devres_alloc(devm_irq_sim_release, sizeof(*dr), GFP_KERNEL);
+   if (!dr)
+   return -ENOMEM;
+
+   rv = irq_sim_init(sim, num_irqs);
+   if (rv) {
+   devres_free(dr);
+   return rv;
+   }
+
+   dr->sim = sim;
+   devres_add(dev, dr);
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(devm_irq_sim_init);
+
 /**
  * irq_sim_fire - Enqueue an interrupt.
  *
-- 
2.13.2

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] gpio: mockup: use irq_sim

2017-07-19 Thread Bartosz Golaszewski
Shrink the driver by removing the code dealing with dummy interrupts
and replacing it with calls to the irq_sim API.

Signed-off-by: Bartosz Golaszewski 
---
 drivers/gpio/Kconfig   |  2 +-
 drivers/gpio/gpio-mockup.c | 77 +-
 2 files changed, 8 insertions(+), 71 deletions(-)

diff --git a/drivers/gpio/Kconfig b/drivers/gpio/Kconfig
index f235eae04c16..e2264fb31f87 100644
--- a/drivers/gpio/Kconfig
+++ b/drivers/gpio/Kconfig
@@ -311,7 +311,7 @@ config GPIO_MOCKUP
depends on GPIOLIB && SYSFS
select GPIO_SYSFS
select GPIOLIB_IRQCHIP
-   select IRQ_WORK
+   select IRQ_SIM
help
  This enables GPIO Testing driver, which provides a way to test GPIO
  subsystem through sysfs(or char device) and debugfs. GPIO_SYSFS
diff --git a/drivers/gpio/gpio-mockup.c b/drivers/gpio/gpio-mockup.c
index a6565e128f9e..6db7163e6d98 100644
--- a/drivers/gpio/gpio-mockup.c
+++ b/drivers/gpio/gpio-mockup.c
@@ -20,7 +20,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 
@@ -47,18 +47,12 @@ enum {
 struct gpio_mockup_line_status {
int dir;
bool value;
-   bool irq_enabled;
-};
-
-struct gpio_mockup_irq_context {
-   struct irq_work work;
-   int irq;
 };
 
 struct gpio_mockup_chip {
struct gpio_chip gc;
struct gpio_mockup_line_status *lines;
-   struct gpio_mockup_irq_context irq_ctx;
+   struct irq_sim irqsim;
struct dentry *dbg_dir;
 };
 
@@ -144,65 +138,11 @@ static int gpio_mockup_name_lines(struct device *dev,
return 0;
 }
 
-static int gpio_mockup_to_irq(struct gpio_chip *chip, unsigned int offset)
-{
-   return chip->irq_base + offset;
-}
-
-static void gpio_mockup_irqmask(struct irq_data *data)
+static int gpio_mockup_to_irq(struct gpio_chip *gc, unsigned int offset)
 {
-   struct gpio_chip *gc = irq_data_get_irq_chip_data(data);
struct gpio_mockup_chip *chip = gpiochip_get_data(gc);
 
-   chip->lines[data->irq - gc->irq_base].irq_enabled = false;
-}
-
-static void gpio_mockup_irqunmask(struct irq_data *data)
-{
-   struct gpio_chip *gc = irq_data_get_irq_chip_data(data);
-   struct gpio_mockup_chip *chip = gpiochip_get_data(gc);
-
-   chip->lines[data->irq - gc->irq_base].irq_enabled = true;
-}
-
-static struct irq_chip gpio_mockup_irqchip = {
-   .name   = GPIO_MOCKUP_NAME,
-   .irq_mask   = gpio_mockup_irqmask,
-   .irq_unmask = gpio_mockup_irqunmask,
-};
-
-static void gpio_mockup_handle_irq(struct irq_work *work)
-{
-   struct gpio_mockup_irq_context *irq_ctx;
-
-   irq_ctx = container_of(work, struct gpio_mockup_irq_context, work);
-   handle_simple_irq(irq_to_desc(irq_ctx->irq));
-}
-
-static int gpio_mockup_irqchip_setup(struct device *dev,
-struct gpio_mockup_chip *chip)
-{
-   struct gpio_chip *gc = &chip->gc;
-   int irq_base, i;
-
-   irq_base = devm_irq_alloc_descs(dev, -1, 0, gc->ngpio, 0);
-   if (irq_base < 0)
-   return irq_base;
-
-   gc->irq_base = irq_base;
-   gc->irqchip = &gpio_mockup_irqchip;
-
-   for (i = 0; i < gc->ngpio; i++) {
-   irq_set_chip(irq_base + i, gc->irqchip);
-   irq_set_chip_data(irq_base + i, gc);
-   irq_set_handler(irq_base + i, &handle_simple_irq);
-   irq_modify_status(irq_base + i,
- IRQ_NOREQUEST | IRQ_NOAUTOEN, IRQ_NOPROBE);
-   }
-
-   init_irq_work(&chip->irq_ctx.work, gpio_mockup_handle_irq);
-
-   return 0;
+   return irq_sim_irqnum(&chip->irqsim, offset);
 }
 
 static ssize_t gpio_mockup_event_write(struct file *file,
@@ -228,11 +168,8 @@ static ssize_t gpio_mockup_event_write(struct file *file,
chip = priv->chip;
gc = &chip->gc;
 
-   if (chip->lines[priv->offset].irq_enabled) {
-   gpiod_set_value_cansleep(desc, val);
-   priv->chip->irq_ctx.irq = gc->irq_base + priv->offset;
-   irq_work_queue(&priv->chip->irq_ctx.work);
-   }
+   gpiod_set_value_cansleep(desc, val);
+   irq_sim_fire(&chip->irqsim, priv->offset);
 
return size;
 }
@@ -319,7 +256,7 @@ static int gpio_mockup_add(struct device *dev,
return ret;
}
 
-   ret = gpio_mockup_irqchip_setup(dev, chip);
+   ret = devm_irq_sim_init(dev, &chip->irqsim, gc->ngpio);
if (ret)
return ret;
 
-- 
2.13.2

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] irq/irq_sim: add a simple interrupt simulator framework

2017-07-19 Thread Bartosz Golaszewski
Implement a simple, irq_work-based framework for simulating
interrupts. Currently the API exposes routines for initializing and
deinitializing the simulator object, enqueueing the interrupts and
retrieving the allocated interrupt numbers based on the offset of the
dummy interrupt in the simulator struct.

Signed-off-by: Bartosz Golaszewski 
---
 include/linux/irq_sim.h |  37 +++
 init/Kconfig|   4 ++
 kernel/Makefile |   1 +
 kernel/irq_sim.c| 118 
 4 files changed, 160 insertions(+)
 create mode 100644 include/linux/irq_sim.h
 create mode 100644 kernel/irq_sim.c

diff --git a/include/linux/irq_sim.h b/include/linux/irq_sim.h
new file mode 100644
index ..0c1abf0e3244
--- /dev/null
+++ b/include/linux/irq_sim.h
@@ -0,0 +1,37 @@
+/*
+ * Copyright (C) 2017 Bartosz Golaszewski 
+ *
+ * Provides a framework for allocating simulated interrupts which can be
+ * requested like normal irqs and enqueued from process context.
+ */
+
+#ifndef _LINUX_IRQ_SIM_H
+#define _LINUX_IRQ_SIM_H
+
+#include 
+
+struct irq_sim_work_ctx {
+   struct irq_work work;
+   int irq;
+};
+
+struct irq_sim_irq_ctx {
+   int irqnum;
+   bool enabled;
+};
+
+struct irq_sim {
+   struct irq_sim_work_ctx work_ctx;
+   int irq_base;
+   unsigned int irq_count;
+   struct irq_sim_irq_ctx *irqs;
+};
+
+int irq_sim_init(struct irq_sim *sim, unsigned int num_irqs);
+void irq_sim_fini(struct irq_sim *sim);
+
+void irq_sim_fire(struct irq_sim *sim, unsigned int offset);
+
+int irq_sim_irqnum(struct irq_sim *sim, unsigned int offset);
+
+#endif /* _LINUX_IRQ_SIM_H */
diff --git a/init/Kconfig b/init/Kconfig
index 8514b25db21c..220456599c3f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -23,6 +23,10 @@ config CONSTRUCTORS
 config IRQ_WORK
bool
 
+config IRQ_SIM
+   bool
+   select IRQ_WORK
+
 config BUILDTIME_EXTABLE_SORT
bool
 
diff --git a/kernel/Makefile b/kernel/Makefile
index 4cb8e8b23c6e..4472567c5835 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -97,6 +97,7 @@ obj-$(CONFIG_TRACE_CLOCK) += trace/
 obj-$(CONFIG_RING_BUFFER) += trace/
 obj-$(CONFIG_TRACEPOINTS) += trace/
 obj-$(CONFIG_IRQ_WORK) += irq_work.o
+obj-$(CONFIG_IRQ_SIM) += irq_sim.o
 obj-$(CONFIG_CPU_PM) += cpu_pm.o
 obj-$(CONFIG_BPF) += bpf/
 
diff --git a/kernel/irq_sim.c b/kernel/irq_sim.c
new file mode 100644
index ..062110c90d2a
--- /dev/null
+++ b/kernel/irq_sim.c
@@ -0,0 +1,118 @@
+/*
+ * Copyright (C) 2017 Bartosz Golaszewski 
+ *
+ * Provides a framework for allocating simulated interrupts which can be
+ * requested like normal irqs and enqueued from process context.
+ */
+
+#include 
+#include 
+
+static void irq_sim_irqmask(struct irq_data *data)
+{
+   struct irq_sim_irq_ctx *irq_ctx = irq_data_get_irq_chip_data(data);
+
+   irq_ctx->enabled = false;
+}
+
+static void irq_sim_irqunmask(struct irq_data *data)
+{
+   struct irq_sim_irq_ctx *irq_ctx = irq_data_get_irq_chip_data(data);
+
+   irq_ctx->enabled = true;
+}
+
+static struct irq_chip irq_sim_irqchip = {
+   .name   = "irq_sim",
+   .irq_mask   = irq_sim_irqmask,
+   .irq_unmask = irq_sim_irqunmask,
+};
+
+static void irq_sim_handle_irq(struct irq_work *work)
+{
+   struct irq_sim_work_ctx *work_ctx;
+
+   work_ctx = container_of(work, struct irq_sim_work_ctx, work);
+   handle_simple_irq(irq_to_desc(work_ctx->irq));
+}
+
+/**
+ * irq_sim_init - Initialize the interrupt simulator: allocate a range of
+ *dummy interrupts.
+ *
+ * @sim:The interrupt simulator object to initialize.
+ * @num_irqs:   Number of interrupts to allocate
+ *
+ * Returns 0 on success and a negative error number on failure.
+ */
+int irq_sim_init(struct irq_sim *sim, unsigned int num_irqs)
+{
+   int i;
+
+   sim->irqs = kmalloc_array(num_irqs, sizeof(*sim->irqs), GFP_KERNEL);
+   if (!sim->irqs)
+   return -ENOMEM;
+
+   sim->irq_base = irq_alloc_descs(-1, 0, num_irqs, 0);
+   if (sim->irq_base < 0) {
+   kfree(sim->irqs);
+   return sim->irq_base;
+   }
+
+   for (i = 0; i < num_irqs; i++) {
+   sim->irqs[i].irqnum = sim->irq_base + i;
+   sim->irqs[i].enabled = false;
+   irq_set_chip(sim->irq_base + i, &irq_sim_irqchip);
+   irq_set_chip_data(sim->irq_base + i, &sim->irqs[i]);
+   irq_set_handler(sim->irq_base + i, &handle_simple_irq);
+   irq_modify_status(sim->irq_base + i,
+ IRQ_NOREQUEST | IRQ_NOAUTOEN, IRQ_NOPROBE);
+   }
+
+   init_irq_work(&sim->work_ctx.work, irq_sim_handle_irq);
+   sim->irq_count = num_irqs;
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(irq_sim_init);
+
+/**
+ * irq_sim_fini - Deinitialize the interrupt simulator: free the interrupt
+ *descriptors and allocated memory.
+ *
+ *

Re: [PATCH 0/3] simulated interrupts

2017-07-19 Thread Thomas Gleixner
On Wed, 19 Jul 2017, Bartosz Golaszewski wrote:

> Some frameworks (e.g. iio, gpiolib) use irq_work to implement simulated
> interrupts that can be 'fired' from process context when needed and
> requested just like normal interrupts. This is useful for testing and
> development purposes.
> 
> Currently this code is reimplemented by every user. This series
> proposes to add a new set of functions that can be used by drivers
> that want to simulate interrupts without having to duplicate any
> boilerplate code.
> 
> The first patch adds a simple irq simulator framework. The second
> extends it with resource management. The third uses the new
> functionality in the gpio-mockup testing driver.
> 
> NOTE: The next candidate for using this API would be iio-dummy-evgen.

I like the general idea - have not looked at the code yet. Just a quick
question: How many copies/variants of this scheme do we have in tree?

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] simulated interrupts

2017-07-19 Thread Bartosz Golaszewski
2017-07-19 14:25 GMT+02:00 Thomas Gleixner :
> On Wed, 19 Jul 2017, Bartosz Golaszewski wrote:
>
>> Some frameworks (e.g. iio, gpiolib) use irq_work to implement simulated
>> interrupts that can be 'fired' from process context when needed and
>> requested just like normal interrupts. This is useful for testing and
>> development purposes.
>>
>> Currently this code is reimplemented by every user. This series
>> proposes to add a new set of functions that can be used by drivers
>> that want to simulate interrupts without having to duplicate any
>> boilerplate code.
>>
>> The first patch adds a simple irq simulator framework. The second
>> extends it with resource management. The third uses the new
>> functionality in the gpio-mockup testing driver.
>>
>> NOTE: The next candidate for using this API would be iio-dummy-evgen.
>
> I like the general idea - have not looked at the code yet. Just a quick
> question: How many copies/variants of this scheme do we have in tree?
>
> Thanks,
>
> tglx

Currently there are two: iio and gpiolib basically duplicate the same
code in their respective testing drivers. I only used irq_sim in
gpio-mockup in this series as an example and to see if there's any
interest in merging it before spending time on iio-dummy-evgen.

In the future this could be used in any framework that uses interrupts
and wants to test the irq code paths without touching any specific
hardware.

Thanks,
Bartosz
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 0/9] cpufreq: transition-latency cleanups

2017-07-19 Thread Rafael J. Wysocki
On Wednesday, July 19, 2017 03:42:40 PM Viresh Kumar wrote:
> Hi Rafael,
> 
> This series tries to cleanup the code around transition-latency and its
> users. Some of the old legacy code, which may not make much sense now,
> is dropped as well. And some code consolidation is also done across
> governors.
> 
> Based of: v4.13-rc1
> Tested on: ARM64 Hikey board.

>From the first quick look this version is fine by me.

Unless I find anything of concern later, this will be queued up for 4.14.

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 3/4] Documentation: add drm and drm_kms_helper module param documentation

2017-07-19 Thread Jani Nikula
Obviously the MODULE_PARM_DESC documentation for
drm_kms_helper.edid_firmware needs some updating before we could
consider this.

Signed-off-by: Jani Nikula 
---
 Documentation/admin-guide/kernel-parameters.txt| 18 ---
 .../admin-guide/module-parameters/drm.rst  | 26 ++
 .../module-parameters/drm_kms_helper.rst   | 23 +++
 3 files changed, 49 insertions(+), 18 deletions(-)
 create mode 100644 Documentation/admin-guide/module-parameters/drm.rst
 create mode 100644 
Documentation/admin-guide/module-parameters/drm_kms_helper.rst

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 4e731b44db95..f0cee38beef2 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -854,24 +854,6 @@
The filter can be disabled or changed to another
driver later using sysfs.
 
-   drm_kms_helper.edid_firmware=[:][,[:]]
-   Broken monitors, graphic adapters, KVMs and EDIDless
-   panels may send no or incorrect EDID data sets.
-   This parameter allows to specify an EDID data sets
-   in the /lib/firmware directory that are used instead.
-   Generic built-in EDID data sets are used, if one of
-   edid/1024x768.bin, edid/1280x1024.bin,
-   edid/1680x1050.bin, or edid/1920x1080.bin is given
-   and no file with the same name exists. Details and
-   instructions how to build your own EDID data are
-   available in Documentation/EDID/HOWTO.txt. An EDID
-   data set will only be used for a particular connector,
-   if its name and a colon are prepended to the EDID
-   name. Each connector may use a unique EDID data
-   set by separating the files with a comma.  An EDID
-   data set with no connector name will be used for
-   any connectors not explicitly specified.
-
dscc4.setup=[NET]
 
dt_cpu_ftrs=[PPC]
diff --git a/Documentation/admin-guide/module-parameters/drm.rst 
b/Documentation/admin-guide/module-parameters/drm.rst
new file mode 100644
index ..8b03a54f844c
--- /dev/null
+++ b/Documentation/admin-guide/module-parameters/drm.rst
@@ -0,0 +1,26 @@
+.. Autogenerated from drm.ko using modinfo(8) and modinfo-to-rst
+
+drm
+===
+
+debug (int)
+  Enable debug output, where each bit enables a debug category.
+   Bit 0 (0x01) will enable CORE messages (drm core code)
+   Bit 1 (0x02) will enable DRIVER messages (drm controller code)
+   Bit 2 (0x04) will enable KMS messages (modesetting code)
+   Bit 3 (0x08) will enable PRIME messages (prime code)
+   Bit 4 (0x10) will enable ATOMIC messages (atomic code)
+   Bit 5 (0x20) will enable VBL messages (vblank code)
+
+edid_fixup (int)
+  Minimum number of valid EDID header bytes (0-8, default 6)
+
+timestamp_monotonic (int)
+  Use monotonic timestamps
+
+timestamp_precision_usec (int)
+  Max. error on timestamps [usecs]
+
+vblankoffdelay (int)
+  Delay until vblank irq auto-disable [msecs] (0: never disable, <0: disable 
immediately)
+
diff --git a/Documentation/admin-guide/module-parameters/drm_kms_helper.rst 
b/Documentation/admin-guide/module-parameters/drm_kms_helper.rst
new file mode 100644
index ..8940951d75da
--- /dev/null
+++ b/Documentation/admin-guide/module-parameters/drm_kms_helper.rst
@@ -0,0 +1,23 @@
+.. Autogenerated from drm_kms_helper.ko using modinfo(8) and modinfo-to-rst
+
+drm_kms_helper
+==
+
+dp_aux_i2c_speed_khz (int)
+  Assumed speed of the i2c bus in kHz, (1-400, default 10)
+
+dp_aux_i2c_transfer_size (int)
+  Number of bytes to transfer in a single I2C over DP AUX CH message, (1-16, 
default 16)
+
+drm_fbdev_overalloc (int)
+  Overallocation of the fbdev buffer (%) [default=100]
+
+edid_firmware (string)
+  Do not probe monitor, use specified EDID blob from built-in data or 
/lib/firmware instead. 
+
+fbdev_emulation (bool)
+  Enable legacy fbdev emulation [default=true]
+
+poll (bool)
+  
+
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 0/4] Documentation: generated module param documentation

2017-07-19 Thread Jani Nikula
Hi Jon, all, here are some quick'n'dirty patches to semi-automatically
generate module param documentation from the source, via module build
and modinfo(8). No polish or proper design, just a hacked up
proof-of-concept to think about.

Do we want nice documentation for module parameters, somewhere that
search engines can find them? And do we want to reuse the documentation
already in place for parameters?


BR,
Jani.

Jani Nikula (4):
  scripts: add modinfo-to-rst script to extract module parameters
  Documentation: start module parameter documentation autogenerated from
modinfo
  Documentation: add drm and drm_kms_helper module param documentation
  scripts: add modparams-docs-update script to update all module
parameter docs

 Documentation/admin-guide/kernel-parameters.rst|   8 ++
 Documentation/admin-guide/kernel-parameters.txt|  32 --
 Documentation/admin-guide/module-parameters/README |  14 +++
 .../admin-guide/module-parameters/drm.rst  |  26 +
 .../module-parameters/drm_kms_helper.rst   |  23 
 .../admin-guide/module-parameters/i915.rst | 125 +
 scripts/modinfo-to-rst |  24 
 scripts/modparams-docs-update  |  20 
 8 files changed, 240 insertions(+), 32 deletions(-)
 create mode 100644 Documentation/admin-guide/module-parameters/README
 create mode 100644 Documentation/admin-guide/module-parameters/drm.rst
 create mode 100644 
Documentation/admin-guide/module-parameters/drm_kms_helper.rst
 create mode 100644 Documentation/admin-guide/module-parameters/i915.rst
 create mode 100755 scripts/modinfo-to-rst
 create mode 100755 scripts/modparams-docs-update

-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 2/4] Documentation: start module parameter documentation autogenerated from modinfo

2017-07-19 Thread Jani Nikula
Module parameters are supposed to be documented using MODULE_PARM_DESC
in source. We should be able to use this information to autogenerate
module parameter documentation instead of duplicating that information
manually in kernel-parameters.rst.

Parsing that information directly from the source is hard. It's not
obvious what the module name is when you encounter MODULE_PARM_DESC. The
information may be entered in source using several layers of other
macros. Oh, and you'd need another C parser.

An alternative is to use modinfo(8) on the compiled kernel modules. The
trouble is, you have to actually build the module to do so, and that is
obviously too slow for the documentation build.

Here's an approach to commit the generated information in the repo. It's
not ideal to commit generated data, but at least the information can be
automatically updated as needed.

All the generated files are included as a toctree in
kernel-parameters.rst.

Signed-off-by: Jani Nikula 
---
 Documentation/admin-guide/kernel-parameters.rst|   8 ++
 Documentation/admin-guide/kernel-parameters.txt|  14 ---
 Documentation/admin-guide/module-parameters/README |  10 ++
 .../admin-guide/module-parameters/i915.rst | 125 +
 4 files changed, 143 insertions(+), 14 deletions(-)
 create mode 100644 Documentation/admin-guide/module-parameters/README
 create mode 100644 Documentation/admin-guide/module-parameters/i915.rst

diff --git a/Documentation/admin-guide/kernel-parameters.rst 
b/Documentation/admin-guide/kernel-parameters.rst
index d76ab3907e2b..eed98e70b700 100644
--- a/Documentation/admin-guide/kernel-parameters.rst
+++ b/Documentation/admin-guide/kernel-parameters.rst
@@ -204,6 +204,14 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted:
 .. include:: kernel-parameters.txt
:literal:
 
+Module parameters
+-
+
+.. toctree::
+   :glob:
+
+   module-parameters/*
+
 Todo
 
 
diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index d9c171ce4190..4e731b44db95 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1382,20 +1382,6 @@
i8k.restricted  [HW] Allow controlling fans only if SYS_ADMIN
capability is set.
 
-   i915.invert_brightness=
-   [DRM] Invert the sense of the variable that is used to
-   set the brightness of the panel backlight. Normally a
-   brightness value of 0 indicates backlight switched off,
-   and the maximum of the brightness value sets the 
backlight
-   to maximum brightness. If this parameter is set to 0
-   (default) and the machine requires it, or this parameter
-   is set to 1, a brightness value of 0 sets the backlight
-   to maximum brightness, and the maximum of the brightness
-   value switches the backlight off.
-   -1 -- never invert brightness
-0 -- machine default
-1 -- force brightness inversion
-
icn=[HW,ISDN]
Format: [,[,[,]]]
 
diff --git a/Documentation/admin-guide/module-parameters/README 
b/Documentation/admin-guide/module-parameters/README
new file mode 100644
index ..38e34f83bdc7
--- /dev/null
+++ b/Documentation/admin-guide/module-parameters/README
@@ -0,0 +1,10 @@
+The files in this directory are automatically generated. Please don't update
+them manually.
+
+To add a new file, run:
+
+   $ scripts/modinfo-to-rst path/to/module.ko > \
+ Documentation/admin-guide/module-parameters/module.rst
+
+Make sure you do this with a recently built module.ko! Review and commit the
+changes afterwards.
diff --git a/Documentation/admin-guide/module-parameters/i915.rst 
b/Documentation/admin-guide/module-parameters/i915.rst
new file mode 100644
index ..7dd4a92fe7b0
--- /dev/null
+++ b/Documentation/admin-guide/module-parameters/i915.rst
@@ -0,0 +1,125 @@
+.. Autogenerated from i915.ko using modinfo(8) and modinfo-to-rst
+
+i915
+
+
+alpha_support (bool)
+  Enable alpha quality driver support for latest hardware. See also 
CONFIG_DRM_I915_ALPHA_SUPPORT.
+
+disable_display (bool)
+  Disable display (default: false)
+
+disable_power_well (int)
+  Disable display power wells when possible (-1=auto [default], 0=power wells 
always on, 1=power wells disabled when possible)
+
+edp_vswing (int)
+  Ignore/Override vswing pre-emph table selection from VBT (0=use value from 
vbt [default], 1=low power swing(200mV),2=default swing(400mV))
+
+enable_cmd_parser (bool)
+  Enable command parsing (true=enabled [default], false=disabled)
+
+enable_dbc (bool)
+  Enable support for dynamic backlight control (default:true)
+
+enable_dc (int)
+  Enable power-savin

[RFC PATCH 1/4] scripts: add modinfo-to-rst script to extract module parameters

2017-07-19 Thread Jani Nikula
Add a simple script to extract module param info from .ko using
modinfo(8), and convert the results to rst. There's no filtering of rst
special characters or anything.

Signed-off-by: Jani Nikula 
---
 scripts/modinfo-to-rst | 24 
 1 file changed, 24 insertions(+)
 create mode 100755 scripts/modinfo-to-rst

diff --git a/scripts/modinfo-to-rst b/scripts/modinfo-to-rst
new file mode 100755
index ..e415c68db72e
--- /dev/null
+++ b/scripts/modinfo-to-rst
@@ -0,0 +1,24 @@
+#!/bin/bash
+# usage: modinfo-to-rst /path/to/module.ko
+
+SELF=modinfo-to-rst
+MODINFO=/sbin/modinfo
+
+if [[ "$#" != "1" ]]; then
+   echo "$SELF: usage: $SELF (modulename|filename)" >&2
+   exit 1
+fi
+
+MODULE="$1"
+MODULENAME="$(basename -s .ko $MODULE)"
+
+cat 

[RFC PATCH 4/4] scripts: add modparams-docs-update script to update all module parameter docs

2017-07-19 Thread Jani Nikula
A simple helper to update module parameter docs. Probably too simple.

Signed-off-by: Jani Nikula 
---
 Documentation/admin-guide/module-parameters/README |  4 
 scripts/modparams-docs-update  | 20 
 2 files changed, 24 insertions(+)
 create mode 100755 scripts/modparams-docs-update

diff --git a/Documentation/admin-guide/module-parameters/README 
b/Documentation/admin-guide/module-parameters/README
index 38e34f83bdc7..99618f11300a 100644
--- a/Documentation/admin-guide/module-parameters/README
+++ b/Documentation/admin-guide/module-parameters/README
@@ -8,3 +8,7 @@ To add a new file, run:
 
 Make sure you do this with a recently built module.ko! Review and commit the
 changes afterwards.
+
+To update all .rst files for which a .ko exists, run:
+
+   $ scripts/modparams-docs-update
diff --git a/scripts/modparams-docs-update b/scripts/modparams-docs-update
new file mode 100755
index ..67a8437ab780
--- /dev/null
+++ b/scripts/modparams-docs-update
@@ -0,0 +1,20 @@
+#!/bin/bash
+
+MODULE_PARAM_DOCS=Documentation/admin-guide/module-parameters
+
+# FIXME: handle objtree
+OBJTREE=.
+
+for rst in $(find $MODULE_PARAM_DOCS -type f -name "*.rst"); do
+   MODULENAME=$(basename -s .rst $rst)
+
+   dupes=0
+   for ko in $(find $OBJTREE -type f -name "${MODULENAME}.ko"); do
+   if [ "$dupes" = "0" ]; then
+   scripts/modinfo-to-rst $ko > $rst
+   dupes=1
+   else
+   echo "$0: $rst has ambiguous source" >&2
+   fi
+   done
+done
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] simulated interrupts

2017-07-19 Thread Thomas Gleixner
On Wed, 19 Jul 2017, Bartosz Golaszewski wrote:

> 2017-07-19 14:25 GMT+02:00 Thomas Gleixner :
> > On Wed, 19 Jul 2017, Bartosz Golaszewski wrote:
> >
> >> Some frameworks (e.g. iio, gpiolib) use irq_work to implement simulated
> >> interrupts that can be 'fired' from process context when needed and
> >> requested just like normal interrupts. This is useful for testing and
> >> development purposes.
> >>
> >> Currently this code is reimplemented by every user. This series
> >> proposes to add a new set of functions that can be used by drivers
> >> that want to simulate interrupts without having to duplicate any
> >> boilerplate code.
> >>
> >> The first patch adds a simple irq simulator framework. The second
> >> extends it with resource management. The third uses the new
> >> functionality in the gpio-mockup testing driver.
> >>
> >> NOTE: The next candidate for using this API would be iio-dummy-evgen.
> >
> > I like the general idea - have not looked at the code yet. Just a quick
> > question: How many copies/variants of this scheme do we have in tree?
> >
> > Thanks,
> >
> > tglx
> 
> Currently there are two: iio and gpiolib basically duplicate the same
> code in their respective testing drivers. I only used irq_sim in
> gpio-mockup in this series as an example and to see if there's any
> interest in merging it before spending time on iio-dummy-evgen.

Yes, I think so. Consolidation is always a good thing and simulation is
useful for developing or validating code.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/5] mm: add vm_insert_mixed_mkwrite()

2017-07-19 Thread Jan Kara
On Wed 28-06-17 16:01:48, Ross Zwisler wrote:
> To be able to use the common 4k zero page in DAX we need to have our PTE
> fault path look more like our PMD fault path where a PTE entry can be
> marked as dirty and writeable as it is first inserted, rather than waiting
> for a follow-up dax_pfn_mkwrite() => finish_mkwrite_fault() call.
> 
> Right now we can rely on having a dax_pfn_mkwrite() call because we can
> distinguish between these two cases in do_wp_page():
> 
>   case 1: 4k zero page => writable DAX storage
>   case 2: read-only DAX storage => writeable DAX storage
> 
> This distinction is made by via vm_normal_page().  vm_normal_page() returns
> false for the common 4k zero page, though, just as it does for DAX ptes.
> Instead of special casing the DAX + 4k zero page case, we will simplify our
> DAX PTE page fault sequence so that it matches our DAX PMD sequence, and
> get rid of dax_pfn_mkwrite() completely.
> 
> This means that insert_pfn() needs to follow the lead of insert_pfn_pmd()
> and allow us to pass in a 'mkwrite' flag.  If 'mkwrite' is set insert_pfn()
> will do the work that was previously done by wp_page_reuse() as part of the
> dax_pfn_mkwrite() call path.
> 
> Signed-off-by: Ross Zwisler 

Just one small comment below.

> @@ -1658,14 +1658,26 @@ static int insert_pfn(struct vm_area_struct *vma, 
> unsigned long addr,
>   if (!pte)
>   goto out;
>   retval = -EBUSY;
> - if (!pte_none(*pte))
> - goto out_unlock;
> + if (!pte_none(*pte)) {
> + if (mkwrite) {
> + entry = *pte;
> + goto out_mkwrite;

Can we maybe check here that (pte_pfn(*pte) == pfn_t_to_pfn(pfn)) and
return -EBUSY otherwise? That way we are sure insert_pfn() isn't doing
anything we don't expect and if I understand the code right, we need to
invalidate all zero page mappings at given file offset (via
unmap_mapping_range()) before mapping an allocated block there and thus the
case of filling the hole won't be affected by this?

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] simulated interrupts

2017-07-19 Thread Marc Zyngier
On 19/07/17 14:58, Thomas Gleixner wrote:
> On Wed, 19 Jul 2017, Bartosz Golaszewski wrote:
> 
>> 2017-07-19 14:25 GMT+02:00 Thomas Gleixner :
>>> On Wed, 19 Jul 2017, Bartosz Golaszewski wrote:
>>>
 Some frameworks (e.g. iio, gpiolib) use irq_work to implement simulated
 interrupts that can be 'fired' from process context when needed and
 requested just like normal interrupts. This is useful for testing and
 development purposes.

 Currently this code is reimplemented by every user. This series
 proposes to add a new set of functions that can be used by drivers
 that want to simulate interrupts without having to duplicate any
 boilerplate code.

 The first patch adds a simple irq simulator framework. The second
 extends it with resource management. The third uses the new
 functionality in the gpio-mockup testing driver.

 NOTE: The next candidate for using this API would be iio-dummy-evgen.
>>>
>>> I like the general idea - have not looked at the code yet. Just a quick
>>> question: How many copies/variants of this scheme do we have in tree?
>>>
>>> Thanks,
>>>
>>> tglx
>>
>> Currently there are two: iio and gpiolib basically duplicate the same
>> code in their respective testing drivers. I only used irq_sim in
>> gpio-mockup in this series as an example and to see if there's any
>> interest in merging it before spending time on iio-dummy-evgen.
> 
> Yes, I think so. Consolidation is always a good thing and simulation is
> useful for developing or validating code.

Indeed, that's pretty interesting.

On a slightly tangential subject, there is another aspect that I thought
of implementing for a while, but always ended up just relying on a quick
hack: forcing the injection of an actual interrupt. A number of
interrupt controllers have the ability to make an interrupt pending, for
it to be handled as if a device had actually triggered it.

In my case, it has proved to be incredibly useful when debugging the
interrupt controller itself, and also things that mess with interrupts
in a creative way (like KVM) while relying on a particular interrupt
controller.

What I had in mind was something like:

echo 1 >/proc/irq/9/trigger (or the corresponding
/sys/kernel/debug/irq/irqs/ interface if we want to make sure that this
is really not for production use...).

If there is any interest, I'll try to whip something up.

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] fs/dcache: Limit numbers of negative dentries

2017-07-19 Thread Miklos Szeredi
On Mon, Jul 17, 2017 at 3:39 PM, Waiman Long  wrote:
> The number of positive dentries is limited by the number of files
> in the filesystems. The number of negative dentries, however,
> has no limit other than the total amount of memory available in
> the system. So a rogue application that generates a lot of negative
> dentries can potentially exhaust most of the memory available in the
> system impacting performance on other running applications.
>
> To prevent this from happening, the dcache code is now updated to limit
> the amount of the negative dentries in the LRU lists that can be kept
> as a percentage of total available system memory. The default is 5%
> and can be changed by specifying the "neg_dentry_pc=" kernel command
> line option.

AFAICS the implementation is counter to the concept of LRU since it
will get rid of the most recently used negative dentry after passing
the limit.  Which in itself is a source of DoS (keep rouge negative
dentries at just about the limit, so normal application are prevented
from getting their negatives cached).

Thanks,
Miklos

>
> Signed-off-by: Waiman Long 
> ---
>  Documentation/admin-guide/kernel-parameters.txt |   7 ++
>  fs/dcache.c | 154 
> +++-
>  include/linux/dcache.h  |   1 +
>  3 files changed, 161 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index f701430..fc3c937 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2372,6 +2372,13 @@
>
> n2= [NET] SDL Inc. RISCom/N2 synchronous serial card
>
> +   neg_dentry_pc=  [KNL]
> +   Range: 1-50
> +   Default: 5
> +   This parameter specifies the amount of negative
> +   dentries allowed in the system as a percentage of
> +   total system memory.
> +
> netdev= [NET] Network devices parameters
> Format: 
> Note that mem_start is often overloaded to mean
> diff --git a/fs/dcache.c b/fs/dcache.c
> index f901413..6a0a844 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -130,8 +130,19 @@ struct dentry_stat_t dentry_stat = {
> .age_limit = 45,
>  };
>
> +/*
> + * Macros and variables to manage and count negative dentries.
> + */
> +#define NEG_DENTRY_BATCH   (1 << 8)
> +static long neg_dentry_percpu_limit __read_mostly;
> +static struct {
> +   raw_spinlock_t nfree_lock;
> +   long nfree; /* Negative dentry free pool */
> +} ndblk cacheline_aligned_in_smp;
> +
>  static DEFINE_PER_CPU(long, nr_dentry);
>  static DEFINE_PER_CPU(long, nr_dentry_unused);
> +static DEFINE_PER_CPU(long, nr_dentry_neg);
>
>  #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
>
> @@ -227,6 +238,86 @@ static inline int dentry_string_cmp(const unsigned char 
> *cs, const unsigned char
>
>  #endif
>
> +/*
> + * There is a system-wide limit to the amount of negative dentries allowed
> + * in the super blocks' LRU lists. The default limit is 5% of the total
> + * system memory. This limit can be changed by using the kernel command line
> + * option "neg_dentry_pc=" to specify the percentage of the total memory
> + * that can be used for negative dentries. That percentage must be in the
> + * 1-50% range.
> + *
> + * To avoid performance problem with a global counter on an SMP system,
> + * the tracking is done mostly on a per-cpu basis. The total limit is
> + * distributed in a 80/20 ratio to per-cpu counters and a global free pool.
> + *
> + * If a per-cpu counter runs out of negative dentries, it can borrow extra
> + * ones from the global free pool. If it has more than its percpu limit,
> + * the extra ones will be returned back to the global pool.
> + */
> +
> +/*
> + * Decrement negative dentry count if applicable.
> + */
> +static void __neg_dentry_dec(struct dentry *dentry)
> +{
> +   if (unlikely(this_cpu_dec_return(nr_dentry_neg) < 0)) {
> +   long *pcnt = get_cpu_ptr(&nr_dentry_neg);
> +
> +   if ((*pcnt < 0) && raw_spin_trylock(&ndblk.nfree_lock)) {
> +   ACCESS_ONCE(ndblk.nfree) += NEG_DENTRY_BATCH;
> +   *pcnt += NEG_DENTRY_BATCH;
> +   raw_spin_unlock(&ndblk.nfree_lock);
> +   }
> +   put_cpu_ptr(&nr_dentry_neg);
> +   }
> +}
> +
> +static inline void neg_dentry_dec(struct dentry *dentry)
> +{
> +   if (unlikely(d_is_negative(dentry)))
> +   __neg_dentry_dec(dentry);
> +}
> +
> +/*
> + * Increment negative dentry count if applicable.
> + */
> +static void __neg_dentry_inc(struct dentry *dentry)
> +{
> +   long cnt, *pcnt;
> +
> +   if (this_cpu_inc_return(nr_dentry_neg) <= neg_dentry_percpu_limit)
> + 

Re: [PATCH 0/3] simulated interrupts

2017-07-19 Thread Andy Shevchenko
On Wed, Jul 19, 2017 at 5:19 PM, Marc Zyngier  wrote:
> On 19/07/17 14:58, Thomas Gleixner wrote:
>> On Wed, 19 Jul 2017, Bartosz Golaszewski wrote:

> echo 1 >/proc/irq/9/trigger
> (or the corresponding /sys/kernel/debug/irq/irqs/ interface if we want to 
> make sure that this
> is really not for production use...).

or /sys/kernel/irq as a successor of /proc/irq


-- 
With Best Regards,
Andy Shevchenko
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 00/38] Nested Virtualization on KVM/ARM

2017-07-19 Thread Jintack Lim
On Wed, Jul 19, 2017 at 4:49 AM, Christoffer Dall  wrote:
> Hi Jintack,
>
> On Tue, Jul 18, 2017 at 10:23:05PM -0400, Jintack Lim wrote:
>> On Tue, Jul 18, 2017 at 12:58 PM, Jintack Lim  wrote:
>> > Nested virtualization is the ability to run a virtual machine inside 
>> > another
>> > virtual machine. In other words, it’s about running a hypervisor (the guest
>> > hypervisor) on top of another hypervisor (the host hypervisor).
>> >
>> > Supporting nested virtualization on ARM means that the hypervisor provides 
>> > not
>> > only EL0/EL1 execution environment to VMs as it usually does but also the
>> > virtualization extensions including EL2 execution environment. Once the 
>> > host
>> > hypervisor provides those execution environments to the VMs, then the guest
>> > hypervisor can run its own VMs (nested VMs) naturally.
>> >
>> > This series supports nested virtualization on arm64. ARM recently 
>> > announced an
>> > extension (ARMv8.3) which has support for nested virtualization[1]. This 
>> > patch
>> > set is based on the ARMv8.3 specification and tested on the FastModel with
>> > ARMv8.3 extension.
>> >
>> > The whole patch set to support nested virtualization is huge over 70
>> > patches, so I categorized them into four parts: CPU, memory, VGIC, and 
>> > timer
>> > virtualization. This patch series is the first part.
>> >
>> > CPU virtualization patch series provides basic nested virtualization 
>> > framework
>> > and instruction emulations including v8.1 VHE feature and v8.3 nested
>> > virtualization feature for VMs.
>> >
>> > This patch series again can be divided into four parts. Patch 1 to 5 
>> > introduces
>> > nested virtualization by discovering hardware feature, adding a kernel
>> > parameter and allowing the userspace to set the initial CPU mode to EL2.
>> >
>> > Patch 6 to 25 are to support the EL2 execution environment, the virtual 
>> > EL2, to
>> > a VM on v8.0 architecture. We de-privilege the guest hypervisor and 
>> > emulate the
>> > virtual EL2 mode in EL1 using the hardware features provided by ARMv8.3; 
>> > The
>> > host hypervisor manages virtual EL2 register state for the guest hypervisor
>> > and shadow EL1 register state that reflects the virtual EL2 register state 
>> > to
>> > run the guest hypervisor in EL1.
>> >
>> > Patch 26 to 33 add support for the virtual EL2 with Virtualization Host
>> > Extensions. These patches emulate newly defined registers and bits in v8.1 
>> > and
>> > allow the virtual EL2 to access EL2 register states via EL1 register 
>> > accesses
>> > as in the real EL2.
>> >
>> > Patch 34 to 38 are to support for the virtual EL2 with nested 
>> > virtualization.
>> > These enable recursive nested virtualization.
>> >
>> > This patch set is tested on the FastModel with the v8.3 extension for 
>> > arm64 and
>> > a cubietruck for arm32. On the FastModel, the host and the guest kernels 
>> > are
>> > compiled with and without VHE, so there are four combinations. I was able 
>> > to
>> > boot SMP Linux in the nested VM on all four configurations and able to run
>> > hackbench. I also checked that regular VMs could boot when the nested
>> > virtualization kernel parameter was not set. On the cubietruck, I also 
>> > verified
>> > that regular VMs could boot as well.
>> >
>> > I'll share my experiment setup shortly.
>>
>> I summarized my experiment setup here.
>>
>> https://github.com/columbia/nesting-pub/wiki/Nested-virtualization-on-ARM-setup
>>
>
> Thanks for sharing this.
>
>> >
>> > Even though this work has some limitations and TODOs, I'd appreciate early
>> > feedback on this RFC. Specifically, I'm interested in:
>> >
>> > - Overall design to manage vcpu context for the virtual EL2
>> > - Verifying correct EL2 register configurations such as HCR_EL2, CPTR_EL2
>> >   (Patch 30 and 32)
>> > - Patch organization and coding style
>>
>> I also wonder if the hardware and/or KVM do not support nested
>> virtualization but the userspace uses nested virtualization option,
>> which one is better: giving an error or launching a regular VM
>> silently.
>>
>
> I think KVM should complain to userspace if userspace tries to set a
> feature it does not support, and I think userspace should give as
> meaningful an error message as possible to the user when that happens.
>

Ok, thanks. I'll work this out.

> Thanks,
> -Christoffer
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] simulated interrupts

2017-07-19 Thread Thomas Gleixner
On Wed, 19 Jul 2017, Andy Shevchenko wrote:

> On Wed, Jul 19, 2017 at 5:19 PM, Marc Zyngier  wrote:
> > On 19/07/17 14:58, Thomas Gleixner wrote:
> >> On Wed, 19 Jul 2017, Bartosz Golaszewski wrote:
> 
> > echo 1 >/proc/irq/9/trigger
> > (or the corresponding /sys/kernel/debug/irq/irqs/ interface if we want to 
> > make sure that this
> > is really not for production use...).
> 
> or /sys/kernel/irq as a successor of /proc/irq

/sys/kernel/debug/irq/irqs/ is already there and it's DEBUG and not
something which is in the regular sysfs maze.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] fs/dcache: Limit numbers of negative dentries

2017-07-19 Thread Waiman Long
On 07/19/2017 10:39 AM, Miklos Szeredi wrote:
> On Mon, Jul 17, 2017 at 3:39 PM, Waiman Long  wrote:
>> The number of positive dentries is limited by the number of files
>> in the filesystems. The number of negative dentries, however,
>> has no limit other than the total amount of memory available in
>> the system. So a rogue application that generates a lot of negative
>> dentries can potentially exhaust most of the memory available in the
>> system impacting performance on other running applications.
>>
>> To prevent this from happening, the dcache code is now updated to limit
>> the amount of the negative dentries in the LRU lists that can be kept
>> as a percentage of total available system memory. The default is 5%
>> and can be changed by specifying the "neg_dentry_pc=" kernel command
>> line option.
> AFAICS the implementation is counter to the concept of LRU since it
> will get rid of the most recently used negative dentry after passing
> the limit.  Which in itself is a source of DoS (keep rouge negative
> dentries at just about the limit, so normal application are prevented
> from getting their negatives cached).
>
> Thanks,
> Miklos

Yes, you are right. That is exactly the problem with patch 1 alone. That
is why I have patches 3 & 4 to enable automatic trimming to decrease the
number of negative dentries before the limit is reached assuming the
rate of increase of negative dentries isn't faster that the reduction
rate of the automatic trimming process.

Cheers,
Longman


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 101/102] Documentation: devres: add explicit exclusive/shared reset control request calls

2017-07-19 Thread Philipp Zabel
Commit a53e35db70d1 ("reset: Ensure drivers are explicit when requesting
reset lines") started to transition the reset control request API calls
to explicitly state whether the driver needs exclusive or shared reset
control behavior. Add the explicit API calls to the devres list.

Cc: Jonathan Corbet 
Cc: Lee Jones 
Cc: linux-doc@vger.kernel.org
Signed-off-by: Philipp Zabel 
---
 Documentation/driver-model/devres.txt | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/Documentation/driver-model/devres.txt 
b/Documentation/driver-model/devres.txt
index 30e04f7a690dd..fd157078dd558 100644
--- a/Documentation/driver-model/devres.txt
+++ b/Documentation/driver-model/devres.txt
@@ -379,7 +379,12 @@ REGULATOR
   devm_regulator_register()
 
 RESET
-  devm_reset_control_get()
+  devm_reset_control_get_exclusive()
+  devm_reset_control_get_shared()
+  devm_reset_control_get_optional_exclusive()
+  devm_reset_control_get_optional_shared()
+  devm_reset_control_get_exclusive_by_index()
+  devm_reset_control_get_shared_by_index()
   devm_reset_controller_register()
 
 SLAVE DMA ENGINE
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 3/5] dax: use common 4k zero page for dax mmap reads

2017-07-19 Thread Jan Kara
On Wed 28-06-17 16:01:50, Ross Zwisler wrote:
> Another major change is that we remove dax_pfn_mkwrite() from our fault
> flow, and instead rely on the page fault itself to make the PTE dirty and
> writeable.  The following description from the patch adding the
> vm_insert_mixed_mkwrite() call explains this a little more:
> 
> ***
>   To be able to use the common 4k zero page in DAX we need to have our PTE
>   fault path look more like our PMD fault path where a PTE entry can be
>   marked as dirty and writeable as it is first inserted, rather than
>   waiting for a follow-up dax_pfn_mkwrite() => finish_mkwrite_fault() call.
> 
>   Right now we can rely on having a dax_pfn_mkwrite() call because we can
>   distinguish between these two cases in do_wp_page():
> 
>   case 1: 4k zero page => writable DAX storage
>   case 2: read-only DAX storage => writeable DAX storage
> 
>   This distinction is made by via vm_normal_page().  vm_normal_page()
>   returns false for the common 4k zero page, though, just as it does for
>   DAX ptes.  Instead of special casing the DAX + 4k zero page case, we will
>   simplify our DAX PTE page fault sequence so that it matches our DAX PMD
>   sequence, and get rid of dax_pfn_mkwrite() completely.
> 
>   This means that insert_pfn() needs to follow the lead of insert_pfn_pmd()
>   and allow us to pass in a 'mkwrite' flag.  If 'mkwrite' is set
>   insert_pfn() will do the work that was previously done by wp_page_reuse()
>   as part of the dax_pfn_mkwrite() call path.
> ***

Hum, thinking about this in context of this patch... So what if we have
allocated storage, a process faults it read-only, we map it to page tables
writeprotected. Then the process writes through mmap to the area - the code
in handle_pte_fault() ends up in do_wp_page() if I'm reading it right.
Then, since we are missing ->pfn_mkwrite() handlers, the PTE will be marked
writeable but radix tree entry stays clean - bug. Am I missing something?

Honza

-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] simulated interrupts

2017-07-19 Thread Andy Shevchenko
On Wed, Jul 19, 2017 at 5:53 PM, Thomas Gleixner  wrote:
> On Wed, 19 Jul 2017, Andy Shevchenko wrote:
>> On Wed, Jul 19, 2017 at 5:19 PM, Marc Zyngier  wrote:

>> > echo 1 >/proc/irq/9/trigger
>> > (or the corresponding /sys/kernel/debug/irq/irqs/ interface if we want to 
>> > make sure that this
>> > is really not for production use...).
>>
>> or /sys/kernel/irq as a successor of /proc/irq
>
> /sys/kernel/debug/irq/irqs/ is already there and it's DEBUG and not
> something which is in the regular sysfs maze.

Yep. My point was "not to extend /proc/irq interface". From the
description of the series it indeed looks suitable rather for debug.

-- 
With Best Regards,
Andy Shevchenko
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 3/5] dax: use common 4k zero page for dax mmap reads

2017-07-19 Thread Ross Zwisler
On Wed, Jul 19, 2017 at 05:33:14PM +0200, Jan Kara wrote:
> On Wed 28-06-17 16:01:50, Ross Zwisler wrote:
> > Another major change is that we remove dax_pfn_mkwrite() from our fault
> > flow, and instead rely on the page fault itself to make the PTE dirty and
> > writeable.  The following description from the patch adding the
> > vm_insert_mixed_mkwrite() call explains this a little more:
> > 
> > ***
> >   To be able to use the common 4k zero page in DAX we need to have our PTE
> >   fault path look more like our PMD fault path where a PTE entry can be
> >   marked as dirty and writeable as it is first inserted, rather than
> >   waiting for a follow-up dax_pfn_mkwrite() => finish_mkwrite_fault() call.
> > 
> >   Right now we can rely on having a dax_pfn_mkwrite() call because we can
> >   distinguish between these two cases in do_wp_page():
> > 
> > case 1: 4k zero page => writable DAX storage
> > case 2: read-only DAX storage => writeable DAX storage
> > 
> >   This distinction is made by via vm_normal_page().  vm_normal_page()
> >   returns false for the common 4k zero page, though, just as it does for
> >   DAX ptes.  Instead of special casing the DAX + 4k zero page case, we will
> >   simplify our DAX PTE page fault sequence so that it matches our DAX PMD
> >   sequence, and get rid of dax_pfn_mkwrite() completely.
> > 
> >   This means that insert_pfn() needs to follow the lead of insert_pfn_pmd()
> >   and allow us to pass in a 'mkwrite' flag.  If 'mkwrite' is set
> >   insert_pfn() will do the work that was previously done by wp_page_reuse()
> >   as part of the dax_pfn_mkwrite() call path.
> > ***
> 
> Hum, thinking about this in context of this patch... So what if we have
> allocated storage, a process faults it read-only, we map it to page tables
> writeprotected. Then the process writes through mmap to the area - the code
> in handle_pte_fault() ends up in do_wp_page() if I'm reading it right.

Yep.

> Then, since we are missing ->pfn_mkwrite() handlers, the PTE will be marked
> writeable but radix tree entry stays clean - bug. Am I missing something?

I don't think we ever end up with a writeable PTE but with a clean radix tree
entry.  When we get the write fault we do a full fault through
dax_iomap_pte_fault() and dax_insert_mapping().

dax_insert_mapping() sets up the dirty radix tree entry via
dax_insert_mapping_entry() before it does anything with the page tables via
vm_insert_mixed_mkwrite().

So, this mkwrite fault path is exactly the path we would have taken if the
initial read to real storage hadn't happened, and we end up in the same end
state - with a dirty DAX radix tree entry and a writeable PTE.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] documentation: Fix two-CPU control-dependency example

2017-07-19 Thread Paul E. McKenney
On Mon, Jul 17, 2017 at 05:24:42PM +0900, Akira Yokosawa wrote:
> >From b798b9b631e237d285aa8699da00bfb8ced33bea Mon Sep 17 00:00:00 2001
> From: Akira Yokosawa 
> Date: Mon, 17 Jul 2017 16:25:33 +0900
> Subject: [PATCH] documentation: Fix two-CPU control-dependency example
> 
> In commit 5646f7acc95f ("memory-barriers: Fix control-ordering
> no-transitivity example"), the operator in "if" statement of
> the two-CPU example was modified from ">=" to ">".
> Now the example misses the point because there is no party
> who will modify "x" nor "y". So each CPU performs only the
> READ_ONCE().
> 
> The point of this example is to use control dependency for ordering,
> and the WRITE_ONCE() should always be executed.
> 
> So it was correct prior to the above mentioned commit. Partial
> revert of the commit (with context adjustments regarding other
> changes thereafter) restores the point.
> 
> Note that the three-CPU example demonstrating the lack of
> transitivity stands regardless of this partial revert.
> 
> Signed-off-by: Akira Yokosawa 

Hello, Akira,

You are quite right that many compilers would generate straightforward
code for the code fragment below, and in that case, the assertion could
never trigger due to either TSO or control dependencies, depending on
the architecture this was running on.

However, if the compiler was too smart for our good, it could figure
out that "x" and "y" only take on the values zero and one, so that
the "if" would always be taken.  At that point, the compiler could
simply ignore the "if" with the result shown below.

> ---
>  Documentation/memory-barriers.txt | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/Documentation/memory-barriers.txt 
> b/Documentation/memory-barriers.txt
> index c4ddfcd..c1ebe99 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -851,7 +851,7 @@ demonstrated by two related examples, with the initial 
> values of
>   CPU 0 CPU 1
>   ===   ===
>   r1 = READ_ONCE(x);r2 = READ_ONCE(y);
> - if (r1 > 0)   if (r2 > 0)
> + if (r1 >= 0)  if (r2 >= 0)
> WRITE_ONCE(y, 1); WRITE_ONCE(x, 1);
> 
>   assert(!(r1 == 1 && r2 == 1));

Original program:

CPU 0 CPU 1
===   ===
r1 = READ_ONCE(x);r2 = READ_ONCE(y);
if (r1 >= 0)  if (r2 >= 0)
  WRITE_ONCE(y, 1); WRITE_ONCE(x, 1);

assert(!(r1 == 1 && r2 == 1));

Enthusiastically optimized program:

CPU 0 CPU 1
===   ===
r1 = READ_ONCE(x);r2 = READ_ONCE(y);
WRITE_ONCE(y, 1); WRITE_ONCE(x, 1);

assert(!(r1 == 1 && r2 == 1));

This optimized version could trigger the assertion.

So we do need to stick with the ">" comparison.

That said, I very much welcome critical reviews of memory-barriers.txt,
so please do feel free to continue doing that!

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/5] mm: add vm_insert_mixed_mkwrite()

2017-07-19 Thread Ross Zwisler
On Wed, Jul 19, 2017 at 04:16:59PM +0200, Jan Kara wrote:
> On Wed 28-06-17 16:01:48, Ross Zwisler wrote:
> > To be able to use the common 4k zero page in DAX we need to have our PTE
> > fault path look more like our PMD fault path where a PTE entry can be
> > marked as dirty and writeable as it is first inserted, rather than waiting
> > for a follow-up dax_pfn_mkwrite() => finish_mkwrite_fault() call.
> > 
> > Right now we can rely on having a dax_pfn_mkwrite() call because we can
> > distinguish between these two cases in do_wp_page():
> > 
> > case 1: 4k zero page => writable DAX storage
> > case 2: read-only DAX storage => writeable DAX storage
> > 
> > This distinction is made by via vm_normal_page().  vm_normal_page() returns
> > false for the common 4k zero page, though, just as it does for DAX ptes.
> > Instead of special casing the DAX + 4k zero page case, we will simplify our
> > DAX PTE page fault sequence so that it matches our DAX PMD sequence, and
> > get rid of dax_pfn_mkwrite() completely.
> > 
> > This means that insert_pfn() needs to follow the lead of insert_pfn_pmd()
> > and allow us to pass in a 'mkwrite' flag.  If 'mkwrite' is set insert_pfn()
> > will do the work that was previously done by wp_page_reuse() as part of the
> > dax_pfn_mkwrite() call path.
> > 
> > Signed-off-by: Ross Zwisler 
> 
> Just one small comment below.
> 
> > @@ -1658,14 +1658,26 @@ static int insert_pfn(struct vm_area_struct *vma, 
> > unsigned long addr,
> > if (!pte)
> > goto out;
> > retval = -EBUSY;
> > -   if (!pte_none(*pte))
> > -   goto out_unlock;
> > +   if (!pte_none(*pte)) {
> > +   if (mkwrite) {
> > +   entry = *pte;
> > +   goto out_mkwrite;
> 
> Can we maybe check here that (pte_pfn(*pte) == pfn_t_to_pfn(pfn)) and
> return -EBUSY otherwise? That way we are sure insert_pfn() isn't doing
> anything we don't expect 

Sure, that's fine.  I'll add it as a WARN_ON_ONCE() so it's a very loud
failure.  If the pfns don't match I think we're insane (and would have been
insane prior to this patch series as well) because we are getting a page fault
and somehow have a different PFN already mapped at that location.

> and if I understand the code right, we need to
> invalidate all zero page mappings at given file offset (via
> unmap_mapping_range()) before mapping an allocated block there and thus the
> case of filling the hole won't be affected by this?

Correct.  Here's the call tree if we already have a zero page mapped and are
now faulting in an allocated block:

dax_iomap_pte_fault()
  dax_insert_mapping()
dax_insert_mapping_entry()
  unmap_mapping_range() for our zero page
vm_insert_mixed_mkwrite() installs the new PTE. We have pte_none(), so we
skip the new mkwrite goto in insert_pfn().
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] fs/dcache: Limit numbers of negative dentries

2017-07-19 Thread Miklos Szeredi
On Mon, Jul 17, 2017 at 3:39 PM, Waiman Long  wrote:
> The number of positive dentries is limited by the number of files
> in the filesystems. The number of negative dentries, however,
> has no limit other than the total amount of memory available in
> the system. So a rogue application that generates a lot of negative
> dentries can potentially exhaust most of the memory available in the
> system impacting performance on other running applications.
>
> To prevent this from happening, the dcache code is now updated to limit
> the amount of the negative dentries in the LRU lists that can be kept
> as a percentage of total available system memory. The default is 5%
> and can be changed by specifying the "neg_dentry_pc=" kernel command
> line option.
>
> Signed-off-by: Waiman Long 
> ---

[...]

> @@ -603,7 +698,13 @@ static struct dentry *dentry_kill(struct dentry *dentry)
>
> if (!IS_ROOT(dentry)) {
> parent = dentry->d_parent;
> -   if (unlikely(!spin_trylock(&parent->d_lock))) {
> +   /*
> +* Force the killing of this negative dentry when
> +* DCACHE_KILL_NEGATIVE flag is set.
> +*/
> +   if (unlikely(dentry->d_flags & DCACHE_KILL_NEGATIVE)) {
> +   spin_lock(&parent->d_lock);

This looks like d_lock ordering problem (should be parent first, child
second).  Why is this needed, anyway?

> +   } else if (unlikely(!spin_trylock(&parent->d_lock))) {
> if (inode)
> spin_unlock(&inode->i_lock);
> goto failed;

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] fs/dcache: Limit numbers of negative dentries

2017-07-19 Thread Waiman Long
On 07/19/2017 04:24 PM, Miklos Szeredi wrote:
> On Mon, Jul 17, 2017 at 3:39 PM, Waiman Long  wrote:
>> The number of positive dentries is limited by the number of files
>> in the filesystems. The number of negative dentries, however,
>> has no limit other than the total amount of memory available in
>> the system. So a rogue application that generates a lot of negative
>> dentries can potentially exhaust most of the memory available in the
>> system impacting performance on other running applications.
>>
>> To prevent this from happening, the dcache code is now updated to limit
>> the amount of the negative dentries in the LRU lists that can be kept
>> as a percentage of total available system memory. The default is 5%
>> and can be changed by specifying the "neg_dentry_pc=" kernel command
>> line option.
>>
>> Signed-off-by: Waiman Long 
>> ---
> [...]
>
>> @@ -603,7 +698,13 @@ static struct dentry *dentry_kill(struct dentry *dentry)
>>
>> if (!IS_ROOT(dentry)) {
>> parent = dentry->d_parent;
>> -   if (unlikely(!spin_trylock(&parent->d_lock))) {
>> +   /*
>> +* Force the killing of this negative dentry when
>> +* DCACHE_KILL_NEGATIVE flag is set.
>> +*/
>> +   if (unlikely(dentry->d_flags & DCACHE_KILL_NEGATIVE)) {
>> +   spin_lock(&parent->d_lock);
> This looks like d_lock ordering problem (should be parent first, child
> second).  Why is this needed, anyway?
>

Yes, that is a bug. I should have used lock_parent() instead.

I have a test program that generate a lot of negative dentries
continuously. Using spin_trylock(), it failed most of the time when that
test program was running. So I need to actually acquire the parent's
d_lock to make sure that the offending negative dentry was really
killed. It was there to protect against the worst case situation. I will
update the patch to correct that.

Thanks for spotting this.

Cheers,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] documentation: Fix two-CPU control-dependency example

2017-07-19 Thread Akira Yokosawa
On 2017/07/20 2:43, Paul E. McKenney wrote:
> On Mon, Jul 17, 2017 at 05:24:42PM +0900, Akira Yokosawa wrote:
>> >From b798b9b631e237d285aa8699da00bfb8ced33bea Mon Sep 17 00:00:00 2001
>> From: Akira Yokosawa 
>> Date: Mon, 17 Jul 2017 16:25:33 +0900
>> Subject: [PATCH] documentation: Fix two-CPU control-dependency example
>>
>> In commit 5646f7acc95f ("memory-barriers: Fix control-ordering
>> no-transitivity example"), the operator in "if" statement of
>> the two-CPU example was modified from ">=" to ">".
>> Now the example misses the point because there is no party
>> who will modify "x" nor "y". So each CPU performs only the
>> READ_ONCE().
>>
>> The point of this example is to use control dependency for ordering,
>> and the WRITE_ONCE() should always be executed.
>>
>> So it was correct prior to the above mentioned commit. Partial
>> revert of the commit (with context adjustments regarding other
>> changes thereafter) restores the point.
>>
>> Note that the three-CPU example demonstrating the lack of
>> transitivity stands regardless of this partial revert.
>>
>> Signed-off-by: Akira Yokosawa 
> 
> Hello, Akira,
> 
> You are quite right that many compilers would generate straightforward
> code for the code fragment below, and in that case, the assertion could
> never trigger due to either TSO or control dependencies, depending on
> the architecture this was running on.
> 
> However, if the compiler was too smart for our good, it could figure
> out that "x" and "y" only take on the values zero and one, so that
> the "if" would always be taken.  At that point, the compiler could
> simply ignore the "if" with the result shown below.
> 
>> ---
>>  Documentation/memory-barriers.txt | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/Documentation/memory-barriers.txt 
>> b/Documentation/memory-barriers.txt
>> index c4ddfcd..c1ebe99 100644
>> --- a/Documentation/memory-barriers.txt
>> +++ b/Documentation/memory-barriers.txt
>> @@ -851,7 +851,7 @@ demonstrated by two related examples, with the initial 
>> values of
>>  CPU 0 CPU 1
>>  ===   ===
>>  r1 = READ_ONCE(x);r2 = READ_ONCE(y);
>> -if (r1 > 0)   if (r2 > 0)
>> +if (r1 >= 0)  if (r2 >= 0)
>>WRITE_ONCE(y, 1); WRITE_ONCE(x, 1);
>>
>>  assert(!(r1 == 1 && r2 == 1));
> 
> Original program:
> 
>   CPU 0 CPU 1
>   ===   ===
>   r1 = READ_ONCE(x);r2 = READ_ONCE(y);
>   if (r1 >= 0)  if (r2 >= 0)
> WRITE_ONCE(y, 1); WRITE_ONCE(x, 1);
> 
>   assert(!(r1 == 1 && r2 == 1));
> 
> Enthusiastically optimized program:
> 
>   CPU 0 CPU 1
>   ===   ===
>   r1 = READ_ONCE(x);r2 = READ_ONCE(y);
>   WRITE_ONCE(y, 1); WRITE_ONCE(x, 1);
> 
>   assert(!(r1 == 1 && r2 == 1));
> 
> This optimized version could trigger the assertion.
> 
> So we do need to stick with the ">" comparison.

Well but,

Current example:

CPU 0 CPU 1
===   ===
r1 = READ_ONCE(x);r2 = READ_ONCE(y);
if (r1 > 0)   if (r2 > 0)
  WRITE_ONCE(y, 1); WRITE_ONCE(x, 1);

assert(!(r1 == 1 && r2 == 1));

Such a clever compiler might be able to prove that "x" and "y"
are never modified and end up in the following:

CPU 0 CPU 1
===   ===
r1 = READ_ONCE(x);r2 = READ_ONCE(y);

assert(!(r1 == 1 && r2 == 1));

This means it is impossible to describe this example in C,
doesn't it?

What am I missing here?

Thanks, Akira

> That said, I very much welcome critical reviews of memory-barriers.txt,
> so please do feel free to continue doing that!
> 
>   Thanx, Paul
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] documentation: Fix two-CPU control-dependency example

2017-07-19 Thread Paul E. McKenney
On Thu, Jul 20, 2017 at 06:33:26AM +0900, Akira Yokosawa wrote:
> On 2017/07/20 2:43, Paul E. McKenney wrote:
> > On Mon, Jul 17, 2017 at 05:24:42PM +0900, Akira Yokosawa wrote:
> >> >From b798b9b631e237d285aa8699da00bfb8ced33bea Mon Sep 17 00:00:00 2001
> >> From: Akira Yokosawa 
> >> Date: Mon, 17 Jul 2017 16:25:33 +0900
> >> Subject: [PATCH] documentation: Fix two-CPU control-dependency example
> >>
> >> In commit 5646f7acc95f ("memory-barriers: Fix control-ordering
> >> no-transitivity example"), the operator in "if" statement of
> >> the two-CPU example was modified from ">=" to ">".
> >> Now the example misses the point because there is no party
> >> who will modify "x" nor "y". So each CPU performs only the
> >> READ_ONCE().
> >>
> >> The point of this example is to use control dependency for ordering,
> >> and the WRITE_ONCE() should always be executed.
> >>
> >> So it was correct prior to the above mentioned commit. Partial
> >> revert of the commit (with context adjustments regarding other
> >> changes thereafter) restores the point.
> >>
> >> Note that the three-CPU example demonstrating the lack of
> >> transitivity stands regardless of this partial revert.
> >>
> >> Signed-off-by: Akira Yokosawa 
> > 
> > Hello, Akira,
> > 
> > You are quite right that many compilers would generate straightforward
> > code for the code fragment below, and in that case, the assertion could
> > never trigger due to either TSO or control dependencies, depending on
> > the architecture this was running on.
> > 
> > However, if the compiler was too smart for our good, it could figure
> > out that "x" and "y" only take on the values zero and one, so that
> > the "if" would always be taken.  At that point, the compiler could
> > simply ignore the "if" with the result shown below.
> > 
> >> ---
> >>  Documentation/memory-barriers.txt | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/Documentation/memory-barriers.txt 
> >> b/Documentation/memory-barriers.txt
> >> index c4ddfcd..c1ebe99 100644
> >> --- a/Documentation/memory-barriers.txt
> >> +++ b/Documentation/memory-barriers.txt
> >> @@ -851,7 +851,7 @@ demonstrated by two related examples, with the initial 
> >> values of
> >>CPU 0 CPU 1
> >>===   ===
> >>r1 = READ_ONCE(x);r2 = READ_ONCE(y);
> >> -  if (r1 > 0)   if (r2 > 0)
> >> +  if (r1 >= 0)  if (r2 >= 0)
> >>  WRITE_ONCE(y, 1); WRITE_ONCE(x, 1);
> >>
> >>assert(!(r1 == 1 && r2 == 1));
> > 
> > Original program:
> > 
> > CPU 0 CPU 1
> > ===   ===
> > r1 = READ_ONCE(x);r2 = READ_ONCE(y);
> > if (r1 >= 0)  if (r2 >= 0)
> >   WRITE_ONCE(y, 1); WRITE_ONCE(x, 1);
> > 
> > assert(!(r1 == 1 && r2 == 1));
> > 
> > Enthusiastically optimized program:
> > 
> > CPU 0 CPU 1
> > ===   ===
> > r1 = READ_ONCE(x);r2 = READ_ONCE(y);
> > WRITE_ONCE(y, 1); WRITE_ONCE(x, 1);
> > 
> > assert(!(r1 == 1 && r2 == 1));
> > 
> > This optimized version could trigger the assertion.
> > 
> > So we do need to stick with the ">" comparison.
> 
> Well but,
> 
> Current example:
> 
>   CPU 0 CPU 1
>   ===   ===
>   r1 = READ_ONCE(x);r2 = READ_ONCE(y);
>   if (r1 > 0)   if (r2 > 0)
> WRITE_ONCE(y, 1); WRITE_ONCE(x, 1);
> 
>   assert(!(r1 == 1 && r2 == 1));
> 
> Such a clever compiler might be able to prove that "x" and "y"
> are never modified and end up in the following:
> 
>   CPU 0 CPU 1
>   ===   ===
>   r1 = READ_ONCE(x);r2 = READ_ONCE(y);
> 
>   assert(!(r1 == 1 && r2 == 1));
> 
> This means it is impossible to describe this example in C,
> doesn't it?

That is a matter of some debate, but it has gotten increasingly more
difficult to get C to do one's bidding over the decades.

> What am I missing here?

The compiler has to work harder in your example case, so it is probably
just a matter of time.  In the original with ">=", all the compiler needed
to do was look at all the assignments and the initial value.  In the
original, it also had to do reasoning based on control dependencies
(which are not yet part of the C standard).

But yes, the degree to which a compiler can optimize atomics is a matter
of some debate.  Here is a proposal to let the developer choose:

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0062r1.html

What are your thoughts on this?

Thanx, Paul

> Thanks, Akira
> 
> > That said, I very much welcome critical reviews of memory-barriers.txt,
> > so please do feel free to continue doing that!
> >

Re: [PATCH v3 1/5] mm: add vm_insert_mixed_mkwrite()

2017-07-19 Thread Ross Zwisler
On Wed, Jul 19, 2017 at 11:51:12AM -0600, Ross Zwisler wrote:
> On Wed, Jul 19, 2017 at 04:16:59PM +0200, Jan Kara wrote:
> > On Wed 28-06-17 16:01:48, Ross Zwisler wrote:
> > > To be able to use the common 4k zero page in DAX we need to have our PTE
> > > fault path look more like our PMD fault path where a PTE entry can be
> > > marked as dirty and writeable as it is first inserted, rather than waiting
> > > for a follow-up dax_pfn_mkwrite() => finish_mkwrite_fault() call.
> > > 
> > > Right now we can rely on having a dax_pfn_mkwrite() call because we can
> > > distinguish between these two cases in do_wp_page():
> > > 
> > >   case 1: 4k zero page => writable DAX storage
> > >   case 2: read-only DAX storage => writeable DAX storage
> > > 
> > > This distinction is made by via vm_normal_page().  vm_normal_page() 
> > > returns
> > > false for the common 4k zero page, though, just as it does for DAX ptes.
> > > Instead of special casing the DAX + 4k zero page case, we will simplify 
> > > our
> > > DAX PTE page fault sequence so that it matches our DAX PMD sequence, and
> > > get rid of dax_pfn_mkwrite() completely.
> > > 
> > > This means that insert_pfn() needs to follow the lead of insert_pfn_pmd()
> > > and allow us to pass in a 'mkwrite' flag.  If 'mkwrite' is set 
> > > insert_pfn()
> > > will do the work that was previously done by wp_page_reuse() as part of 
> > > the
> > > dax_pfn_mkwrite() call path.
> > > 
> > > Signed-off-by: Ross Zwisler 
> > 
> > Just one small comment below.
> > 
> > > @@ -1658,14 +1658,26 @@ static int insert_pfn(struct vm_area_struct *vma, 
> > > unsigned long addr,
> > >   if (!pte)
> > >   goto out;
> > >   retval = -EBUSY;
> > > - if (!pte_none(*pte))
> > > - goto out_unlock;
> > > + if (!pte_none(*pte)) {
> > > + if (mkwrite) {
> > > + entry = *pte;
> > > + goto out_mkwrite;
> > 
> > Can we maybe check here that (pte_pfn(*pte) == pfn_t_to_pfn(pfn)) and
> > return -EBUSY otherwise? That way we are sure insert_pfn() isn't doing
> > anything we don't expect 
> 
> Sure, that's fine.  I'll add it as a WARN_ON_ONCE() so it's a very loud
> failure.  If the pfns don't match I think we're insane (and would have been
> insane prior to this patch series as well) because we are getting a page fault
> and somehow have a different PFN already mapped at that location.

Umm...well, I added the warning, and during my regression testing hit a case
where the PFNs didn't match.  (generic/437 with both ext4 & XFS)

I've verified that this behavior happens with vanilla v4.12, so it's not a new
condition introduced by my patch.

I'm off tracking that down - there's a bug lurking somewhere, I think.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] documentation: Fix two-CPU control-dependency example

2017-07-19 Thread Boqun Feng
On Wed, Jul 19, 2017 at 02:56:02PM -0700, Paul E. McKenney wrote:
> On Thu, Jul 20, 2017 at 06:33:26AM +0900, Akira Yokosawa wrote:
> > On 2017/07/20 2:43, Paul E. McKenney wrote:
> > > On Mon, Jul 17, 2017 at 05:24:42PM +0900, Akira Yokosawa wrote:
> > >> >From b798b9b631e237d285aa8699da00bfb8ced33bea Mon Sep 17 00:00:00 2001
> > >> From: Akira Yokosawa 
> > >> Date: Mon, 17 Jul 2017 16:25:33 +0900
> > >> Subject: [PATCH] documentation: Fix two-CPU control-dependency example
> > >>
> > >> In commit 5646f7acc95f ("memory-barriers: Fix control-ordering
> > >> no-transitivity example"), the operator in "if" statement of
> > >> the two-CPU example was modified from ">=" to ">".
> > >> Now the example misses the point because there is no party
> > >> who will modify "x" nor "y". So each CPU performs only the
> > >> READ_ONCE().
> > >>
> > >> The point of this example is to use control dependency for ordering,
> > >> and the WRITE_ONCE() should always be executed.
> > >>
> > >> So it was correct prior to the above mentioned commit. Partial
> > >> revert of the commit (with context adjustments regarding other
> > >> changes thereafter) restores the point.
> > >>
> > >> Note that the three-CPU example demonstrating the lack of
> > >> transitivity stands regardless of this partial revert.
> > >>
> > >> Signed-off-by: Akira Yokosawa 
> > > 
> > > Hello, Akira,
> > > 
> > > You are quite right that many compilers would generate straightforward
> > > code for the code fragment below, and in that case, the assertion could
> > > never trigger due to either TSO or control dependencies, depending on
> > > the architecture this was running on.
> > > 
> > > However, if the compiler was too smart for our good, it could figure
> > > out that "x" and "y" only take on the values zero and one, so that
> > > the "if" would always be taken.  At that point, the compiler could
> > > simply ignore the "if" with the result shown below.
> > > 
> > >> ---
> > >>  Documentation/memory-barriers.txt | 2 +-
> > >>  1 file changed, 1 insertion(+), 1 deletion(-)
> > >>
> > >> diff --git a/Documentation/memory-barriers.txt 
> > >> b/Documentation/memory-barriers.txt
> > >> index c4ddfcd..c1ebe99 100644
> > >> --- a/Documentation/memory-barriers.txt
> > >> +++ b/Documentation/memory-barriers.txt
> > >> @@ -851,7 +851,7 @@ demonstrated by two related examples, with the 
> > >> initial values of
> > >>  CPU 0 CPU 1
> > >>  ===   ===
> > >>  r1 = READ_ONCE(x);r2 = READ_ONCE(y);
> > >> -if (r1 > 0)   if (r2 > 0)
> > >> +if (r1 >= 0)  if (r2 >= 0)
> > >>WRITE_ONCE(y, 1); WRITE_ONCE(x, 1);
> > >>
> > >>  assert(!(r1 == 1 && r2 == 1));
> > > 
> > > Original program:
> > > 
> > >   CPU 0 CPU 1
> > >   ===   ===
> > >   r1 = READ_ONCE(x);r2 = READ_ONCE(y);
> > >   if (r1 >= 0)  if (r2 >= 0)
> > > WRITE_ONCE(y, 1); WRITE_ONCE(x, 1);
> > > 
> > >   assert(!(r1 == 1 && r2 == 1));
> > > 
> > > Enthusiastically optimized program:
> > > 
> > >   CPU 0 CPU 1
> > >   ===   ===
> > >   r1 = READ_ONCE(x);r2 = READ_ONCE(y);
> > >   WRITE_ONCE(y, 1); WRITE_ONCE(x, 1);
> > > 
> > >   assert(!(r1 == 1 && r2 == 1));
> > > 
> > > This optimized version could trigger the assertion.
> > > 
> > > So we do need to stick with the ">" comparison.
> > 
> > Well but,
> > 
> > Current example:
> > 
> > CPU 0 CPU 1
> > ===   ===
> > r1 = READ_ONCE(x);r2 = READ_ONCE(y);
> > if (r1 > 0)   if (r2 > 0)
> >   WRITE_ONCE(y, 1); WRITE_ONCE(x, 1);
> > 
> > assert(!(r1 == 1 && r2 == 1));
> > 
> > Such a clever compiler might be able to prove that "x" and "y"
> > are never modified and end up in the following:
> > 

Hi Akira,

I wouldn't call that compiler a clever one, it's a broken one ;-)

So here is the thing: READ_ONCE() is a *volatile* load, which means the
compiler has to generate code that actually does a load, so the values
of r1 and r2 depend on the loads, therefore, a sane compiler will not 
optimize the if()s out because the volatile semantics of READ_ONCE().
Otherwise, I think we have a lot more to worry about than this case.

> > CPU 0 CPU 1
> > ===   ===
> > r1 = READ_ONCE(x);r2 = READ_ONCE(y);
> > 
> > assert(!(r1 == 1 && r2 == 1));
> > 
> > This means it is impossible to describe this example in C,
> > doesn't it?
> 
> That is a matter of some debate, but it has gotten increasingly more
> difficult to get C to do one's bidding over the decades.
> 
> > What am I missing here?
> 
> The compiler has to work harder in your example case, so it is probably
> just

Re: [PATCH] documentation: Fix two-CPU control-dependency example

2017-07-19 Thread Paul E. McKenney
On Thu, Jul 20, 2017 at 09:31:41AM +0800, Boqun Feng wrote:
> On Wed, Jul 19, 2017 at 02:56:02PM -0700, Paul E. McKenney wrote:
> > On Thu, Jul 20, 2017 at 06:33:26AM +0900, Akira Yokosawa wrote:
> > > On 2017/07/20 2:43, Paul E. McKenney wrote:
> > > > On Mon, Jul 17, 2017 at 05:24:42PM +0900, Akira Yokosawa wrote:
> > > >> >From b798b9b631e237d285aa8699da00bfb8ced33bea Mon Sep 17 00:00:00 2001
> > > >> From: Akira Yokosawa 
> > > >> Date: Mon, 17 Jul 2017 16:25:33 +0900
> > > >> Subject: [PATCH] documentation: Fix two-CPU control-dependency example
> > > >>
> > > >> In commit 5646f7acc95f ("memory-barriers: Fix control-ordering
> > > >> no-transitivity example"), the operator in "if" statement of
> > > >> the two-CPU example was modified from ">=" to ">".
> > > >> Now the example misses the point because there is no party
> > > >> who will modify "x" nor "y". So each CPU performs only the
> > > >> READ_ONCE().
> > > >>
> > > >> The point of this example is to use control dependency for ordering,
> > > >> and the WRITE_ONCE() should always be executed.
> > > >>
> > > >> So it was correct prior to the above mentioned commit. Partial
> > > >> revert of the commit (with context adjustments regarding other
> > > >> changes thereafter) restores the point.
> > > >>
> > > >> Note that the three-CPU example demonstrating the lack of
> > > >> transitivity stands regardless of this partial revert.
> > > >>
> > > >> Signed-off-by: Akira Yokosawa 
> > > > 
> > > > Hello, Akira,
> > > > 
> > > > You are quite right that many compilers would generate straightforward
> > > > code for the code fragment below, and in that case, the assertion could
> > > > never trigger due to either TSO or control dependencies, depending on
> > > > the architecture this was running on.
> > > > 
> > > > However, if the compiler was too smart for our good, it could figure
> > > > out that "x" and "y" only take on the values zero and one, so that
> > > > the "if" would always be taken.  At that point, the compiler could
> > > > simply ignore the "if" with the result shown below.
> > > > 
> > > >> ---
> > > >>  Documentation/memory-barriers.txt | 2 +-
> > > >>  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >>
> > > >> diff --git a/Documentation/memory-barriers.txt 
> > > >> b/Documentation/memory-barriers.txt
> > > >> index c4ddfcd..c1ebe99 100644
> > > >> --- a/Documentation/memory-barriers.txt
> > > >> +++ b/Documentation/memory-barriers.txt
> > > >> @@ -851,7 +851,7 @@ demonstrated by two related examples, with the 
> > > >> initial values of
> > > >>CPU 0 CPU 1
> > > >>===   ===
> > > >>r1 = READ_ONCE(x);r2 = READ_ONCE(y);
> > > >> -  if (r1 > 0)   if (r2 > 0)
> > > >> +  if (r1 >= 0)  if (r2 >= 0)
> > > >>  WRITE_ONCE(y, 1); WRITE_ONCE(x, 1);
> > > >>
> > > >>assert(!(r1 == 1 && r2 == 1));
> > > > 
> > > > Original program:
> > > > 
> > > > CPU 0 CPU 1
> > > > ===   ===
> > > > r1 = READ_ONCE(x);r2 = READ_ONCE(y);
> > > > if (r1 >= 0)  if (r2 >= 0)
> > > >   WRITE_ONCE(y, 1); WRITE_ONCE(x, 1);
> > > > 
> > > > assert(!(r1 == 1 && r2 == 1));
> > > > 
> > > > Enthusiastically optimized program:
> > > > 
> > > > CPU 0 CPU 1
> > > > ===   ===
> > > > r1 = READ_ONCE(x);r2 = READ_ONCE(y);
> > > > WRITE_ONCE(y, 1); WRITE_ONCE(x, 1);
> > > > 
> > > > assert(!(r1 == 1 && r2 == 1));
> > > > 
> > > > This optimized version could trigger the assertion.
> > > > 
> > > > So we do need to stick with the ">" comparison.
> > > 
> > > Well but,
> > > 
> > > Current example:
> > > 
> > >   CPU 0 CPU 1
> > >   ===   ===
> > >   r1 = READ_ONCE(x);r2 = READ_ONCE(y);
> > >   if (r1 > 0)   if (r2 > 0)
> > > WRITE_ONCE(y, 1); WRITE_ONCE(x, 1);
> > > 
> > >   assert(!(r1 == 1 && r2 == 1));
> > > 
> > > Such a clever compiler might be able to prove that "x" and "y"
> > > are never modified and end up in the following:
> > > 
> 
> Hi Akira,
> 
> I wouldn't call that compiler a clever one, it's a broken one ;-)
> 
> So here is the thing: READ_ONCE() is a *volatile* load, which means the
> compiler has to generate code that actually does a load, so the values
> of r1 and r2 depend on the loads, therefore, a sane compiler will not 
> optimize the if()s out because the volatile semantics of READ_ONCE().
> Otherwise, I think we have a lot more to worry about than this case.
> 
> > >   CPU 0 CPU 1
> > >   ===   ===
> > >   r1 = READ_ONCE(x);r2 = READ_ONCE(y);
> > > 
> > >   assert(!(r1 == 1 && r2 == 1));
> > > 
> > 

Re: [RFC v6 01/62] powerpc: Free up four 64K PTE bits in 4K backed HPTE pages

2017-07-19 Thread Aneesh Kumar K.V

.

>   /*
> @@ -116,8 +104,8 @@ int __hash_page_4K(unsigned long ea, unsigned long 
> access, unsigned long vsid,
>* On hash insert failure we use old pte value and we don't
>* want slot information there if we have a insert failure.
>*/
> - old_pte &= ~(H_PAGE_HASHPTE | H_PAGE_F_GIX | H_PAGE_F_SECOND);
> - new_pte &= ~(H_PAGE_HASHPTE | H_PAGE_F_GIX | H_PAGE_F_SECOND);
> + old_pte &= ~(H_PAGE_HASHPTE);
> + new_pte &= ~(H_PAGE_HASHPTE);
>   goto htab_insert_hpte;
>   }

With the current path order and above hunk we will breaks the bisect I guess. 
With the above, when
we convert a 64k hpte to 4khpte, since this is the first patch, we
should clear that H_PAGE_F_GIX and H_PAGE_F_SECOND. We still use them
for 64k. I guess you should move this hunk to second patch.


-aneesh

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v6 02/62] powerpc: Free up four 64K PTE bits in 64K backed HPTE pages

2017-07-19 Thread Aneesh Kumar K.V
Ram Pai  writes:

> Rearrange 64K PTE bits to  free  up  bits 3, 4, 5  and  6
> in the 64K backed HPTE pages. This along with the earlier
> patch will  entirely free  up the four bits from 64K PTE.
> The bit numbers are  big-endian as defined in the  ISA3.0
>
> This patch  does  the  following change to 64K PTE backed
> by 64K HPTE.
>
> H_PAGE_F_SECOND (S) which  occupied  bit  4  moves to the
>   second part of the pte to bit 60.
> H_PAGE_F_GIX (G,I,X) which  occupied  bit 5, 6 and 7 also
>   moves  to  the   second part of the pte to bit 61,
>   62, 63, 64 respectively
>
> since bit 7 is now freed up, we move H_PAGE_BUSY (B) from
> bit  9  to  bit  7.
>
> The second part of the PTE will hold
> (H_PAGE_F_SECOND|H_PAGE_F_GIX) at bit 60,61,62,63.
> NOTE: None of the bits in the secondary PTE were not used
> by 64k-HPTE backed PTE.
>
> Before the patch, the 64K HPTE backed 64k PTE format was
> as follows
>
>  0 1 2 3 4  5  6  7  8 9 10...63
>  : : : : :  :  :  :  : : ::
>  v v v v v  v  v  v  v v vv
>
> ,-,-,-,-,--,--,--,--,-,-,-,-,-,--,-,-,-,
> |x|x|x| |S |G |I |X |x|B| |x|x||x|x|x|x| <- primary pte
> '_'_'_'_'__'__'__'__'_'_'_'_'_''_'_'_'_'
> | | | | |  |  |  |  | | | | |..| | | | | <- secondary pte
> '_'_'_'_'__'__'__'__'_'_'_'_'__'_'_'_'_'
>
> After the patch, the 64k HPTE backed 64k PTE format is
> as follows
>
>  0 1 2 3 4  5  6  7  8 9 10...63
>  : : : : :  :  :  :  : : ::
>  v v v v v  v  v  v  v v vv
>
> ,-,-,-,-,--,--,--,--,-,-,-,-,-,--,-,-,-,
> |x|x|x| |  |  |  |B |x| | |x|x||.|.|.|.| <- primary pte
> '_'_'_'_'__'__'__'__'_'_'_'_'_''_'_'_'_'
> | | | | |  |  |  |  | | | | |..|S|G|I|X| <- secondary pte
> '_'_'_'_'__'__'__'__'_'_'_'_'__'_'_'_'_'
>
> The above PTE changes is applicable to hugetlbpages aswell.
>
> The patch does the following code changes:
>
> a) moves  the  H_PAGE_F_SECOND and  H_PAGE_F_GIX to 4k PTE
>   header   since it is no more needed b the 64k PTEs.
> b) abstracts  out __real_pte() and __rpte_to_hidx() so the
>   caller  need not know the bit location of the slot.
> c) moves the slot bits the secondary pte.
>

Reviewed-by: Aneesh Kumar K.V 
With changes suggested for the first patch.

> Signed-off-by: Ram Pai 
> ---


-aneesh

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v6 03/62] powerpc: introduce pte_set_hash_slot() helper

2017-07-19 Thread Aneesh Kumar K.V
Ram Pai  writes:

> Introduce pte_set_hash_slot().It  sets the (H_PAGE_F_SECOND|H_PAGE_F_GIX)
> bits at  the   appropriate   location   in   the   PTE  of  4K  PTE.  For
> 64K PTE, it  sets  the  bits  in  the  second  part  of  the  PTE. Though
> the implementation  for the former just needs the slot parameter, it does
> take some additional parameters to keep the prototype consistent.
>
> This function  will  be  handy  as  we   work   towards  re-arranging the
> bits in the later patches.
>

Reviewed-by: Aneesh Kumar K.V 
> Signed-off-by: Ram Pai 
> ---
>  arch/powerpc/include/asm/book3s/64/hash-4k.h  |   15 +++
>  arch/powerpc/include/asm/book3s/64/hash-64k.h |   25 
> +
>  2 files changed, 40 insertions(+), 0 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h 
> b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> index d2cf949..dc153c6 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> @@ -53,6 +53,21 @@ static inline int hash__hugepd_ok(hugepd_t hpd)
>  }
>  #endif
>
> +/*
> + * 4k pte format is  different  from  64k  pte  format.  Saving  the
> + * hash_slot is just a matter of returning the pte bits that need to
> + * be modified. On 64k pte, things are a  little  more  involved and
> + * hence  needs   many   more  parameters  to  accomplish  the  same.
> + * However we  want  to abstract this out from the caller by keeping
> + * the prototype consistent across the two formats.
> + */
> +static inline unsigned long pte_set_hash_slot(pte_t *ptep, real_pte_t rpte,
> + unsigned int subpg_index, unsigned long slot)
> +{
> + return (slot << H_PAGE_F_GIX_SHIFT) &
> + (H_PAGE_F_SECOND | H_PAGE_F_GIX);
> +}
> +
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>
>  static inline char *get_hpte_slot_array(pmd_t *pmdp)
> diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h 
> b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> index c281f18..89ef5a9 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> @@ -67,6 +67,31 @@ static inline unsigned long __rpte_to_hidx(real_pte_t 
> rpte, unsigned long index)
>   return ((rpte.hidx >> (index<<2)) & 0xfUL);
>  }
>
> +/*
> + * Commit the hash slot and return pte bits that needs to be modified.
> + * The caller is expected to modify the pte bits accordingly and
> + * commit the pte to memory.
> + */
> +static inline unsigned long pte_set_hash_slot(pte_t *ptep, real_pte_t rpte,
> + unsigned int subpg_index, unsigned long slot)
> +{
> + unsigned long *hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
> +
> + rpte.hidx &= ~(0xfUL << (subpg_index << 2));
> + *hidxp = rpte.hidx  | (slot << (subpg_index << 2));
> + /*
> +  * Commit the hidx bits to memory before returning.
> +  * Anyone reading  pte  must  ensure hidx bits are
> +  * read  only  after  reading the pte by using the
> +  * read-side  barrier  smp_rmb(). __real_pte() can
> +  * help ensure that.
> +  */
> + smp_wmb();
> +
> + /* no pte bits to be modified, return 0x0UL */
> + return 0x0UL;
> +}
> +
>  #define __rpte_to_pte(r) ((r).pte)
>  extern bool __rpte_sub_valid(real_pte_t rpte, unsigned long index);
>  /*
> -- 
> 1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v6 04/62] powerpc: introduce pte_get_hash_gslot() helper

2017-07-19 Thread Aneesh Kumar K.V
Ram Pai  writes:

> Introduce pte_get_hash_gslot()() which returns the slot number of the
> HPTE in the global hash table.
>
> This function will come in handy as we work towards re-arranging the
> PTE bits in the later patches.
>

Reviewed-by: Aneesh Kumar K.V 

> Signed-off-by: Ram Pai 
> ---
>  arch/powerpc/include/asm/book3s/64/hash.h |3 +++
>  arch/powerpc/mm/hash_utils_64.c   |   18 ++
>  2 files changed, 21 insertions(+), 0 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/book3s/64/hash.h 
> b/arch/powerpc/include/asm/book3s/64/hash.h
> index d27f885..277158c 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash.h
> @@ -156,6 +156,9 @@ static inline int hash__pte_none(pte_t pte)
>   return (pte_val(pte) & ~H_PTE_NONE_MASK) == 0;
>  }
>
> +unsigned long pte_get_hash_gslot(unsigned long vpn, unsigned long shift,
> + int ssize, real_pte_t rpte, unsigned int subpg_index);
> +
>  /* This low level function performs the actual PTE insertion
>   * Setting the PTE depends on the MMU type and other factors. It's
>   * an horrible mess that I'm not going to try to clean up now but
> diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
> index 1b494d0..d3604da 100644
> --- a/arch/powerpc/mm/hash_utils_64.c
> +++ b/arch/powerpc/mm/hash_utils_64.c
> @@ -1591,6 +1591,24 @@ static inline void tm_flush_hash_page(int local)
>  }
>  #endif
>
> +/*
> + * return the global hash slot, corresponding to the given
> + * pte, which contains the hpte.
> + */
> +unsigned long pte_get_hash_gslot(unsigned long vpn, unsigned long shift,
> + int ssize, real_pte_t rpte, unsigned int subpg_index)
> +{
> + unsigned long hash, slot, hidx;
> +
> + hash = hpt_hash(vpn, shift, ssize);
> + hidx = __rpte_to_hidx(rpte, subpg_index);
> + if (hidx & _PTEIDX_SECONDARY)
> + hash = ~hash;
> + slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> + slot += hidx & _PTEIDX_GROUP_IX;
> + return slot;
> +}
> +
>  /* WARNING: This is called from hash_low_64.S, if you change this prototype,
>   *  do not forget to update the assembly call site !
>   */
> -- 
> 1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v6 05/62] powerpc: capture the PTE format changes in the dump pte report

2017-07-19 Thread Aneesh Kumar K.V
Ram Pai  writes:

> The H_PAGE_F_SECOND,H_PAGE_F_GIX are not in the 64K main-PTE.
> capture these changes in the dump pte report.
>

Reviewed-by: Aneesh Kumar K.V 

> Signed-off-by: Ram Pai 
> ---
>  arch/powerpc/mm/dump_linuxpagetables.c |3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/arch/powerpc/mm/dump_linuxpagetables.c 
> b/arch/powerpc/mm/dump_linuxpagetables.c
> index 44fe483..5627edd 100644
> --- a/arch/powerpc/mm/dump_linuxpagetables.c
> +++ b/arch/powerpc/mm/dump_linuxpagetables.c
> @@ -213,7 +213,7 @@ struct flag_info {
>   .val= H_PAGE_4K_PFN,
>   .set= "4K_pfn",
>   }, {
> -#endif
> +#else /* CONFIG_PPC_64K_PAGES */
>   .mask   = H_PAGE_F_GIX,
>   .val= H_PAGE_F_GIX,
>   .set= "f_gix",
> @@ -224,6 +224,7 @@ struct flag_info {
>   .val= H_PAGE_F_SECOND,
>   .set= "f_second",
>   }, {
> +#endif /* CONFIG_PPC_64K_PAGES */
>  #endif
>   .mask   = _PAGE_SPECIAL,
>   .val= _PAGE_SPECIAL,
> -- 
> 1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v6 06/62] powerpc: use helper functions in __hash_page_64K() for 64K PTE

2017-07-19 Thread Aneesh Kumar K.V
Ram Pai  writes:

> replace redundant code in __hash_page_64K() with helper
> functions pte_get_hash_gslot() and pte_set_hash_slot()
>

Reviewed-by: Aneesh Kumar K.V 


> Signed-off-by: Ram Pai 
> ---
>  arch/powerpc/mm/hash64_64k.c |   24 
>  1 files changed, 4 insertions(+), 20 deletions(-)
>
> diff --git a/arch/powerpc/mm/hash64_64k.c b/arch/powerpc/mm/hash64_64k.c
> index 0012618..645f621 100644
> --- a/arch/powerpc/mm/hash64_64k.c
> +++ b/arch/powerpc/mm/hash64_64k.c
> @@ -244,7 +244,6 @@ int __hash_page_64K(unsigned long ea, unsigned long 
> access,
>   unsigned long flags, int ssize)
>  {
>   real_pte_t rpte;
> - unsigned long *hidxp;
>   unsigned long hpte_group;
>   unsigned long rflags, pa;
>   unsigned long old_pte, new_pte;
> @@ -289,18 +288,12 @@ int __hash_page_64K(unsigned long ea, unsigned long 
> access,
>
>   vpn  = hpt_vpn(ea, vsid, ssize);
>   if (unlikely(old_pte & H_PAGE_HASHPTE)) {
> - unsigned long hash, slot, hidx;
> -
> - hash = hpt_hash(vpn, shift, ssize);
> - hidx = __rpte_to_hidx(rpte, 0);
> - if (hidx & _PTEIDX_SECONDARY)
> - hash = ~hash;
> - slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> - slot += hidx & _PTEIDX_GROUP_IX;
> + unsigned long gslot;
>   /*
>* There MIGHT be an HPTE for this pte
>*/
> - if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_64K,
> + gslot = pte_get_hash_gslot(vpn, shift, ssize, rpte, 0);
> + if (mmu_hash_ops.hpte_updatepp(gslot, rflags, vpn, MMU_PAGE_64K,
>  MMU_PAGE_64K, ssize,
>  flags) == -1)
>   old_pte &= ~_PAGE_HPTEFLAGS;
> @@ -350,17 +343,8 @@ int __hash_page_64K(unsigned long ea, unsigned long 
> access,
>   return -1;
>   }
>
> - /*
> -  * Insert slot number & secondary bit in PTE second half.
> -  */
> - hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
> - rpte.hidx &= ~(0xfUL);
> - *hidxp = rpte.hidx  | (slot & 0xfUL);
> - /*
> -  * check __real_pte for details on matching smp_rmb()
> -  */
> - smp_wmb();
>   new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | H_PAGE_HASHPTE;
> + new_pte |= pte_set_hash_slot(ptep, rpte, 0, slot);
>   }
>   *ptep = __pte(new_pte & ~H_PAGE_BUSY);
>   return 0;
> -- 
> 1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v6 07/62] powerpc: use helper functions in __hash_page_huge() for 64K PTE

2017-07-19 Thread Aneesh Kumar K.V
Ram Pai  writes:

> replace redundant code in __hash_page_huge() with helper
> functions pte_get_hash_gslot() and pte_set_hash_slot()
>

Can you fold all the helper function usage into one patch ?


> Signed-off-by: Ram Pai 
> ---
>  arch/powerpc/mm/hugetlbpage-hash64.c |   24 
>  1 files changed, 4 insertions(+), 20 deletions(-)
>
> diff --git a/arch/powerpc/mm/hugetlbpage-hash64.c 
> b/arch/powerpc/mm/hugetlbpage-hash64.c
> index 6f7aee3..e6dcd50 100644
> --- a/arch/powerpc/mm/hugetlbpage-hash64.c
> +++ b/arch/powerpc/mm/hugetlbpage-hash64.c
> @@ -23,7 +23,6 @@ int __hash_page_huge(unsigned long ea, unsigned long 
> access, unsigned long vsid,
>int ssize, unsigned int shift, unsigned int mmu_psize)
>  {
>   real_pte_t rpte;
> - unsigned long *hidxp;
>   unsigned long vpn;
>   unsigned long old_pte, new_pte;
>   unsigned long rflags, pa, sz;
> @@ -74,16 +73,10 @@ int __hash_page_huge(unsigned long ea, unsigned long 
> access, unsigned long vsid,
>   /* Check if pte already has an hpte (case 2) */
>   if (unlikely(old_pte & H_PAGE_HASHPTE)) {
>   /* There MIGHT be an HPTE for this pte */
> - unsigned long hash, slot, hidx;
> + unsigned long gslot;
>
> - hash = hpt_hash(vpn, shift, ssize);
> - hidx = __rpte_to_hidx(rpte, 0);
> - if (hidx & _PTEIDX_SECONDARY)
> - hash = ~hash;
> - slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> - slot += hidx & _PTEIDX_GROUP_IX;
> -
> - if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, mmu_psize,
> + gslot = pte_get_hash_gslot(vpn, shift, ssize, rpte, 0);
> + if (mmu_hash_ops.hpte_updatepp(gslot, rflags, vpn, mmu_psize,
>  mmu_psize, ssize, flags) == -1)
>   old_pte &= ~_PAGE_HPTEFLAGS;
>   }
> @@ -110,16 +103,7 @@ int __hash_page_huge(unsigned long ea, unsigned long 
> access, unsigned long vsid,
>   return -1;
>   }
>
> - /*
> -  * Insert slot number & secondary bit in PTE second half.
> -  */
> - hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
> - rpte.hidx &= ~(0xfUL);
> - *hidxp = rpte.hidx  | (slot & 0xfUL);
> - /*
> -  * check __real_pte for details on matching smp_rmb()
> -  */
> - smp_wmb();
> + new_pte |= pte_set_hash_slot(ptep, rpte, 0, slot);
>   }
>
>   /*
> -- 
> 1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v6 11/62] powerpc: initial pkey plumbing

2017-07-19 Thread Aneesh Kumar K.V
Ram Pai  writes:

> basic setup to initialize the pkey system. Only 64K kernel in HPT
> mode, enables the pkey system.
>
> Signed-off-by: Ram Pai 
> ---
>  arch/powerpc/Kconfig   |   16 ++
>  arch/powerpc/include/asm/mmu_context.h |5 +++
>  arch/powerpc/include/asm/pkeys.h   |   51 
> 
>  arch/powerpc/kernel/setup_64.c |4 ++
>  arch/powerpc/mm/Makefile   |1 +
>  arch/powerpc/mm/hash_utils_64.c|1 +
>  arch/powerpc/mm/pkeys.c|   18 +++
>  7 files changed, 96 insertions(+), 0 deletions(-)
>  create mode 100644 arch/powerpc/include/asm/pkeys.h
>  create mode 100644 arch/powerpc/mm/pkeys.c
>
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index bf4391d..5c60fd6 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -855,6 +855,22 @@ config SECCOMP
>
> If unsure, say Y. Only embedded should say N here.
>
> +config PPC64_MEMORY_PROTECTION_KEYS
> + prompt "PowerPC Memory Protection Keys"
> + def_bool y
> + # Note: only available in 64-bit mode
> + depends on PPC64 && PPC_64K_PAGES
> + select ARCH_USES_HIGH_VMA_FLAGS
> + select ARCH_HAS_PKEYS
> + ---help---
> +   Memory Protection Keys provides a mechanism for enforcing
> +   page-based protections, but without requiring modification of the
> +   page tables when an application changes protection domains.
> +
> +   For details, see Documentation/vm/protection-keys.txt
> +
> +   If unsure, say y.
> +
>  endmenu


Shouldn't this come as the last patch ? Or can we enable this config by
this patch ? If so what does it do ? Did you test boot each of this
patch to make sure we don't break git bisect ?

>
>  config ISA_DMA_API
> diff --git a/arch/powerpc/include/asm/mmu_context.h 
> b/arch/powerpc/include/asm/mmu_context.h
> index da7e943..4b93547 100644
> --- a/arch/powerpc/include/asm/mmu_context.h
> +++ b/arch/powerpc/include/asm/mmu_context.h
> @@ -181,5 +181,10 @@ static inline bool arch_vma_access_permitted(struct 
> vm_area_struct *vma,
>   /* by default, allow everything */


-aneesh

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] documentation: Fix two-CPU control-dependency example

2017-07-19 Thread Boqun Feng
On Wed, Jul 19, 2017 at 10:47:04PM -0700, Paul E. McKenney wrote:
[...]
> > Hi Paul,
> > 
> > I know the compiler could optimize atomics in very interesting ways, but
> > this case is about volatile, so I guess our case is still fine? ;-)
> 
> Hello, Boqun,
> 
> When I asked that question, the answer I got was "the compiler must
> emit the load instruction, but is under no obligation to actually use the
> value loaded".
> 
> I don't happen to like that answer, by the way.  ;-)
> 

Me neither, seems to me the kernel happens to work well at
compiler-optimization's mercy ;-/

With claim like that, compiler could do optimization as turning:

struct task_struct *owner;

for (;;) {
owner = READ_ONCE(lock->owner);
if (owner && !mutex_spin_on_owner(lock, owner))
break;
/* ... */

into:

struct task_struct *owner;

owner = READ_ONCE(lock->owner);

for (;;READ_ONCE(lock->owner)) {
if (owner && !mutex_spin_on_owner(lock, owner))
break;
/* ... */

Because the load executed in every loop, and they just happen to choose
not to use the values. And this is within their rights!

Regards,
Boqun

>   Thanx, Paul
> 
> > > http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0062r1.html
> > > 
> > 
> > Great material to wake up mind in the morning! Thanks.
> > 
> > Regards,
> > Boqun
> > 
> > > What are your thoughts on this?
> > > 
> > >   Thanx, Paul
> > > 
> > > > Thanks, Akira
> > > > 
> > > > > That said, I very much welcome critical reviews of 
> > > > > memory-barriers.txt,
> > > > > so please do feel free to continue doing that!
> > > > > 
> > > > >   Thanx, Paul
> > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v6 26/62] powerpc: Program HPTE key protection bits

2017-07-19 Thread Aneesh Kumar K.V
Ram Pai  writes:

> Map the PTE protection key bits to the HPTE key protection bits,
> while creating HPTE  entries.
>
Reviewed-by: Aneesh Kumar K.V 

> Signed-off-by: Ram Pai 
> ---
>  arch/powerpc/include/asm/book3s/64/mmu-hash.h |5 +
>  arch/powerpc/include/asm/pkeys.h  |   12 
>  arch/powerpc/mm/hash_utils_64.c   |4 
>  3 files changed, 21 insertions(+), 0 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
> b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> index 6981a52..f7a6ed3 100644
> --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> @@ -90,6 +90,8 @@
>  #define HPTE_R_PP0   ASM_CONST(0x8000)
>  #define HPTE_R_TSASM_CONST(0x4000)
>  #define HPTE_R_KEY_HIASM_CONST(0x3000)
> +#define HPTE_R_KEY_BIT0  ASM_CONST(0x2000)
> +#define HPTE_R_KEY_BIT1  ASM_CONST(0x1000)
>  #define HPTE_R_RPN_SHIFT 12
>  #define HPTE_R_RPN   ASM_CONST(0x0000)
>  #define HPTE_R_RPN_3_0   ASM_CONST(0x01fff000)
> @@ -104,6 +106,9 @@
>  #define HPTE_R_C ASM_CONST(0x0080)
>  #define HPTE_R_R ASM_CONST(0x0100)
>  #define HPTE_R_KEY_LOASM_CONST(0x0e00)
> +#define HPTE_R_KEY_BIT2  ASM_CONST(0x0800)
> +#define HPTE_R_KEY_BIT3  ASM_CONST(0x0400)
> +#define HPTE_R_KEY_BIT4  ASM_CONST(0x0200)
>
>  #define HPTE_V_1TB_SEG   ASM_CONST(0x4000)
>  #define HPTE_V_VRMA_MASK ASM_CONST(0x4001ff00)
> diff --git a/arch/powerpc/include/asm/pkeys.h 
> b/arch/powerpc/include/asm/pkeys.h
> index ad39db0..bbb5d85 100644
> --- a/arch/powerpc/include/asm/pkeys.h
> +++ b/arch/powerpc/include/asm/pkeys.h
> @@ -41,6 +41,18 @@ static inline u64 vmflag_to_page_pkey_bits(u64 vm_flags)
>   ((vm_flags & VM_PKEY_BIT4) ? H_PAGE_PKEY_BIT0 : 0x0UL));
>  }
>
> +static inline u64 pte_to_hpte_pkey_bits(u64 pteflags)
> +{
> + if (!pkey_inited)
> + return 0x0UL;
> +
> + return (((pteflags & H_PAGE_PKEY_BIT0) ? HPTE_R_KEY_BIT0 : 0x0UL) |
> + ((pteflags & H_PAGE_PKEY_BIT1) ? HPTE_R_KEY_BIT1 : 0x0UL) |
> + ((pteflags & H_PAGE_PKEY_BIT2) ? HPTE_R_KEY_BIT2 : 0x0UL) |
> + ((pteflags & H_PAGE_PKEY_BIT3) ? HPTE_R_KEY_BIT3 : 0x0UL) |
> + ((pteflags & H_PAGE_PKEY_BIT4) ? HPTE_R_KEY_BIT4 : 0x0UL));
> +}
> +
>  static inline int vma_pkey(struct vm_area_struct *vma)
>  {
>   if (!pkey_inited)
> diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
> index f88423b..1e74529 100644
> --- a/arch/powerpc/mm/hash_utils_64.c
> +++ b/arch/powerpc/mm/hash_utils_64.c
> @@ -231,6 +231,10 @@ unsigned long htab_convert_pte_flags(unsigned long 
> pteflags)
>*/
>   rflags |= HPTE_R_M;
>
> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> + rflags |= pte_to_hpte_pkey_bits(pteflags);
> +#endif
> +
>   return rflags;
>  }
>
> -- 
> 1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v6 27/62] powerpc: helper to validate key-access permissions of a pte

2017-07-19 Thread Aneesh Kumar K.V
Ram Pai  writes:

> helper function that checks if the read/write/execute is allowed
> on the pte.
>
> Signed-off-by: Ram Pai 
> ---
>  arch/powerpc/include/asm/book3s/64/pgtable.h |4 +++
>  arch/powerpc/include/asm/pkeys.h |   12 +
>  arch/powerpc/mm/pkeys.c  |   33 
> ++
>  3 files changed, 49 insertions(+), 0 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
> b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index 30d7f55..0056e58 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -472,6 +472,10 @@ static inline void write_uamor(u64 value)
>   mtspr(SPRN_UAMOR, value);
>  }
>
> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> +extern bool arch_pte_access_permitted(u64 pte, bool write, bool execute);
> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> +
>  #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>  unsigned long addr, pte_t *ptep)
> diff --git a/arch/powerpc/include/asm/pkeys.h 
> b/arch/powerpc/include/asm/pkeys.h
> index bbb5d85..7a9aade 100644
> --- a/arch/powerpc/include/asm/pkeys.h
> +++ b/arch/powerpc/include/asm/pkeys.h
> @@ -53,6 +53,18 @@ static inline u64 pte_to_hpte_pkey_bits(u64 pteflags)
>   ((pteflags & H_PAGE_PKEY_BIT4) ? HPTE_R_KEY_BIT4 : 0x0UL));
>  }
>
> +static inline u16 pte_to_pkey_bits(u64 pteflags)
> +{
> + if (!pkey_inited)
> + return 0x0UL;

Do we really need that above check ? We should always find it
peky_inited to be set. 

> +
> + return (((pteflags & H_PAGE_PKEY_BIT0) ? 0x10 : 0x0UL) |
> + ((pteflags & H_PAGE_PKEY_BIT1) ? 0x8 : 0x0UL) |
> + ((pteflags & H_PAGE_PKEY_BIT2) ? 0x4 : 0x0UL) |
> + ((pteflags & H_PAGE_PKEY_BIT3) ? 0x2 : 0x0UL) |
> + ((pteflags & H_PAGE_PKEY_BIT4) ? 0x1 : 0x0UL));
> +}
> +


-aneesh

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html