date:20140528

Re: [PATCH v2] powerpc/powernv: hwmon driver for power values, fan rpm and temperature

2014-05-28 Thread Guenter Roeck


On 05/19/2014 07:26 AM, Neelesh Gupta wrote:

This patch adds basic kernel enablement for reading power values, fan
speed rpm and temperature values on powernv platforms which will
be exported to user space through sysfs interface.

Test results:
-
[root@tul163p1 ~]# sensors
ibmpowernv-isa-
Adapter: ISA adapter
fan1:5487 RPM  (min =0 RPM)
fan2:5152 RPM  (min =0 RPM)
fan3:5590 RPM  (min =0 RPM)
fan4:4963 RPM  (min =0 RPM)
fan5:   0 RPM  (min =0 RPM)
fan6:   0 RPM  (min =0 RPM)
fan7:7488 RPM  (min =0 RPM)
fan8:7944 RPM  (min =0 RPM)
temp1:+39.0°C  (high =  +0.0°C)
power1:  192.00 W

[root@tul163p1 ~]# ls /sys/devices/platform/
alarmtimer  ibmpowernv.0  rtc-generic  serial8250  uevent
[root@tul163p1 ~]# ls /sys/devices/platform/ibmpowernv.0/
driver/hwmon/ modalias   subsystem/ uevent
[root@tul163p1 ~]# ls /sys/devices/platform/ibmpowernv.0/hwmon/hwmon0/
device  fan2_minfan4_minfan6_minfan8_min   power1_input
fan1_fault  fan3_fault  fan5_fault  fan7_fault  in1_fault  subsystem
fan1_input  fan3_input  fan5_input  fan7_input  in2_fault  temp1_input
fan1_minfan3_minfan5_minfan7_minin3_fault  temp1_max
fan2_fault  fan4_fault  fan6_fault  fan8_fault  in4_fault  uevent
fan2_input  fan4_input  fan6_input  fan8_input  name
[root@tul163p1 ~]#
[root@tul163p1 ~]# ls /sys/class/hwmon/hwmon0/
device  fan2_minfan4_minfan6_minfan8_min   power1_input
fan1_fault  fan3_fault  fan5_fault  fan7_fault  in1_fault  subsystem
fan1_input  fan3_input  fan5_input  fan7_input  in2_fault  temp1_input
fan1_minfan3_minfan5_minfan7_minin3_fault  temp1_max
fan2_fault  fan4_fault  fan6_fault  fan8_fault  in4_fault  uevent
fan2_input  fan4_input  fan6_input  fan8_input  name
[root@tul163p1 ~]#




The inX_fault attributes don't really make much sense. _fault attributes without
_input attributes don't have any value and are, as you noticed, not displayed
with the sensors command. Is this a problem in teh devicetree data or do you
really have voltage faults without voltages ?


Signed-off-by: Neelesh Gupta 
---


Checkpatch says:

total: 8 errors, 11 warnings, 389 lines checked

NOTE: whitespace errors detected, you may wish to use scripts/cleanpatch or
  scripts/cleanfile

which should really not happen at this point.

Please make sure you follow CodingStyle. checkpatch --strict points to a number
of additional violations.

More comments inline.


Changes in v2
=
- Generic use of devm_* functions in hwmon like using devm_kzalloc() for dynamic
   memory request, avoiding the need to explicit free of memory.
   Adding 'struct attribute_group' as member of platform data structure to be
   populated and then passed to devm_hwmon_device_register_with_groups().

   Note: Having an array of pointers of 'attribute_group' and each group
   corresponds to 'enum sensors' type. Not completely sure, if it's ideal or
   could have just one group populated with attributes of sensor types?


Your call, really; whatever is easier for you. I won't dictate one or the other.

Question though is what happens if one group is not populated. If I understand
the code correctly this will result in the remaining groups to be 'dropped',
ie not displayed at all.


- 'ibmpowernv' is not hot-pluggable device so moving 'platform_driver' callback
   function (probe) as part of __init code.
- Fixed issues related to coding style.
- Other general comments in v1.

  drivers/hwmon/Kconfig  |8 +
  drivers/hwmon/Makefile |1
  drivers/hwmon/ibmpowernv.c |  368 


You'll also need Documentation/hwmon/ibmpowernv and
Documentation/devicetree/bindings/hwmon/ibmpowernv.

The latter will need to get an Ack from the devicetree maintainers.


  3 files changed, 377 insertions(+)
  create mode 100644 drivers/hwmon/ibmpowernv.c

diff --git a/drivers/hwmon/Kconfig b/drivers/hwmon/Kconfig
index bc196f4..3e308fa 100644
--- a/drivers/hwmon/Kconfig
+++ b/drivers/hwmon/Kconfig
@@ -554,6 +554,14 @@ config SENSORS_IBMPEX
  This driver can also be built as a module.  If so, the module
  will be called ibmpex.

+config SENSORS_IBMPOWERNV
+   tristate "IBM POWERNV platform sensors"
+   depends on PPC_POWERNV
+   default y
+   help
+ If you say yes here you get support for the temperature/fan/power
+ sensors on your platform.
+
  config SENSORS_IIO_HWMON
tristate "Hwmon driver that uses channels specified via iio maps"
depends on IIO
diff --git a/drivers/hwmon/Makefile b/drivers/hwmon/Makefile
index c48f987..199c401 100644
--- a/drivers/hwmon/Makefile
+++ b/drivers/hwmon/Makefile
@@ -71,6 +71,7 @@ obj-$(CONFIG_SENSORS_ULTRA45) += ultra45_env.o
  obj-$(CONFIG_SENSORS_I5K_AMB) += i5k_amb.o
  obj-$(CONFIG_SENSORS_IBMAEM)  += ibmaem.o
  obj-$(CONFIG_SENSORS_IBMPEX)  += ibmpex.o
+

Re: [PATCH v2] powerpc/powernv: hwmon driver for power values, fan rpm and temperature

2014-05-28 Thread Benjamin Herrenschmidt

On Wed, 2014-05-28 at 00:23 -0700, Guenter Roeck wrote:
> Consider using of_property_read_u32().
> 
> > + sdata[count].id = *sensor_id;
> > + sdata[count].type = type;

Especially since this is broken for Little Endian !

Neelesh, please make sure you test your patch on LE.

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [V6 00/11] perf: New conditional branch filter

2014-05-28 Thread Anshuman Khandual

On 05/27/2014 05:39 PM, Stephane Eranian wrote:
> I have been looking at those patches and ran some tests.
> And I found a few issues so far.
> 
> I am running:
> $ perf record -j any_ret -e cycles:u test_program
> $ perf report -D
> 
> Most entries are okay and match the filter, however some do not make sense:
> 
> 3642586996762 0x15d0 [0x108]: PERF_RECORD_SAMPLE(IP, 2): 17921/17921:
> 0x10001170 period: 613678 addr: 0
>  branch stack: nr:9
> .  0: 100011cc -> 1e38
> .  1: 10001150 -> 100011bc
> .  2: 10001208 -> 1e38
> .  3: 10001160 -> 100011f8
> .  4: 100011cc -> 1e38
> .  5: 10001150 -> 100011bc
> .  6: 10001208 -> 1e38
> .  7: 10001160 -> 100011f8
> .  8:  -> 10001160
> ^^
> Entry 8 does not make sense, unless 0x0 is a valid return branch
> instruction address.
> If an address is invalid, the whole entry needs to be eliminated. It
> is okay to have
> less than the max number of entries supported by HW.

Hey Stephane,

Okay. The same behaviour is also reflected in the test results what I have
shared in the patchset. Here is that section.

(3) perf record -j any_ret -e branch-misses:u ./cprog

# Overhead  Command  Source Shared Object  Source Symbol  Target Shared 
Object  Target Symbol
#   ...    .  
  .
#
15.61%cprog  [unknown] [.]    cprog 
[.] sw_3_1   
 6.28%cprog  cprog [.] symbol2cprog 
[.] hw_1_2   
 6.28%cprog  cprog [.] ctr_addr   cprog 
[.] sw_4_1   
 6.26%cprog  cprog [.] success_3_1_3  cprog 
[.] sw_3_1   
 6.24%cprog  cprog [.] symbol1cprog 
[.] hw_1_1   
 6.24%cprog  cprog [.] sw_4_2 cprog 
[.] callme   
 6.21%cprog  [unknown] [.]    cprog 
[.] callme   
 6.19%cprog  cprog [.] lr_addrcprog 
[.] sw_4_2   
 3.16%cprog  cprog [.] hw_1_2 cprog 
[.] callme   
 3.15%cprog  cprog [.] success_3_1_1  cprog 
[.] sw_3_1   
 3.15%cprog  cprog [.] sw_4_1 cprog 
[.] callme   
 3.14%cprog  cprog [.] callme cprog 
[.] main 
 3.13%cprog  cprog [.] hw_1_1 cprog 
[.] callme

So a lot of samples above have 0x0 as the "from" address. This originates from 
the code
section here inside the function "power_pmu_bhrb_read", where we hit two back 
to back
target addresses. So we zero out the from address for the first target address 
and re-read
the second address over again. So thats how we get zero as the from address. 
This is how the
HW capture the samples. I was reluctant to drop these samples but I agree that 
these kind of
samples can be dropped if we need to.

if (val & BHRB_TARGET) {
/* Shouldn't have two targets in a
   row.. Reset index and try again */
r_index--;
addr = 0;
}

> I also had cases where monitoring only at the user level, got me
> branch addresses in the
> 0xc000.. range. My test program is linked statically.
> 

Thats weird. I would need more information and details on this. BTW
what is the system you are running on ? Could you please share the
/proc/cpuinfo details of the same ?

> when eliminating the bogus entries, my tests yielded only return
> branch instruction addresses
> which is good. Will run more tests.

Sure. Thanks for the tests and comments.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 2/2] selftests/powerpc: Test the THP bug we fixed in the previous commit

2014-05-28 Thread Michael Ellerman

Signed-off-by: Michael Ellerman 
---
 tools/testing/selftests/powerpc/Makefile   |  2 +-
 tools/testing/selftests/powerpc/mm/Makefile| 18 ++
 .../selftests/powerpc/mm/hugetlb_vs_thp_test.c | 72 ++
 3 files changed, 91 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/powerpc/mm/Makefile
 create mode 100644 tools/testing/selftests/powerpc/mm/hugetlb_vs_thp_test.c

diff --git a/tools/testing/selftests/powerpc/Makefile 
b/tools/testing/selftests/powerpc/Makefile
index 316194f..b3dbe9e 100644
--- a/tools/testing/selftests/powerpc/Makefile
+++ b/tools/testing/selftests/powerpc/Makefile
@@ -13,7 +13,7 @@ CFLAGS := -Wall -O2 -flto -Wall -Werror 
-DGIT_VERSION='"$(GIT_VERSION)"' -I$(CUR
 
 export CC CFLAGS
 
-TARGETS = pmu copyloops
+TARGETS = pmu copyloops mm
 
 endif
 
diff --git a/tools/testing/selftests/powerpc/mm/Makefile 
b/tools/testing/selftests/powerpc/mm/Makefile
new file mode 100644
index 000..357ccbd
--- /dev/null
+++ b/tools/testing/selftests/powerpc/mm/Makefile
@@ -0,0 +1,18 @@
+noarg:
+   $(MAKE) -C ../
+
+PROGS := hugetlb_vs_thp_test
+
+all: $(PROGS)
+
+$(PROGS): ../harness.c
+
+run_tests: all
+   @-for PROG in $(PROGS); do \
+   ./$$PROG; \
+   done;
+
+clean:
+   rm -f $(PROGS)
+
+.PHONY: all run_tests clean
diff --git a/tools/testing/selftests/powerpc/mm/hugetlb_vs_thp_test.c 
b/tools/testing/selftests/powerpc/mm/hugetlb_vs_thp_test.c
new file mode 100644
index 000..3d8e5b0
--- /dev/null
+++ b/tools/testing/selftests/powerpc/mm/hugetlb_vs_thp_test.c
@@ -0,0 +1,72 @@
+#include 
+#include 
+#include 
+
+#include "utils.h"
+
+/* This must match the huge page & THP size */
+#define SIZE   (16 * 1024 * 1024)
+
+static int test_body(void)
+{
+   void *addr;
+   char *p;
+
+   addr = (void *)0xa000;
+
+   p = mmap(addr, SIZE, PROT_READ | PROT_WRITE,
+MAP_HUGETLB | MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+   if (p != MAP_FAILED) {
+   /*
+* Typically the mmap will fail because no huge pages are
+* allocated on the system. But if there are huge pages
+* allocated the mmap will succeed. That's fine too, we just
+* munmap here before continuing.
+*/
+   munmap(addr, SIZE);
+   }
+
+   p = mmap(addr, SIZE, PROT_READ | PROT_WRITE,
+MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+   if (p == MAP_FAILED) {
+   printf("Mapping failed @ %p\n", addr);
+   perror("mmap");
+   return 1;
+   }
+
+   /*
+* Either a user or kernel access is sufficient to trigger the bug.
+* A kernel access is easier to spot & debug, as it will trigger the
+* softlockup or RCU stall detectors, and when the system is kicked
+* into xmon we get a backtrace in the kernel.
+*
+* A good option is:
+*  getcwd(p, SIZE);
+*
+* For the purposes of this testcase it's preferable to spin in
+* userspace, so the harness can kill us if we get stuck. That way we
+* see a test failure rather than a dead system.
+*/
+   *p = 0xf;
+
+   munmap(addr, SIZE);
+
+   return 0;
+}
+
+static int test_main(void)
+{
+   int i;
+
+   /* 10,000 because it's a "bunch", and completes reasonably quickly */
+   for (i = 0; i < 1; i++)
+   if (test_body())
+   return 1;
+
+   return 0;
+}
+
+int main(void)
+{
+   return test_harness(test_main, "hugetlb_vs_thp");
+}
-- 
1.9.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 1/2] powerpc/mm: Check paca psize is up to date for huge mappings

2014-05-28 Thread Michael Ellerman

We have a bug in our hugepage handling which exhibits as an infinite
loop of hash faults. If the fault is being taken in the kernel it will
typically trigger the softlockup detector, or the RCU stall detector.

The bug is as follows:

 1. mmap(0xa000, ..., MAP_FIXED | MAP_HUGE_TLB | MAP_ANONYMOUS ..)
 2. Slice code converts the slice psize to 16M.
 3. The code on lines 539-540 of slice.c in slice_get_unmapped_area()
synchronises the mm->context with the paca->context. So the paca slice
mask is updated to include the 16M slice.
 3. Either:
* mmap() fails because there are no huge pages available.
* mmap() succeeds and the mapping is then munmapped.
In both cases the slice psize remains at 16M in both the paca & mm.
 4. mmap(0xa000, ..., MAP_FIXED | MAP_ANONYMOUS ..)
 5. The slice psize is converted back to 64K. Because of the check on line 539
of slice.c we DO NOT update the paca->context. The paca slice mask is now
out of sync with the mm slice mask.
 6. User/kernel accesses 0xa000.
 7. The SLB miss handler slb_allocate_realmode() **uses the paca slice mask**
to create an SLB entry and inserts it in the SLB.
18. With the 16M SLB entry in place the hardware does a hash lookup, no entry
is found so a data access exception is generated.
19. The data access handler calls do_page_fault() -> handle_mm_fault().
10. __handle_mm_fault() creates a THP mapping with do_huge_pmd_anonymous_page().
11. The hardware retries the access, there is still nothing in the hash table
so once again a data access exception is generated.
12. hash_page() calls into __hash_page_thp() and inserts a mapping in the
hash. Although the THP mapping maps 16M the hashing is done using 64K
as the segment page size.
13. hash_page() returns immediately after calling __hash_page_thp(), skipping
over the code at line 1125. Resulting in the mismatch between the
paca->context and mm->context not being detected.
14. The hardware retries the access, the hash it generates using the 16M
SLB entry does NOT match the hash we inserted.
15. We take another data access and go into __hash_page_thp().
16. We see a valid entry in the hpte_slot_array and so we call updatepp()
which succeeds.
17. Goto 14.

We could fix this in two ways. The first would be to remove or modify
the check on line 539 of slice.c.

The second option is to cause the check of paca psize in hash_page() on
line 1125 to also be done for THP pages.

We prefer the latter, because the check & update of the paca psize is
not done until we know it's necessary. It's also done only on the
current cpu, so we don't need to IPI all other cpus.

Without further rearranging the code, the simplest fix is to pull out
the code that checks paca psize and call it in two places. Firstly for
THP/hugetlb, and secondly for other mappings as before.

Thanks to Dave Jones for trinity, which originally found this bug.

Signed-off-by: Michael Ellerman 
---
 arch/powerpc/mm/hash_utils_64.c | 31 ---
 1 file changed, 20 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index d766d6e..6650699 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -960,6 +960,22 @@ void hash_failure_debug(unsigned long ea, unsigned long 
access,
trap, vsid, ssize, psize, lpsize, pte);
 }
 
+static void check_paca_psize(unsigned long ea, struct mm_struct *mm,
+int psize, bool user_region)
+{
+   if (user_region) {
+   if (psize != get_paca_psize(ea)) {
+   get_paca()->context = mm->context;
+   slb_flush_and_rebolt();
+   }
+   } else if (get_paca()->vmalloc_sllp !=
+  mmu_psize_defs[mmu_vmalloc_psize].sllp) {
+   get_paca()->vmalloc_sllp =
+   mmu_psize_defs[mmu_vmalloc_psize].sllp;
+   slb_vmalloc_update();
+   }
+}
+
 /* Result code is:
  *  0 - handled
  *  1 - normal page fault
@@ -1081,6 +1097,8 @@ int hash_page(unsigned long ea, unsigned long access, 
unsigned long trap)
WARN_ON(1);
}
 #endif
+   check_paca_psize(ea, mm, psize, user_region);
+
goto bail;
}
 
@@ -1121,17 +1139,8 @@ int hash_page(unsigned long ea, unsigned long access, 
unsigned long trap)
 #endif
}
}
-   if (user_region) {
-   if (psize != get_paca_psize(ea)) {
-   get_paca()->context = mm->context;
-   slb_flush_and_rebolt();
-   }
-   } else if (get_paca()->vmalloc_sllp !=
-  mmu_psize_defs[mmu_vmalloc_psize].sllp) {
-   get_paca()->vmalloc_sllp =
-   mmu_psize_defs[mmu_vmalloc_psize].sllp;
-   slb_vmalloc_update();
-   }
+
+   check_paca_psize(ea, mm, psize,

[PATCH] powerpc/85xx: Add T4240RDB board support

2014-05-28 Thread Chunhe Lan

T4240RDB board Specification

Memory subsystem:
 6GB DDR3
 128MB NOR flash
 2GB NAND flash
Ethernet:
 Eight 1G SGMII ports
 Four 10Gbps SFP+ ports
PCIe:
 Two PCIe slots
USB:
 Two USB2.0 Type A ports
SDHC:
 One SD-card port
SATA:
 One SATA port
UART:
 Dual RJ45 ports

Signed-off-by: Chunhe Lan 
Cc: Scott Wood 
---
 arch/powerpc/boot/dts/t4240rdb.dts|  186 +
 arch/powerpc/configs/corenet64_smp_defconfig  |5 +
 arch/powerpc/platforms/85xx/Kconfig   |2 +-
 arch/powerpc/platforms/85xx/corenet_generic.c |1 +
 4 files changed, 193 insertions(+), 1 deletions(-)
 create mode 100644 arch/powerpc/boot/dts/t4240rdb.dts

diff --git a/arch/powerpc/boot/dts/t4240rdb.dts 
b/arch/powerpc/boot/dts/t4240rdb.dts
new file mode 100644
index 000..53761d4
--- /dev/null
+++ b/arch/powerpc/boot/dts/t4240rdb.dts
@@ -0,0 +1,186 @@
+/*
+ * T4240RDB Device Tree Source
+ *
+ * Copyright 2014 Freescale Semiconductor Inc.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ * * Redistributions of source code must retain the above copyright
+ *  notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ *  notice, this list of conditions and the following disclaimer in the
+ *  documentation and/or other materials provided with the distribution.
+ * * Neither the name of Freescale Semiconductor nor the
+ *  names of its contributors may be used to endorse or promote products
+ *  derived from this software without specific prior written permission.
+ *
+ *
+ * ALTERNATIVELY, this software may be distributed under the terms of the
+ * GNU General Public License ("GPL") as published by the Free Software
+ * Foundation, either version 2 of that License or (at your option) any
+ * later version.
+ *
+ * THIS SOFTWARE IS PROVIDED BY Freescale Semiconductor "AS IS" AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+ * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL Freescale Semiconductor BE LIABLE FOR ANY
+ * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+ * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF 
THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/include/ "fsl/t4240si-pre.dtsi"
+
+/ {
+   model = "fsl,T4240RDB";
+   compatible = "fsl,T4240RDB";
+   #address-cells = <2>;
+   #size-cells = <2>;
+   interrupt-parent = <&mpic>;
+
+   ifc: localbus@ffe124000 {
+   reg = <0xf 0xfe124000 0 0x2000>;
+   ranges = <0 0 0xf 0xe800 0x0800
+ 2 0 0xf 0xff80 0x0001
+ 3 0 0xf 0xffdf 0x8000>;
+
+   nor@0,0 {
+   #address-cells = <1>;
+   #size-cells = <1>;
+   compatible = "cfi-flash";
+   reg = <0x0 0x0 0x800>;
+
+   bank-width = <2>;
+   device-width = <1>;
+   };
+
+   nand@2,0 {
+   #address-cells = <1>;
+   #size-cells = <1>;
+   compatible = "fsl,ifc-nand";
+   reg = <0x2 0x0 0x1>;
+   };
+   };
+
+   memory {
+   device_type = "memory";
+   };
+
+   dcsr: dcsr@f {
+   ranges = <0x 0xf 0x 0x01072000>;
+   };
+
+   soc: soc@ffe00 {
+   ranges = <0x 0xf 0xfe00 0x100>;
+   reg = <0xf 0xfe00 0 0x1000>;
+   spi@11 {
+   flash@0 {
+   #address-cells = <1>;
+   #size-cells = <1>;
+   compatible = "sst,sst25wf040";
+   reg = <0>;
+   spi-max-frequency = <4000>; /* input clock 
*/
+   };
+   };
+
+   i2c@118000 {
+   eeprom@52 {
+   compatible = "at24,24c256";
+   reg = <0x52>;
+   };
+   eeprom@54 {
+   compatible = "at24,24c256";
+   reg = <0x54>;
+   };
+   eeprom@56 {
+

[PATCH] t4240/dts: Enable third elo3 DMA engine support

2014-05-28 Thread Chunhe Lan

T4240QDS and T4240RDB have the third DMA engine
controller. So add corresponding DMA node into
dts file.

Signed-off-by: Chunhe Lan 
Cc: Scott Wood 
---
 arch/powerpc/boot/dts/fsl/t4240si-post.dtsi |1 +
 arch/powerpc/boot/dts/fsl/t4240si-pre.dtsi  |1 +
 2 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/boot/dts/fsl/t4240si-post.dtsi 
b/arch/powerpc/boot/dts/fsl/t4240si-post.dtsi
index f99d74f..001183c 100644
--- a/arch/powerpc/boot/dts/fsl/t4240si-post.dtsi
+++ b/arch/powerpc/boot/dts/fsl/t4240si-post.dtsi
@@ -475,6 +475,7 @@
 
 /include/ "elo3-dma-0.dtsi"
 /include/ "elo3-dma-1.dtsi"
+/include/ "elo3-dma-2.dtsi"
 
 /include/ "qoriq-espi-0.dtsi"
spi@11 {
diff --git a/arch/powerpc/boot/dts/fsl/t4240si-pre.dtsi 
b/arch/powerpc/boot/dts/fsl/t4240si-pre.dtsi
index 0b8ccc5..05fb23b 100644
--- a/arch/powerpc/boot/dts/fsl/t4240si-pre.dtsi
+++ b/arch/powerpc/boot/dts/fsl/t4240si-pre.dtsi
@@ -57,6 +57,7 @@
pci3 = &pci3;
dma0 = &dma0;
dma1 = &dma1;
+   dma2 = &dma2;
sdhc = &sdhc;
};
 
-- 
1.7.6.5

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

RE: [PATCH 11/16] byteorder: provide a linux/byteorder.h with {be, le}_to_cpu() and cpu_to_{be, le}() macros

2014-05-28 Thread David Laight

From: Cody P Schafer
> Rather manually specifying the size of the integer to be converted, key
> off of the type size. Reduces duplicate size info and the occurance of
> certain types of bugs (using the wrong sized conversion).
...
> +#define be_to_cpu(v) \
> + __builtin_choose_expr(sizeof(v) == sizeof(uint8_t) , v, \
> + __builtin_choose_expr(sizeof(v) == sizeof(uint16_t), be16_to_cpu(v), \
> + __builtin_choose_expr(sizeof(v) == sizeof(uint32_t), be32_to_cpu(v), \
> + __builtin_choose_expr(sizeof(v) == sizeof(uint64_t), be64_to_cpu(v), \
> + (void)0
...

I'm not at all sure that using the 'size' of the constant will reduce
the number of bugs - it just introduces a whole new category of bugs.

Using the size of the destination might help, but that makes the code
ugly.

Getting one of the static analysers to find the obvious errors is
probably more appropriate.

David

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Kernel 3.15: Boot problems with a PA6T board

2014-05-28 Thread Christian Zigotzky


Hi Michael,

Thank you for your answer and thank you for your help. :-)

On 28.05.2014 06:23, Michael Ellerman wrote:

On Wed, 2014-05-28 at 01:08 +0200, Christian Zigotzky wrote:

Hi Michael,

Thanks a lot for your answer.

...

18a1a7a1d862ae0794a0179473d08a414dd49234 <- It doesn't boot. Error messages: 
Oops: Machine check, sig: 7 [#1] CPU: 1 PID: 1 Comm: swapper/0 not tainted
d8ff9cdf68fd119d491f3de90e1a612afc2f3b2b <- It boots. :-)
0f5a869600141a0d5575e3190af01a050c081b07 <- It boots. :-)
c7e64b9ce04aa2e3fad7396d92b5cb92056d16ac <- It boots. :-)
d3e144532703fe2454b56eddb56f30d2d620187b <- It boots. :-)

I think the machine check is the problem.

Yes I think it is. Do you get any more info, or just that one line?

So I think the latest working commit we have is 
d8ff9cdf68fd119d491f3de90e1a612afc2f3b2b.

I'm going to guess that cd427485357c0c4b99f69719251baacf25946e11 is BAD. Can
you please confirm or deny that?

It's not BAD. It boots.

Rgds,

Christian


Assuming cd42748 is bad, you should do a git bisect between it and 18a1a7a.
That should be a fairly quick bisect. That would be:

$ git bisect start
$ git bisect good d8ff9cdf68fd119d491f3de90e1a612afc2f3b2b
$ git bisect bad  cd427485357c0c4b99f69719251baacf25946e11

If cd42748 is *good*, then you'll need to do a bigger bisect from d8ff9cd to
18a1a7a.


cheers





___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/2] powerpc/mm: Check paca psize is up to date for huge mappings

2014-05-28 Thread Aneesh Kumar K.V

Michael Ellerman  writes:

> We have a bug in our hugepage handling which exhibits as an infinite
> loop of hash faults. If the fault is being taken in the kernel it will
> typically trigger the softlockup detector, or the RCU stall detector.
>
> The bug is as follows:
>
>  1. mmap(0xa000, ..., MAP_FIXED | MAP_HUGE_TLB | MAP_ANONYMOUS ..)
>  2. Slice code converts the slice psize to 16M.
>  3. The code on lines 539-540 of slice.c in slice_get_unmapped_area()
> synchronises the mm->context with the paca->context. So the paca slice
> mask is updated to include the 16M slice.
>  3. Either:
> * mmap() fails because there are no huge pages available.
> * mmap() succeeds and the mapping is then munmapped.
> In both cases the slice psize remains at 16M in both the paca & mm.
>  4. mmap(0xa000, ..., MAP_FIXED | MAP_ANONYMOUS ..)
>  5. The slice psize is converted back to 64K. Because of the check on line 539
> of slice.c we DO NOT update the paca->context. The paca slice mask is now
> out of sync with the mm slice mask.
>  6. User/kernel accesses 0xa000.
>  7. The SLB miss handler slb_allocate_realmode() **uses the paca slice mask**
> to create an SLB entry and inserts it in the SLB.
> 18. With the 16M SLB entry in place the hardware does a hash lookup, no entry
> is found so a data access exception is generated.
> 19. The data access handler calls do_page_fault() -> handle_mm_fault().
> 10. __handle_mm_fault() creates a THP mapping with 
> do_huge_pmd_anonymous_page().
> 11. The hardware retries the access, there is still nothing in the hash table
> so once again a data access exception is generated.
> 12. hash_page() calls into __hash_page_thp() and inserts a mapping in the
> hash. Although the THP mapping maps 16M the hashing is done using 64K
> as the segment page size.
> 13. hash_page() returns immediately after calling __hash_page_thp(), skipping
> over the code at line 1125. Resulting in the mismatch between the
> paca->context and mm->context not being detected.
> 14. The hardware retries the access, the hash it generates using the 16M
> SLB entry does NOT match the hash we inserted.
> 15. We take another data access and go into __hash_page_thp().
> 16. We see a valid entry in the hpte_slot_array and so we call updatepp()
> which succeeds.
> 17. Goto 14.
>
> We could fix this in two ways. The first would be to remove or modify
> the check on line 539 of slice.c.
>
> The second option is to cause the check of paca psize in hash_page() on
> line 1125 to also be done for THP pages.
>
> We prefer the latter, because the check & update of the paca psize is
> not done until we know it's necessary. It's also done only on the
> current cpu, so we don't need to IPI all other cpus.
>
> Without further rearranging the code, the simplest fix is to pull out
> the code that checks paca psize and call it in two places. Firstly for
> THP/hugetlb, and secondly for other mappings as before.
>
> Thanks to Dave Jones for trinity, which originally found this bug.
>
> Signed-off-by: Michael Ellerman 

Reviewed-by: Aneesh Kumar K.V 

> ---
>  arch/powerpc/mm/hash_utils_64.c | 31 ---
>  1 file changed, 20 insertions(+), 11 deletions(-)
>
> diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
> index d766d6e..6650699 100644
> --- a/arch/powerpc/mm/hash_utils_64.c
> +++ b/arch/powerpc/mm/hash_utils_64.c
> @@ -960,6 +960,22 @@ void hash_failure_debug(unsigned long ea, unsigned long 
> access,
>   trap, vsid, ssize, psize, lpsize, pte);
>  }
>  
> +static void check_paca_psize(unsigned long ea, struct mm_struct *mm,
> +  int psize, bool user_region)
> +{
> + if (user_region) {
> + if (psize != get_paca_psize(ea)) {
> + get_paca()->context = mm->context;
> + slb_flush_and_rebolt();
> + }
> + } else if (get_paca()->vmalloc_sllp !=
> +mmu_psize_defs[mmu_vmalloc_psize].sllp) {
> + get_paca()->vmalloc_sllp =
> + mmu_psize_defs[mmu_vmalloc_psize].sllp;
> + slb_vmalloc_update();
> + }
> +}
> +
>  /* Result code is:
>   *  0 - handled
>   *  1 - normal page fault
> @@ -1081,6 +1097,8 @@ int hash_page(unsigned long ea, unsigned long access, 
> unsigned long trap)
>   WARN_ON(1);
>   }
>  #endif
> + check_paca_psize(ea, mm, psize, user_region);
> +
>   goto bail;
>   }
>  
> @@ -1121,17 +1139,8 @@ int hash_page(unsigned long ea, unsigned long access, 
> unsigned long trap)
>  #endif
>   }
>   }
> - if (user_region) {
> - if (psize != get_paca_psize(ea)) {
> - get_paca()->context = mm->context;
> - slb_flush_and_rebolt();
> - }
> - } else if (get_paca()->vmalloc_sllp !=
> -mmu_psize_defs

Re: [PATCH 2/2] selftests/powerpc: Test the THP bug we fixed in the previous commit

2014-05-28 Thread Aneesh Kumar K.V

Michael Ellerman  writes:

> Signed-off-by: Michael Ellerman 

Reviewed-by: Aneesh Kumar K.V 

> ---
>  tools/testing/selftests/powerpc/Makefile   |  2 +-
>  tools/testing/selftests/powerpc/mm/Makefile| 18 ++
>  .../selftests/powerpc/mm/hugetlb_vs_thp_test.c | 72 
> ++
>  3 files changed, 91 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/powerpc/mm/Makefile
>  create mode 100644 tools/testing/selftests/powerpc/mm/hugetlb_vs_thp_test.c
>
> diff --git a/tools/testing/selftests/powerpc/Makefile 
> b/tools/testing/selftests/powerpc/Makefile
> index 316194f..b3dbe9e 100644
> --- a/tools/testing/selftests/powerpc/Makefile
> +++ b/tools/testing/selftests/powerpc/Makefile
> @@ -13,7 +13,7 @@ CFLAGS := -Wall -O2 -flto -Wall -Werror 
> -DGIT_VERSION='"$(GIT_VERSION)"' -I$(CUR
>  
>  export CC CFLAGS
>  
> -TARGETS = pmu copyloops
> +TARGETS = pmu copyloops mm
>  
>  endif
>  
> diff --git a/tools/testing/selftests/powerpc/mm/Makefile 
> b/tools/testing/selftests/powerpc/mm/Makefile
> new file mode 100644
> index 000..357ccbd
> --- /dev/null
> +++ b/tools/testing/selftests/powerpc/mm/Makefile
> @@ -0,0 +1,18 @@
> +noarg:
> + $(MAKE) -C ../
> +
> +PROGS := hugetlb_vs_thp_test
> +
> +all: $(PROGS)
> +
> +$(PROGS): ../harness.c
> +
> +run_tests: all
> + @-for PROG in $(PROGS); do \
> + ./$$PROG; \
> + done;
> +
> +clean:
> + rm -f $(PROGS)
> +
> +.PHONY: all run_tests clean
> diff --git a/tools/testing/selftests/powerpc/mm/hugetlb_vs_thp_test.c 
> b/tools/testing/selftests/powerpc/mm/hugetlb_vs_thp_test.c
> new file mode 100644
> index 000..3d8e5b0
> --- /dev/null
> +++ b/tools/testing/selftests/powerpc/mm/hugetlb_vs_thp_test.c
> @@ -0,0 +1,72 @@
> +#include 
> +#include 
> +#include 
> +
> +#include "utils.h"
> +
> +/* This must match the huge page & THP size */
> +#define SIZE (16 * 1024 * 1024)
> +
> +static int test_body(void)
> +{
> + void *addr;
> + char *p;
> +
> + addr = (void *)0xa000;
> +
> + p = mmap(addr, SIZE, PROT_READ | PROT_WRITE,
> +  MAP_HUGETLB | MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
> + if (p != MAP_FAILED) {
> + /*
> +  * Typically the mmap will fail because no huge pages are
> +  * allocated on the system. But if there are huge pages
> +  * allocated the mmap will succeed. That's fine too, we just
> +  * munmap here before continuing.
> +  */
> + munmap(addr, SIZE);
> + }
> +
> + p = mmap(addr, SIZE, PROT_READ | PROT_WRITE,
> +  MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
> + if (p == MAP_FAILED) {
> + printf("Mapping failed @ %p\n", addr);
> + perror("mmap");
> + return 1;
> + }
> +
> + /*
> +  * Either a user or kernel access is sufficient to trigger the bug.
> +  * A kernel access is easier to spot & debug, as it will trigger the
> +  * softlockup or RCU stall detectors, and when the system is kicked
> +  * into xmon we get a backtrace in the kernel.
> +  *
> +  * A good option is:
> +  *  getcwd(p, SIZE);
> +  *
> +  * For the purposes of this testcase it's preferable to spin in
> +  * userspace, so the harness can kill us if we get stuck. That way we
> +  * see a test failure rather than a dead system.
> +  */
> + *p = 0xf;
> +
> + munmap(addr, SIZE);
> +
> + return 0;
> +}
> +
> +static int test_main(void)
> +{
> + int i;
> +
> + /* 10,000 because it's a "bunch", and completes reasonably quickly */
> + for (i = 0; i < 1; i++)
> + if (test_body())
> + return 1;
> +
> + return 0;
> +}
> +
> +int main(void)
> +{
> + return test_harness(test_main, "hugetlb_vs_thp");
> +}
> -- 
> 1.9.1
>
> ___
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device

2014-05-28 Thread Alexander Graf



On 28.05.14 02:57, Alex Williamson wrote:

On Wed, 2014-05-28 at 02:44 +0200, Alexander Graf wrote:

On 28.05.14 02:39, Alex Williamson wrote:

On Wed, 2014-05-28 at 00:49 +0200, Alexander Graf wrote:

On 27.05.14 20:15, Alex Williamson wrote:

On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:

The patch adds new IOCTL commands for sPAPR VFIO container device
to support EEH functionality for PCI devices, which have been passed
through from host to somebody else via VFIO.

Signed-off-by: Gavin Shan 
---
Documentation/vfio.txt  | 92 
-
drivers/vfio/pci/Makefile   |  1 +
drivers/vfio/pci/vfio_pci.c | 20 +---
drivers/vfio/pci/vfio_pci_eeh.c | 46 +++
drivers/vfio/pci/vfio_pci_private.h |  5 ++
drivers/vfio/vfio_iommu_spapr_tce.c | 85 ++
include/uapi/linux/vfio.h   | 66 ++
7 files changed, 308 insertions(+), 7 deletions(-)
create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c

[...]


+
+   return ret;
+}
+
static long tce_iommu_ioctl(void *iommu_data,
 unsigned int cmd, unsigned long arg)
{
@@ -283,6 +363,11 @@ static long tce_iommu_ioctl(void *iommu_data,
tce_iommu_disable(container);
mutex_unlock(&container->lock);
return 0;
+   case VFIO_EEH_PE_SET_OPTION:
+   case VFIO_EEH_PE_GET_STATE:
+   case VFIO_EEH_PE_RESET:
+   case VFIO_EEH_PE_CONFIGURE:
+   return tce_iommu_eeh_ioctl(iommu_data, cmd, arg);

This is where it would have really made sense to have a single
VFIO_EEH_OP ioctl with a data structure passed to indicate the sub-op.
AlexG, are you really attached to splitting these out into separate
ioctls?

I don't see the problem. We need to forward 4 ioctls to a separate piece
of code, so we forward 4 ioctls to a separate piece of code :). Putting
them into one ioctl just moves the switch() into another function.

And uses an extra 3 ioctl numbers and gives us extra things to update if
we ever need to add more ioctls, etc.  ioctl numbers are an address
space, how much address space do we really want to give to EEH?  It's
not a big difference, but I don't think it's completely even either.
Thanks,

Yes, that's the point. I by far prefer to have you push back on anyone
who introduces useless ioctls rather than have a separate EEH number
space that people can just throw anything in they like ;).

Well, I appreciate that, but having them as separate ioctls doesn't
really prevent that either.  Any one of these 4 could be set to take a
sub-option to extend and contort the EEH interface.  The only way to
prevent that would be to avoid the argsz+flags hack that make the ioctl
extendable.  Thanks,


Sure, that's what patch review is about. I'm really more concerned about 
whose court the number space is in - you or Gavin. If we're talking 
about top level ioctls you will care a lot more.


But I'm not religious about this. You're the VFIO maintainer, so it's 
your call. I just personally cringe when I see an ioctl that gets an 
"opcode" and a "parameter" argument where the "parameter" argument is a 
union with one struct for each opcode.



Alex

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Kernel 3.15: Boot problems with a PA6T board

2014-05-28 Thread Christian Zigotzky


On 28.05.2014 10:53, Christian Zigotzky wrote:

Hi Michael,

Thank you for your answer and thank you for your help. :-)

On 28.05.2014 06:23, Michael Ellerman wrote:

On Wed, 2014-05-28 at 01:08 +0200, Christian Zigotzky wrote:

Hi Michael,

Thanks a lot for your answer.

...

18a1a7a1d862ae0794a0179473d08a414dd49234 <- It doesn't boot. Error 
messages: Oops: Machine check, sig: 7 [#1] CPU: 1 PID: 1 Comm: 
swapper/0 not tainted

d8ff9cdf68fd119d491f3de90e1a612afc2f3b2b <- It boots. :-)
0f5a869600141a0d5575e3190af01a050c081b07 <- It boots. :-)
c7e64b9ce04aa2e3fad7396d92b5cb92056d16ac <- It boots. :-)
d3e144532703fe2454b56eddb56f30d2d620187b <- It boots. :-)

I think the machine check is the problem.

Yes I think it is. Do you get any more info, or just that one line?

So I think the latest working commit we have is 
d8ff9cdf68fd119d491f3de90e1a612afc2f3b2b.


I'm going to guess that cd427485357c0c4b99f69719251baacf25946e11 is 
BAD. Can

you please confirm or deny that?

It's not BAD. It boots.

Rgds,

Christian


Assuming cd42748 is bad, you should do a git bisect between it and 
18a1a7a.

That should be a fairly quick bisect. That would be:

$ git bisect start
$ git bisect good d8ff9cdf68fd119d491f3de90e1a612afc2f3b2b
$ git bisect bad  cd427485357c0c4b99f69719251baacf25946e11

If cd42748 is *good*, then you'll need to do a bigger bisect from 
d8ff9cd to

18a1a7a.

OK :-)

-> git clone 
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git linux2-git

-> git bisect start
-> git bisect good d8ff9cdf68fd119d491f3de90e1a612afc2f3b2b
-> git bisect bad 18a1a7a1d862ae0794a0179473d08a414dd49234

Output:
Bisecting: 5900 revisions left to test after this (roughly 13 steps)
[cb1595563880a81dab6eab9a5ecb4520d2e76077] Merge tag 'tty-3.15-rc1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty


Unfortunately it doesn't boot. :-(



cheers







___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device

2014-05-28 Thread Alexander Graf



On 28.05.14 02:55, Gavin Shan wrote:

On Tue, May 27, 2014 at 12:15:27PM -0600, Alex Williamson wrote:

On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:

The patch adds new IOCTL commands for sPAPR VFIO container device
to support EEH functionality for PCI devices, which have been passed
through from host to somebody else via VFIO.

Signed-off-by: Gavin Shan 
---
  Documentation/vfio.txt  | 92 -
  drivers/vfio/pci/Makefile   |  1 +
  drivers/vfio/pci/vfio_pci.c | 20 +---
  drivers/vfio/pci/vfio_pci_eeh.c | 46 +++
  drivers/vfio/pci/vfio_pci_private.h |  5 ++
  drivers/vfio/vfio_iommu_spapr_tce.c | 85 ++
  include/uapi/linux/vfio.h   | 66 ++
  7 files changed, 308 insertions(+), 7 deletions(-)
  create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c


[...]


diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index cb9023d..c5fac36 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -455,6 +455,72 @@ struct vfio_iommu_spapr_tce_info {
  
  #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
  
+/*

+ * EEH functionality can be enabled or disabled on one specific device.
+ * Also, the DMA or IO frozen state can be removed from the frozen PE
+ * if required.
+ */
+struct vfio_eeh_pe_set_option {
+   __u32 argsz;
+   __u32 flags;
+   __u32 option;
+#define VFIO_EEH_PE_SET_OPT_DISABLE0   /* Disable EEH  */
+#define VFIO_EEH_PE_SET_OPT_ENABLE 1   /* Enable EEH   */
+#define VFIO_EEH_PE_SET_OPT_IO 2   /* Enable IO*/
+#define VFIO_EEH_PE_SET_OPT_DMA3   /* Enable DMA   */

This is more of a "command" than an "option" isn't it?  Each of these
probably needs a more significant description.


Yeah, it would be regarded as "opcode" and I'll add more description about
them in next revision.


Please just call them commands.




+};
+
+#define VFIO_EEH_PE_SET_OPTION _IO(VFIO_TYPE, VFIO_BASE + 21)
+
+/*
+ * Each EEH PE should have unique address to be identified. PE's
+ * sharing mode is also useful information as well.
+ */
+#define VFIO_EEH_PE_GET_ADDRESS0   /* Get address  */
+#define VFIO_EEH_PE_GET_MODE   1   /* Query mode   */
+#define VFIO_EEH_PE_MODE_NONE  0   /* Not a PE */
+#define VFIO_EEH_PE_MODE_NOT_SHARED1   /* Exclusive*/
+#define VFIO_EEH_PE_MODE_SHARED2   /* Shared mode  */
+
+/*
+ * EEH PE might have been frozen because of PCI errors. Also, it might
+ * be experiencing reset for error revoery. The following command helps
+ * to get the state.
+ */
+struct vfio_eeh_pe_get_state {
+   __u32 argsz;
+   __u32 flags;
+   __u32 state;
+};

Should state be a union to better describe the value returned?  What
exactly is the address and why does the user need to know it?  Does this
need user input or could we just return the address and mode regardless?


Ok. I think you want enum (not union) for state. I'll have macros for the
state in next revision as I did that for other cases.

Those macros defined for "address" just for ABI stuff as Alex.G mentioned.
There isn't corresponding ioctl command for host to get address any more
because QEMU (user) will have to figure it out by himself. The "address"
here means PE address and user has to figure it out according to PE
segmentation.


Why would the user ever need the address?


Alex

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device

2014-05-28 Thread Gavin Shan

On Wed, May 28, 2014 at 01:41:35PM +0200, Alexander Graf wrote:
>
>On 28.05.14 02:55, Gavin Shan wrote:
>>On Tue, May 27, 2014 at 12:15:27PM -0600, Alex Williamson wrote:
>>>On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:
The patch adds new IOCTL commands for sPAPR VFIO container device
to support EEH functionality for PCI devices, which have been passed
through from host to somebody else via VFIO.

Signed-off-by: Gavin Shan 
---
  Documentation/vfio.txt  | 92 
 -
  drivers/vfio/pci/Makefile   |  1 +
  drivers/vfio/pci/vfio_pci.c | 20 +---
  drivers/vfio/pci/vfio_pci_eeh.c | 46 +++
  drivers/vfio/pci/vfio_pci_private.h |  5 ++
  drivers/vfio/vfio_iommu_spapr_tce.c | 85 
 ++
  include/uapi/linux/vfio.h   | 66 ++
  7 files changed, 308 insertions(+), 7 deletions(-)
  create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c
>
>[...]
>
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index cb9023d..c5fac36 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -455,6 +455,72 @@ struct vfio_iommu_spapr_tce_info {
  #define VFIO_IOMMU_SPAPR_TCE_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
+/*
+ * EEH functionality can be enabled or disabled on one specific device.
+ * Also, the DMA or IO frozen state can be removed from the frozen PE
+ * if required.
+ */
+struct vfio_eeh_pe_set_option {
+   __u32 argsz;
+   __u32 flags;
+   __u32 option;
+#define VFIO_EEH_PE_SET_OPT_DISABLE0   /* Disable EEH  */
+#define VFIO_EEH_PE_SET_OPT_ENABLE 1   /* Enable EEH   */
+#define VFIO_EEH_PE_SET_OPT_IO 2   /* Enable IO*/
+#define VFIO_EEH_PE_SET_OPT_DMA3   /* Enable DMA   */
>>>This is more of a "command" than an "option" isn't it?  Each of these
>>>probably needs a more significant description.
>>>
>>Yeah, it would be regarded as "opcode" and I'll add more description about
>>them in next revision.
>
>Please just call them commands.
>

Ok. I guess you want me to change the macro names like this ?

#define VFIO_EEH_CMD_DISABLE0   /* Disable EEH functionality*/
#define VFIO_EEH_CMD_ENABLE 1   /* Enable EEH functionality */
#define VFIO_EEH_CMD_ENABLE_IO  2   /* Enable IO for frozen PE  */
#define VFIO_EEH_CMD_ENABLE_DMA 3   /* Enable DMA for frozen PE */

>>
+};
+
+#define VFIO_EEH_PE_SET_OPTION _IO(VFIO_TYPE, VFIO_BASE + 21)
+
+/*
+ * Each EEH PE should have unique address to be identified. PE's
+ * sharing mode is also useful information as well.
+ */
+#define VFIO_EEH_PE_GET_ADDRESS0   /* Get address  */
+#define VFIO_EEH_PE_GET_MODE   1   /* Query mode   */
+#define VFIO_EEH_PE_MODE_NONE  0   /* Not a PE */
+#define VFIO_EEH_PE_MODE_NOT_SHARED1   /* Exclusive*/
+#define VFIO_EEH_PE_MODE_SHARED2   /* Shared mode  */
+
+/*
+ * EEH PE might have been frozen because of PCI errors. Also, it might
+ * be experiencing reset for error revoery. The following command helps
+ * to get the state.
+ */
+struct vfio_eeh_pe_get_state {
+   __u32 argsz;
+   __u32 flags;
+   __u32 state;
+};
>>>Should state be a union to better describe the value returned?  What
>>>exactly is the address and why does the user need to know it?  Does this
>>>need user input or could we just return the address and mode regardless?
>>>
>>Ok. I think you want enum (not union) for state. I'll have macros for the
>>state in next revision as I did that for other cases.
>>
>>Those macros defined for "address" just for ABI stuff as Alex.G mentioned.
>>There isn't corresponding ioctl command for host to get address any more
>>because QEMU (user) will have to figure it out by himself. The "address"
>>here means PE address and user has to figure it out according to PE
>>segmentation.
>
>Why would the user ever need the address?
>

I will remove those "address" related macros in next revision because it's
user-level bussiness, not related to host kernel any more.

If the user is QEMU + guest, we need the address to identify the PE though PHB
BUID could be used as same purpose to get PHB, which is one-to-one mapping with
IOMMU group on sPAPR platform. However, once the PE address is built and 
returned
to guest, guest will use the PE address as input parameter in subsequent RTAS
calls.

If the user is some kind of application who just uses the ioctl() without 
supporting
RTAS calls. We don't need care about PE address. 

Thanks,
Gavin

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo

Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device

2014-05-28 Thread Alexander Graf



On 28.05.14 14:49, Gavin Shan wrote:

On Wed, May 28, 2014 at 01:41:35PM +0200, Alexander Graf wrote:

On 28.05.14 02:55, Gavin Shan wrote:

On Tue, May 27, 2014 at 12:15:27PM -0600, Alex Williamson wrote:

On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:

The patch adds new IOCTL commands for sPAPR VFIO container device
to support EEH functionality for PCI devices, which have been passed
through from host to somebody else via VFIO.

Signed-off-by: Gavin Shan 
---
  Documentation/vfio.txt  | 92 -
  drivers/vfio/pci/Makefile   |  1 +
  drivers/vfio/pci/vfio_pci.c | 20 +---
  drivers/vfio/pci/vfio_pci_eeh.c | 46 +++
  drivers/vfio/pci/vfio_pci_private.h |  5 ++
  drivers/vfio/vfio_iommu_spapr_tce.c | 85 ++
  include/uapi/linux/vfio.h   | 66 ++
  7 files changed, 308 insertions(+), 7 deletions(-)
  create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c

[...]


diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index cb9023d..c5fac36 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -455,6 +455,72 @@ struct vfio_iommu_spapr_tce_info {
  #define VFIO_IOMMU_SPAPR_TCE_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
+/*
+ * EEH functionality can be enabled or disabled on one specific device.
+ * Also, the DMA or IO frozen state can be removed from the frozen PE
+ * if required.
+ */
+struct vfio_eeh_pe_set_option {
+   __u32 argsz;
+   __u32 flags;
+   __u32 option;
+#define VFIO_EEH_PE_SET_OPT_DISABLE0   /* Disable EEH  */
+#define VFIO_EEH_PE_SET_OPT_ENABLE 1   /* Enable EEH   */
+#define VFIO_EEH_PE_SET_OPT_IO 2   /* Enable IO*/
+#define VFIO_EEH_PE_SET_OPT_DMA3   /* Enable DMA   */

This is more of a "command" than an "option" isn't it?  Each of these
probably needs a more significant description.


Yeah, it would be regarded as "opcode" and I'll add more description about
them in next revision.

Please just call them commands.


Ok. I guess you want me to change the macro names like this ?

#define VFIO_EEH_CMD_DISABLE0   /* Disable EEH functionality*/
#define VFIO_EEH_CMD_ENABLE 1   /* Enable EEH functionality */
#define VFIO_EEH_CMD_ENABLE_IO  2   /* Enable IO for frozen PE  */
#define VFIO_EEH_CMD_ENABLE_DMA 3   /* Enable DMA for frozen PE */


Yes, the ioctl name too.




+};
+
+#define VFIO_EEH_PE_SET_OPTION _IO(VFIO_TYPE, VFIO_BASE + 21)
+
+/*
+ * Each EEH PE should have unique address to be identified. PE's
+ * sharing mode is also useful information as well.
+ */
+#define VFIO_EEH_PE_GET_ADDRESS0   /* Get address  */
+#define VFIO_EEH_PE_GET_MODE   1   /* Query mode   */
+#define VFIO_EEH_PE_MODE_NONE  0   /* Not a PE */
+#define VFIO_EEH_PE_MODE_NOT_SHARED1   /* Exclusive*/
+#define VFIO_EEH_PE_MODE_SHARED2   /* Shared mode  */
+
+/*
+ * EEH PE might have been frozen because of PCI errors. Also, it might
+ * be experiencing reset for error revoery. The following command helps
+ * to get the state.
+ */
+struct vfio_eeh_pe_get_state {
+   __u32 argsz;
+   __u32 flags;
+   __u32 state;
+};

Should state be a union to better describe the value returned?  What
exactly is the address and why does the user need to know it?  Does this
need user input or could we just return the address and mode regardless?


Ok. I think you want enum (not union) for state. I'll have macros for the
state in next revision as I did that for other cases.

Those macros defined for "address" just for ABI stuff as Alex.G mentioned.
There isn't corresponding ioctl command for host to get address any more
because QEMU (user) will have to figure it out by himself. The "address"
here means PE address and user has to figure it out according to PE
segmentation.

Why would the user ever need the address?


I will remove those "address" related macros in next revision because it's
user-level bussiness, not related to host kernel any more.


Ok, so the next question is whether there will be any state outside of 
GET_MODE left in the future.



Alex


If the user is QEMU + guest, we need the address to identify the PE though PHB
BUID could be used as same purpose to get PHB, which is one-to-one mapping with
IOMMU group on sPAPR platform. However, once the PE address is built and 
returned
to guest, guest will use the PE address as input parameter in subsequent RTAS
calls.

If the user is some kind of application who just uses the ioctl() without 
supporting
RTAS calls. We don't need care about PE address.

Thanks,
Gavin

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


___

Re: [PATCH] powerpc, kexec: Fix "Processor X is stuck" issue during kexec from ST mode

2014-05-28 Thread Vivek Goyal

On Tue, May 27, 2014 at 04:25:34PM +0530, Srivatsa S. Bhat wrote:
> If we try to perform a kexec when the machine is in ST (Single-Threaded) mode
> (ppc64_cpu --smt=off), the kexec operation doesn't succeed properly, and we
> get the following messages during boot:
> 
> [0.089866] POWER8 performance monitor hardware support registered
> [0.089985] power8-pmu: PMAO restore workaround active.
> [5.095419] Processor 1 is stuck.
> [   10.097933] Processor 2 is stuck.
> [   15.100480] Processor 3 is stuck.
> [   20.102982] Processor 4 is stuck.
> [   25.105489] Processor 5 is stuck.
> [   30.108005] Processor 6 is stuck.
> [   35.110518] Processor 7 is stuck.
> [   40.113369] Processor 9 is stuck.
> [   45.115879] Processor 10 is stuck.
> [   50.118389] Processor 11 is stuck.
> [   55.120904] Processor 12 is stuck.
> [   60.123425] Processor 13 is stuck.
> [   65.125970] Processor 14 is stuck.
> [   70.128495] Processor 15 is stuck.
> [   75.131316] Processor 17 is stuck.
> 
> Note that only the sibling threads are stuck, while the primary threads (0, 8,
> 16 etc) boot just fine. Looking closer at the previous step of kexec, we 
> observe
> that kexec tries to wakeup (bring online) the sibling threads of all the 
> cores,
> before performing kexec:
> 
> [ 9464.131231] Starting new kernel
> [ 9464.148507] kexec: Waking offline cpu 1.
> [ 9464.148552] kexec: Waking offline cpu 2.
> [ 9464.148600] kexec: Waking offline cpu 3.
> [ 9464.148636] kexec: Waking offline cpu 4.
> [ 9464.148671] kexec: Waking offline cpu 5.
> [ 9464.148708] kexec: Waking offline cpu 6.
> [ 9464.148743] kexec: Waking offline cpu 7.
> [ 9464.148779] kexec: Waking offline cpu 9.
> [ 9464.148815] kexec: Waking offline cpu 10.
> [ 9464.148851] kexec: Waking offline cpu 11.
> [ 9464.148887] kexec: Waking offline cpu 12.
> [ 9464.148922] kexec: Waking offline cpu 13.
> [ 9464.148958] kexec: Waking offline cpu 14.
> [ 9464.148994] kexec: Waking offline cpu 15.
> [ 9464.149030] kexec: Waking offline cpu 17.
> 
> Instrumenting this piece of code revealed that the cpu_up() operation actually
> fails with -EBUSY. Thus, only the primary threads of all the cores are online
> during kexec, and hence this is a sure-shot receipe for disaster, as explained
> in commit e8e5c2155b (powerpc/kexec: Fix orphaned offline CPUs across kexec),
> as well as in the comment above wake_offline_cpus().
> 
> It turns out that cpu_up() was returning -EBUSY because the variable
> 'cpu_hotplug_disabled' was set to 1; and this disabling of CPU hotplug was 
> done
> by migrate_to_reboot_cpu() inside kernel_kexec().
> 
> Now, migrate_to_reboot_cpu() was originally written with the assumption that
> any further code will not need to perform CPU hotplug, since we are anyway in
> the reboot path. However, kexec is clearly not such a case, since we depend on
> onlining CPUs, atleast on powerpc.
> 
> So re-enable cpu-hotplug after returning from migrate_to_reboot_cpu() in the
> kexec path, to fix this regression in kexec on powerpc.
> 
> Also, wrap the cpu_up() in powerpc kexec code within a WARN_ON(), so that we
> can catch such issues more easily in the future.
> 
> Fixes: c97102ba963 (kexec: migrate to reboot cpu)
> Cc: sta...@vger.kernel.org
> Signed-off-by: Srivatsa S. Bhat 
> ---
> 
>  arch/powerpc/kernel/machine_kexec_64.c |2 +-
>  kernel/kexec.c |8 
>  2 files changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/machine_kexec_64.c 
> b/arch/powerpc/kernel/machine_kexec_64.c
> index 59d229a..879b3aa 100644
> --- a/arch/powerpc/kernel/machine_kexec_64.c
> +++ b/arch/powerpc/kernel/machine_kexec_64.c
> @@ -237,7 +237,7 @@ static void wake_offline_cpus(void)
>   if (!cpu_online(cpu)) {
>   printk(KERN_INFO "kexec: Waking offline cpu %d.\n",
>  cpu);
> - cpu_up(cpu);
> + WARN_ON(cpu_up(cpu));
>   }
>   }
>  }
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index c8380ad..28c5706 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -1683,6 +1683,14 @@ int kernel_kexec(void)
>   kexec_in_progress = true;
>   kernel_restart_prepare(NULL);
>   migrate_to_reboot_cpu();
> +
> + /*
> +  * migrate_to_reboot_cpu() disables CPU hotplug assuming that
> +  * no further code needs to use CPU hotplug (which is true in
> +  * the reboot case). However, the kexec path depends on using
> +  * CPU hotplug again; so re-enable it here.
> +  */
> + cpu_hotplug_enable();
>   printk(KERN_EMERG "Starting new kernel\n");
>   machine_shutdown();

After migrate_to_reboot_cpu(), we are calling machine_shutdown() which
calls disable_nonboot_cpus() and which in turn calls _cpu_down().

So it is kind of odd that we first migrate to boot cpu, and then disable
all non-boot c

Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device

2014-05-28 Thread Alex Williamson

On Wed, 2014-05-28 at 13:37 +0200, Alexander Graf wrote:
> On 28.05.14 02:57, Alex Williamson wrote:
> > On Wed, 2014-05-28 at 02:44 +0200, Alexander Graf wrote:
> >> On 28.05.14 02:39, Alex Williamson wrote:
> >>> On Wed, 2014-05-28 at 00:49 +0200, Alexander Graf wrote:
>  On 27.05.14 20:15, Alex Williamson wrote:
> > On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:
> >> The patch adds new IOCTL commands for sPAPR VFIO container device
> >> to support EEH functionality for PCI devices, which have been passed
> >> through from host to somebody else via VFIO.
> >>
> >> Signed-off-by: Gavin Shan 
> >> ---
> >> Documentation/vfio.txt  | 92 
> >> -
> >> drivers/vfio/pci/Makefile   |  1 +
> >> drivers/vfio/pci/vfio_pci.c | 20 +---
> >> drivers/vfio/pci/vfio_pci_eeh.c | 46 +++
> >> drivers/vfio/pci/vfio_pci_private.h |  5 ++
> >> drivers/vfio/vfio_iommu_spapr_tce.c | 85 
> >> ++
> >> include/uapi/linux/vfio.h   | 66 ++
> >> 7 files changed, 308 insertions(+), 7 deletions(-)
> >> create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c
>  [...]
> 
> >> +
> >> +  return ret;
> >> +}
> >> +
> >> static long tce_iommu_ioctl(void *iommu_data,
> >> unsigned int cmd, unsigned long arg)
> >> {
> >> @@ -283,6 +363,11 @@ static long tce_iommu_ioctl(void *iommu_data,
> >>tce_iommu_disable(container);
> >>mutex_unlock(&container->lock);
> >>return 0;
> >> +  case VFIO_EEH_PE_SET_OPTION:
> >> +  case VFIO_EEH_PE_GET_STATE:
> >> +  case VFIO_EEH_PE_RESET:
> >> +  case VFIO_EEH_PE_CONFIGURE:
> >> +  return tce_iommu_eeh_ioctl(iommu_data, cmd, arg);
> > This is where it would have really made sense to have a single
> > VFIO_EEH_OP ioctl with a data structure passed to indicate the sub-op.
> > AlexG, are you really attached to splitting these out into separate
> > ioctls?
>  I don't see the problem. We need to forward 4 ioctls to a separate piece
>  of code, so we forward 4 ioctls to a separate piece of code :). Putting
>  them into one ioctl just moves the switch() into another function.
> >>> And uses an extra 3 ioctl numbers and gives us extra things to update if
> >>> we ever need to add more ioctls, etc.  ioctl numbers are an address
> >>> space, how much address space do we really want to give to EEH?  It's
> >>> not a big difference, but I don't think it's completely even either.
> >>> Thanks,
> >> Yes, that's the point. I by far prefer to have you push back on anyone
> >> who introduces useless ioctls rather than have a separate EEH number
> >> space that people can just throw anything in they like ;).
> > Well, I appreciate that, but having them as separate ioctls doesn't
> > really prevent that either.  Any one of these 4 could be set to take a
> > sub-option to extend and contort the EEH interface.  The only way to
> > prevent that would be to avoid the argsz+flags hack that make the ioctl
> > extendable.  Thanks,
> 
> Sure, that's what patch review is about. I'm really more concerned about 
> whose court the number space is in - you or Gavin. If we're talking 
> about top level ioctls you will care a lot more.
> 
> But I'm not religious about this. You're the VFIO maintainer, so it's 
> your call. I just personally cringe when I see an ioctl that gets an 
> "opcode" and a "parameter" argument where the "parameter" argument is a 
> union with one struct for each opcode.

Well, what would it look like...

struct vfio_eeh_pe_op {
__u32 argsz;
__u32 flags;
__u32 op;
};

Couldn't every single one of these be a separate "op"?  Are there any
cases where we can't use the ioctl return value?

VFIO_EEH_PE_DISABLE
VFIO_EEH_PE_ENABLE
VFIO_EEH_PE_UNFREEZE_IO
VFIO_EEH_PE_UNFREEZE_DMA
VFIO_EEH_PE_GET_MODE
VFIO_EEH_PE_RESET_DEACTIVATE
VFIO_EEH_PE_RESET_HOT
VFIO_EEH_PE_RESET_FUNDAMENTAL
VFIO_EEH_PE_CONFIGURE

It doesn't look that bad to me, what am I missing?  Thanks,

Alex

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device

2014-05-28 Thread Alex Williamson

On Wed, 2014-05-28 at 10:55 +1000, Gavin Shan wrote:
> On Tue, May 27, 2014 at 12:15:27PM -0600, Alex Williamson wrote:
> >On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:
> >> The patch adds new IOCTL commands for sPAPR VFIO container device
> >> to support EEH functionality for PCI devices, which have been passed
> >> through from host to somebody else via VFIO.
> >> 
> >> Signed-off-by: Gavin Shan 
> >> ---
> >>  Documentation/vfio.txt  | 92 
> >> -
> >>  drivers/vfio/pci/Makefile   |  1 +
> >>  drivers/vfio/pci/vfio_pci.c | 20 +---
> >>  drivers/vfio/pci/vfio_pci_eeh.c | 46 +++
> >>  drivers/vfio/pci/vfio_pci_private.h |  5 ++
> >>  drivers/vfio/vfio_iommu_spapr_tce.c | 85 
> >> ++
> >>  include/uapi/linux/vfio.h   | 66 ++
> >>  7 files changed, 308 insertions(+), 7 deletions(-)
> >>  create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c
> >> 
> >> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> >> index b9ca023..d890fed 100644
> >> --- a/Documentation/vfio.txt
> >> +++ b/Documentation/vfio.txt
> >> @@ -305,7 +305,15 @@ faster, the map/unmap handling has been implemented 
> >> in real mode which provides
> >>  an excellent performance which has limitations such as inability to do
> >>  locked pages accounting in real time.
> >>  
> >> -So 3 additional ioctls have been added:
> >> +4) According to sPAPR specification, A Partitionable Endpoint (PE) is an 
> >> I/O
> >> +subtree that can be treated as a unit for the purposes of partitioning and
> >> +error recovery. A PE may be a single or multi-function IOA (IO Adapter), a
> >> +function of a multi-function IOA, or multiple IOAs (possibly including 
> >> switch
> >> +and bridge structures above the multiple IOAs). PPC64 guests detect PCI 
> >> errors
> >> +and recover from them via EEH RTAS services, which works on the basis of
> >> +additional ioctl commands.
> >> +
> >> +So 7 additional ioctls have been added:
> >>  
> >>VFIO_IOMMU_SPAPR_TCE_GET_INFO - returns the size and the start
> >>of the DMA window on the PCI bus.
> >> @@ -316,6 +324,17 @@ So 3 additional ioctls have been added:
> >>  
> >>VFIO_IOMMU_DISABLE - disables the container.
> >>  
> >> +  VFIO_EEH_PE_SET_OPTION - enables or disables EEH functionality on the
> >> +  specified device. Also, it can be used to remove IO or DMA
> >> +  stopped state on the frozen PE.
> >> +
> >> +  VFIO_EEH_PE_GET_STATE - retrieve PE's state: frozen or normal state.
> >> +
> >> +  VFIO_EEH_PE_RESET - do PE reset, which is one of the major steps for
> >> +  error recovering.
> >> +
> >> +  VFIO_EEH_PE_CONFIGURE - configure the PCI bridges after PE reset. It's
> >> +  one of the major steps for error recoverying.
> >>  
> >>  The code flow from the example above should be slightly changed:
> >>  
> >> @@ -346,6 +365,77 @@ The code flow from the example above should be 
> >> slightly changed:
> >>ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);
> >>.
> >>  
> >> +Based on the initial example we have, the following piece of code could be
> >> +reference for EEH setup and error handling:
> >> +
> >> +  struct vfio_eeh_pe_set_option option = { .argsz = sizeof(option) };
> >> +  struct vfio_eeh_pe_get_state state = { .argsz = sizeof(state) };
> >> +  struct vfio_eeh_pe_reset reset = { .argsz = sizeof(reset) };
> >> +  struct vfio_eeh_pe_configure configure = { .argsz = sizeof(configure) };
> >> +
> >> +  
> >> +
> >> +  /* Get a file descriptor for the device */
> >> +  device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, ":06:0d.0");
> >> +
> >> +  /* Enable the EEH functionality on the device */
> >> +  option.option = VFIO_EEH_PE_SET_OPT_ENABLE;
> >> +  ioctl(container, VFIO_EEH_PE_SET_OPTION, &option);
> >> +
> >> +  /* You're suggested to create additional data struct to represent
> >> +   * PE, and put child devices belonging to same IOMMU group to the
> >> +   * PE instance for later reference.
> >> +   */
> >> +
> >> +  /* Check the PE's state and make sure it's in functional state */
> >> +  ioctl(container, VFIO_EEH_PE_GET_STATE, &state);
> >> +
> >> +  /* Save device's state. pci_save_state() would be good enough
> >> +   * as an example.
> >> +   */
> >> +
> >> +  /* Test and setup the device */
> >> +  ioctl(device, VFIO_DEVICE_GET_INFO, &device_info);
> >> +
> >> +  
> >> +
> >> +  /* When 0xFF's returned from reading PCI config space or IO BARs
> >> +   * of the PCI device. Check the PE state to see if that has been
> >> +   * frozen.
> >> +   */
> >> +  ioctl(container, VFIO_EEH_PE_GET_STATE, &state);
> >> +
> >> +  /* Waiting for pending PCI transactions to be completed and don't
> >> +   * produce any more PCI traffic from/to the affected PE until
> >> +   * recovery is finished.
> >> +   */
> >> +
> >> +  /* Enable IO for the affected PE and collect logs. Usually

Re: [PATCH] tty/hvc/hvc_console: Fix wakeup of HVC thread on hvc_kick()

2014-05-28 Thread Greg Kroah-Hartman

On Wed, May 28, 2014 at 10:01:30AM +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2014-05-23 at 19:41 +1000, Benjamin Herrenschmidt wrote:
> > Some backends call hvc_kick() to wakeup the HVC thread from its
> > slumber upon incoming characters. This however doesn't work
> > properly because it uses msleep_interruptible() which is mostly
> > immune to wake_up_process(). It will basically go back to sleep
> > until the timeout is expired (only signals can really wake it).
> >
> > Replace it with a simple shedule_timeout_interruptible() instead,
> > which may wakeup earlier every now and then but we really don't
> > care in this case.
> 
> Nobody commented ? :-)
> 
> Greg, do you want to take this in the tty tree or can I stick it in
> powerpc ?

I'll take it, thanks.

greg k-h
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: NUMA topology question wrt. d4edc5b6

2014-05-28 Thread Nishanth Aravamudan

On 23.05.2014 [02:18:05 +0530], Srivatsa S. Bhat wrote:
> 
> [ Adding a few more CC's ]
> 
> On 05/22/2014 01:34 AM, Nishanth Aravamudan wrote:
> > Hi Srivatsa,
> > 
> > After d4edc5b6 ("powerpc: Fix the setup of CPU-to-Node mappings during
> > CPU online"), cpu_to_node() looks like:
> > 
> > static inline int cpu_to_node(int cpu)
> > {
> > int nid;
> > 
> > nid = numa_cpu_lookup_table[cpu];
> > 
> > /*
> >  * During early boot, the numa-cpu lookup table might not have been
> >  * setup for all CPUs yet. In such cases, default to node 0.
> >  */
> > return (nid < 0) ? 0 : nid;
> > }
> > 
> > However, I'm curious if this is correct in all cases. I have seen
> > several LPARs that do not have any CPUs on node 0. In fact, because node
> > 0 is statically set online in the initialization of the N_ONLINE
> > nodemask, 0 is always present to Linux, whether it is present on the
> > system. I'm not sure what the best thing to do here is, but I'm curious
> > if you have any ideas? I would like to remove the static initialization
> > of node 0, as it's confusing to users to see an empty node (particularly
> > when it's completely separate in the numbering from other nodes), but
> > we trip a panic (refer to:
> > http://www.spinics.net/lists/linux-mm/msg73321.html).
> > 
> 
> Ah, I see. I didn't have any particular reason to default it to zero.
> I just did that because the existing code before this patch did the same
> thing. (numa_cpu_lookup_table[] is a global array, so it will be initialized
> with zeros. So if we access it before populating it via numa_setup_cpu(),
> it would return 0. So I retained that behaviour with the above conditional).

Ok, that seems reasonable to me (keeping the behavior the same as it was
before).

> Will something like the below [totally untested] patch solve the boot-panic?
> I understand that as of today first_online_node will still pick 0 since
> N_ONLINE is initialized statically, but with your proposed change to that
> init code, I guess the following patch should avoid the boot panic.
> 
> [ But note that first_online_node is hard-coded to 0, if MAX_NUMNODES is = 1.
> So we'll have to fix that if powerpc can have a single node system whose node
> is numbered something other than 0. Can that happen as well? ]

I think all single-node systems are only Node 0, but I'm not 100% on
that.

> And regarding your question about what is the best way to fix this
> whole Linux MM's assumption about node0, I'm not really sure.. since I
> am not really aware of the extent to which the MM subsystem is
> intertwined with this assumption and what it would take to cure that
> :-(

Well, at this point, it might be fine to just leave it alone, as it
seems to be more trouble than it's worth -- and really the only
confusion is on those LPARs where there really isn't a Node 0. I'll take
another look later this week.

Thanks,
Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device

2014-05-28 Thread Benjamin Herrenschmidt

On Wed, 2014-05-28 at 22:49 +1000, Gavin Shan wrote:
> 
> I will remove those "address" related macros in next revision because it's
> user-level bussiness, not related to host kernel any more.
> 
> If the user is QEMU + guest, we need the address to identify the PE though PHB
> BUID could be used as same purpose to get PHB, which is one-to-one mapping 
> with
> IOMMU group on sPAPR platform. However, once the PE address is built and 
> returned
> to guest, guest will use the PE address as input parameter in subsequent RTAS
> calls.
> 
> If the user is some kind of application who just uses the ioctl() without 
> supporting
> RTAS calls. We don't need care about PE address. 

I am a bit reluctant with that PE==PHB equation we seem to be introducing.

This isn't the case in HW. It's possible that this is how we handle VFIO *today*
in qemu but it doesn't have to be does it ?

It also paints us into a corner if we want to start implementing some kind of
emulated EEH for selected emulated devices and/or virtio.

Cheers,
Ben.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 11/16] byteorder: provide a linux/byteorder.h with {be, le}_to_cpu() and cpu_to_{be, le}() macros

2014-05-28 Thread Cody P Schafer

On Wed, May 28, 2014 at 3:45 AM, David Laight  wrote:
> From: Cody P Schafer
>> Rather manually specifying the size of the integer to be converted, key
>> off of the type size. Reduces duplicate size info and the occurance of
>> certain types of bugs (using the wrong sized conversion).
> ...
>> +#define be_to_cpu(v) \
>> + __builtin_choose_expr(sizeof(v) == sizeof(uint8_t) , v, \
>> + __builtin_choose_expr(sizeof(v) == sizeof(uint16_t), be16_to_cpu(v), \
>> + __builtin_choose_expr(sizeof(v) == sizeof(uint32_t), be32_to_cpu(v), \
>> + __builtin_choose_expr(sizeof(v) == sizeof(uint64_t), be64_to_cpu(v), \
>> + (void)0
> ...
>
> I'm not at all sure that using the 'size' of the constant will reduce
> the number of bugs - it just introduces a whole new category of bugs.

Certainly, if you mis-size the argument (and thus have missized one of
the variables containing the be value, probably a bug anyhow), there
will be problems.

I put this interface together because of an actual bug I wrote into
the initial code of the hv_24x7 driver (resized a struct member
without adjusting the be*_to_cpu() sizing).
Having this "auto sizing" macro means I can avoid encoding the size of
a struct field in multiple places.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 11/16] byteorder: provide a linux/byteorder.h with {be,le}_to_cpu() and cpu_to_{be,le}() macros

2014-05-28 Thread Cody P Schafer

On Tue, May 27, 2014 at 7:44 PM, Joe Perches  wrote:
> On Tue, 2014-05-27 at 17:22 -0700, Cody P Schafer wrote:
>> Rather manually specifying the size of the integer to be converted, key
>> off of the type size. Reduces duplicate size info and the occurance of
>> certain types of bugs (using the wrong sized conversion).
> []
>> diff --git a/include/linux/byteorder.h b/include/linux/byteorder.h
> []
>> @@ -0,0 +1,34 @@
>> +#ifndef LINUX_BYTEORDER_H_
>> +#define LINUX_BYTEORDER_H_
>> +
>> +#include 
>> +
>> +#define be_to_cpu(v) \
>> + __builtin_choose_expr(sizeof(v) == sizeof(uint8_t) , v, \
>> + __builtin_choose_expr(sizeof(v) == sizeof(uint16_t), be16_to_cpu(v), \
>> + __builtin_choose_expr(sizeof(v) == sizeof(uint32_t), be32_to_cpu(v), \
>> + __builtin_choose_expr(sizeof(v) == sizeof(uint64_t), be64_to_cpu(v), \
>> + (void)0
>
> probably better to use BUILD_BUG instead of these 0 returns
>

They aren't 0 returns.

$ echo "int main(void) { int x = (void)0; return x; }" | gcc -x c -
: In function ‘main’:
:1:26: error: void value not ignored as it ought to be
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 11/16] byteorder: provide a linux/byteorder.h with {be, le}_to_cpu() and cpu_to_{be, le}() macros

2014-05-28 Thread Cody P Schafer

On Wed, May 28, 2014 at 5:05 PM, Cody P Schafer  wrote:
> On Wed, May 28, 2014 at 3:45 AM, David Laight  wrote:
>> From: Cody P Schafer
>>> Rather manually specifying the size of the integer to be converted, key
>>> off of the type size. Reduces duplicate size info and the occurance of
>>> certain types of bugs (using the wrong sized conversion).
>> ...
>>> +#define be_to_cpu(v) \
>>> + __builtin_choose_expr(sizeof(v) == sizeof(uint8_t) , v, \
>>> + __builtin_choose_expr(sizeof(v) == sizeof(uint16_t), be16_to_cpu(v), \
>>> + __builtin_choose_expr(sizeof(v) == sizeof(uint32_t), be32_to_cpu(v), \
>>> + __builtin_choose_expr(sizeof(v) == sizeof(uint64_t), be64_to_cpu(v), \
>>> + (void)0
>> ...
>>
>> I'm not at all sure that using the 'size' of the constant will reduce
>> the number of bugs - it just introduces a whole new category of bugs.
>
> Certainly, if you mis-size the argument (and thus have missized one of
> the variables containing the be value, probably a bug anyhow), there
> will be problems.
>
> I put this interface together because of an actual bug I wrote into
> the initial code of the hv_24x7 driver (resized a struct member
> without adjusting the be*_to_cpu() sizing).
> Having this "auto sizing" macro means I can avoid encoding the size of
> a struct field in multiple places.

To clarify, the point I'm making here is that this simply cuts out 1
more place we can screw up endianness conversion sizing.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] powerpc/perf: Never program book3s PMCs with values >= 0x80000000

2014-05-28 Thread Anton Blanchard

We are seeing a lot of PMU warnings on POWER8:

Can't find PMC that caused IRQ

Looking closer, the active PMC is 0 at this point and we took a PMU
exception on the transition from negative to 0. Some versions of POWER8
have an issue where they edge detect and not level detect PMC overflows.

A number of places program the PMC with (0x8000 - period_left),
where period_left can be negative. We can either fix all of these or
just ensure that period_left is always >= 1.

This patch takes the second option.

Cc: 
Signed-off-by: Anton Blanchard 
---

Attached is a graph that shows how this bug also causes problems in the
initial period calculation for samples. Once fixed we converge on
the correct period very quickly.

Index: b/arch/powerpc/perf/core-book3s.c
===
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -996,7 +996,22 @@ static void power_pmu_read(struct perf_e
} while (local64_cmpxchg(&event->hw.prev_count, prev, val) != prev);
 
local64_add(delta, &event->count);
-   local64_sub(delta, &event->hw.period_left);
+
+   /*
+* A number of places program the PMC with (0x8000 - period_left).
+* We never want period_left to be less than 1 because we will program
+* the PMC with a value >= 0x8 and an edge detected PMC will
+* roll around to 0 before taking an exception. We have seen this
+* on POWER8.
+*
+* To fix this, clamp the minimum value of period_left to 1.
+*/
+   do {
+   prev = local64_read(&event->hw.period_left);
+   val = prev - delta;
+   if (val < 1)
+   val = 1;
+   } while (local64_cmpxchg(&event->hw.period_left, prev, val) != prev);
 }
 
 /*
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device

2014-05-28 Thread Alexander Graf



On 28.05.14 18:17, Alex Williamson wrote:

On Wed, 2014-05-28 at 13:37 +0200, Alexander Graf wrote:

On 28.05.14 02:57, Alex Williamson wrote:

On Wed, 2014-05-28 at 02:44 +0200, Alexander Graf wrote:

On 28.05.14 02:39, Alex Williamson wrote:

On Wed, 2014-05-28 at 00:49 +0200, Alexander Graf wrote:

On 27.05.14 20:15, Alex Williamson wrote:

On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:

The patch adds new IOCTL commands for sPAPR VFIO container device
to support EEH functionality for PCI devices, which have been passed
through from host to somebody else via VFIO.

Signed-off-by: Gavin Shan 
---
 Documentation/vfio.txt  | 92 
-
 drivers/vfio/pci/Makefile   |  1 +
 drivers/vfio/pci/vfio_pci.c | 20 +---
 drivers/vfio/pci/vfio_pci_eeh.c | 46 +++
 drivers/vfio/pci/vfio_pci_private.h |  5 ++
 drivers/vfio/vfio_iommu_spapr_tce.c | 85 ++
 include/uapi/linux/vfio.h   | 66 ++
 7 files changed, 308 insertions(+), 7 deletions(-)
 create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c

[...]


+
+   return ret;
+}
+
 static long tce_iommu_ioctl(void *iommu_data,
 unsigned int cmd, unsigned long arg)
 {
@@ -283,6 +363,11 @@ static long tce_iommu_ioctl(void *iommu_data,
tce_iommu_disable(container);
mutex_unlock(&container->lock);
return 0;
+   case VFIO_EEH_PE_SET_OPTION:
+   case VFIO_EEH_PE_GET_STATE:
+   case VFIO_EEH_PE_RESET:
+   case VFIO_EEH_PE_CONFIGURE:
+   return tce_iommu_eeh_ioctl(iommu_data, cmd, arg);

This is where it would have really made sense to have a single
VFIO_EEH_OP ioctl with a data structure passed to indicate the sub-op.
AlexG, are you really attached to splitting these out into separate
ioctls?

I don't see the problem. We need to forward 4 ioctls to a separate piece
of code, so we forward 4 ioctls to a separate piece of code :). Putting
them into one ioctl just moves the switch() into another function.

And uses an extra 3 ioctl numbers and gives us extra things to update if
we ever need to add more ioctls, etc.  ioctl numbers are an address
space, how much address space do we really want to give to EEH?  It's
not a big difference, but I don't think it's completely even either.
Thanks,

Yes, that's the point. I by far prefer to have you push back on anyone
who introduces useless ioctls rather than have a separate EEH number
space that people can just throw anything in they like ;).

Well, I appreciate that, but having them as separate ioctls doesn't
really prevent that either.  Any one of these 4 could be set to take a
sub-option to extend and contort the EEH interface.  The only way to
prevent that would be to avoid the argsz+flags hack that make the ioctl
extendable.  Thanks,

Sure, that's what patch review is about. I'm really more concerned about
whose court the number space is in - you or Gavin. If we're talking
about top level ioctls you will care a lot more.

But I'm not religious about this. You're the VFIO maintainer, so it's
your call. I just personally cringe when I see an ioctl that gets an
"opcode" and a "parameter" argument where the "parameter" argument is a
union with one struct for each opcode.

Well, what would it look like...

struct vfio_eeh_pe_op {
__u32 argsz;
__u32 flags;
__u32 op;
};

Couldn't every single one of these be a separate "op"?  Are there any
cases where we can't use the ioctl return value?

VFIO_EEH_PE_DISABLE
VFIO_EEH_PE_ENABLE
VFIO_EEH_PE_UNFREEZE_IO
VFIO_EEH_PE_UNFREEZE_DMA
VFIO_EEH_PE_GET_MODE
VFIO_EEH_PE_RESET_DEACTIVATE
VFIO_EEH_PE_RESET_HOT
VFIO_EEH_PE_RESET_FUNDAMENTAL
VFIO_EEH_PE_CONFIGURE

It doesn't look that bad to me, what am I missing?  Thanks,


Yup, that looks well to me as well :)


Alex

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device

2014-05-28 Thread Alexander Graf



On 28.05.14 23:58, Benjamin Herrenschmidt wrote:

On Wed, 2014-05-28 at 22:49 +1000, Gavin Shan wrote:

I will remove those "address" related macros in next revision because it's
user-level bussiness, not related to host kernel any more.

If the user is QEMU + guest, we need the address to identify the PE though PHB
BUID could be used as same purpose to get PHB, which is one-to-one mapping with
IOMMU group on sPAPR platform. However, once the PE address is built and 
returned
to guest, guest will use the PE address as input parameter in subsequent RTAS
calls.

If the user is some kind of application who just uses the ioctl() without 
supporting
RTAS calls. We don't need care about PE address.

I am a bit reluctant with that PE==PHB equation we seem to be introducing.

This isn't the case in HW. It's possible that this is how we handle VFIO *today*
in qemu but it doesn't have to be does it ?


Right, but that's pure QEMU internals. From the VFIO kernel interface's 
point of view, a VFIO group is a PE context, as that's what gets IOMMU 
controlled together too.



It also paints us into a corner if we want to start implementing some kind of
emulated EEH for selected emulated devices and/or virtio.


I don't think so :). In QEMU the PHB emulation would have to notify the 
"container" (IOMMU emulation layer -> PE) that a PE operation happened. 
It's that emulation code's responsibility to broadcast operations across 
its own emulated operations (recover config space access, reconfigure 
BARs, etc) and the VFIO PE operations.


So from a kernel interface point of view, I think leaving out any 
address information is the right way to go. Whether we managed to get 
all QEMU internal interfaces modeled correctly yet has to be seen on the 
next patch set revision :).



Alex

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 11/16] byteorder: provide a linux/byteorder.h with {be, le}_to_cpu() and cpu_to_{be, le}() macros

2014-05-28 Thread Joe Perches

On Wed, 2014-05-28 at 17:11 -0500, Cody P Schafer wrote:
> On Wed, May 28, 2014 at 5:05 PM, Cody P Schafer  wrote:
> > On Wed, May 28, 2014 at 3:45 AM, David Laight  
> > wrote:
> >> From: Cody P Schafer
> >>> Rather manually specifying the size of the integer to be converted, key
> >>> off of the type size. Reduces duplicate size info and the occurance of
> >>> certain types of bugs (using the wrong sized conversion).
> >> ...
> >>> +#define be_to_cpu(v) \
> >>> + __builtin_choose_expr(sizeof(v) == sizeof(uint8_t) , v, \
> >>> + __builtin_choose_expr(sizeof(v) == sizeof(uint16_t), 
> >>> be16_to_cpu(v), \
> >>> + __builtin_choose_expr(sizeof(v) == sizeof(uint32_t), 
> >>> be32_to_cpu(v), \
> >>> + __builtin_choose_expr(sizeof(v) == sizeof(uint64_t), 
> >>> be64_to_cpu(v), \
> >>> + (void)0
> >> ...
> >>
> >> I'm not at all sure that using the 'size' of the constant will reduce
> >> the number of bugs - it just introduces a whole new category of bugs.
> >
> > Certainly, if you mis-size the argument (and thus have missized one of
> > the variables containing the be value, probably a bug anyhow), there
> > will be problems.
> >
> > I put this interface together because of an actual bug I wrote into
> > the initial code of the hv_24x7 driver (resized a struct member
> > without adjusting the be*_to_cpu() sizing).
> > Having this "auto sizing" macro means I can avoid encoding the size of
> > a struct field in multiple places.
> 
> To clarify, the point I'm making here is that this simply cuts out 1
> more place we can screw up endianness conversion sizing.

It does screw up other types when you do things like:

u8 foo = some_function();

cpu_to_be(foo + 1);

the return value is sizeof(int) not u8


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device

2014-05-28 Thread Gavin Shan

On Wed, May 28, 2014 at 03:12:35PM +0200, Alexander Graf wrote:
>
>On 28.05.14 14:49, Gavin Shan wrote:
>>On Wed, May 28, 2014 at 01:41:35PM +0200, Alexander Graf wrote:
>>>On 28.05.14 02:55, Gavin Shan wrote:
On Tue, May 27, 2014 at 12:15:27PM -0600, Alex Williamson wrote:
>On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:
>>The patch adds new IOCTL commands for sPAPR VFIO container device
>>to support EEH functionality for PCI devices, which have been passed
>>through from host to somebody else via VFIO.
>>
>>Signed-off-by: Gavin Shan 
>>---
>>  Documentation/vfio.txt  | 92 
>> -
>>  drivers/vfio/pci/Makefile   |  1 +
>>  drivers/vfio/pci/vfio_pci.c | 20 +---
>>  drivers/vfio/pci/vfio_pci_eeh.c | 46 +++
>>  drivers/vfio/pci/vfio_pci_private.h |  5 ++
>>  drivers/vfio/vfio_iommu_spapr_tce.c | 85 
>> ++
>>  include/uapi/linux/vfio.h   | 66 ++
>>  7 files changed, 308 insertions(+), 7 deletions(-)
>>  create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c
>>>[...]
>>>
>>diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>index cb9023d..c5fac36 100644
>>--- a/include/uapi/linux/vfio.h
>>+++ b/include/uapi/linux/vfio.h
>>@@ -455,6 +455,72 @@ struct vfio_iommu_spapr_tce_info {
>>  #define VFIO_IOMMU_SPAPR_TCE_GET_INFO   _IO(VFIO_TYPE, VFIO_BASE + 12)
>>+/*
>>+ * EEH functionality can be enabled or disabled on one specific device.
>>+ * Also, the DMA or IO frozen state can be removed from the frozen PE
>>+ * if required.
>>+ */
>>+struct vfio_eeh_pe_set_option {
>>+ __u32 argsz;
>>+ __u32 flags;
>>+ __u32 option;
>>+#define VFIO_EEH_PE_SET_OPT_DISABLE  0   /* Disable EEH  */
>>+#define VFIO_EEH_PE_SET_OPT_ENABLE   1   /* Enable EEH   */
>>+#define VFIO_EEH_PE_SET_OPT_IO   2   /* Enable IO*/
>>+#define VFIO_EEH_PE_SET_OPT_DMA  3   /* Enable DMA   */
>This is more of a "command" than an "option" isn't it?  Each of these
>probably needs a more significant description.
>
Yeah, it would be regarded as "opcode" and I'll add more description about
them in next revision.
>>>Please just call them commands.
>>>
>>Ok. I guess you want me to change the macro names like this ?
>>
>>#define VFIO_EEH_CMD_DISABLE  0   /* Disable EEH functionality*/
>>#define VFIO_EEH_CMD_ENABLE   1   /* Enable EEH functionality */
>>#define VFIO_EEH_CMD_ENABLE_IO2   /* Enable IO for frozen PE  
>>*/
>>#define VFIO_EEH_CMD_ENABLE_DMA   3   /* Enable DMA for frozen PE 
>>*/
>
>Yes, the ioctl name too.
>

Ok. Thanks. I will also to rename those "option" / "command" related macros
to "VFIO_EEH_CMD_" in next revision.

>>
>>+};
>>+
>>+#define VFIO_EEH_PE_SET_OPTION   _IO(VFIO_TYPE, VFIO_BASE + 21)
>>+
>>+/*
>>+ * Each EEH PE should have unique address to be identified. PE's
>>+ * sharing mode is also useful information as well.
>>+ */
>>+#define VFIO_EEH_PE_GET_ADDRESS  0   /* Get address  */
>>+#define VFIO_EEH_PE_GET_MODE 1   /* Query mode   */
>>+#define VFIO_EEH_PE_MODE_NONE0   /* Not a PE */
>>+#define VFIO_EEH_PE_MODE_NOT_SHARED  1   /* Exclusive*/
>>+#define VFIO_EEH_PE_MODE_SHARED  2   /* Shared mode  */
>>+
>>+/*
>>+ * EEH PE might have been frozen because of PCI errors. Also, it might
>>+ * be experiencing reset for error revoery. The following command helps
>>+ * to get the state.
>>+ */
>>+struct vfio_eeh_pe_get_state {
>>+ __u32 argsz;
>>+ __u32 flags;
>>+ __u32 state;
>>+};
>Should state be a union to better describe the value returned?  What
>exactly is the address and why does the user need to know it?  Does this
>need user input or could we just return the address and mode regardless?
>
Ok. I think you want enum (not union) for state. I'll have macros for the
state in next revision as I did that for other cases.

Those macros defined for "address" just for ABI stuff as Alex.G mentioned.
There isn't corresponding ioctl command for host to get address any more
because QEMU (user) will have to figure it out by himself. The "address"
here means PE address and user has to figure it out according to PE
segmentation.
>>>Why would the user ever need the address?
>>>
>>I will remove those "address" related macros in next revision because it's
>>user-level bussiness, not related to host kernel any more.
>
>Ok, so the next question is whether there will be any state outside
>of GET_MODE left in the future.
>

That's also user-level business and those macros should be removed as well :-)

Thanks,
Gav

Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device

2014-05-28 Thread Benjamin Herrenschmidt

On Thu, 2014-05-29 at 00:46 +0200, Alexander Graf wrote:
> I don't think so :). In QEMU the PHB emulation would have to notify the 
> "container" (IOMMU emulation layer -> PE) that a PE operation happened. 
> It's that emulation code's responsibility to broadcast operations across 
> its own emulated operations (recover config space access, reconfigure 
> BARs, etc) and the VFIO PE operations.
> 
> So from a kernel interface point of view, I think leaving out any 
> address information is the right way to go. Whether we managed to get 
> all QEMU internal interfaces modeled correctly yet has to be seen on the 
> next patch set revision :).

For a kernel interface I absolutely agree, all we care about is the group fd.

From a qemu internal, I suppose it makes things easier to operate at the PHB
level for now and we can revisit later if the need arises.

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 11/16] byteorder: provide a linux/byteorder.h with {be, le}_to_cpu() and cpu_to_{be, le}() macros

2014-05-28 Thread Cody P Schafer

On Wed, May 28, 2014 at 6:00 PM, Joe Perches  wrote:
> On Wed, 2014-05-28 at 17:11 -0500, Cody P Schafer wrote:
>> On Wed, May 28, 2014 at 5:05 PM, Cody P Schafer  wrote:
>> > On Wed, May 28, 2014 at 3:45 AM, David Laight  
>> > wrote:
>> >> From: Cody P Schafer
>> >>> Rather manually specifying the size of the integer to be converted, key
>> >>> off of the type size. Reduces duplicate size info and the occurance of
>> >>> certain types of bugs (using the wrong sized conversion).
>> >> ...
>> >>> +#define be_to_cpu(v) \
>> >>> + __builtin_choose_expr(sizeof(v) == sizeof(uint8_t) , v, \
>> >>> + __builtin_choose_expr(sizeof(v) == sizeof(uint16_t), 
>> >>> be16_to_cpu(v), \
>> >>> + __builtin_choose_expr(sizeof(v) == sizeof(uint32_t), 
>> >>> be32_to_cpu(v), \
>> >>> + __builtin_choose_expr(sizeof(v) == sizeof(uint64_t), 
>> >>> be64_to_cpu(v), \
>> >>> + (void)0
>> >> ...
>> >>
>> >> I'm not at all sure that using the 'size' of the constant will reduce
>> >> the number of bugs - it just introduces a whole new category of bugs.
>> >
>> > Certainly, if you mis-size the argument (and thus have missized one of
>> > the variables containing the be value, probably a bug anyhow), there
>> > will be problems.
>> >
>> > I put this interface together because of an actual bug I wrote into
>> > the initial code of the hv_24x7 driver (resized a struct member
>> > without adjusting the be*_to_cpu() sizing).
>> > Having this "auto sizing" macro means I can avoid encoding the size of
>> > a struct field in multiple places.
>>
>> To clarify, the point I'm making here is that this simply cuts out 1
>> more place we can screw up endianness conversion sizing.
>
> It does screw up other types when you do things like:
>
> u8 foo = some_function();
>
> cpu_to_be(foo + 1);
>
> the return value is sizeof(int) not u8

Yep, that is a very good argument against the cpu_to_{be,le}()
variants. It might make sense to remove them and just have the
{be,le}_to_cpu() ones.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 1/2] powerpc: numa: enable USE_PERCPU_NUMA_NODE_ID

2014-05-28 Thread Andrew Morton

On Tue, 27 May 2014 17:09:58 -0700 Nishanth Aravamudan 
 wrote:

> On 28.05.2014 [09:56:14 +1000], Benjamin Herrenschmidt wrote:
> > On Tue, 2014-05-27 at 16:44 -0700, Nishanth Aravamudan wrote:
> > > > Signed-off-by: Nishanth Aravamudan 
> > > 
> > > Ping on this and patch 2/2. Ben, would you be willing to pull these
> > > into
> > > your -next branch so they'd get some testing?
> > > 
> > > http://patchwork.ozlabs.org/patch/350368/
> > > http://patchwork.ozlabs.org/patch/349838/
> > > 
> > > Without any further changes, these two help quite a bit with the slab
> > > consumption on CONFIG_SLUB kernels when memoryless nodes are present.
> > 
> > I don't mind at all :-) I haven't really been following that story
> > so I was waiting for the dust to settle and maybe acks from MM people
> > but if you tell me they are good I'm prepared to trust you.
> 
> The patches themselves are pretty minimal and similar to the ia64
> changes (and the affected code seems like it hasn't changed in some
> time, beyond in the common code). I'd mostly like to get some
> broad-range build & boot testing.
> 
> Also, is NUMA a sufficient symbol to depend, you think? I figure most of
> the PPC NUMA systems are the pSeries/IBM variety, which is where I've
> run into memoryless nodes in the first place.

It's a powerpc-only patchset so I didn't do anything.

Nits:

- When referring to git commits, use 12 digits of hash and include
  the name of the patch as well (because patches can have different
  hashes in different trees).  So

   Based on 3bccd996276b ("numa: ia64: use generic percpu
   var numa_node_id() implementation") for ia64.

- It's nice to include a diffstat.  If there's a [0/n] patch then
  that's a great place for it, so people can see the overall impact at
  a glance.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device

2014-05-28 Thread Gavin Shan

On Thu, May 29, 2014 at 12:40:26AM +0200, Alexander Graf wrote:
>On 28.05.14 18:17, Alex Williamson wrote:
>>On Wed, 2014-05-28 at 13:37 +0200, Alexander Graf wrote:
>>>On 28.05.14 02:57, Alex Williamson wrote:
On Wed, 2014-05-28 at 02:44 +0200, Alexander Graf wrote:
>On 28.05.14 02:39, Alex Williamson wrote:
>>On Wed, 2014-05-28 at 00:49 +0200, Alexander Graf wrote:
>>>On 27.05.14 20:15, Alex Williamson wrote:
On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:
>The patch adds new IOCTL commands for sPAPR VFIO container device
>to support EEH functionality for PCI devices, which have been passed
>through from host to somebody else via VFIO.
>
>Signed-off-by: Gavin Shan 
>---
> Documentation/vfio.txt  | 92 
> -
> drivers/vfio/pci/Makefile   |  1 +
> drivers/vfio/pci/vfio_pci.c | 20 +---
> drivers/vfio/pci/vfio_pci_eeh.c | 46 +++
> drivers/vfio/pci/vfio_pci_private.h |  5 ++
> drivers/vfio/vfio_iommu_spapr_tce.c | 85 
> ++
> include/uapi/linux/vfio.h   | 66 
> ++
> 7 files changed, 308 insertions(+), 7 deletions(-)
> create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c
>>>[...]
>>>
>+
>+  return ret;
>+}
>+
> static long tce_iommu_ioctl(void *iommu_data,
>unsigned int cmd, unsigned long arg)
> {
>@@ -283,6 +363,11 @@ static long tce_iommu_ioctl(void *iommu_data,
>   tce_iommu_disable(container);
>   mutex_unlock(&container->lock);
>   return 0;
>+  case VFIO_EEH_PE_SET_OPTION:
>+  case VFIO_EEH_PE_GET_STATE:
>+  case VFIO_EEH_PE_RESET:
>+  case VFIO_EEH_PE_CONFIGURE:
>+  return tce_iommu_eeh_ioctl(iommu_data, cmd, arg);
This is where it would have really made sense to have a single
VFIO_EEH_OP ioctl with a data structure passed to indicate the sub-op.
AlexG, are you really attached to splitting these out into separate
ioctls?
>>>I don't see the problem. We need to forward 4 ioctls to a separate piece
>>>of code, so we forward 4 ioctls to a separate piece of code :). Putting
>>>them into one ioctl just moves the switch() into another function.
>>And uses an extra 3 ioctl numbers and gives us extra things to update if
>>we ever need to add more ioctls, etc.  ioctl numbers are an address
>>space, how much address space do we really want to give to EEH?  It's
>>not a big difference, but I don't think it's completely even either.
>>Thanks,
>Yes, that's the point. I by far prefer to have you push back on anyone
>who introduces useless ioctls rather than have a separate EEH number
>space that people can just throw anything in they like ;).
Well, I appreciate that, but having them as separate ioctls doesn't
really prevent that either.  Any one of these 4 could be set to take a
sub-option to extend and contort the EEH interface.  The only way to
prevent that would be to avoid the argsz+flags hack that make the ioctl
extendable.  Thanks,
>>>Sure, that's what patch review is about. I'm really more concerned about
>>>whose court the number space is in - you or Gavin. If we're talking
>>>about top level ioctls you will care a lot more.
>>>
>>>But I'm not religious about this. You're the VFIO maintainer, so it's
>>>your call. I just personally cringe when I see an ioctl that gets an
>>>"opcode" and a "parameter" argument where the "parameter" argument is a
>>>union with one struct for each opcode.
>>Well, what would it look like...
>>
>>struct vfio_eeh_pe_op {
>>  __u32 argsz;
>>  __u32 flags;
>>  __u32 op;
>>};
>>
>>Couldn't every single one of these be a separate "op"?  Are there any
>>cases where we can't use the ioctl return value?
>>
>>VFIO_EEH_PE_DISABLE
>>VFIO_EEH_PE_ENABLE
>>VFIO_EEH_PE_UNFREEZE_IO
>>VFIO_EEH_PE_UNFREEZE_DMA
>>VFIO_EEH_PE_GET_MODE
>>VFIO_EEH_PE_RESET_DEACTIVATE
>>VFIO_EEH_PE_RESET_HOT
>>VFIO_EEH_PE_RESET_FUNDAMENTAL
>>VFIO_EEH_PE_CONFIGURE
>>
>>It doesn't look that bad to me, what am I missing?  Thanks,
>
>Yup, that looks well to me as well :)
>

s/VFIO_EEH_PE_GET_MODE/VFIO_EEH_PE_GET_STATE.

I'll include this in next revision. Thanks, Alex.

Thanks,
Gavin

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device

2014-05-28 Thread Alexander Graf



On 29.05.14 01:37, Gavin Shan wrote:

On Thu, May 29, 2014 at 12:40:26AM +0200, Alexander Graf wrote:

On 28.05.14 18:17, Alex Williamson wrote:

On Wed, 2014-05-28 at 13:37 +0200, Alexander Graf wrote:

On 28.05.14 02:57, Alex Williamson wrote:

On Wed, 2014-05-28 at 02:44 +0200, Alexander Graf wrote:

On 28.05.14 02:39, Alex Williamson wrote:

On Wed, 2014-05-28 at 00:49 +0200, Alexander Graf wrote:

On 27.05.14 20:15, Alex Williamson wrote:

On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:

The patch adds new IOCTL commands for sPAPR VFIO container device
to support EEH functionality for PCI devices, which have been passed
through from host to somebody else via VFIO.

Signed-off-by: Gavin Shan 
---
 Documentation/vfio.txt  | 92 
-
 drivers/vfio/pci/Makefile   |  1 +
 drivers/vfio/pci/vfio_pci.c | 20 +---
 drivers/vfio/pci/vfio_pci_eeh.c | 46 +++
 drivers/vfio/pci/vfio_pci_private.h |  5 ++
 drivers/vfio/vfio_iommu_spapr_tce.c | 85 ++
 include/uapi/linux/vfio.h   | 66 ++
 7 files changed, 308 insertions(+), 7 deletions(-)
 create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c

[...]


+
+   return ret;
+}
+
 static long tce_iommu_ioctl(void *iommu_data,
 unsigned int cmd, unsigned long arg)
 {
@@ -283,6 +363,11 @@ static long tce_iommu_ioctl(void *iommu_data,
tce_iommu_disable(container);
mutex_unlock(&container->lock);
return 0;
+   case VFIO_EEH_PE_SET_OPTION:
+   case VFIO_EEH_PE_GET_STATE:
+   case VFIO_EEH_PE_RESET:
+   case VFIO_EEH_PE_CONFIGURE:
+   return tce_iommu_eeh_ioctl(iommu_data, cmd, arg);

This is where it would have really made sense to have a single
VFIO_EEH_OP ioctl with a data structure passed to indicate the sub-op.
AlexG, are you really attached to splitting these out into separate
ioctls?

I don't see the problem. We need to forward 4 ioctls to a separate piece
of code, so we forward 4 ioctls to a separate piece of code :). Putting
them into one ioctl just moves the switch() into another function.

And uses an extra 3 ioctl numbers and gives us extra things to update if
we ever need to add more ioctls, etc.  ioctl numbers are an address
space, how much address space do we really want to give to EEH?  It's
not a big difference, but I don't think it's completely even either.
Thanks,

Yes, that's the point. I by far prefer to have you push back on anyone
who introduces useless ioctls rather than have a separate EEH number
space that people can just throw anything in they like ;).

Well, I appreciate that, but having them as separate ioctls doesn't
really prevent that either.  Any one of these 4 could be set to take a
sub-option to extend and contort the EEH interface.  The only way to
prevent that would be to avoid the argsz+flags hack that make the ioctl
extendable.  Thanks,

Sure, that's what patch review is about. I'm really more concerned about
whose court the number space is in - you or Gavin. If we're talking
about top level ioctls you will care a lot more.

But I'm not religious about this. You're the VFIO maintainer, so it's
your call. I just personally cringe when I see an ioctl that gets an
"opcode" and a "parameter" argument where the "parameter" argument is a
union with one struct for each opcode.

Well, what would it look like...

struct vfio_eeh_pe_op {
__u32 argsz;
__u32 flags;
__u32 op;
};

Couldn't every single one of these be a separate "op"?  Are there any
cases where we can't use the ioctl return value?

VFIO_EEH_PE_DISABLE
VFIO_EEH_PE_ENABLE
VFIO_EEH_PE_UNFREEZE_IO
VFIO_EEH_PE_UNFREEZE_DMA
VFIO_EEH_PE_GET_MODE
VFIO_EEH_PE_RESET_DEACTIVATE
VFIO_EEH_PE_RESET_HOT
VFIO_EEH_PE_RESET_FUNDAMENTAL
VFIO_EEH_PE_CONFIGURE

It doesn't look that bad to me, what am I missing?  Thanks,

Yup, that looks well to me as well :)


s/VFIO_EEH_PE_GET_MODE/VFIO_EEH_PE_GET_STATE.

I'll include this in next revision. Thanks, Alex.


Yup, no need for CMD anymore then either :)


Alex

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device

2014-05-28 Thread Gavin Shan

On Thu, May 29, 2014 at 01:38:46AM +0200, Alexander Graf wrote:
>On 29.05.14 01:37, Gavin Shan wrote:
>>On Thu, May 29, 2014 at 12:40:26AM +0200, Alexander Graf wrote:
>>>On 28.05.14 18:17, Alex Williamson wrote:
On Wed, 2014-05-28 at 13:37 +0200, Alexander Graf wrote:
>On 28.05.14 02:57, Alex Williamson wrote:
>>On Wed, 2014-05-28 at 02:44 +0200, Alexander Graf wrote:
>>>On 28.05.14 02:39, Alex Williamson wrote:
On Wed, 2014-05-28 at 00:49 +0200, Alexander Graf wrote:
>On 27.05.14 20:15, Alex Williamson wrote:
>>On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:
>>>The patch adds new IOCTL commands for sPAPR VFIO container device
>>>to support EEH functionality for PCI devices, which have been passed
>>>through from host to somebody else via VFIO.
>>>
>>>Signed-off-by: Gavin Shan 
>>>---
>>> Documentation/vfio.txt  | 92 
>>> -
>>> drivers/vfio/pci/Makefile   |  1 +
>>> drivers/vfio/pci/vfio_pci.c | 20 +---
>>> drivers/vfio/pci/vfio_pci_eeh.c | 46 +++
>>> drivers/vfio/pci/vfio_pci_private.h |  5 ++
>>> drivers/vfio/vfio_iommu_spapr_tce.c | 85 
>>> ++
>>> include/uapi/linux/vfio.h   | 66 
>>> ++
>>> 7 files changed, 308 insertions(+), 7 deletions(-)
>>> create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c
>[...]
>
>>>+
>>>+return ret;
>>>+}
>>>+
>>> static long tce_iommu_ioctl(void *iommu_data,
>>>  unsigned int cmd, unsigned 
>>> long arg)
>>> {
>>>@@ -283,6 +363,11 @@ static long tce_iommu_ioctl(void *iommu_data,
>>> tce_iommu_disable(container);
>>> mutex_unlock(&container->lock);
>>> return 0;
>>>+case VFIO_EEH_PE_SET_OPTION:
>>>+case VFIO_EEH_PE_GET_STATE:
>>>+case VFIO_EEH_PE_RESET:
>>>+case VFIO_EEH_PE_CONFIGURE:
>>>+return tce_iommu_eeh_ioctl(iommu_data, cmd, arg);
>>This is where it would have really made sense to have a single
>>VFIO_EEH_OP ioctl with a data structure passed to indicate the sub-op.
>>AlexG, are you really attached to splitting these out into separate
>>ioctls?
>I don't see the problem. We need to forward 4 ioctls to a separate 
>piece
>of code, so we forward 4 ioctls to a separate piece of code :). Putting
>them into one ioctl just moves the switch() into another function.
And uses an extra 3 ioctl numbers and gives us extra things to update if
we ever need to add more ioctls, etc.  ioctl numbers are an address
space, how much address space do we really want to give to EEH?  It's
not a big difference, but I don't think it's completely even either.
Thanks,
>>>Yes, that's the point. I by far prefer to have you push back on anyone
>>>who introduces useless ioctls rather than have a separate EEH number
>>>space that people can just throw anything in they like ;).
>>Well, I appreciate that, but having them as separate ioctls doesn't
>>really prevent that either.  Any one of these 4 could be set to take a
>>sub-option to extend and contort the EEH interface.  The only way to
>>prevent that would be to avoid the argsz+flags hack that make the ioctl
>>extendable.  Thanks,
>Sure, that's what patch review is about. I'm really more concerned about
>whose court the number space is in - you or Gavin. If we're talking
>about top level ioctls you will care a lot more.
>
>But I'm not religious about this. You're the VFIO maintainer, so it's
>your call. I just personally cringe when I see an ioctl that gets an
>"opcode" and a "parameter" argument where the "parameter" argument is a
>union with one struct for each opcode.
Well, what would it look like...

struct vfio_eeh_pe_op {
__u32 argsz;
__u32 flags;
__u32 op;
};

Couldn't every single one of these be a separate "op"?  Are there any
cases where we can't use the ioctl return value?

VFIO_EEH_PE_DISABLE
VFIO_EEH_PE_ENABLE
VFIO_EEH_PE_UNFREEZE_IO
VFIO_EEH_PE_UNFREEZE_DMA
VFIO_EEH_PE_GET_MODE
VFIO_EEH_PE_RESET_DEACTIVATE
VFIO_EEH_PE_RESET_HOT
VFIO_EEH_PE_RESET_FUNDAMENTAL
VFIO_EEH_PE_CONFIGURE

It doesn't look that bad to me, what am I missing?  Thanks,
>>>Yup, that looks well to me as well :)
>>>
>>s/VFIO_EEH_PE_GET_MODE/VFIO_EEH_PE_GET_STATE.
>>
>>I'll include this in next revision. Thanks, Alex.
>
>Yup, no need for CMD anymore then eit

Re: [RFC PATCH v2 1/2] powerpc: numa: enable USE_PERCPU_NUMA_NODE_ID

2014-05-28 Thread Nishanth Aravamudan

On 28.05.2014 [16:36:07 -0700], Andrew Morton wrote:
> On Tue, 27 May 2014 17:09:58 -0700 Nishanth Aravamudan 
>  wrote:
> 
> > On 28.05.2014 [09:56:14 +1000], Benjamin Herrenschmidt wrote:
> > > On Tue, 2014-05-27 at 16:44 -0700, Nishanth Aravamudan wrote:
> > > > > Signed-off-by: Nishanth Aravamudan 
> > > > 
> > > > Ping on this and patch 2/2. Ben, would you be willing to pull these
> > > > into
> > > > your -next branch so they'd get some testing?
> > > > 
> > > > http://patchwork.ozlabs.org/patch/350368/
> > > > http://patchwork.ozlabs.org/patch/349838/
> > > > 
> > > > Without any further changes, these two help quite a bit with the slab
> > > > consumption on CONFIG_SLUB kernels when memoryless nodes are present.
> > > 
> > > I don't mind at all :-) I haven't really been following that story
> > > so I was waiting for the dust to settle and maybe acks from MM people
> > > but if you tell me they are good I'm prepared to trust you.
> > 
> > The patches themselves are pretty minimal and similar to the ia64
> > changes (and the affected code seems like it hasn't changed in some
> > time, beyond in the common code). I'd mostly like to get some
> > broad-range build & boot testing.
> > 
> > Also, is NUMA a sufficient symbol to depend, you think? I figure most of
> > the PPC NUMA systems are the pSeries/IBM variety, which is where I've
> > run into memoryless nodes in the first place.
> 
> It's a powerpc-only patchset so I didn't do anything.
> 
> Nits:
> 
> - When referring to git commits, use 12 digits of hash and include
>   the name of the patch as well (because patches can have different
>   hashes in different trees).  So
> 
>Based on 3bccd996276b ("numa: ia64: use generic percpu
>var numa_node_id() implementation") for ia64.
> 
> - It's nice to include a diffstat.  If there's a [0/n] patch then
>   that's a great place for it, so people can see the overall impact at
>   a glance.

Thanks for the notes, I'll include those in any updated patches.

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device

2014-05-28 Thread Gavin Shan

On Wed, May 28, 2014 at 10:32:11AM -0600, Alex Williamson wrote:
>On Wed, 2014-05-28 at 10:55 +1000, Gavin Shan wrote:
>> On Tue, May 27, 2014 at 12:15:27PM -0600, Alex Williamson wrote:
>> >On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:
>> >> The patch adds new IOCTL commands for sPAPR VFIO container device
>> >> to support EEH functionality for PCI devices, which have been passed
>> >> through from host to somebody else via VFIO.
>> >> 
>> >> Signed-off-by: Gavin Shan 
>> >> ---
>> >>  Documentation/vfio.txt  | 92 
>> >> -
>> >>  drivers/vfio/pci/Makefile   |  1 +
>> >>  drivers/vfio/pci/vfio_pci.c | 20 +---
>> >>  drivers/vfio/pci/vfio_pci_eeh.c | 46 +++
>> >>  drivers/vfio/pci/vfio_pci_private.h |  5 ++
>> >>  drivers/vfio/vfio_iommu_spapr_tce.c | 85 
>> >> ++
>> >>  include/uapi/linux/vfio.h   | 66 ++
>> >>  7 files changed, 308 insertions(+), 7 deletions(-)
>> >>  create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c
>> >> 
>> >> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
>> >> index b9ca023..d890fed 100644
>> >> --- a/Documentation/vfio.txt
>> >> +++ b/Documentation/vfio.txt
>> >> @@ -305,7 +305,15 @@ faster, the map/unmap handling has been implemented 
>> >> in real mode which provides
>> >>  an excellent performance which has limitations such as inability to do
>> >>  locked pages accounting in real time.
>> >>  
>> >> -So 3 additional ioctls have been added:
>> >> +4) According to sPAPR specification, A Partitionable Endpoint (PE) is an 
>> >> I/O
>> >> +subtree that can be treated as a unit for the purposes of partitioning 
>> >> and
>> >> +error recovery. A PE may be a single or multi-function IOA (IO Adapter), 
>> >> a
>> >> +function of a multi-function IOA, or multiple IOAs (possibly including 
>> >> switch
>> >> +and bridge structures above the multiple IOAs). PPC64 guests detect PCI 
>> >> errors
>> >> +and recover from them via EEH RTAS services, which works on the basis of
>> >> +additional ioctl commands.
>> >> +
>> >> +So 7 additional ioctls have been added:
>> >>  
>> >>   VFIO_IOMMU_SPAPR_TCE_GET_INFO - returns the size and the start
>> >>   of the DMA window on the PCI bus.
>> >> @@ -316,6 +324,17 @@ So 3 additional ioctls have been added:
>> >>  
>> >>   VFIO_IOMMU_DISABLE - disables the container.
>> >>  
>> >> + VFIO_EEH_PE_SET_OPTION - enables or disables EEH functionality on the
>> >> + specified device. Also, it can be used to remove IO or DMA
>> >> + stopped state on the frozen PE.
>> >> +
>> >> + VFIO_EEH_PE_GET_STATE - retrieve PE's state: frozen or normal state.
>> >> +
>> >> + VFIO_EEH_PE_RESET - do PE reset, which is one of the major steps for
>> >> + error recovering.
>> >> +
>> >> + VFIO_EEH_PE_CONFIGURE - configure the PCI bridges after PE reset. It's
>> >> + one of the major steps for error recoverying.
>> >>  
>> >>  The code flow from the example above should be slightly changed:
>> >>  
>> >> @@ -346,6 +365,77 @@ The code flow from the example above should be 
>> >> slightly changed:
>> >>   ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);
>> >>   .
>> >>  
>> >> +Based on the initial example we have, the following piece of code could 
>> >> be
>> >> +reference for EEH setup and error handling:
>> >> +
>> >> + struct vfio_eeh_pe_set_option option = { .argsz = sizeof(option) };
>> >> + struct vfio_eeh_pe_get_state state = { .argsz = sizeof(state) };
>> >> + struct vfio_eeh_pe_reset reset = { .argsz = sizeof(reset) };
>> >> + struct vfio_eeh_pe_configure configure = { .argsz = sizeof(configure) };
>> >> +
>> >> + 
>> >> +
>> >> + /* Get a file descriptor for the device */
>> >> + device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, ":06:0d.0");
>> >> +
>> >> + /* Enable the EEH functionality on the device */
>> >> + option.option = VFIO_EEH_PE_SET_OPT_ENABLE;
>> >> + ioctl(container, VFIO_EEH_PE_SET_OPTION, &option);
>> >> +
>> >> + /* You're suggested to create additional data struct to represent
>> >> +  * PE, and put child devices belonging to same IOMMU group to the
>> >> +  * PE instance for later reference.
>> >> +  */
>> >> +
>> >> + /* Check the PE's state and make sure it's in functional state */
>> >> + ioctl(container, VFIO_EEH_PE_GET_STATE, &state);
>> >> +
>> >> + /* Save device's state. pci_save_state() would be good enough
>> >> +  * as an example.
>> >> +  */
>> >> +
>> >> + /* Test and setup the device */
>> >> + ioctl(device, VFIO_DEVICE_GET_INFO, &device_info);
>> >> +
>> >> + 
>> >> +
>> >> + /* When 0xFF's returned from reading PCI config space or IO BARs
>> >> +  * of the PCI device. Check the PE state to see if that has been
>> >> +  * frozen.
>> >> +  */
>> >> + ioctl(container, VFIO_EEH_PE_GET_STATE, &state);
>> >> +
>> >> + /* Waiting for pending PCI transactions to be completed and don't
>> >> +  * produce any more PCI t

Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device

2014-05-28 Thread Benjamin Herrenschmidt

On Thu, 2014-05-29 at 10:05 +1000, Gavin Shan wrote:
> The log stuff is TBD and I'll figure it out later.
> 
> About to what are the errors, there are a lot. Most of them are related
> to hardware level, for example unstable PCI link. Usually, those error
> bits defined in AER fatal error state register contribute to EEH errors.
> It could be software related, e.g. violating IOMMU protection (read/write
> permission etc), or even one PCI device isn't capable of DMAing. Hopefully,
> it's the explaination you're looking for? :-)

Note to Alex('s) ...

The log we get from FW at the moment in the host is:

  - In the case of pHyp / RTAS host, opaque. Basically it's a blob that we store
and that can be sent to IBM service people :-) Not terribly practical.

  - On PowerNV, it's a IODA specific data structure (basically a dump of a 
bunch of register state and tables). IODA is our IO architecture (sadly the
document itself isn't public at this point) and we have two versions, IODA1
and IODA2. You can consider the structure as chipset specific basically.

What I want to do in the long run is:

  - In the case of pHyp/RTAS host, there's not much we can do, so basically
forward that log as-is.

  - In the case of PowerNV, forward the log *and* add a well-defined blob to
it that does some basic interpretation of it. In fact I want to do the latter
more generally in the host kernel for host kernel errors as well, but we
can forward it to the guest via VFIO too. What I mean by interpretation is
something along the lines of an error type "DMA IOMMU fault, MMIO error,
Link loss, PCIe UE, ..." among a list of well known error types that
represent the most common ones, with a little bit of added info when
available (for most DMA errors we can provide the DMA address that faulted
for example).

So Gavin and I need to work a bit on that, both in the context of the host
kernel to improve the reporting there, and in the context of what we pass to
user space.

However, no driver today cares about that information. The PCIe error recovery
system doesn't carry it and it has no impact on the EEH recovery procedures,
so EEH in that sense is useful without that reporting. It's useful for the
programmer (or user/admin) to identify what went wrong but it's not used by
the automated recovery process.

One last thing to look at is in the case of a VFIO device, we might want to
silence the host kernel printf's once we support guest EEH since otherwise
the guest has a path to flood the host kernel log by triggering a lot of
EEH errors purposefully.

Cheers,
Ben.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

linux-next: manual merge of the powerpc tree with Linus' tree

2014-05-28 Thread Stephen Rothwell

Hi all,

Today's linux-next merge of the powerpc tree got a conflict in
arch/powerpc/include/asm/sections.h between commit b18db0b80867 ("KVM
guest: Make pv trampoline code executable") from the  tree and commit
07de8377f748 ("powerpc: Fix ABIv2 issue with
dereference_function_descriptor") from the powerpc tree.

I fixed it up (see below) and can carry the fix as necessary (no action
is required).

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au

diff --cc arch/powerpc/include/asm/sections.h
index 521790330672,d1bb96d5a298..
--- a/arch/powerpc/include/asm/sections.h
+++ b/arch/powerpc/include/asm/sections.h
@@@ -39,17 -39,7 +39,18 @@@ static inline int overlaps_kernel_text(
(unsigned long)_stext < end;
  }
  
 +static inline int overlaps_kvm_tmp(unsigned long start, unsigned long end)
 +{
 +#ifdef CONFIG_KVM_GUEST
 +  extern char kvm_tmp[];
 +  return start < (unsigned long)kvm_tmp &&
 +  (unsigned long)&kvm_tmp[1024 * 1024] < end;
 +#else
 +  return 0;
 +#endif
 +}
 +
+ #if !defined(_CALL_ELF) || _CALL_ELF != 2
  #undef dereference_function_descriptor
  static inline void *dereference_function_descriptor(void *ptr)
  {


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Kernel 3.15: Boot problems with a PA6T board

2014-05-28 Thread Michael Ellerman

On Wed, 2014-05-28 at 13:25 +0200, Christian Zigotzky wrote:
> On 28.05.2014 10:53, Christian Zigotzky wrote:
> > Hi Michael,
> >
> > Thank you for your answer and thank you for your help. :-)
> >
> > On 28.05.2014 06:23, Michael Ellerman wrote:
> >> On Wed, 2014-05-28 at 01:08 +0200, Christian Zigotzky wrote:
> >>
> >> I'm going to guess that cd427485357c0c4b99f69719251baacf25946e11 is 
> >> BAD. Can
> >> you please confirm or deny that?
>
> > It's not BAD. It boots.

Hmm, interesting.

> >> If cd42748 is *good*, then you'll need to do a bigger bisect from 
> >> d8ff9cd to
> >> 18a1a7a.

> OK :-)
> 
> -> git clone 
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git linux2-git
> -> git bisect start
> -> git bisect good d8ff9cdf68fd119d491f3de90e1a612afc2f3b2b
> -> git bisect bad 18a1a7a1d862ae0794a0179473d08a414dd49234
> 
> Output:
> Bisecting: 5900 revisions left to test after this (roughly 13 steps)
> [cb1595563880a81dab6eab9a5ecb4520d2e76077] Merge tag 'tty-3.15-rc1' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
> 
> Unfortunately it doesn't boot. :-(

OK. So you do:

$ git bisect bad

And it will pick a new commit for you to test. Repeat that ~13 times and you
should have identified the bad commit.

cheers


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

41 matches

Mail list logo