Re: [PATCH 2/3] swiotlb: prefix dma_to_phys and phys_to_dma functions

2016-03-19 Thread Sinan Kaya
On 3/18/2016 8:12 AM, Robin Murphy wrote:
> Since we know for sure that swiotlb_to_phys is a no-op on arm64, it might be 
> cleaner to simply not reference it at all. I suppose we could have some 
> private local wrappers, e.g.:
> 
> #define swiotlb_to_virt(addr) phys_to_virt((phys_addr_t)(addr))
> 
> to keep the intent of the code clear (and just in case anyone ever builds a 
> system mad enough to warrant switching out that definition, but I'd hope that 
> never happens).
> 
> Otherwise, looks good - thanks for doing this!

OK. I added this. Reviewed-by?

I'm not happy to submit such a big patch for all different ARCHs. I couldn't
find a cleaner solution. I'm willing to split this patch into multiple if there
is a better way.

diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index ada00c3..8c0f66b 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -29,6 +29,14 @@

 #include 

+/*
+ * If you are building a system without IOMMU, then you are using SWIOTLB
+ * library. The ARM64 adaptation of this library does not support address
+ * translation and it assumes that physical address = dma address for such
+ * a use case. Please don't build a platform that violates this.
+ */
+#define swiotlb_to_virt(addr) phys_to_virt((phys_addr_t)(addr))
+
 static pgprot_t __get_dma_pgprot(struct dma_attrs *attrs, pgprot_t prot,
 bool coherent)
 {
@@ -188,7 +196,7 @@ static void __dma_free(struct device *dev, size_t size,
   void *vaddr, dma_addr_t dma_handle,
   struct dma_attrs *attrs)
 {
-   void *swiotlb_addr = phys_to_virt(swiotlb_dma_to_phys(dev, dma_handle));
+   void *swiotlb_addr = swiotlb_to_virt(dma_handle);

size = PAGE_ALIGN(size);

@@ -209,8 +217,7 @@ static dma_addr_t __swiotlb_map_page(struct device *dev, 
struct page *page,

dev_addr = swiotlb_map_page(dev, page, offset, size, dir, attrs);
if (!is_device_dma_coherent(dev))
-   __dma_map_area(phys_to_virt(swiotlb_dma_to_phys(dev, dev_addr)),
-  size, dir);
+   __dma_map_area(swiotlb_to_virt(dev_addr), size, dir);

return dev_addr;
 }
@@ -283,8 +290,7 @@ static void __swiotlb_sync_single_for_device(struct device 
*dev,
 {
swiotlb_sync_single_for_device(dev, dev_addr, size, dir);
if (!is_device_dma_coherent(dev))
-   __dma_map_area(phys_to_virt(swiotlb_dma_to_phys(dev, dev_addr)),
-  size, dir);
+   __dma_map_area(swiotlb_to_virt(dev_addr), size, dir);
 }




-- 
Sinan Kaya
Qualcomm Technologies, Inc. on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 2/3] swiotlb: prefix dma_to_phys and phys_to_dma functions

2016-03-19 Thread Sinan Kaya
Prefixing dma_to_phys and phys_to_dma API with swiotlb so that
they are no longer part of the DMA API. These APIs do not exist
on all architectures and breaks compatibility.

In preparation for the clean up, also make the ARCH implementation
known by defining swiotlb_phys_do_dma and swiotlb_dma_to_phys.

Signed-off-by: Sinan Kaya 
---
 arch/arm/include/asm/dma-mapping.h |  8 ++-
 arch/arm64/include/asm/dma-mapping.h   |  9 ++-
 arch/arm64/mm/dma-mapping.c| 75 ++
 arch/ia64/include/asm/dma-mapping.h|  8 ++-
 arch/mips/cavium-octeon/dma-octeon.c   |  8 +--
 .../include/asm/mach-cavium-octeon/dma-coherence.h |  7 +-
 arch/mips/include/asm/mach-generic/dma-coherence.h |  8 ++-
 .../include/asm/mach-loongson64/dma-coherence.h| 14 ++--
 arch/mips/loongson64/common/dma-swiotlb.c  |  4 +-
 arch/powerpc/include/asm/dma-mapping.h |  8 ++-
 arch/tile/include/asm/dma-mapping.h|  8 ++-
 arch/unicore32/include/asm/dma-mapping.h   |  8 ++-
 arch/x86/include/asm/dma-mapping.h | 15 +++--
 arch/x86/kernel/pci-swiotlb.c  |  2 +-
 arch/x86/pci/sta2x11-fixup.c   | 10 +--
 arch/xtensa/include/asm/dma-mapping.h  |  8 ++-
 lib/swiotlb.c  | 28 
 17 files changed, 150 insertions(+), 78 deletions(-)

diff --git a/arch/arm/include/asm/dma-mapping.h 
b/arch/arm/include/asm/dma-mapping.h
index 6ad1ced..e176689 100644
--- a/arch/arm/include/asm/dma-mapping.h
+++ b/arch/arm/include/asm/dma-mapping.h
@@ -129,17 +129,21 @@ static inline bool is_device_dma_coherent(struct device 
*dev)
return dev->archdata.dma_coherent;
 }
 
-static inline dma_addr_t phys_to_dma(struct device *dev, phys_addr_t paddr)
+static inline dma_addr_t swiotlb_phys_to_dma(struct device *dev,
+phys_addr_t paddr)
 {
unsigned int offset = paddr & ~PAGE_MASK;
return pfn_to_dma(dev, __phys_to_pfn(paddr)) + offset;
 }
+#define swiotlb_phys_to_dma swiotlb_phys_to_dma
 
-static inline phys_addr_t dma_to_phys(struct device *dev, dma_addr_t dev_addr)
+static inline phys_addr_t swiotlb_dma_to_phys(struct device *dev,
+ dma_addr_t dev_addr)
 {
unsigned int offset = dev_addr & ~PAGE_MASK;
return __pfn_to_phys(dma_to_pfn(dev, dev_addr)) + offset;
 }
+#define swiotlb_dma_to_phys swiotlb_dma_to_phys
 
 static inline bool dma_capable(struct device *dev, dma_addr_t addr, size_t 
size)
 {
diff --git a/arch/arm64/include/asm/dma-mapping.h 
b/arch/arm64/include/asm/dma-mapping.h
index ba437f0..39f21e8 100644
--- a/arch/arm64/include/asm/dma-mapping.h
+++ b/arch/arm64/include/asm/dma-mapping.h
@@ -64,15 +64,19 @@ static inline bool is_device_dma_coherent(struct device 
*dev)
return dev->archdata.dma_coherent;
 }
 
-static inline dma_addr_t phys_to_dma(struct device *dev, phys_addr_t paddr)
+static inline dma_addr_t swiotlb_phys_to_dma(struct device *dev,
+phys_addr_t paddr)
 {
return (dma_addr_t)paddr;
 }
+#define swiotlb_phys_to_dma swiotlb_phys_to_dma
 
-static inline phys_addr_t dma_to_phys(struct device *dev, dma_addr_t dev_addr)
+static inline phys_addr_t swiotlb_dma_to_phys(struct device *dev,
+ dma_addr_t dev_addr)
 {
return (phys_addr_t)dev_addr;
 }
+#define swiotlb_dma_to_phys swiotlb_dma_to_phys
 
 static inline bool dma_capable(struct device *dev, dma_addr_t addr, size_t 
size)
 {
@@ -88,3 +92,4 @@ static inline void dma_mark_clean(void *addr, size_t size)
 
 #endif /* __KERNEL__ */
 #endif /* __ASM_DMA_MAPPING_H */
+
diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index a6e757c..ada00c3 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -107,7 +107,7 @@ static void *__dma_alloc_coherent(struct device *dev, 
size_t size,
if (!page)
return NULL;
 
-   *dma_handle = phys_to_dma(dev, page_to_phys(page));
+   *dma_handle = swiotlb_phys_to_dma(dev, page_to_phys(page));
addr = page_address(page);
memset(addr, 0, size);
return addr;
@@ -121,7 +121,7 @@ static void __dma_free_coherent(struct device *dev, size_t 
size,
struct dma_attrs *attrs)
 {
bool freed;
-   phys_addr_t paddr = dma_to_phys(dev, dma_handle);
+   phys_addr_t paddr = swiotlb_dma_to_phys(dev, dma_handle);
 
if (dev == NULL) {
WARN_ONCE(1, "Use an actual device structure for DMA 
allocation\n");
@@ -151,7 +151,8 @@ static void *__dma_alloc(struct device *dev, size_t size,
void *addr = __alloc_from_pool(size, &page, flags)

Re: [PATCH 2/3] swiotlb: prefix dma_to_phys and phys_to_dma functions

2016-03-29 Thread Sinan Kaya
On 3/29/2016 8:44 AM, Stefano Stabellini wrote:
> Could you please explain what is the problem that you are trying to
> solve? In other words, what is the issue with assuming that physical
> address = dma address (and the current dma_to_phys and phys_to_dma
> static inlines) if no arm64 platforms violate it? That's pretty much
> what is done on x86 too (without X86_DMA_REMAP).
> 
> If you want to make sure that the assumption is not violated, you can
> introduce a boot time check or a BUG_ON somewhere.
> 
> If there is an arm64 platform with phys_addr != dma_addr, we need proper
> support for it. In fact even if there is an IOMMU on that platform, when
> running Xen on it, the IOMMU would be used by the hypervisor and Linux
> would still end up without it, using the swiotlb.


The problem is that device drivers are trying to use dma_to_phys and phys_to_dma
APIs to do address translation even though these two API do not exist in DMA 
mapping
layer. (see patch #1 and I made the same mistake in my HIDMA code)

Especially, when a device is behind an IOMMU; the DMA addresses are not equal
to physical addresses. Usage of dma_to_phys causes a crash on the system.

I'm trying to prefix the dma_to_phys and phys_to_dma API with swiotlb so that
we know what it is intended for and usage of these API in drivers need to be
discouraged in the future during code reviews.

Since I renamed the API, Robin Murphy made a suggestion to convert this 

phys_to_virt(swiotlb_dma_to_phys(dev, dma_handle))

to this

#define swiotlb_to_virt(addr) phys_to_virt((phys_addr_t)(addr))

swiotlb_to_virt(dma_handle)

just to reduce code clutter since we know swiotlb_dma_to_phys returns phys 
already
for ARM64 architecture.

I think we can do this exercise later. I'll undo this change for the moment. 
Let's focus on the API rename. 

I don't want this general purpose dma_to_phys API returning phys=dma value. 
This is
not correct.

-- 
Sinan Kaya
Qualcomm Technologies, Inc. on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v4 2/5] powerpc: Replace CONFIG_DEBUG_KERNEL with CONFIG_DEBUG_MISC

2019-04-11 Thread Sinan Kaya
CONFIG_DEBUG_KERNEL should not impact code generation. Use the newly
defined CONFIG_DEBUG_MISC instead to keep the current code.

Signed-off-by: Sinan Kaya 
---
 arch/powerpc/kernel/sysfs.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index e8e93c2c7d03..7a1708875d27 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -610,7 +610,7 @@ SYSFS_PMCSETUP(pa6t_pmc2, SPRN_PA6T_PMC2);
 SYSFS_PMCSETUP(pa6t_pmc3, SPRN_PA6T_PMC3);
 SYSFS_PMCSETUP(pa6t_pmc4, SPRN_PA6T_PMC4);
 SYSFS_PMCSETUP(pa6t_pmc5, SPRN_PA6T_PMC5);
-#ifdef CONFIG_DEBUG_KERNEL
+#ifdef CONFIG_DEBUG_MISC
 SYSFS_SPRSETUP(hid0, SPRN_HID0);
 SYSFS_SPRSETUP(hid1, SPRN_HID1);
 SYSFS_SPRSETUP(hid4, SPRN_HID4);
@@ -639,7 +639,7 @@ SYSFS_SPRSETUP(tsr0, SPRN_PA6T_TSR0);
 SYSFS_SPRSETUP(tsr1, SPRN_PA6T_TSR1);
 SYSFS_SPRSETUP(tsr2, SPRN_PA6T_TSR2);
 SYSFS_SPRSETUP(tsr3, SPRN_PA6T_TSR3);
-#endif /* CONFIG_DEBUG_KERNEL */
+#endif /* CONFIG_DEBUG_MISC */
 #endif /* HAS_PPC_PMC_PA6T */
 
 #ifdef HAS_PPC_PMC_IBM
@@ -680,7 +680,7 @@ static struct device_attribute pa6t_attrs[] = {
__ATTR(pmc3, 0600, show_pa6t_pmc3, store_pa6t_pmc3),
__ATTR(pmc4, 0600, show_pa6t_pmc4, store_pa6t_pmc4),
__ATTR(pmc5, 0600, show_pa6t_pmc5, store_pa6t_pmc5),
-#ifdef CONFIG_DEBUG_KERNEL
+#ifdef CONFIG_DEBUG_MISC
__ATTR(hid0, 0600, show_hid0, store_hid0),
__ATTR(hid1, 0600, show_hid1, store_hid1),
__ATTR(hid4, 0600, show_hid4, store_hid4),
@@ -709,7 +709,7 @@ static struct device_attribute pa6t_attrs[] = {
__ATTR(tsr1, 0600, show_tsr1, store_tsr1),
__ATTR(tsr2, 0600, show_tsr2, store_tsr2),
__ATTR(tsr3, 0600, show_tsr3, store_tsr3),
-#endif /* CONFIG_DEBUG_KERNEL */
+#endif /* CONFIG_DEBUG_MISC */
 };
 #endif /* HAS_PPC_PMC_PA6T */
 #endif /* HAS_PPC_PMC_CLASSIC */
-- 
2.21.0



[PATCH v5 2/5] powerpc: Replace CONFIG_DEBUG_KERNEL with CONFIG_DEBUG_MISC

2019-04-13 Thread Sinan Kaya
CONFIG_DEBUG_KERNEL should not impact code generation. Use the newly
defined CONFIG_DEBUG_MISC instead to keep the current code.

Signed-off-by: Sinan Kaya 
Reviewed-by: Josh Triplett 
---
 arch/powerpc/kernel/sysfs.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index e8e93c2c7d03..7a1708875d27 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -610,7 +610,7 @@ SYSFS_PMCSETUP(pa6t_pmc2, SPRN_PA6T_PMC2);
 SYSFS_PMCSETUP(pa6t_pmc3, SPRN_PA6T_PMC3);
 SYSFS_PMCSETUP(pa6t_pmc4, SPRN_PA6T_PMC4);
 SYSFS_PMCSETUP(pa6t_pmc5, SPRN_PA6T_PMC5);
-#ifdef CONFIG_DEBUG_KERNEL
+#ifdef CONFIG_DEBUG_MISC
 SYSFS_SPRSETUP(hid0, SPRN_HID0);
 SYSFS_SPRSETUP(hid1, SPRN_HID1);
 SYSFS_SPRSETUP(hid4, SPRN_HID4);
@@ -639,7 +639,7 @@ SYSFS_SPRSETUP(tsr0, SPRN_PA6T_TSR0);
 SYSFS_SPRSETUP(tsr1, SPRN_PA6T_TSR1);
 SYSFS_SPRSETUP(tsr2, SPRN_PA6T_TSR2);
 SYSFS_SPRSETUP(tsr3, SPRN_PA6T_TSR3);
-#endif /* CONFIG_DEBUG_KERNEL */
+#endif /* CONFIG_DEBUG_MISC */
 #endif /* HAS_PPC_PMC_PA6T */
 
 #ifdef HAS_PPC_PMC_IBM
@@ -680,7 +680,7 @@ static struct device_attribute pa6t_attrs[] = {
__ATTR(pmc3, 0600, show_pa6t_pmc3, store_pa6t_pmc3),
__ATTR(pmc4, 0600, show_pa6t_pmc4, store_pa6t_pmc4),
__ATTR(pmc5, 0600, show_pa6t_pmc5, store_pa6t_pmc5),
-#ifdef CONFIG_DEBUG_KERNEL
+#ifdef CONFIG_DEBUG_MISC
__ATTR(hid0, 0600, show_hid0, store_hid0),
__ATTR(hid1, 0600, show_hid1, store_hid1),
__ATTR(hid4, 0600, show_hid4, store_hid4),
@@ -709,7 +709,7 @@ static struct device_attribute pa6t_attrs[] = {
__ATTR(tsr1, 0600, show_tsr1, store_tsr1),
__ATTR(tsr2, 0600, show_tsr2, store_tsr2),
__ATTR(tsr3, 0600, show_tsr3, store_tsr3),
-#endif /* CONFIG_DEBUG_KERNEL */
+#endif /* CONFIG_DEBUG_MISC */
 };
 #endif /* HAS_PPC_PMC_PA6T */
 #endif /* HAS_PPC_PMC_CLASSIC */
-- 
2.21.0



Re: [PATCH 0/2] PCI/AER: Consistently use _OSC to determine who owns AER

2018-11-15 Thread Sinan Kaya

On 11/15/2018 3:16 PM, Alexandru Gagniuc wrote:

I've asked around a few people at Dell and they unanimously agree that
_OSC is the correct way to determine ownership of AER. In linux, we
use the result of _OSC to enable AER services, but we use HEST to
determine AER ownership. That's inconsistent. This series drops the
use of HEST in favor of _OSC.

[1]https://lkml.org/lkml/2018/11/15/62


This change breaks the existing systems that rely on the HEST table
telling the operating system about firmware first presence.

Besides, HEST table has much more granularity about which PCI component
needs firmware such as global/device/switch.

You should probably circulate these ideas for wider consumption in UEFI
forum as UEFI owns the HEST table definition.


Re: [PATCH 0/2] PCI/AER: Consistently use _OSC to determine who owns AER

2018-11-19 Thread Sinan Kaya

On 11/19/2018 11:53 AM, Keith Busch wrote:

On Mon, Nov 19, 2018 at 11:53:05AM -0500, Tyler Baicar wrote:

On Thu, Nov 15, 2018 at 8:49 PM Sinan Kaya  wrote:


On 11/15/2018 3:16 PM, Alexandru Gagniuc wrote:

I've asked around a few people at Dell and they unanimously agree that
_OSC is the correct way to determine ownership of AER. In linux, we
use the result of _OSC to enable AER services, but we use HEST to
determine AER ownership. That's inconsistent. This series drops the
use of HEST in favor of _OSC.

[1]https://lkml.org/lkml/2018/11/15/62


This change breaks the existing systems that rely on the HEST table
telling the operating system about firmware first presence.

Besides, HEST table has much more granularity about which PCI component
needs firmware such as global/device/switch.

You should probably circulate these ideas for wider consumption in UEFI
forum as UEFI owns the HEST table definition.


I agree with Sinan, this will break existing systems, and the granularity of the
HEST definition is more useful than the single bit in _OSC.


But we're not using HEST as a fine grain control. We disable native AER
handling if *any* device has FF set in HEST, and that just forces people
to use pcie_ports=native to get around that.



I don't see *any* in the code.  aer_hest_parse() does the HEST table parsing.
It switches to firmware first mode if global flag in HEST is set. Otherwise
for each BDF in device, hest_match_pci() is used to do a cross-matching against
HEST table contents.

Am I missing something?


Re: [PATCH 0/2] PCI/AER: Consistently use _OSC to determine who owns AER

2018-11-19 Thread Sinan Kaya

On 11/19/2018 12:32 PM, Sinan Kaya wrote:


But we're not using HEST as a fine grain control. We disable native AER
handling if *any* device has FF set in HEST, and that just forces people
to use pcie_ports=native to get around that.



I don't see *any* in the code.  aer_hest_parse() does the HEST table parsing.
It switches to firmware first mode if global flag in HEST is set. Otherwise
for each BDF in device, hest_match_pci() is used to do a cross-matching against
HEST table contents.

Am I missing something?


I see. I think you are talking about aer_firmware_first, right?

aer_set_firmware_first() and  pcie_aer_get_firmware_first() seem to do the right
thing.

aer_firmware_first is probably getting set because events are all routed to a
single root port and aer_acpi_firmware_first() is used to decide whether AER
should be initialized or not.

I think I understand what is going on now.

Still, breaking existing systems that rely on HEST table is not cool.
I'd rather have users specify "pcie_ports=native" to skip FF rather than
having broken systems by default to be honest.


Re: [PATCH 0/2] PCI/AER: Consistently use _OSC to determine who owns AER

2018-11-19 Thread Sinan Kaya

On 11/19/2018 12:41 PM, Keith Busch wrote:

Still, breaking existing systems that rely on HEST table is not cool.
I'd rather have users specify "pcie_ports=native" to skip FF rather than
having broken systems by default to be honest.

The pcie_ports=native work-around ignores FF to potentially unknown
results, though.



How about being able to enable/disable FF in BIOS?

We can't really turn off firmware first in the kernel without asking help
from the firmware. Like you said, it causes unpredictable results.

There will be two competing software trying to touch the same registers.


Re: [PATCH 0/2] PCI/AER: Consistently use _OSC to determine who owns AER

2018-11-19 Thread Sinan Kaya

On 11/19/2018 1:10 PM, Keith Busch wrote:

We can't really turn off firmware first in the kernel without asking help
from the firmware.

The _OSC method this patch utilizes is the ACPI spec defined way for
the kernel to wrest control from firmware. BIOS specific menu settings
shouldn't be our only recourse when we have a spec authority defining
generic OS interfaces to accomplish the same thing.

Unless there is a disagreement on the _OSC interpreation, we'd have to
accept that platforms breaking from this patch are non-compliant.



It depends on which spec you look :)

UEFI HEST table specification also claims that it should be the ultimate
table for when PCI firmware-first should be disabled/enabled.

I think somebody needs to fix these. I saw an email from Harb Abdulhamid
sent to aswg this morning.

That's why I suggested circulating this idea in UEFI forums first.
Let's see what everybody thinks. We can go from there.


Re: [PATCH 0/2] PCI/AER: Consistently use _OSC to determine who owns AER

2018-11-19 Thread Sinan Kaya

UEFI HEST table specification also claims that it should be the ultimate
table for when PCI firmware-first should be disabled/enabled.


IIRC, EFI absorbed ACPI before FFS was a thing. Could you point me to the UEFI 
chapter that says HEST is authoritative?
(not being a smartie, just that my free software PDF readers can't search within 
a file that large)




ACPI 6.2:

18.3.2.4 PCI Express Root Port AER Structure

Flags:

Bit [0] - FIRMWARE_FIRST: If set, this bit indicates to the OSPM that system 
firmware will handle errors from this source first.
Bit [1] - GLOBAL: If set, indicates that the settings contained in this 
structure apply globally to all PCI Express Devices.

All other bits must be set to zero.

It doesn't say shall, may or might. It says will.




I think somebody needs to fix these. I saw an email from Harb Abdulhamid
sent to aswg this morning.

That's why I suggested circulating this idea in UEFI forums first.
Let's see what everybody thinks. We can go from there.


However you look at it, we have a glaring inconsistency in how we handle AER 
control in linux. I'm surprised we didn't see huge issues because of mixing 
HEST/_OSC.


What systems rely on the HEST definition as opposed to _OSC? It doesn't make 
sense to me that you could have a system with mixed FFS and native AER on the 
same root port. The granularity of HEST shouldn't matter here because of how AER 
works.


I think It depends on your PCI topology.

For other topologies with multiple PCI root complexes, I can see this being
used per root complex flag to indicate which root complex needs firmware first
and which one doesn't.



I'd like see how exactly we break one of those elusive systems with _OSC. I 
suspect _OSC and HEST end up having the same information, and that's why we 
didn't see any real-life issue with mixing the approaches.


I'm already aware of two systems that rely on HEST table to pass information to
the OS that firmware first is enabled. Both of the systems do not change their
_OSC bits during this assuming HEST table has priority over _OSC for firmware
first.

If we add this patch, OS will try to claim the AER address space while firmware
wants exclusive access.

As I said in my previous email, the right place to talk about this is UEFI
forum.



Alex


P.S.
(SARCASM ALERT) Isn't UEFI is a pile of stuff that didn't stick to the wall?

P.P.S
I remember someone asking why we don't disable FFS in the BIOS. I think it would 
be next to impossible to get certain platform vendors to relinquish FFS control, 
unless the specs required things to be that way -- and had a "standard" way to 
do so.


Then getting the specs to change in that direction is also a battle.





Re: [PATCH 0/2] PCI/AER: Consistently use _OSC to determine who owns AER

2018-11-19 Thread Sinan Kaya

On 11/19/2018 3:16 PM, alex_gagn...@dellteam.com wrote:

On 11/19/2018 01:32 PM, Sinan Kaya wrote:

ACPI 6.2:

18.3.2.4 PCI Express Root Port AER Structure

Flags:

Bit [0] - FIRMWARE_FIRST: If set, this bit indicates to the OSPM that system
firmware will handle errors from this source first.
Bit [1] - GLOBAL: If set, indicates that the settings contained in this
structure apply globally to all PCI Express Devices.
All other bits must be set to zero.

It doesn't say shall, may or might. It says will.


It says "system firmware will handle errors". It does not say "system
firmware owns AER registers". In absence on any descriptor text on the
meaning of these tables, this really looks to me like it should be
interpreted as a descriptor of APEI error sources, not a mutex on who
writes to certain bits-- AER in this case.


True. I was trying to get it out in a rush. I omitted words.

However; table assumes governance about for which entities firmware first
should be enabled. There is no cross reference to _OSC or permission
negotiation like _OST.



I don't think that is contradictory or inconsistent.
I also wasn't able to find any reference to HEST in UEFI 2.7, only in
ACPI spec.


You are right. It was a confusion on my side. The right place to look is
ACPI specification. I was involved in this a couple of years ago. Some pieces
were in UEFI spec. Other pieces were in ACPI. I guess they got unified
now.




I think It depends on your PCI topology.

For other topologies with multiple PCI root complexes, I can see this being
used per root complex flag to indicate which root complex needs firmware first
and which one doesn't.


_OSC is per root bus, so it's already granular enough, right? Why would
it depend on PCI topology?



I was speculating. I don't have the full background on this. Need to consult
the spec developers.


As I said in my previous email, the right place to talk about this is UEFI
forum.


The way I would present the problem to he spec writers is that, although
the spec appears to be consistent, we've seen firmware vendors that made
the wrong assumptions about HEST/_OSC. Instead of describing AER
ownership with _OSC, they attempted to do it with HEST. So we should add
an implementation note, or clarification about this.


I agree.



Alex





Re: [PATCH 0/2] PCI/AER: Consistently use _OSC to determine who owns AER

2018-11-19 Thread Sinan Kaya

On 11/19/2018 6:49 PM, alex_gagn...@dellteam.com wrote:

On 11/19/2018 02:33 PM, Sinan Kaya wrote:

However; table assumes governance about for which entities firmware first
should be enabled. There is no cross reference to _OSC or permission
negotiation like _OST.


Well, from an OSPM perspective, is FFS something that can be enabled or
disabled? FFS seems to be static to OSPM, which would change the sort of
assumptions we can reasonably make here.


IMO, it can be enabled/disabled in BIOS. I have seen this implementation before.
If the trigger is the presence of a statically compiled ACPI HEST table (as the
current code does); presence of FFS would be static from OSPM perspective.
BIOS could patch this table or hide it during boot.

If FFS were to be negotiated via _OSC as indirectly implied in this series, then
same BIOS could patch the ACPI table to return different values for the _OSC
return.





As I said in my previous email, the right place to talk about this is UEFI
forum.


The way I would present the problem to he spec writers is that, although
the spec appears to be consistent, we've seen firmware vendors that made
the wrong assumptions about HEST/_OSC. Instead of describing AER
ownership with _OSC, they attempted to do it with HEST. So we should add
an implementation note, or clarification about this.


I agree.


Cool. While the UEFI Secret Society debates, can we figure out if/how
[patch 1/2] breaks those systems, or is it only [patch 2/2] of this
series that we suspect?


I went back and looked at both patches. Both of them are removing references to
HEST table. I think both patches are impacted by this discussion.



Alex





Re: [PATCH 0/2] PCI/AER: Consistently use _OSC to determine who owns AER

2018-11-20 Thread Sinan Kaya

On 11/20/2018 3:44 PM, alex_gagn...@dellteam.com wrote:

I'd prefer "sure" instead of "think". "I think it breaks some system I'm
not telling you about" doesn't help much in figuring out how not to
break said system(s).:)


Sorry, I thought I mentioned why it would break but let me repeat.

The systems I have seen rely on the HEST table presence as an indicator
to the OS that firmware first is enabled. If you go look at the _OSC bits
on such systems, it still says OS owns the AER service.

The assumption here is that HEST table has precedence over the _OSC bits.
That's what needs to be clarified in the UEFI forum.

If this code is to go in and ignore the HEST table presence, then firmware
will think that it owns AER service and OS will think that it owns AER
service too.


Re: [PATCH 0/2] PCI/AER: Consistently use _OSC to determine who owns AER

2018-11-20 Thread Sinan Kaya

On 11/20/2018 4:46 PM, alex_gagn...@dellteam.com wrote:

Now, let's assume, for the sake of argument, that the firmware on those
system's is broken, and it didn't intend to give the OS control of AER.
OSPM checking HEST instead of _OSC is still wrong, according to the
spec. Two wrongs don't make a right, they just don't crash.

I think the correct way is to identify those broken systems, and add
quirks for them. Continuing to have inconsistent and over-complicated
logic that is not spec compliant is not any better.


Remember that both _OSC and HEST are in the ACPI specification. I don't
think there is a consensus on what is "wrong".

There is certainly a need for spec clarification.

One version is:

"if HEST table is present, ignore _OSC"

or

Another version is:

"if HEST table is present, make sure that FW sets _OSC bit for AER to
false. Otherwise, warn like crazy that this BIOS is broken and needs
an update and can cause all sorts of trouble"

I can see both points of view. The second one can also be worked around
by an SMBIOS quirk too as you suggested. Counting the number of quirks
and random bug reports will be an interesting exercise / regression.

I followed the ASWG thread yesterday. There will be a meeting next week to
discuss this.

My preference is not to introduce new behavior/regression to the kernel.


Re: [PATCH 0/2] PCI/AER: Consistently use _OSC to determine who owns AER

2018-11-20 Thread Sinan Kaya

On 11/20/2018 4:42 PM, Keith Busch wrote:

How does that work? If the OS takes control, it sets up MSIs that FW don't
react to, and disables system errors through PCIe Root Control. Aren't
those sys errs the mechanism FW knows it has something to do, which
means the OS can effectively fence it off?


I think this is all implementation detail and doesn't necessarily apply
to all firmware-first implementation flavors.

Assumptions are:
1. both FW and OS are listening to MSI interrupts
2. FW monitors the system errors

Some FF implementation could route the AER interrupt to a higher privilege
level. Some other implementation could use INTx or a side-band channel interrupt
for firmware-interrupt too.

I have seen all 3 except MSI :) and also firmware never monitored the system
error bits. I was curious if anybody ever used those legacy bits. Now, I know
someone is using it.




Re: [PATCH 0/2] PCI/AER: Consistently use _OSC to determine who owns AER

2018-11-27 Thread Sinan Kaya

On 11/27/2018 1:22 PM, alex_gagn...@dellteam.com wrote:

On 11/20/2018 04:08 PM, Sinan Kaya wrote:

I followed the ASWG thread yesterday. There will be a meeting next week to
discuss this.


Any updates on the meeting?




Tyler should be able to get an update.


Re: [PATCH] pci: pcie: AER: Fix logging of Correctable errors

2020-06-19 Thread Sinan Kaya
On 6/18/2020 11:55 AM, Matt Jolly wrote:

> + pci_warn(dev, "  device [%04x:%04x] error 
> status/mask=%08x/%08x\n",
> + dev->vendor, dev->device,
> + info->status, info->mask);
> + } else {



> + pci_err(dev, "  device [%04x:%04x] error 
> status/mask=%08x/%08x\n",
> + dev->vendor, dev->device,
> + info->status, info->mask);


Function pointers for pci_warn vs. pci_err ?

This looks like a lot of copy/paste.


[PATCH 02/30] powerpc/PCI: deprecate pci_get_bus_and_slot()

2017-11-21 Thread Sinan Kaya
pci_get_bus_and_slot() is restrictive such that it assumes domain=0 as
where a PCI device is present. This restricts the device drivers to be
reused for other domain numbers.

Use pci_get_domain_bus_and_slot() with a domain number of 0 where we can't
extract the domain number. Other places, use the actual domain number from
the device.

Signed-off-by: Sinan Kaya 
---
 arch/powerpc/kernel/pci_32.c  | 3 ++-
 arch/powerpc/platforms/powermac/feature.c | 2 +-
 arch/powerpc/sysdev/mv64x60_pci.c | 4 ++--
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/pci_32.c b/arch/powerpc/kernel/pci_32.c
index 1d817f4..85ad2f7 100644
--- a/arch/powerpc/kernel/pci_32.c
+++ b/arch/powerpc/kernel/pci_32.c
@@ -96,7 +96,8 @@
reg = of_get_property(node, "reg", NULL);
if (!reg)
continue;
-   dev = pci_get_bus_and_slot(pci_bus, ((reg[0] >> 8) & 0xff));
+   dev = pci_get_domain_bus_and_slot(0, pci_bus,
+ ((reg[0] >> 8) & 0xff));
if (!dev || !dev->subordinate) {
pci_dev_put(dev);
continue;
diff --git a/arch/powerpc/platforms/powermac/feature.c 
b/arch/powerpc/platforms/powermac/feature.c
index 9e3f39d..ed8b166 100644
--- a/arch/powerpc/platforms/powermac/feature.c
+++ b/arch/powerpc/platforms/powermac/feature.c
@@ -829,7 +829,7 @@ static long core99_scc_enable(struct device_node *node, 
long param, long value)
 
if (value) {
if (pci_device_from_OF_node(node, &pbus, &pid) == 0)
-   pdev = pci_get_bus_and_slot(pbus, pid);
+   pdev = pci_get_domain_bus_and_slot(0, pbus, pid);
if (pdev == NULL)
return 0;
rc = pci_enable_device(pdev);
diff --git a/arch/powerpc/sysdev/mv64x60_pci.c 
b/arch/powerpc/sysdev/mv64x60_pci.c
index d52b3b8..6fe9104 100644
--- a/arch/powerpc/sysdev/mv64x60_pci.c
+++ b/arch/powerpc/sysdev/mv64x60_pci.c
@@ -37,7 +37,7 @@ static ssize_t mv64x60_hs_reg_read(struct file *filp, struct 
kobject *kobj,
if (count < MV64X60_VAL_LEN_MAX)
return -EINVAL;
 
-   phb = pci_get_bus_and_slot(0, PCI_DEVFN(0, 0));
+   phb = pci_get_domain_bus_and_slot(0, 0, PCI_DEVFN(0, 0));
if (!phb)
return -ENODEV;
pci_read_config_dword(phb, MV64X60_PCICFG_CPCI_HOTSWAP, &v);
@@ -61,7 +61,7 @@ static ssize_t mv64x60_hs_reg_write(struct file *filp, struct 
kobject *kobj,
if (sscanf(buf, "%i", &v) != 1)
return -EINVAL;
 
-   phb = pci_get_bus_and_slot(0, PCI_DEVFN(0, 0));
+   phb = pci_get_domain_bus_and_slot(0, 0, PCI_DEVFN(0, 0));
if (!phb)
return -ENODEV;
pci_write_config_dword(phb, MV64X60_PCICFG_CPCI_HOTSWAP, v);
-- 
1.9.1



[PATCH 14/30] powerpc/powermac: deprecate pci_get_bus_and_slot()

2017-11-21 Thread Sinan Kaya
pci_get_bus_and_slot() is restrictive such that it assumes domain=0 as
where a PCI device is present. This restricts the device drivers to be
reused for other domain numbers.

Use pci_get_domain_bus_and_slot() with a domain number of 0 where we can't
extract the domain number. Other places, use the actual domain number from
the device.

Signed-off-by: Sinan Kaya 
---
 drivers/macintosh/via-pmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/macintosh/via-pmu.c b/drivers/macintosh/via-pmu.c
index c4c2b3b..3e8b3b6 100644
--- a/drivers/macintosh/via-pmu.c
+++ b/drivers/macintosh/via-pmu.c
@@ -1799,7 +1799,7 @@ static int powerbook_sleep_grackle(void)
struct adb_request req;
struct pci_dev *grackle;
 
-   grackle = pci_get_bus_and_slot(0, 0);
+   grackle = pci_get_domain_bus_and_slot(0, 0, 0);
if (!grackle)
return -ENODEV;
 
-- 
1.9.1



[PATCH V2 02/29] powerpc/PCI: deprecate pci_get_bus_and_slot()

2017-11-22 Thread Sinan Kaya
pci_get_bus_and_slot() is restrictive such that it assumes domain=0 as
where a PCI device is present. This restricts the device drivers to be
reused for other domain numbers.

Getting ready to remove pci_get_bus_and_slot() function in favor of
pci_get_domain_bus_and_slot().

Use pci_get_domain_bus_and_slot() with a domain number of 0 as the code
is not ready to consume multiple domains and existing code used domain
number 0.

Signed-off-by: Sinan Kaya 
---
 arch/powerpc/kernel/pci_32.c  | 3 ++-
 arch/powerpc/platforms/powermac/feature.c | 2 +-
 arch/powerpc/sysdev/mv64x60_pci.c | 4 ++--
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/pci_32.c b/arch/powerpc/kernel/pci_32.c
index 1d817f4..85ad2f7 100644
--- a/arch/powerpc/kernel/pci_32.c
+++ b/arch/powerpc/kernel/pci_32.c
@@ -96,7 +96,8 @@
reg = of_get_property(node, "reg", NULL);
if (!reg)
continue;
-   dev = pci_get_bus_and_slot(pci_bus, ((reg[0] >> 8) & 0xff));
+   dev = pci_get_domain_bus_and_slot(0, pci_bus,
+ ((reg[0] >> 8) & 0xff));
if (!dev || !dev->subordinate) {
pci_dev_put(dev);
continue;
diff --git a/arch/powerpc/platforms/powermac/feature.c 
b/arch/powerpc/platforms/powermac/feature.c
index 9e3f39d..ed8b166 100644
--- a/arch/powerpc/platforms/powermac/feature.c
+++ b/arch/powerpc/platforms/powermac/feature.c
@@ -829,7 +829,7 @@ static long core99_scc_enable(struct device_node *node, 
long param, long value)
 
if (value) {
if (pci_device_from_OF_node(node, &pbus, &pid) == 0)
-   pdev = pci_get_bus_and_slot(pbus, pid);
+   pdev = pci_get_domain_bus_and_slot(0, pbus, pid);
if (pdev == NULL)
return 0;
rc = pci_enable_device(pdev);
diff --git a/arch/powerpc/sysdev/mv64x60_pci.c 
b/arch/powerpc/sysdev/mv64x60_pci.c
index d52b3b8..6fe9104 100644
--- a/arch/powerpc/sysdev/mv64x60_pci.c
+++ b/arch/powerpc/sysdev/mv64x60_pci.c
@@ -37,7 +37,7 @@ static ssize_t mv64x60_hs_reg_read(struct file *filp, struct 
kobject *kobj,
if (count < MV64X60_VAL_LEN_MAX)
return -EINVAL;
 
-   phb = pci_get_bus_and_slot(0, PCI_DEVFN(0, 0));
+   phb = pci_get_domain_bus_and_slot(0, 0, PCI_DEVFN(0, 0));
if (!phb)
return -ENODEV;
pci_read_config_dword(phb, MV64X60_PCICFG_CPCI_HOTSWAP, &v);
@@ -61,7 +61,7 @@ static ssize_t mv64x60_hs_reg_write(struct file *filp, struct 
kobject *kobj,
if (sscanf(buf, "%i", &v) != 1)
return -EINVAL;
 
-   phb = pci_get_bus_and_slot(0, PCI_DEVFN(0, 0));
+   phb = pci_get_domain_bus_and_slot(0, 0, PCI_DEVFN(0, 0));
if (!phb)
return -ENODEV;
pci_write_config_dword(phb, MV64X60_PCICFG_CPCI_HOTSWAP, v);
-- 
1.9.1



[PATCH V2 13/29] powerpc/powermac: deprecate pci_get_bus_and_slot()

2017-11-22 Thread Sinan Kaya
pci_get_bus_and_slot() is restrictive such that it assumes domain=0 as
where a PCI device is present. This restricts the device drivers to be
reused for other domain numbers.

Getting ready to remove pci_get_bus_and_slot() function in favor of
pci_get_domain_bus_and_slot().

Hard-code the domain number as 0 to match the previous behavior.

Signed-off-by: Sinan Kaya 
---
 drivers/macintosh/via-pmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/macintosh/via-pmu.c b/drivers/macintosh/via-pmu.c
index c4c2b3b..3e8b3b6 100644
--- a/drivers/macintosh/via-pmu.c
+++ b/drivers/macintosh/via-pmu.c
@@ -1799,7 +1799,7 @@ static int powerbook_sleep_grackle(void)
struct adb_request req;
struct pci_dev *grackle;
 
-   grackle = pci_get_bus_and_slot(0, 0);
+   grackle = pci_get_domain_bus_and_slot(0, 0, 0);
if (!grackle)
return -ENODEV;
 
-- 
1.9.1



[PATCH V3 02/29] powerpc/PCI: deprecate pci_get_bus_and_slot()

2017-11-27 Thread Sinan Kaya
pci_get_bus_and_slot() is restrictive such that it assumes domain=0 as
where a PCI device is present. This restricts the device drivers to be
reused for other domain numbers.

Getting ready to remove pci_get_bus_and_slot() function in favor of
pci_get_domain_bus_and_slot().

Use pci_get_domain_bus_and_slot() with a domain number of 0 as the code
is not ready to consume multiple domains and existing code used domain
number 0.

Signed-off-by: Sinan Kaya 
---
 arch/powerpc/kernel/pci_32.c  | 3 ++-
 arch/powerpc/platforms/powermac/feature.c | 2 +-
 arch/powerpc/sysdev/mv64x60_pci.c | 4 ++--
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/pci_32.c b/arch/powerpc/kernel/pci_32.c
index 1d817f4..85ad2f7 100644
--- a/arch/powerpc/kernel/pci_32.c
+++ b/arch/powerpc/kernel/pci_32.c
@@ -96,7 +96,8 @@
reg = of_get_property(node, "reg", NULL);
if (!reg)
continue;
-   dev = pci_get_bus_and_slot(pci_bus, ((reg[0] >> 8) & 0xff));
+   dev = pci_get_domain_bus_and_slot(0, pci_bus,
+ ((reg[0] >> 8) & 0xff));
if (!dev || !dev->subordinate) {
pci_dev_put(dev);
continue;
diff --git a/arch/powerpc/platforms/powermac/feature.c 
b/arch/powerpc/platforms/powermac/feature.c
index 9e3f39d..ed8b166 100644
--- a/arch/powerpc/platforms/powermac/feature.c
+++ b/arch/powerpc/platforms/powermac/feature.c
@@ -829,7 +829,7 @@ static long core99_scc_enable(struct device_node *node, 
long param, long value)
 
if (value) {
if (pci_device_from_OF_node(node, &pbus, &pid) == 0)
-   pdev = pci_get_bus_and_slot(pbus, pid);
+   pdev = pci_get_domain_bus_and_slot(0, pbus, pid);
if (pdev == NULL)
return 0;
rc = pci_enable_device(pdev);
diff --git a/arch/powerpc/sysdev/mv64x60_pci.c 
b/arch/powerpc/sysdev/mv64x60_pci.c
index d52b3b8..6fe9104 100644
--- a/arch/powerpc/sysdev/mv64x60_pci.c
+++ b/arch/powerpc/sysdev/mv64x60_pci.c
@@ -37,7 +37,7 @@ static ssize_t mv64x60_hs_reg_read(struct file *filp, struct 
kobject *kobj,
if (count < MV64X60_VAL_LEN_MAX)
return -EINVAL;
 
-   phb = pci_get_bus_and_slot(0, PCI_DEVFN(0, 0));
+   phb = pci_get_domain_bus_and_slot(0, 0, PCI_DEVFN(0, 0));
if (!phb)
return -ENODEV;
pci_read_config_dword(phb, MV64X60_PCICFG_CPCI_HOTSWAP, &v);
@@ -61,7 +61,7 @@ static ssize_t mv64x60_hs_reg_write(struct file *filp, struct 
kobject *kobj,
if (sscanf(buf, "%i", &v) != 1)
return -EINVAL;
 
-   phb = pci_get_bus_and_slot(0, PCI_DEVFN(0, 0));
+   phb = pci_get_domain_bus_and_slot(0, 0, PCI_DEVFN(0, 0));
if (!phb)
return -ENODEV;
pci_write_config_dword(phb, MV64X60_PCICFG_CPCI_HOTSWAP, v);
-- 
1.9.1



[PATCH V3 13/29] powerpc/powermac: deprecate pci_get_bus_and_slot()

2017-11-27 Thread Sinan Kaya
pci_get_bus_and_slot() is restrictive such that it assumes domain=0 as
where a PCI device is present. This restricts the device drivers to be
reused for other domain numbers.

Getting ready to remove pci_get_bus_and_slot() function in favor of
pci_get_domain_bus_and_slot().

Hard-code the domain number as 0 to match the previous behavior.

Signed-off-by: Sinan Kaya 
---
 drivers/macintosh/via-pmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/macintosh/via-pmu.c b/drivers/macintosh/via-pmu.c
index c4c2b3b..3e8b3b6 100644
--- a/drivers/macintosh/via-pmu.c
+++ b/drivers/macintosh/via-pmu.c
@@ -1799,7 +1799,7 @@ static int powerbook_sleep_grackle(void)
struct adb_request req;
struct pci_dev *grackle;
 
-   grackle = pci_get_bus_and_slot(0, 0);
+   grackle = pci_get_domain_bus_and_slot(0, 0, 0);
if (!grackle)
return -ENODEV;
 
-- 
1.9.1



[PATCH V4 02/26] powerpc/PCI: deprecate pci_get_bus_and_slot()

2017-12-18 Thread Sinan Kaya
pci_get_bus_and_slot() is restrictive such that it assumes domain=0 as
where a PCI device is present. This restricts the device drivers to be
reused for other domain numbers.

Getting ready to remove pci_get_bus_and_slot() function in favor of
pci_get_domain_bus_and_slot().

Use pci_get_domain_bus_and_slot() with a domain number of 0 as the code
is not ready to consume multiple domains and existing code used domain
number 0.

Signed-off-by: Sinan Kaya 
---
 arch/powerpc/kernel/pci_32.c  | 3 ++-
 arch/powerpc/platforms/powermac/feature.c | 2 +-
 arch/powerpc/sysdev/mv64x60_pci.c | 4 ++--
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/pci_32.c b/arch/powerpc/kernel/pci_32.c
index 1d817f4..85ad2f7 100644
--- a/arch/powerpc/kernel/pci_32.c
+++ b/arch/powerpc/kernel/pci_32.c
@@ -96,7 +96,8 @@
reg = of_get_property(node, "reg", NULL);
if (!reg)
continue;
-   dev = pci_get_bus_and_slot(pci_bus, ((reg[0] >> 8) & 0xff));
+   dev = pci_get_domain_bus_and_slot(0, pci_bus,
+ ((reg[0] >> 8) & 0xff));
if (!dev || !dev->subordinate) {
pci_dev_put(dev);
continue;
diff --git a/arch/powerpc/platforms/powermac/feature.c 
b/arch/powerpc/platforms/powermac/feature.c
index 466b842..3f82cb2 100644
--- a/arch/powerpc/platforms/powermac/feature.c
+++ b/arch/powerpc/platforms/powermac/feature.c
@@ -829,7 +829,7 @@ static long core99_scc_enable(struct device_node *node, 
long param, long value)
 
if (value) {
if (pci_device_from_OF_node(node, &pbus, &pid) == 0)
-   pdev = pci_get_bus_and_slot(pbus, pid);
+   pdev = pci_get_domain_bus_and_slot(0, pbus, pid);
if (pdev == NULL)
return 0;
rc = pci_enable_device(pdev);
diff --git a/arch/powerpc/sysdev/mv64x60_pci.c 
b/arch/powerpc/sysdev/mv64x60_pci.c
index d52b3b8..6fe9104 100644
--- a/arch/powerpc/sysdev/mv64x60_pci.c
+++ b/arch/powerpc/sysdev/mv64x60_pci.c
@@ -37,7 +37,7 @@ static ssize_t mv64x60_hs_reg_read(struct file *filp, struct 
kobject *kobj,
if (count < MV64X60_VAL_LEN_MAX)
return -EINVAL;
 
-   phb = pci_get_bus_and_slot(0, PCI_DEVFN(0, 0));
+   phb = pci_get_domain_bus_and_slot(0, 0, PCI_DEVFN(0, 0));
if (!phb)
return -ENODEV;
pci_read_config_dword(phb, MV64X60_PCICFG_CPCI_HOTSWAP, &v);
@@ -61,7 +61,7 @@ static ssize_t mv64x60_hs_reg_write(struct file *filp, struct 
kobject *kobj,
if (sscanf(buf, "%i", &v) != 1)
return -EINVAL;
 
-   phb = pci_get_bus_and_slot(0, PCI_DEVFN(0, 0));
+   phb = pci_get_domain_bus_and_slot(0, 0, PCI_DEVFN(0, 0));
if (!phb)
return -ENODEV;
pci_write_config_dword(phb, MV64X60_PCICFG_CPCI_HOTSWAP, v);
-- 
1.9.1



[PATCH V4 12/26] powerpc/powermac: deprecate pci_get_bus_and_slot()

2017-12-18 Thread Sinan Kaya
pci_get_bus_and_slot() is restrictive such that it assumes domain=0 as
where a PCI device is present. This restricts the device drivers to be
reused for other domain numbers.

Getting ready to remove pci_get_bus_and_slot() function in favor of
pci_get_domain_bus_and_slot().

Hard-code the domain number as 0 to match the previous behavior.

Signed-off-by: Sinan Kaya 
---
 drivers/macintosh/via-pmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/macintosh/via-pmu.c b/drivers/macintosh/via-pmu.c
index e8b29fc..08849e3 100644
--- a/drivers/macintosh/via-pmu.c
+++ b/drivers/macintosh/via-pmu.c
@@ -1799,7 +1799,7 @@ static int powerbook_sleep_grackle(void)
struct adb_request req;
struct pci_dev *grackle;
 
-   grackle = pci_get_bus_and_slot(0, 0);
+   grackle = pci_get_domain_bus_and_slot(0, 0, 0);
if (!grackle)
return -ENODEV;
 
-- 
1.9.1



Re: [PATCH V4 02/26] powerpc/PCI: deprecate pci_get_bus_and_slot()

2017-12-19 Thread Sinan Kaya
On 12/19/2017 5:29 AM, Michael Ellerman wrote:
> Sinan Kaya  writes:
> 
>> pci_get_bus_and_slot() is restrictive such that it assumes domain=0 as
>> where a PCI device is present. This restricts the device drivers to be
>> reused for other domain numbers.
>>
>> Getting ready to remove pci_get_bus_and_slot() function in favor of
>> pci_get_domain_bus_and_slot().
>>
>> Use pci_get_domain_bus_and_slot() with a domain number of 0 as the code
>> is not ready to consume multiple domains and existing code used domain
>> number 0.
>>
>> Signed-off-by: Sinan Kaya 
>> ---
>>  arch/powerpc/kernel/pci_32.c  | 3 ++-
>>  arch/powerpc/platforms/powermac/feature.c | 2 +-
>>  arch/powerpc/sysdev/mv64x60_pci.c | 4 ++--
>>  3 files changed, 5 insertions(+), 4 deletions(-)
> 
> Looks to do what it says. I don't know if any of these call sites could
> be improved to provide the domain, but that's not really your problem.
> 

I tried to fix the places where I can at the rest of the series. I even took
a stab at fixing powerpc. There were places where the domain information was
not available and I couldn't locate where to extract it. I decided to leave it
as it is and hope that someone someday may expand the powerpc PCI adaptation. 

> Acked-by: Michael Ellerman  (powerpc)
> 
> cheers
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs

2018-03-17 Thread Sinan Kaya
+linuxppc-dev@lists.ozlabs.org

On 3/17/2018 11:05 AM, Jason Gunthorpe wrote:
> On Sat, Mar 17, 2018 at 12:25:14AM -0400, Sinan Kaya wrote:
>> On 3/17/2018 12:03 AM, Sinan Kaya wrote:
>>> On 3/16/2018 11:40 PM, Sinan Kaya wrote:
>>>> I'll change writel_relaxed() with __raw_writel() in the series like you 
>>>> suggested
>>>> and also look at your other comments.
>>>
>>> I spoke too soon.
>>>
>>> Now that I realized, code needs to follow one of the following patterns for 
>>> correctness
>>>
>>> 1)
>>> wmb()
>>> writel()/writel_relaxed()
>>>
>>> or 
>>>
>>> 2)
>>> wmb()
>>> __raw_wrltel()
>>> mmiowb()
>>>
>>> but definitely not
>>>
>>> wmb()
>>> __raw_wrltel()
>>>
>>> Since #1 == #2, I'll stick to my current implementation of writel_relaxed()
>>>
>>> Changing writel() to writel_relaxed() or __raw_writel() isn't enough. 
>>> PowerPC needs mmiowb()
>>> for correctness. ARM's mmiowb() implementation is empty.
>>>
>>> So, there is no one size fits all solution with the current state of 
>>> affairs.
>>>
>>>
>>
>> I think I finally got what you mean.
>>
>> Code seems to have
>>
>> wmb()
>> writel()/writeq()
>> wmb()
>>
>> this can be safely replaced with
>>
>> wmb()
>> __raw_writel()/__raw_writeq()
>> wmb()
>>
>> This will work on all arches. Below is the new version. Let me know if this 
>> is OK.
>>
>> +++ b/drivers/infiniband/hw/cxgb4/t4.h
>> @@ -457,7 +457,7 @@ static inline void pio_copy(u64 __iomem *dst, u64 *src)
>> int count = 8;
>>
>> while (count) {
>> -   writeq(*src, dst);
>> +   __raw_writeq(*src, dst);
>> src++;
>> dst++;
>> count--;
>> @@ -477,15 +477,16 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, u16 
>> inc, union t4_wr *wqe)
>>  (u64 *)wqe);
>> } else {
>> pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
>> -   writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
>> -  wq->sq.bar2_va + SGE_UDB_KDOORBELL);
>> +   __raw_writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
>> +wq->sq.bar2_va + SGE_UDB_KDOORBELL);
>> }
>>
>> /* Flush user doorbell area writes. */
>> wmb();
>> return;
>> }
>> -   writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
>> +   __raw_writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
>> +   mmiowmb()
>>  }
>>
>>  static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
>> @@ -502,15 +503,16 @@ static inline void t4_ring_rq_db(struct t4_wq *wq, u16 
>> inc,
>>  (void *)wqe);
>> } else {
>> pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx);
>> -   writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
>> -  wq->rq.bar2_va + SGE_UDB_KDOORBELL);
>> +   __raw_writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
>> +wq->rq.bar2_va + SGE_UDB_KDOORBELL);
>> }
>>
>> /* Flush user doorbell area writes. */
>> wmb();
>> return;
>> }
>> -   writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
>> +   __raw_writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
>> +   mmiowmb();
> 
> No! NAK! Adding *new* barriers to use the *wrong* accessor is crazy!
> 
> Your first patch was right, replacing
>   wmb()
>   writel()
> 
> With wmb(); writel_relaxed() is fine, and helps ARM.
> 
> The problem with PPC has nothing to do with the writel, it is with the
> spinlock, and we can't improve it without adding some new
> infrastructure. Certainly don't hack around the problem by using
> __raw_writel in multi-arch drivers!
> 
> In userspace we settled on something like this as a pattern:
> 
>  mmio_wc_spin_lock()
>  writel_wc_mmio()
>  mmio_wc_spin_unlock()

> 
> Where mmio_wc_spin_unlock incorporates the mmiowmb and is defined to
&

RFC on writel and writel_relaxed

2018-03-20 Thread Sinan Kaya
Hi PPC Maintainers,

We are seeking feedback on the status of relaxed write API implementation.
What is the motivation for not implementing the relaxed API? 

I see that network drivers are working around the issue by calling
__raw_write() API directly but this also breaks other architectures
like SPARC since the semantics of __raw_writel() seems to be system dependent.

This is putting drivers into a tight position and they cannot achieve true
multi-arch enablement and are forced into calling __raw APIs flavors
directly with #ifdef BIG_ENDIAN ugliness.

Sinan

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: RFC on writel and writel_relaxed

2018-03-21 Thread Sinan Kaya
On 3/20/2018 10:40 PM, Oliver wrote:
>> I see that network drivers are working around the issue by calling
>> __raw_write() API directly but this also breaks other architectures
>> like SPARC since the semantics of __raw_writel() seems to be system 
>> dependent.
> Yeah that's pretty gross. Which drivers are doing this?
> 

Searching for __raw_writel() and BIG_ENDIAN in drivers/net directory
should give you the idea.

In a nutshell, drivers are doing this today.

wmb()
__raw_writel();
mmiowb()

I'm in the process of posting patches ([1], [2], [3]) to various subsystems to
eliminate double barriers by replacing sequences like 

wmb()
writel()
mmiowb()

with

wmb()
writel_relaxed()
mmiowb()

Reviewers pointed out that writel_relaxed() is not __raw_writel() on PPC
and cannot take advantage of the optimization. 

Replacing writel_relaxed() with raw_writel() or vice versa is not correct.

writel_relaxed() needs to have ordering guarantees with respect to the order
device observes writes. 

x86 has compiler barrier inside the relaxed() API so that code does not
get reordered. ARM64 architecturally guarantees device writes to be observed
in order.

I was hoping that PPC could follow x86 and inject compiler barrier into the
relaxed functions. 

BTW, I have no idea what compiler barrier does on PPC and if

wrltel() == compiler barrier() + wrltel_relaxed()

can be said.

Need guidance from the PPC maintainers. 

[1] https://www.spinics.net/lists/netdev/msg490480.html
[2] https://www.spinics.net/lists/arm-kernel/msg642341.html
[3] https://www.spinics.net/lists/arm-kernel/msg642336.html

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: RFC on writel and writel_relaxed

2018-03-21 Thread Sinan Kaya
On 3/21/2018 8:53 AM, Sinan Kaya wrote:
> BTW, I have no idea what compiler barrier does on PPC and if
> 
> wrltel() == compiler barrier() + wrltel_relaxed()
> 
> can be said.

this should have been

writel_relaxed() == compiler barrier() + __raw_writel()



-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: RFC on writel and writel_relaxed

2018-03-21 Thread Sinan Kaya
On 3/21/2018 9:35 AM, David Laight wrote:
>> x86 has compiler barrier inside the relaxed() API so that code does not
>> get reordered. ARM64 architecturally guarantees device writes to be observed
>> in order.
> 
> There are places where you don't even need a compile barrier between
> every write.
> 
> I had horrid problems getting some ppc code (for a specific embedded SoC)
> optimised to have no extra barriers.
> I ended up just writing through 'pointer to volatile' and adding an
> explicit 'eieio' between the block of writes and status read.
> 
> No less painful was doing a byteswapping write to normal memory.

If the architecture is reordering writes to the peripheral, then removing
the compiler barrier can break the multi-arch drivers. barriers document
clearly states that device need to observe writes in order. 

Though for special cases like you mentioned, you can certainly do this:

wmb()
__raw_write/pointer access
__raw_write/pointer access
__raw_write/pointer access

/* flush everything */
mmiowb()
__raw_write/pointer access

There would be no ordering guarantee between the wmb() and mmiowb().
This can only be done for known code and known hardware. I don't believe
this applies to multi-arch drivers.


> 
>   David
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: RFC on writel and writel_relaxed

2018-03-22 Thread Sinan Kaya
On 3/22/2018 8:52 AM, Benjamin Herrenschmidt wrote:
>>> No, it's not sufficient.
> Just to clarify ... barrier() is just a compiler barrier, it means the
> compiler will generate things in the order they are written. This isn't
> sufficient on archs with an OO memory model, where an actual memory
> barrier instruction needs to be emited.

Surprisingly, ARM64 GCC compiler generates a write barrier as
opposed to preventing code reordering.

I was curious if this is an ARM only thing or not. 

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: RFC on writel and writel_relaxed

2018-03-23 Thread Sinan Kaya
On 3/22/2018 8:16 PM, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-22 at 12:51 -0500, Sinan Kaya wrote:
>> On 3/22/2018 8:52 AM, Benjamin Herrenschmidt wrote:
>>>>> No, it's not sufficient.
>>>
>>> Just to clarify ... barrier() is just a compiler barrier, it means the
>>> compiler will generate things in the order they are written. This isn't
>>> sufficient on archs with an OO memory model, where an actual memory
>>> barrier instruction needs to be emited.
>>
>> Surprisingly, ARM64 GCC compiler generates a write barrier as
>> opposed to preventing code reordering.
>>
>> I was curious if this is an ARM only thing or not. 
> 
> Are you sure of that ? I thought it's the ARM implementation of writel
> that had an explicit write barrier in it:

Yes, I'm %100 sure. The answer is both writel() and barrier() generates
a write barrier instruction. I found this by searching the kernel disassembly
for back to back "dsb st" instruction.

> 
> #define writel(v,c)   ({ __iowmb(); writel_relaxed((v),(c)); })
> 
> And __iowmb() is 
> 
> #define __iowmb() wmb()
> 
> Note, I'm a bit dubious about this in ARM:
> 
> #define readl(c)  ({ u32 __v = readl_relaxed(c); __iormb(); __v; }
> 
> Will, Marc, on powerpc, we put a sync *before* the read in readl etc...
> 
> The reasoning was there could be some DMA setup followed by a side
> effect readl rather than a side effect writel to trigger a DMA. Granted
> I wouldn't expect modern devices to be that stupid, but I have vague
> memory of some devices back in the day having that sort of read ops.
> 
> In general, I though the model offerred by x86 and thus by Linux
> readl/writel was full synchronization both before and after the MMIO,
> vs either other MMIO or all other forms of ops (cachable memory, locks
> etc...).
> 
> Also, can't the above readl_relaxed leak out of a lock ?

I think you are asking about PPC, correct? 

I read somewhere that PPC implementation keeps track of MMIO accesses and
has an implicit barrier inside the spin_unlock() code for such accesses.
Isn't this true?

> 
> Cheers,
> Ben.
> 
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: RFC on writel and writel_relaxed

2018-03-24 Thread Sinan Kaya
On 3/23/2018 9:22 PM, Benjamin Herrenschmidt wrote:
>> Yes, I'm %100 sure. The answer is both writel() and barrier() generates
>> a write barrier instruction. I found this by searching the kernel disassembly
>> for back to back "dsb st" instruction.
> I'm not sure you are correct here. As I wrote below, the implementatoin
> of writel() contains an *explicit" memory barrier which is completely
> different to a barrier() instruction:

OK. I did some directed tests and I'm taking it back. 
barrier() is a compiler reordering statement only.

What got me confused was this sequence:

wmb()
barrier()
writel()

I thought that the second barrier instruction was coming from barrier() but
it was actually coming from writel().

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: RFC on writel and writel_relaxed

2018-03-26 Thread Sinan Kaya
On 3/26/2018 8:11 AM, ok...@codeaurora.org wrote:
> On 2018-03-26 07:44, Will Deacon wrote:
>> Hi Ben,
>>
>> I don't seem to have the beginning of this thread, so please bounce it over
>> if you'd like me to look at it!
>>
> 
> https://www.spinics.net/lists/linux-rdma/msg62570.html
> 
> https://www.spinics.net/lists/linux-rdma/index.html#62666
> 

To add some more details on why we are looking at this now:

I posted several patches last week to remove duplicate barriers on ARM while
trying to make the code friendly with other architectures.

https://www.spinics.net/lists/netdev/msg491842.html
https://www.spinics.net/lists/linux-rdma/msg62434.html
https://www.spinics.net/lists/arm-kernel/msg642336.html

The conversation on this thread is interesting.

https://patchwork.kernel.org/patch/10288987/

1. I tried to replace wmb()+writel() with wmb()+writel_relaxed().
2. writel_relaxed() is equal to writel() at this moment for PPC.
3. Chelsio developers wanted to pull it into wmb()+__raw_writel() direction
to take advantage of the same optimization for PPC.
4. Dave informed us that behavior of __raw_write() is not identical on all
architectures.
5. We decided to go back to PPC and ask to implement writel_relaxed()
instead of coming up with writel_realy_relaxed() API.

> 
>> On Fri, Mar 23, 2018 at 11:16:08AM +1100, Benjamin Herrenschmidt wrote:
>>> On Thu, 2018-03-22 at 12:51 -0500, Sinan Kaya wrote:
>>> > On 3/22/2018 8:52 AM, Benjamin Herrenschmidt wrote:
>>> > > > > No, it's not sufficient.
>>> > >
>>> > > Just to clarify ... barrier() is just a compiler barrier, it means the
>>> > > compiler will generate things in the order they are written. This isn't
>>> > > sufficient on archs with an OO memory model, where an actual memory
>>> > > barrier instruction needs to be emited.
>>> >
>>> > Surprisingly, ARM64 GCC compiler generates a write barrier as
>>> > opposed to preventing code reordering.
>>
>> In context, this looks like a misunderstanding somewhere. barrier() is a
>> compiler barrier for us just like everybody else and we use the generic
>> implementation with the empty asm + memory clobber.
>>
> 
> True, I clarified it this weekend
> 
> https://www.spinics.net/lists/linux-rdma/msg62788.html
> 
> 
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: RFC on writel and writel_relaxed

2018-03-26 Thread Sinan Kaya
On 3/26/2018 9:43 AM, Arnd Bergmann wrote:
> On Wed, Mar 21, 2018 at 2:58 PM, Sinan Kaya  wrote:
>> On 3/21/2018 8:53 AM, Sinan Kaya wrote:
>>> BTW, I have no idea what compiler barrier does on PPC and if
>>>
>>> wrltel() == compiler barrier() + wrltel_relaxed()
>>>
>>> can be said.
>>
>> this should have been
>>
>> writel_relaxed() == compiler barrier() + __raw_writel()
> 
> I don't think anyone clarified this so far, but there are additional 
> differences
> between the two, writel_relaxed() assumes we are talking to a 32-bit
> little-endian
> MMIO register, while __raw_writel() is primarily used for writing into
> memory-type
> regions with no particular byte order. This means:
> 
> - writel_relaxed() must perform a byte swap when running on big-endian kernels
> - when used with __packed MMIO pointers, __raw_writel() may turn into a series
>   of byte writes, while writel_relaxed() must result in a single 32-bit 
> access.
> - A set if consecutive writel_relaxed() on the same device is issued in 
> program
>   order, while __raw_writel() is not ordered. This typically requires
> only a compiler
>   barrier, but may also need a CPU barrier (in addition to the
> barriers we use to
>   serialize with spinlocks and DMA in writel() but not writel_relaxed()).

Thanks for the great summary. I didn't know that __raw_writel() could get 
converted
to byte writes.

> 
> Arnd
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: RFC on writel and writel_relaxed

2018-03-26 Thread Sinan Kaya
On 3/26/2018 5:30 PM, Arnd Bergmann wrote:
>> But that was never a requirement of writel(),
>> Documentation/memory-barriers.txt gives an explicit example demanding
>> the wmb() before writel() for ordering system memory against writel.
> Indeed, but it's in an example for when to use dma_wmb(), not wmb().
> Adding Alexander Duyck to Cc, he added that section as part of
> 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
> dma_wmb()"). Also adding the other people that were involved with that.
> 

ARM developers can get away with not including wmb() in their code and use
writel() to observe memory writes due to implicit barriers.

However, same code will not work on Intel.

writel() has a compiler barrier in it for x86.
wmb() has a sync operation in it for x86. 

Unless wmb() is called, PCIe device won't observe memory updates from the CPU.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: RFC on writel and writel_relaxed

2018-03-26 Thread Sinan Kaya
On 3/26/2018 6:01 PM, Benjamin Herrenschmidt wrote:
> On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote:
>> On 3/26/2018 5:30 PM, Arnd Bergmann wrote:
>>>> But that was never a requirement of writel(),
>>>> Documentation/memory-barriers.txt gives an explicit example demanding
>>>> the wmb() before writel() for ordering system memory against writel.
>>>
>>> Indeed, but it's in an example for when to use dma_wmb(), not wmb().
>>> Adding Alexander Duyck to Cc, he added that section as part of
>>> 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
>>> dma_wmb()"). Also adding the other people that were involved with that.
>>>
>>
>> ARM developers can get away with not including wmb() in their code and use
>> writel() to observe memory writes due to implicit barriers.
>>
>> However, same code will not work on Intel.
> 
> Wrong. It will.
> 
> You do NOT need wmb between writes to memory and writel.

If writel() provides such a guarantee, why do I see code sequences like

wmb()
writel() 

all over the place.

> 
>> writel() has a compiler barrier in it for x86.
>> wmb() has a sync operation in it for x86. 
>>
>> Unless wmb() is called, PCIe device won't observe memory updates from the 
>> CPU.
> 
> This is completely wrong. They will. Intel provides the necessary
> ordering guarantees without an explicit wmb.
> 

I'm still reserving my doubts here. I was told about an explicit
wmb() requirement last week.

> Otherwise almost all drivers out there are broken which I very much
> doubt :-)
> 
> Cheers,
> Ben.
> 
> 
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Sinan Kaya
On 3/27/2018 7:02 AM, Will Deacon wrote:
> - See Documentation/DMA-API.txt for more information on consistent memory.
> + can see it now has ownership.  Note that, when using writel(), a prior
> + wmb() is not needed to guarantee that the cache coherent memory writes
> + have completed before writing to the MMIO region.  The cheaper
> + writel_relaxed() does not provide this guarantee and must not be used
> + here.

Can we say the same thing for iowrite32() and iowrite32be(). I also see wmb()
in front of these.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Sinan Kaya
+netdev, +Alex

On 3/26/2018 6:00 PM, Benjamin Herrenschmidt wrote:
> On Mon, 2018-03-26 at 23:30 +0200, Arnd Bergmann wrote:
>>  Most of the drivers have a unwound loop with writeq() or something to
>>> do it.
>>
>> But isn't the writeq() barrier much more expensive than anything you'd
>> do in function calls?
> 
> It is for us, and will break any write combining.
> 
>>>>> The same document says that _relaxed() does not give that guarentee.
>>>>>
>>>>> The lwn articule on this went into some depth on the interaction with
>>>>> spinlocks.
>>>>>
>>>>> As far as I can see, containment in a spinlock seems to be the only
>>>>> different between writel and writel_relaxed..
>>>>
>>>> I was always puzzled by this: The intention of _relaxed() on ARM
>>>> (where it originates) was to skip the barrier that serializes DMA
>>>> with MMIO, not to skip the serialization between MMIO and locks.
>>>
>>> But that was never a requirement of writel(),
>>> Documentation/memory-barriers.txt gives an explicit example demanding
>>> the wmb() before writel() for ordering system memory against writel.
> 
> This is a bug in the documentation.
> 
>> Indeed, but it's in an example for when to use dma_wmb(), not wmb().
>> Adding Alexander Duyck to Cc, he added that section as part of
>> 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
>> dma_wmb()"). Also adding the other people that were involved with that.
> 
> Linus himself made it very clear years ago. readl and writel have to
> order vs memory accesses.
> 
>>> I actually have no idea why ARM had that barrier, I always assumed it
>>> was to give program ordering to the accesses and that _relaxed allowed
>>> re-ordering (the usual meaning of relaxed)..
>>>
>>> But the barrier document makes it pretty clear that the only
>>> difference between the two is spinlock containment, and WillD wrote
>>> this text, so I belive it is accurate for ARM.
>>>
>>> Very confusing.
>>
>> It does mention serialization with both DMA and locks in the
>> section about  readX_relaxed()/writeX_relaxed(). The part
>> about DMA is very clear here, and I must have just forgotten
>> the exact semantics with regards to spinlocks. I'm still not
>> sure what prevents a writel() from leaking out the end of a
>> spinlock section that doesn't happen with writel_relaxed(), since
>> the barrier in writel() comes before the access, and the
>> spin_unlock() shouldn't affect the external buses.
> 
> So...
> 
> Historically, what happened is that we (we means whoever participated
> in the discussion on the list with Linus calling the shots really)
> decided that there was no sane way for drivers to understand a world
> where readl/writel didn't fully order things vs. memory accesses (ie,
> DMA).
> 
> So it should always be correct to do:
> 
>   - Write to some in-memory buffer
>   - writel() to kick the DMA read of that buffer
> 
> without any extra barrier.
> 
> The spinlock situation however got murky. Mostly that came up because
> on architecture (I forgot who, might have been ia64) has a hard time
> providing that consistency without making writel insanely expensive.
> 
> Thus they created mmiowb whose main purpose was precisely to order
> writel with a following spin_unlock.
> 
> I decided not to go down that path on power because getting all drivers
> "fixed" to do the right thing was going to be a losing battle, and
> instead added per-cpu tracking of writel in order to "escalate" to a
> heavier barrier in spin_unlock itself when necessary.
> 
> Now, all this happened more than a decade ago and it's possible that
> the understanding or expectations "shifted" over time...

Alex is raising concerns on the netdev list.

Sinan
"We are being told that if you use writel(), then you don't need a wmb() on
all architectures."

Alex:
"I'm not sure who told you that but that is incorrect, at least for
x86. If you attempt to use writel() without the wmb() we will have to
NAK the patches. We will accept the wmb() with writel_releaxed() since
that solves things for ARM."

> Jason is seeking behavior clarification for write combined buffers.

Alex:
"Don't bother. I can tell you right now that for x86 you have to have a
wmb() before the writel().

Based on the comment in
(https://www.spinics.net/lists/linux-rdma/msg62666.html):
Replacing wmb() + writel() with wmb() + writel_rela

Re: RFC on writel and writel_relaxed

2018-03-27 Thread Sinan Kaya
On 3/27/2018 5:54 PM, Alexander Duyck wrote:
> I view the wmb() + writel_relaxed() as more of a driver owning and
> handling this itself. Besides in the Intel Ethernet driver case it is
> better performance as our wmb() placement for us also provides a
> secondary barrier so we don't need to add a separate smp_wmb() to deal
> with a potential race we have with the Tx cleanup.

Thanks for the reminder. 

I forgot about the double barrier optimization. wmb() + writel_relaxed()
seems to be the best option for Intel network drivers at this moment. 
Otherwise, we'll have to remove wmb() and throw in smp barriers there
like you mentioned.

I'll leave the changes in the Intel drivers alone.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Sinan Kaya
On 3/27/2018 10:51 PM, Linus Torvalds wrote:
>> The discussion at hand is about
>>
>> dma_buffer->foo = 1;/* WB */
>> writel(KICK, DMA_KICK_REGISTER);/* UC */
> Yes. That certainly is ordered on x86. In fact, afaik it's ordered
> even if that writel() might be of type WC, because that only delays
> writes, it doesn't move them earlier.

Now that we clarified x86 myth, Is this guaranteed on all architectures?
We keep getting IA64 exception example. Maybe, this is also corrected since
then.

Jose Abreu says "I don't know about x86 but arc architecture doesn't
have a wmb() in the writel() function (in some configs)".

As long as we have these exceptions, these wmb() in common drivers is not
going anywhere and relaxed-arches will continue paying performance penalty.

I see 15% performance loss on ARM64 servers using Intel i40e network
drivers and an XL710 adapter due to CPU keeping itself busy doing barriers
most of the time rather than real work because of sequences like this all over
the place.

 dma_buffer->foo = 1;/* WB */
 wmb()
 writel(KICK, DMA_KICK_REGISTER);/* UC */

I posted several patches last week to remove duplicate barriers on ARM while
trying to make the code friendly with other architectures.

Basically changing it to

dma_buffer->foo = 1;/* WB */
wmb()
writel_relaxed(KICK, DMA_KICK_REGISTER);/* UC */
mmiowb()

This is a small step in the performance direction until we remove all 
exceptions.

https://www.spinics.net/lists/netdev/msg491842.html
https://www.spinics.net/lists/linux-rdma/msg62434.html
https://www.spinics.net/lists/arm-kernel/msg642336.html

Discussion started to move around the need for relaxed API on PPC and then
why wmb() question came up.

Sinan

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: RFC on writel and writel_relaxed

2018-03-29 Thread Sinan Kaya
On 3/28/2018 11:55 AM, David Miller wrote:
> From: Benjamin Herrenschmidt 
> Date: Thu, 29 Mar 2018 02:13:16 +1100
> 
>> Let's fix all archs, it's way easier than fixing all drivers. Half of
>> the archs are unused or dead anyway.
> 
> Agreed.
> 

I pinged most of the maintainers yesterday.
Which arches do we care about these days?
I have not been paying attention any other architecture besides arm64.

archstatus  detail
--  -   
alpha   question sent
arc question sent   ys...@users.sourceforge.jp will fix it.
arm no issues
arm64   no issues
blackfinquestion sent   about to be removed
c6x question sent
crisquestion sent
frv
h8300   question sent
hexagon question sent
ia64no issues   confirmed by Tony Luck
m32r
m68kquestion sent
metag
microblaze  question sent
mipsquestion sent
mn10300 question sent
nios2   question sent
openriscno issues   sho...@gmail.com says should no issues
parisc  no issues   grantgrund...@gmail.com says most 
probably no problem but still looking
powerpc no issues
riscv   question sent
s390question sent
score   question sent
sh  question sent
sparc   question sent
tilequestion sent
unicore32   question sent
x86 no issues
xtensa  question sent


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: RFC on writel and writel_relaxed

2018-03-29 Thread Sinan Kaya
On 3/29/2018 12:29 PM, Arnd Bergmann wrote:
> On Thu, Mar 29, 2018 at 3:56 PM, Sinan Kaya  wrote:
>> On 3/28/2018 11:55 AM, David Miller wrote:
>>> From: Benjamin Herrenschmidt 
>>> Date: Thu, 29 Mar 2018 02:13:16 +1100
>>>
>>>> Let's fix all archs, it's way easier than fixing all drivers. Half of
>>>> the archs are unused or dead anyway.
>>>
>>> Agreed.
>>>
>>
>> I pinged most of the maintainers yesterday.
>> Which arches do we care about these days?
>> I have not been paying attention any other architecture besides arm64.
>>
>> archstatus  detail
>> --  -   
>> alpha   question sent

Thanks for the detailed analysis.

> 
> I'm guessing alpha has problems
> 
> extern inline u32 readl(const volatile void __iomem *addr)
> {
> u32 ret = __raw_readl(addr);
> mb();
> return ret;
> }
> extern inline void writel(u32 b, volatile void __iomem *addr)
> {
> __raw_writel(b, addr);
> mb();
> }

Looks like a problem to me too. I'll start a thread with the alpha
people and CC you.


> 
> There is a barrier in writel /after/ the acess but not before.
> 

This is the consolidated list. I also heart back from m68k and corrected
contacts for arc and h8300.

archstatus  detail
--  -   
alpha   question sent   Arnd: alpha has problems
arc question sent   vineet.gup...@synopsys.com says he'll 
get to this
in the next few days
arm no issues
arm64   no issues
c6x no issues   no PCI
h8300   no issues   no PCI: ys...@users.sourceforge.jp will 
fix it.
hexagon no issues   no PCI
ia64no issues   confirmed by Tony Luck
m68kno issues   ge...@linux-m68k.org says no problem
metag   no issues   arnd: removed
microblaze  question sent   arnd: some mips platforms have problems
mipsquestion sent   arnd: some mips platforms have problems
nds32   question sent
nios2   no issues   no PCI
openriscno issues   sho...@gmail.com says should no issues
parisc  no issues   grantgrund...@gmail.com says most 
probably no problem
but still looking
powerpc no issues
riscv   no issues   arnd: riscv should be fine
s390no issues   arnd: Pretty sure this is also fine
sh  question sent
sparc   no issues       da...@davemloft.net says always 
strongly ordered
unicore32   question sent   resent to g...@pku.edu.cn
x86 no issues
x86_64  no issues



> 
>  Arnd
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: RFC on writel and writel_relaxed

2018-04-02 Thread Sinan Kaya
On 3/29/2018 9:40 PM, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-29 at 09:56 -0400, Sinan Kaya wrote:
>> On 3/28/2018 11:55 AM, David Miller wrote:
>>> From: Benjamin Herrenschmidt 
>>> Date: Thu, 29 Mar 2018 02:13:16 +1100
>>>
>>>> Let's fix all archs, it's way easier than fixing all drivers. Half of
>>>> the archs are unused or dead anyway.
>>>
>>> Agreed.
>>>
>>
>> I pinged most of the maintainers yesterday.
>> Which arches do we care about these days?
>> I have not been paying attention any other architecture besides arm64.
> 
> Thanks for going through that exercise !
> 
> Once sparc, s390, microblaze and mips reply, I think we'll have a good
> coverage, maybe riscv is to put in that lot too.


I posted the following two patches for supporting microblaze and unicore32. 

[PATCH v2 1/2] io: prevent compiler reordering on the default writeX() 
implementation
[PATCH v2 2/2] io: prevent compiler reordering on the default readX() 
implementation

The rest of the arches except mips and alpha seem OK.
I sent a question email on Friday to mips and alpha mailing lists. I'll follow 
up with
an actual patch today.


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.