Re: [PCI] BUG: unable to handle kernel

Guenter Roeck Mon, 09 Mar 2015 09:10:00 -0700

On 03/09/2015 08:53 AM, Murali Karicheri wrote:

On 03/09/2015 10:44 AM, Bjorn Helgaas wrote:

On Mon, Mar 9, 2015 at 9:17 AM, Murali Karicheri<[email protected]>  wrote:

On 03/06/2015 12:58 PM, Murali Karicheri wrote:


On 03/06/2015 11:55 AM, Guenter Roeck wrote:


On Fri, Mar 06, 2015 at 10:48:59AM -0500, Murali Karicheri wrote:
[ ... ]

 From 098b4f5e4ab9407fbdbfcca3a91785c17e25cf03 Mon Sep 17 00:00:00 2001

From: Murali Karicheri<[email protected]>
Date: Fri, 6 Mar 2015 10:23:08 -0500
Subject: [PATCH] pci: of : fix kernel crash

This is a debug patch to root cause the kernel crash

commit 0b2af171520e5d5e7d5b5f479b90a6a5014d9df6

PCI: Update DMA configuration from DT

Signed-off-by: Murali Karicheri<[email protected]>
---
drivers/of/of_pci.c | 8 ++++++++
drivers/pci/host-bridge.c | 5 +++++
2 files changed, 13 insertions(+)

diff --git a/drivers/of/of_pci.c b/drivers/of/of_pci.c
index 86d3c38..5a59fb8 100644
--- a/drivers/of/of_pci.c
+++ b/drivers/of/of_pci.c
@@ -129,6 +129,14 @@ void of_pci_dma_configure(struct pci_dev *pci_dev)
struct device *dev =&pci_dev->dev;
struct device *bridge = pci_get_host_bridge_device(pci_dev);

+ if (!bridge || !bridge->parent) {
+ if (!bridge)
+ pr_err("PCI bridge not found\n");
+ if (!bridge->parent)
+ pr_err("PCI bridge parent not found\n");



You'll see a crash here if bridge is NULL. Maybe add an else before
the second
if statement ? Also, dev_err might be a bit more useful and would be
available.

Fixed and attached.

Murali


Thanks,
Guenter

+ return;
+ }
+
of_dma_configure(dev, bridge->parent->of_node);
pci_put_host_bridge_device(bridge);
}
diff --git a/drivers/pci/host-bridge.c b/drivers/pci/host-bridge.c
index 3e5bbf9..ef2ab51 100644
--- a/drivers/pci/host-bridge.c
+++ b/drivers/pci/host-bridge.c
@@ -28,6 +28,11 @@ struct device *pci_get_host_bridge_device(struct
pci_dev *dev)
struct pci_bus *root_bus = find_pci_root_bus(dev->bus);
struct device *bridge = root_bus->bridge;

+ if (!bridge) {
+ pr_err("PCI: bridge not found\n");
+ return NULL;
+ }
+
kobject_get(&bridge->kobj);
return bridge;
}
--
1.7.9.5

BJorn,

Any chance of applying the attached debug patch to see if this fixes and
provide some additional information on this BUG? Not sure who will pick this
one and apply.


The change that caused the oops (0b2af171520e ("PCI: Update DMA
configuration from DT")) only exists on my pci/iommu branch, so I'm
the one to apply it.

It's much easier for me to deal with plain text patches (not attachments).

I'm hesitating because I don't want to encourage use of the 0-day
testing robot as a tool at which we can just throw debug patches.  The
robot is a service that costs somebody real money, and I want to be a
good neighbor when using it.


Thanks for the clarification as I don't have much information on the testing 
robot. At the same time the question is how similar incidence in the past have 
been handled. I am a newbie w.r.t to this. This is first time I have introduced 
a patch that impacts multiple arch/machines.


Was the information in the robot's report enough to reproduce the
oops?  If not, is there additional information we could add to the
report that would enable you to reproduce it?  Even if we can't
reproduce the oops, the report seems detailed enough that we should be
able to deduce the problem and produce a fix in which we have high
confidence.


The BUG report essentially indicates the crash happened in 
of_pci_dma_configure(). The machine specific log make sense to a person 
familiar with this arch and I am not familiar with the same. So anyone can help 
narrow down the root cause of this?

Looking at the code, there are two ptr variables that are accessed without 
checking for NULL as initial thinking was that these can never be NULL. So the 
debug patch is just adding addition check before accessing the ptr. I can send 
this patch without debug prints if that make sense. I was thinking to get 
confirmation that this is indeed the case before adding the check. What do you 
think the right approach here? Send a patch for this to the ML for adding the 
check as a potential fix? Or someone can help me investigate the crash dump and 
root cause it? or if we can use test robot to confirm this, I can re-send the 
patch ASIS to the list. Please clarify.

If the assumption is that the pointers can never be NULL,
wouldn't it be important to see a call trace and to find out
if the NULL pointers can actually be seen by design,
or if there is some other bug ?

I am a bit concerned that adding those NULL pointer checks
might end up hiding some other bug, ie that they just hide
the real bug without fixing it.

Thanks,
Guenter

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PCI] BUG: unable to handle kernel

Reply via email to