** Changed in: linux-lts-utopic (Ubuntu) Status: New => Invalid -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-lts-utopic in Ubuntu. https://bugs.launchpad.net/bugs/1533351
Title: DLPAR operation fails on Bell adapter with Ubuntu 14.04.3 OS Status in linux-lts-utopic package in Ubuntu: Invalid Bug description: == Comment: #0 - HARSHA THYAGARAJA <hathy...@in.ibm.com> - 2015-11-06 04:10:32 == ---Problem Description--- DLPAR operation fails on Bell adapter Contact Information = hathy...@in.ibm.com, iranna.an...@in.ibm.com ---uname output--- Linux tuletapio1-lp5 3.13.0-67-generic #110-Ubuntu SMP Fri Oct 23 13:24:51 UTC 2015 ppc64le ppc64le ppc64le GNU/Linux Machine Type = 8286-41A ---Steps to Reproduce--- Necessary packages installed are: devices.chrp.base.servicerm_2.5.0.1-15111_ppc64el.deb dynamicrm_2.0.1-3_ppc64el.deb rsct.core_3.2.0.6-15111_ppc64el.deb rsct.core.utils_3.2.0.6-15111_ppc64el.deb src_3.2.0.6-15111_ppc64el.deb On the OS: root@tuletapio1-lp5:~# startsrc -g rsct 0513-059 The ctcas Subsystem has been started. Subsystem PID is 1382. 0513-029 The ctrmc Subsystem is already active. Multiple instances are not supported. root@tuletapio1-lp5:~# startsrc -g rsct_rm 0513-029 The IBM.MgmtDomainRM Subsystem is already active. Multiple instances are not supported. 0513-059 The IBM.ERRM Subsystem has been started. Subsystem PID is 1389. 0513-029 The IBM.HostRM Subsystem is already active. Multiple instances are not supported. 0513-059 The IBM.AuditRM Subsystem has been started. Subsystem PID is 1390. 0513-059 The IBM.SensorRM Subsystem has been started. Subsystem PID is 1393. 0513-029 The IBM.DRM Subsystem is already active. Multiple instances are not supported. 0513-029 The IBM.ServiceRM Subsystem is already active. Multiple instances are not supported. root@tuletapio1-lp5:~# lssrc -a Subsystem Group PID Status ctrmc rsct 921 active IBM.DRM rsct_rm 1025 active IBM.MgmtDomainRM rsct_rm 1130 active IBM.HostRM rsct_rm 1143 active IBM.ServiceRM rsct_rm 1183 active ctcas rsct 1382 active IBM.ERRM rsct_rm 1389 active IBM.AuditRM rsct_rm 1390 active IBM.SensorRM rsct_rm 1393 active In the HMC: Run the command: hscroot@pwrio-hmc:~> lshwres -r io -m tuletapio1-fsp --rsubtype slot --filter "lpar_names=tuletapio1-lp5-iranna" unit_phys_loc=U78C9.001.WZS00CH,bus_id=24,phys_loc=C6,drc_index=21010018,lpar_name=tuletapio1-lp5-iranna,lpar_id=5,slot_io_pool_id=none,description=Quad Async EIA-232 PCI-Express Adapter,feature_codes=none,pci_vendor_id=114F,pci_device_id=00B6,pci_subs_vendor_id=114F,pci_subs_device_id=00B6,pci_class=0000,pci_revision_id=AA,bus_grouping=0,iop=0,parent_slot_drc_index=none,drc_name=U78C9.001.WZS00CH-P1-C6,interposer_present=0,interposer_pcie=0,lpar_assignment_capable=1,dynamic_lpar_assignment_capable=1 hscroot@pwrio-hmc:~> chhwres -r io -m tuletapio1-fsp -o r --id 5 -l 21010018 HSCL2929 The dynamic removal of I/O resources failed: The I/O slot dynamic partitioning operation failed. Here are the I/O slot IDs that failed and the reasons for failure: Validating PHB DLPAR capability...yes. failed to open /sys/bus/pci/slots/U78C9.001.WZS00CH-P1-C6/power: No such file or directory failed to disable hotplug children kernel remove failed for PHB 24, rc = -1 Observed in the terminal: Nov 4 05:26:43 tuletapio1-lp5 kernel: [ 553.125671] rpadlpar_io: slot PHB 24 removed Nov 4 05:26:44 tuletapio1-lp5 kernel: [ 554.125766] rpadlpar_io: slot PHB 24 removed Nov 4 05:26:45 tuletapio1-lp5 kernel: [ 555.125862] rpadlpar_io: slot PHB 24 removed Nov 4 05:26:46 tuletapio1-lp5 kernel: [ 556.125957] rpadlpar_io: slot PHB 24 removed Nov 4 05:26:47 tuletapio1-lp5 kernel: [ 557.126052] rpadlpar_io: slot PHB 24 removed Nov 4 05:26:48 tuletapio1-lp5 kernel: [ 558.126148] rpadlpar_io: slot PHB 24 removed Nov 4 05:26:49 tuletapio1-lp5 kernel: [ 559.126243] rpadlpar_io: slot PHB 24 removed Nov 4 05:26:50 tuletapio1-lp5 kernel: [ 560.126338] rpadlpar_io: slot PHB 24 removed Nov 4 05:26:51 tuletapio1-lp5 kernel: [ 561.126432] rpadlpar_io: slot PHB 24 removed Nov 4 05:26:52 tuletapio1-lp5 kernel: [ 562.126527] rpadlpar_io: slot PHB 24 removed Nov 4 05:26:53 tuletapio1-lp5 kernel: [ 563.126622] rpadlpar_io: slot PHB 24 removed Nov 4 05:26:54 tuletapio1-lp5 kernel: [ 564.126717] rpadlpar_io: slot PHB 24 removed Nov 4 05:26:55 tuletapio1-lp5 kernel: [ 565.126813] rpadlpar_io: slot PHB 24 removed Nov 4 05:26:56 tuletapio1-lp5 kernel: [ 566.126908] rpadlpar_io: slot PHB 24 removed Nov 4 05:26:57 tuletapio1-lp5 kernel: [ 567.127004] rpadlpar_io: slot PHB 24 removed Nov 4 05:26:58 tuletapio1-lp5 kernel: [ 568.127099] rpadlpar_io: slot PHB 24 removed Nov 4 05:26:59 tuletapio1-lp5 kernel: [ 569.127193] rpadlpar_io: slot PHB 24 removed The terminal dumps above messages continuously that the adapter has been removed but lspci -nn still showed the entry for the adapter root@tuletapio1-lp5:~# lspci -nn 01:00.0 PCI bridge [0604]: PLX Technology, Inc. PEX8112 x1 Lane PCI Express-to-PCI Bridge [10b5:8112] (rev ff) 02:00.0 Serial controller [0700]: Digi International Digi Neo 4 (IBM version) [114f:00f4] (rev ff) Details of the system: IP: 9.40.192.64 creds: root/ltcnetdd *Additional Instructions for hathy...@in.ibm.com, iranna.an...@in.ibm.com: -Post a private note with access information to the machine that the bug is occuring on. == Comment: #1 - Nathan D. Fontenot <nfont...@us.ibm.com> - 2015-11-20 11:10:57 == Interesting. Looking at the drmgr logs the DLPAR remove of the PHB is not failing because of an error but because the drmgr is timing out before it is able to complete the request. I am building the latest upstream code and will take a look as to why the request is timing out, this should be able to complete within the five minute timeout given. ########## Nov 04 05:09:32 2015 ########## drmgr: -r -c phb -s PHB 24 -w 5 -d 1 Validating PHB DLPAR capability...yes. Getting node types 0x00000010 DR nodes list ============== /proc/device-tree/pci@800000020000018: drc index: 0x20000018 description: Unknown slot type drc name: PHB 24 loc code: U78C9.001.WZS00CH-P1 /proc/device-tree/pci@800000020000018: drc index: 0x22010018 description: PCI-E capable, Rev 3, 16x lanes with 16x lanes connected drc name: U78C9.001.WZS00CH-P1-C6 loc code: U78C9.001.WZS00CH-P1 Retrieving hotplug nodes Could not find DRC property group in path: /proc/device-tree/pci@800000020000018/pci@0. setting hp adapter status to UNCONFIG adapter for U78C9.001.WZS00CH-P1-C6 failed to open /sys/bus/pci/slots/U78C9.001.WZS00CH-P1-C6/power: No such file or directory failed to disable hotplug children Removing device-tree node /proc/device-tree/pci@800000020000018/pci@0/serial@0 Removing device-tree node /proc/device-tree/pci@800000020000018/pci@0 HPDEV: /sys/bus/pci/devices/0000:01:00.0 /pci@800000020000018/pci@0 HPDEV: /sys/bus/pci/devices/0000:02:00.0 /pci@800000020000018/pci@0/serial@0 performing kernel op for PHB 24, file is /sys/bus/pci/slots/control/remove_slot Drmgr has exceeded its specified wait time and will not continue kernel remove failed for PHB 24, rc = -1 ########## Nov 04 05:14:32 2015 ########## == Comment: #3 - Nathan D. Fontenot <nfont...@us.ibm.com> - 2015-12-01 11:08:57 == When looking at this bz prior to the Thanksgiving break I was noticing that the hotplug slots under this PHB are not getting registered by the rpadlpar_io kernel module (this is where we handle pci hotplug on Power). This results in them not getting removed when we go to remove the PHB and resulting in the scenario we are seeing. The continuous output of the "PHB 245 Removed" message. Can anyone comment on whether this issue is seen on any other systems or on any other distros? == Comment: #4 - Nathan D. Fontenot <nfont...@us.ibm.com> - 2015-12-08 12:09:13 == Updates from further investigation into this issue. This does not appear to be a drmgr issue. I was able to boot a 4.2 kernel on the system and then add and remove the adapter without any problems. It appears the reason the dlpar add of the adapter is failing is because the device tree gets set up wrong. In the process of adding the adapter the first update to the device tree is to add the interrupt controller for the PHB, afterwards we add the PHB itself. When the PHB is added the kernel is putting the PHB under the interrupt controller instead of in the root of the device tree where it belongs. This causes the drmgr command to think a failure occurs because it cannot find the PHB after adding it to the device tree, it should not be under the interrupt controller and we do not look for it there. As mentioned above, the same drmgr command fails on the stock kernel and works on a 4.2 kernel. Next step is to determine why the PHB is being put under the interrupt controller instead of the root node. == Comment: #5 - Nathan D. Fontenot <nfont...@us.ibm.com> - 2016-01-12 11:58:18 == The fix for this issue is already upstream in commit 99de64984c3a7c9bf56a50e6dcc51006c9485620 OF: fix of_find_node_by_path() assumption that of_allnodes is root of_find_node_by_path() is borked because of_allnodes is not guaranteed to contain the root of the tree after using any of the dynamic update functions because some other nodes ends up as of_allnodes. Fixes: c22e650e66b8 of: Make of_find_node_by_path() handle /aliases Reported-by: pantelis.anton...@konsulko.com Signed-off-by: Frank Rowand <frank.row...@sonymobile.com> Signed-off-by: Rob Herring <r...@kernel.org> Attached is a backport of the patch. == Comment: #7 - Nathan D. Fontenot <nfont...@us.ibm.com> - 2016-01-12 12:13:10 == This patch is needed to avoid breaking DLPAR capabilities on he power platforms. Without this patch the DLPAR capabilities of Power platforms to add devices is broken. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-lts-utopic/+bug/1533351/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp