Public bug reported:

== Comment: #33 - Mauricio Faria De Oliveira - 2016-12-09 06:49:57 ==

Hi Canonical,

Can you please apply this patch [1] to 16.10 and 16.04.x HWE (4.8) ?
It's fixes a regression introduced in 4.8.

As you can see, it's in the SCSI maintainer (James Bottomley)'s 'fixes'
branch, but didn't make 4.9-rc8 (maybe he considered it late for this
one).

We have installer, boot, and post-boot problems due to this one.
It'd be good if the netboot images for 16.04.x HWE kernel can get it too.

Thank you,

[1] scsi: lpfc: fix oops/BUG in lpfc_sli_ringtxcmpl_put() 
    
http://git.kernel.org/cgit/linux/kernel/git/jejb/scsi.git/commit/?h=fixes&id=2319f847a8910cff1d46c9b66aa1dd7cc3e836a9

Historical context:

== Comment: #0 - HARSHA THYAGARAJA - 2016-11-21 02:39:35 ==
---Problem Description---
Ubuntu 16.10 netboot install fails with "Oops: Exception in kernel mode. " 
(kernel: 4.8.0-27)
 
Machine Type = Power8 baremetal  
 
---boot type---
QEMU direct boot kernel/initrd
 
---Kernel cmdline used to launch install---
On a Power8 server,  Using kernel and initrd images, 

netcfg/disable_dhcp=true netcfg/confirm_static=true 
netcfg/choose_interface=98:BE:94:00:4C:68 netcfg/get_ipaddress=9.47.67.159/20 
netcfg/get_gateway=9.47.79.254 netcfg/get_nameservers=
 
---Install repository type---
Internet repository
 
---Install repository Location---
http://ports.ubuntu.com/ubuntu-ports/dists/yakkety/main/installer-ppc64el/current/images/netboot/ubuntu-installer/ppc64el/
 
---Point of failure---
Other failure during installation (stage 1)

== Comment: #1 - HARSHA THYAGARAJA - 2016-11-21 02:41:54 ==
The netboot install fails and Call traces are seen at the Disk detection step.

== Comment: #8 - Mauricio Faria De Oliveira - 2016-11-21 15:58:25 ==
Finally got it.

The assembly offset/code + the trap signal is due to this BUG_ON(), and
the second condition triggered the trap.

Checking why piocb is not NULL but piocb->vport is NULL.
This might have happened in the lpfc_linkdown_port() path, in the stack trace.

Would need a more readable console log (ie, dmesg, as requested in
comments 5, 3) to help understanding it.

--

static int
lpfc_sli_ringtxcmpl_put(struct lpfc_hba *phba, struct lpfc_sli_ring *pring,
                        struct lpfc_iocbq *piocb)
{
        lockdep_assert_held(&phba->hbalock);

        BUG_ON(!piocb || !piocb->vport);
<...>
}

[  226.147886] NIP [d00000000b7324c0] lpfc_sli_ringtxcmpl_put+0x48/0x120
[lpfc]

0x2478 + 0x48 = 0x24c0 (tdnei; trap doubleword not equal immediate)

0000000000002478 <lpfc_sli_ringtxcmpl_put>:
<...>
    2498:       78 2b bf 7c     mr      r31,r5 // r31 is *piocb (r5 is the 3rd 
function parameter)
    249c:       78 1b 7d 7c     mr      r29,r3
    24a0:       78 23 9e 7c     mr      r30,r4
    24a4:       01 00 00 48     bl      24a4 <lpfc_sli_ringtxcmpl_put+0x2c> // 
probably converted at module load time to the call lockdep_assert_held()
    24a8:       00 00 00 60     nop
    24ac:       00 00 bf 2f     cmpdi   cr7,r31,0 // compare piocb with 0. 
checking for NULL.
    24b0:       70 00 9e 41     beq     cr7,2520 <lpfc_sli_ringtxcmpl_put+0xa8> 
// if equal to zero, branch out. done w/ the  former part of the OR check.
    24b4:       e8 00 3f e9     ld      r9,232(r31) // an offset of piocb. 
probably piocb->vport in the bug_on
    24b8:       74 00 29 7d     cntlzd  r9,r9 // count leading zeroes. if r59 
is null (0), leading zeroes is 64 (binary: 0100_0000, bit 6 is 1, and 6 LSbs 
[bits 5-0] are 0)
    24bc:       82 d1 29 79     rldicl  r9,r9,58,6 // rotate left 58 (ie, those 
6 LSbs are now MSbs, and that bit 6 from 64 was rotated in the register and is 
now bit 0, the LSb), now AND the 6 MSbs w/ 0-bits, and the all lower bits with 
1-bits (ie, save the LSb). 
    24c0:       00 00 09 0b     tdnei   r9,0 // trap if not equal to zero. (ie, 
the whole r9 was zero, with 64 leading/consecutive zeroes, then bit 6 is 1, it 
becomes bit 0.. and since bit 0 is now 1, r9 is thus non-zero, and the trap 
triggers.)    this checked the latter part of the OR.

** Affects: linux (Ubuntu)
     Importance: Undecided
     Assignee: Taco Screen team (taco-screen-team)
         Status: New


** Tags: architecture-ppc64le bugnameltc-148978 severity-critical 
targetmilestone-inin1610

** Tags added: architecture-ppc64le bugnameltc-148978 severity-critical
targetmilestone-inin1610

** Changed in: ubuntu
     Assignee: (unassigned) => Taco Screen team (taco-screen-team)

** Package changed: ubuntu => linux (Ubuntu)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1648873

Title:
  Ubuntu 16.10 netboot install fails with "Oops: Exception in kernel
  mode, sig: 5 [#1] " (lpfc)

Status in linux package in Ubuntu:
  New

Bug description:
  == Comment: #33 - Mauricio Faria De Oliveira - 2016-12-09 06:49:57 ==

  Hi Canonical,

  Can you please apply this patch [1] to 16.10 and 16.04.x HWE (4.8) ?
  It's fixes a regression introduced in 4.8.

  As you can see, it's in the SCSI maintainer (James Bottomley)'s
  'fixes' branch, but didn't make 4.9-rc8 (maybe he considered it late
  for this one).

  We have installer, boot, and post-boot problems due to this one.
  It'd be good if the netboot images for 16.04.x HWE kernel can get it too.

  Thank you,

  [1] scsi: lpfc: fix oops/BUG in lpfc_sli_ringtxcmpl_put() 
      
http://git.kernel.org/cgit/linux/kernel/git/jejb/scsi.git/commit/?h=fixes&id=2319f847a8910cff1d46c9b66aa1dd7cc3e836a9

  Historical context:

  == Comment: #0 - HARSHA THYAGARAJA - 2016-11-21 02:39:35 ==
  ---Problem Description---
  Ubuntu 16.10 netboot install fails with "Oops: Exception in kernel mode. " 
(kernel: 4.8.0-27)
   
  Machine Type = Power8 baremetal  
   
  ---boot type---
  QEMU direct boot kernel/initrd
   
  ---Kernel cmdline used to launch install---
  On a Power8 server,  Using kernel and initrd images, 

  netcfg/disable_dhcp=true netcfg/confirm_static=true 
netcfg/choose_interface=98:BE:94:00:4C:68 netcfg/get_ipaddress=9.47.67.159/20 
netcfg/get_gateway=9.47.79.254 netcfg/get_nameservers=
   
  ---Install repository type---
  Internet repository
   
  ---Install repository Location---
  
http://ports.ubuntu.com/ubuntu-ports/dists/yakkety/main/installer-ppc64el/current/images/netboot/ubuntu-installer/ppc64el/
   
  ---Point of failure---
  Other failure during installation (stage 1)

  == Comment: #1 - HARSHA THYAGARAJA - 2016-11-21 02:41:54 ==
  The netboot install fails and Call traces are seen at the Disk detection step.

  == Comment: #8 - Mauricio Faria De Oliveira - 2016-11-21 15:58:25 ==
  Finally got it.

  The assembly offset/code + the trap signal is due to this BUG_ON(),
  and the second condition triggered the trap.

  Checking why piocb is not NULL but piocb->vport is NULL.
  This might have happened in the lpfc_linkdown_port() path, in the stack trace.

  Would need a more readable console log (ie, dmesg, as requested in
  comments 5, 3) to help understanding it.

  --

  static int
  lpfc_sli_ringtxcmpl_put(struct lpfc_hba *phba, struct lpfc_sli_ring *pring,
                          struct lpfc_iocbq *piocb)
  {
          lockdep_assert_held(&phba->hbalock);

          BUG_ON(!piocb || !piocb->vport);
  <...>
  }

  [  226.147886] NIP [d00000000b7324c0]
  lpfc_sli_ringtxcmpl_put+0x48/0x120 [lpfc]

  0x2478 + 0x48 = 0x24c0 (tdnei; trap doubleword not equal immediate)

  0000000000002478 <lpfc_sli_ringtxcmpl_put>:
  <...>
      2498:       78 2b bf 7c     mr      r31,r5 // r31 is *piocb (r5 is the 
3rd function parameter)
      249c:       78 1b 7d 7c     mr      r29,r3
      24a0:       78 23 9e 7c     mr      r30,r4
      24a4:       01 00 00 48     bl      24a4 <lpfc_sli_ringtxcmpl_put+0x2c> 
// probably converted at module load time to the call lockdep_assert_held()
      24a8:       00 00 00 60     nop
      24ac:       00 00 bf 2f     cmpdi   cr7,r31,0 // compare piocb with 0. 
checking for NULL.
      24b0:       70 00 9e 41     beq     cr7,2520 
<lpfc_sli_ringtxcmpl_put+0xa8> // if equal to zero, branch out. done w/ the  
former part of the OR check.
      24b4:       e8 00 3f e9     ld      r9,232(r31) // an offset of piocb. 
probably piocb->vport in the bug_on
      24b8:       74 00 29 7d     cntlzd  r9,r9 // count leading zeroes. if r59 
is null (0), leading zeroes is 64 (binary: 0100_0000, bit 6 is 1, and 6 LSbs 
[bits 5-0] are 0)
      24bc:       82 d1 29 79     rldicl  r9,r9,58,6 // rotate left 58 (ie, 
those 6 LSbs are now MSbs, and that bit 6 from 64 was rotated in the register 
and is now bit 0, the LSb), now AND the 6 MSbs w/ 0-bits, and the all lower 
bits with 1-bits (ie, save the LSb). 
      24c0:       00 00 09 0b     tdnei   r9,0 // trap if not equal to zero. 
(ie, the whole r9 was zero, with 64 leading/consecutive zeroes, then bit 6 is 
1, it becomes bit 0.. and since bit 0 is now 1, r9 is thus non-zero, and the 
trap triggers.)    this checked the latter part of the OR.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1648873/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to