[Kernel-packages] [Bug 1813018] Re: Kernel Oops - unable to handle kernel paging request; RIP is at wait_migrate_huge_page+0x51/0x70

Mauricio Faria de Oliveira Tue, 29 Oct 2019 08:22:58 -0700

** Description changed:

- Kernel oops occurs randomly every now and then, seemingly when running
- memory-intensive processes (so far, it happened to me when using bowtie2
- or STAR).
+ [Impact]
+ 
+  * Users on NUMA systems (mostly servers) with
+    NUMA balancing enabled (which is by default)
+    might hit a crash/BUG() on a race condition
+    if two simultaneous page faults of the same
+    transparent hugepage go into the path for
+    migration to another NUMA node.
+ 
+  * The symptom is BUG() for 0xffffeaffffffffc0,
+    which happens if the PMD is set to zero/NULL.
+ 
+      BUG: unable to handle kernel paging request at ffffeaffffffffc0
+      IP: [<ffffffff811b3d31>] wait_migrate_huge_page+0x51/0x70
+ 
+  * NUMA balancing periodically unmaps pages so to
+    force page faults to occur, and later find out
+    using page faults where the NUMA memory access
+    is coming from - if it's often from other NUMA
+    node, it attempts to migrate the page contents
+    to the other NUMA node (for more local access.)
+ 
+  * The race condition is related to these 3 functions
+    in the pagefault handling of transparent hugepages:
+ 
+    do_huge_pmd_numa_page() -> wait_migrate_huge_page()
+    do_huge_pmd_numa_page() -> migrate_misplaced_transhuge_page()
+ 
+    The first task to hit the pagefault / migration path
+    calls migrate_misplaced_transhuge_page(), which does:
+ 
+    - ptl = pmd_lock(mm, pmd)     // calls spin_lock(ptl)
+    - pmdp_clear_flush(..., pmd); // set PMD to zero
+    - set_pmd_at(..., pmd, ...);  // set PMD to non-zero
+    - spin_unlock(ptl);
+ 
+    The second task to hit that path finds that the page
+    is already being migrated (page is locked) and waits
+    for that finish (ie, until page is unlocked), doing:
+ 
+    - spin_unlock(ptl)
+    - page = pmd_page(*pmd)
+    - wait_on_page_locked(page)
+ 
+    *BUT* it reads the PMD value *after* unlocking stuff.
+ 
+     So, if the tasks/CPUs manage to run in the sequence
+     below the PMD can be set to zero/NULL by first task
+     and read by second task before it's set to non-NULL.
+     Thus the second task miscalculates the page pointer,
+     from PMD and hit BUG for address 0xffffeaffffffffc0.
+ 
+     Task 1 / CPU 1                            Task 2 / CPU 2
+ 
+     do_huge_pmd_numa_page()                   do_huge_pmd_numa_page()
+     - pmd_lock()                                .  
+     - trylock_page() // PageLocked = true       .  
+       .                                         .  
+     - spin_unlock()                             .  
+       .                                       - pmd_lock()
+       .                                       - pmd_trans_migrating() // 
PageLocked == true
+       .                                       - spin_unlock()
+     - migrate_misplaced_transhuge_page()        .  
+       - pmd_lock()                              .  
+       - pmdp_clear_flush() // PMD = NULL        .  
+         .                                     - wait_migrate_huge_page()
+         .                                       - page = pmd_page() // PMD == 
NULL ... page = <bogus>
+         .                                       - wait_on_page_locked(page) 
// BUG()
+         .                                         < pagefault handler in bad 
state >
+         .                                         < that userspace process is 
hung >
+       - set_pmd_at() // PMD = non-NULL
+       - spin_unlock()
+ 
+  * The fix just moves pmd_page() before spin_unlock(),
+    and now the change perfomed in the other function
+    (done within the spin_lock()/spin_unlock() region)
+    can no longer run concurrently with this PMD read.
+ 
+  * So, when the other function releases the spin lock,
+    the PMD has already been set to non-NULL/valid PMD,
+    and wait_on_page_locked() receives a valid address.
+ 
+  * Fix commit 5d833062139d ("mm: numa: do not dereference
+    pmd outside of the lock during NUMA hinting fault") [1]
+ 
+  * Applied in v4.0 upstream; only Trusty/3.13 needs it.
+    
+    $ git describe --contains 5d833062139d
+    v4.0-rc1~98^2~103
+ 
+ <PENDING>
+     
+ [1] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5d833062139d
+ 
+ 
+ Kernel oops occurs randomly every now and then, seemingly when running 
memory-intensive processes (so far, it happened to me when using bowtie2 or 
STAR).
  
  Running Ubuntu 14.04 LTS on AWS EC2 instances (m4.* and c4.* family
  classes). After the error occurs, the server stays accessible through
  SSH, but the commands w, htop, ps (and maybe others) seem to hang, while
  commands like ls, cd, top and others keep working. Whatever process was
  running and (probably) caused the crash seems to go into a sleeping
  mode.
  
  Rebooting (sudo reboot) makes the instance refuse all connections (more
  than an hour after initiating the reboot). Stopping the (AWS EC2)
  instance and starting again makes the instance function normally again.
  
  Restarting the task that was running when the instance crashed on the newly 
(re)started instance usually works with no more problems.
- --- 
+ ---
  AlsaDevices:
-  total 0
-  crw-rw---- 1 root audio 116,  1 Jan 23 12:49 seq
-  crw-rw---- 1 root audio 116, 33 Jan 23 12:49 timer
+  total 0
+  crw-rw---- 1 root audio 116,  1 Jan 23 12:49 seq
+  crw-rw---- 1 root audio 116, 33 Jan 23 12:49 timer
  AplayDevices: Error: [Errno 2] No such file or directory
  ApportVersion: 2.14.1-0ubuntu3.29
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: Error: [Errno 2] No such file or directory
  DistroRelease: Ubuntu 14.04
  Ec2AMI: ami-4473183b
  Ec2AMIManifest: (unknown)
  Ec2AvailabilityZone: us-east-1c
  Ec2InstanceType: m4.16xlarge
  Ec2Kernel: unavailable
  Ec2Ramdisk: unavailable
  IwConfig: Error: [Errno 2] No such file or directory
  Lsusb: Error: command ['lsusb'] failed with exit code 1: unable to initialize 
libusb: -99
  MachineType: Xen HVM domU
  Package: linux (not installed)
  PciMultimedia:
-  
+ 
  ProcEnviron:
-  TERM=xterm
-  PATH=(custom, no user)
-  XDG_RUNTIME_DIR=<set>
-  LANG=en_US.UTF-8
-  SHELL=/bin/bash
+  TERM=xterm
+  PATH=(custom, no user)
+  XDG_RUNTIME_DIR=<set>
+  LANG=en_US.UTF-8
+  SHELL=/bin/bash
  ProcFB:
-  
+ 
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.13.0-164-generic 
root=UUID=d4f2aafc-946a-4514-930d-4c45e676f198 ro console=tty1 console=ttyS0
  ProcVersionSignature: Ubuntu 3.13.0-164.214-generic 3.13.11-ckt39
  RelatedPackageVersions:
-  linux-restricted-modules-3.13.0-164-generic N/A
-  linux-backports-modules-3.13.0-164-generic  N/A
-  linux-firmware                              N/A
+  linux-restricted-modules-3.13.0-164-generic N/A
+  linux-backports-modules-3.13.0-164-generic  N/A
+  linux-firmware                              N/A
  RfKill: Error: [Errno 2] No such file or directory
  Tags:  trusty ec2-images
  Uname: Linux 3.13.0-164-generic x86_64
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups: sudo
  WifiSyslog:
-  
+ 
  _MarkForUpload: True
  dmi.bios.date: 08/24/2006
  dmi.bios.vendor: Xen
  dmi.bios.version: 4.2.amazon
  dmi.chassis.type: 1
  dmi.chassis.vendor: Xen
  dmi.modalias: 
dmi:bvnXen:bvr4.2.amazon:bd08/24/2006:svnXen:pnHVMdomU:pvr4.2.amazon:cvnXen:ct1:cvr:
  dmi.product.name: HVM domU
  dmi.product.version: 4.2.amazon
  dmi.sys.vendor: Xen


-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1813018

Title:
  Kernel Oops - unable to handle kernel paging request; RIP is at
  wait_migrate_huge_page+0x51/0x70

Status in linux package in Ubuntu:
  Invalid
Status in linux source package in Trusty:
  In Progress

Bug description:
  [Impact]

   * Users on NUMA systems (mostly servers) with
     NUMA balancing enabled (which is by default)
     might hit a crash/BUG() on a race condition
     if two simultaneous page faults of the same
     transparent hugepage go into the path for
     migration to another NUMA node.

   * The symptom is BUG() for 0xffffeaffffffffc0,
     which happens if the PMD is set to zero/NULL.

       BUG: unable to handle kernel paging request at ffffeaffffffffc0
       IP: [<ffffffff811b3d31>] wait_migrate_huge_page+0x51/0x70

   * NUMA balancing periodically unmaps pages so to
     force page faults to occur, and later find out
     using page faults where the NUMA memory access
     is coming from - if it's often from other NUMA
     node, it attempts to migrate the page contents
     to the other NUMA node (for more local access.)

   * The race condition is related to these 3 functions
     in the pagefault handling of transparent hugepages:

     do_huge_pmd_numa_page() -> wait_migrate_huge_page()
     do_huge_pmd_numa_page() -> migrate_misplaced_transhuge_page()

     The first task to hit the pagefault / migration path
     calls migrate_misplaced_transhuge_page(), which does:

     - ptl = pmd_lock(mm, pmd)     // calls spin_lock(ptl)
     - pmdp_clear_flush(..., pmd); // set PMD to zero
     - set_pmd_at(..., pmd, ...);  // set PMD to non-zero
     - spin_unlock(ptl);

     The second task to hit that path finds that the page
     is already being migrated (page is locked) and waits
     for that finish (ie, until page is unlocked), doing:

     - spin_unlock(ptl)
     - page = pmd_page(*pmd)
     - wait_on_page_locked(page)

     *BUT* it reads the PMD value *after* unlocking stuff.

      So, if the tasks/CPUs manage to run in the sequence
      below the PMD can be set to zero/NULL by first task
      and read by second task before it's set to non-NULL.
      Thus the second task miscalculates the page pointer,
      from PMD and hit BUG for address 0xffffeaffffffffc0.

      Task 1 / CPU 1                            Task 2 / CPU 2

      do_huge_pmd_numa_page()                   do_huge_pmd_numa_page()
      - pmd_lock()                                .  
      - trylock_page() // PageLocked = true       .  
        .                                         .  
      - spin_unlock()                             .  
        .                                       - pmd_lock()
        .                                       - pmd_trans_migrating() // 
PageLocked == true
        .                                       - spin_unlock()
      - migrate_misplaced_transhuge_page()        .  
        - pmd_lock()                              .  
        - pmdp_clear_flush() // PMD = NULL        .  
          .                                     - wait_migrate_huge_page()
          .                                       - page = pmd_page() // PMD == 
NULL ... page = <bogus>
          .                                       - wait_on_page_locked(page) 
// BUG()
          .                                         < pagefault handler in bad 
state >
          .                                         < that userspace process is 
hung >
        - set_pmd_at() // PMD = non-NULL
        - spin_unlock()

   * The fix just moves pmd_page() before spin_unlock(),
     and now the change perfomed in the other function
     (done within the spin_lock()/spin_unlock() region)
     can no longer run concurrently with this PMD read.

   * So, when the other function releases the spin lock,
     the PMD has already been set to non-NULL/valid PMD,
     and wait_on_page_locked() receives a valid address.

   * Fix commit 5d833062139d ("mm: numa: do not dereference
     pmd outside of the lock during NUMA hinting fault") [1]

   * Applied in v4.0 upstream; only Trusty/3.13 needs it.
     
     $ git describe --contains 5d833062139d
     v4.0-rc1~98^2~103

  <PENDING>
      
  [1] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5d833062139d

  
  Kernel oops occurs randomly every now and then, seemingly when running 
memory-intensive processes (so far, it happened to me when using bowtie2 or 
STAR).

  Running Ubuntu 14.04 LTS on AWS EC2 instances (m4.* and c4.* family
  classes). After the error occurs, the server stays accessible through
  SSH, but the commands w, htop, ps (and maybe others) seem to hang,
  while commands like ls, cd, top and others keep working. Whatever
  process was running and (probably) caused the crash seems to go into a
  sleeping mode.

  Rebooting (sudo reboot) makes the instance refuse all connections
  (more than an hour after initiating the reboot). Stopping the (AWS
  EC2) instance and starting again makes the instance function normally
  again.

  Restarting the task that was running when the instance crashed on the newly 
(re)started instance usually works with no more problems.
  ---
  AlsaDevices:
   total 0
   crw-rw---- 1 root audio 116,  1 Jan 23 12:49 seq
   crw-rw---- 1 root audio 116, 33 Jan 23 12:49 timer
  AplayDevices: Error: [Errno 2] No such file or directory
  ApportVersion: 2.14.1-0ubuntu3.29
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: Error: [Errno 2] No such file or directory
  DistroRelease: Ubuntu 14.04
  Ec2AMI: ami-4473183b
  Ec2AMIManifest: (unknown)
  Ec2AvailabilityZone: us-east-1c
  Ec2InstanceType: m4.16xlarge
  Ec2Kernel: unavailable
  Ec2Ramdisk: unavailable
  IwConfig: Error: [Errno 2] No such file or directory
  Lsusb: Error: command ['lsusb'] failed with exit code 1: unable to initialize 
libusb: -99
  MachineType: Xen HVM domU
  Package: linux (not installed)
  PciMultimedia:

  ProcEnviron:
   TERM=xterm
   PATH=(custom, no user)
   XDG_RUNTIME_DIR=<set>
   LANG=en_US.UTF-8
   SHELL=/bin/bash
  ProcFB:

  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.13.0-164-generic 
root=UUID=d4f2aafc-946a-4514-930d-4c45e676f198 ro console=tty1 console=ttyS0
  ProcVersionSignature: Ubuntu 3.13.0-164.214-generic 3.13.11-ckt39
  RelatedPackageVersions:
   linux-restricted-modules-3.13.0-164-generic N/A
   linux-backports-modules-3.13.0-164-generic  N/A
   linux-firmware                              N/A
  RfKill: Error: [Errno 2] No such file or directory
  Tags:  trusty ec2-images
  Uname: Linux 3.13.0-164-generic x86_64
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups: sudo
  WifiSyslog:

  _MarkForUpload: True
  dmi.bios.date: 08/24/2006
  dmi.bios.vendor: Xen
  dmi.bios.version: 4.2.amazon
  dmi.chassis.type: 1
  dmi.chassis.vendor: Xen
  dmi.modalias: 
dmi:bvnXen:bvr4.2.amazon:bd08/24/2006:svnXen:pnHVMdomU:pvr4.2.amazon:cvnXen:ct1:cvr:
  dmi.product.name: HVM domU
  dmi.product.version: 4.2.amazon
  dmi.sys.vendor: Xen

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1813018/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1813018] Re: Kernel Oops - unable to handle kernel paging request; RIP is at wait_migrate_huge_page+0x51/0x70

Reply via email to