Hi Guillaume, thanks for your report! Do you have more details on how did you reproduce it? It was "organic", or can you trigger that using some workload or test pattern? Also, based on the log output that you pasted, seems it took about 30 hours to reproduce, correct? Let me know if possible the topology of the cifs volumes in your system (how many mount points you have, how many targets - and where are they -, etc).
Do you think you could run a debug kernel in order to collect more data? I'll take a look in the code and see if I can instrument the cifs code to narrow how this race might be happening, let me know if you are able to test that. Thanks, Guilherme -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-azure in Ubuntu. https://bugs.launchpad.net/bugs/1795659 Title: kernel panic using CIFS share in smb2_push_mandatory_locks() Status in linux package in Ubuntu: Confirmed Status in linux-azure package in Ubuntu: Fix Released Status in linux source package in Xenial: Invalid Status in linux-azure source package in Xenial: Fix Released Status in linux source package in Bionic: Fix Released Status in linux-azure source package in Bionic: Invalid Status in linux source package in Cosmic: Won't Fix Status in linux-azure source package in Cosmic: Invalid Status in linux source package in Disco: Fix Released Status in linux-azure source package in Disco: Invalid Bug description: [Impact] * We got reports of a kernel crash in cifs module with the following signature: BUG: unable to handle kernel NULL pointer dereference at 0000000000000038 IP: smb2_push_mandatory_locks+0x10e/0x3b0 [cifs] PGD 0 P4D 0 RIP: 0010:smb2_push_mandatory_locks+0x10e/0x3b0 [cifs] Call Trace: cifs_oplock_break+0x12f/0x3d0 [cifs] process_one_work+0x14d/0x410 worker_thread+0x4b/0x460 kthread+0x105/0x140 [...] * Low-level analysis (decodecode script output and the objdump of the function) revealed that we are crashing in a NULL ptr dereference when trying to access "cfile->tlink"; below, a snippet of the objdump at function smb2_push_mandatory_locks(): [...] mov 0x10(%r14),%r15 # %r15 = cifsFileInfo *cfile mov 0x18(%r14),%rbx # %rbx = cifsLockInfo *li = (fdlocks->locks) lea 0x18(%r14),%r12 mov 0x90(%r15),%rax # %rax = struct tcon_link *tlink (cfile->tlink) cmp %r12,%rbx mov 0x38(%rax),%rax # <--- TRAP [trying to get cifs_tcon *tl_tcon] [...] * After discussing the issue with CIFS maintainers (Steve French and Pavel Shilovsky) they suggested commit b98749cac4a6 ("CIFS: keep FileInfo handle live during oplock break") [http://git.kernel.org/linus/b98749cac4a6] as a fix for multiple reports of this kind of crash. * The fix was sent to stable kernels and is present in Ubuntu kernels 5.0 and newer. We are requesting the SRU for this patch here in order to fix the crashes, after reports of successful testing with the patch (see below section) and since the patch is restricted to the cifs module scope and accepted on linux stable. * Alternatively the issue is known to be avoided when oplocks are disabled using "cifs.enable_oplocks=N" module parameter. [Test case] * Unfortunately we cannot reproduce the issue. The patch proposed here was validated by us with xfstests (instructions followed from https://wiki.samba.org/index.php/Xfstesting-cifs) and fio. Also, we have a user report of test validation using LISA (https://github.com/LIS/LISAv2). * Using xfstest with the exclusions proposed in the link above we managed to get the same results as a non-patched kernel, i.e., the same tests failed in both kernels, we didn't get worse results with the patch. Fio also didn't show noticeable performance regression with the patch. [Regression potential] * The patch was validated by the cifs filesystem maintainers (in fact they suggested its inclusion in Ubuntu) and by the aforementioned tests; also, the scope is restricted to cifs only so the likelihood of regressions is considered low. * Due to the nature of the code modification (add a new reference of a file handler and manipulate it in different places), I consider that if we have a regression it'll manifest as deadlock/blocked tasks, not something more serious like crashes or data corruption. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1795659/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp