Hi eric, I found the code below from archive.ubuntu.com/ubuntu/pool/main/l/linux/fs/ocfs2/suballoc.c
2410 if (status < 0) { 2411 mlog_errno(status); 2412 goto bail; 2413 } 2414 2415 if (undo_fn) { 2416 jbd_lock_bh_state(group_bh); 2417 undo_bg = (struct ocfs2_group_desc *) 2418 bh2jh(group_bh)->b_committed_data; 2419 BUG_ON(!undo_bg); 2420 } 2421 2422 tmp = num_bits; 2423 while(tmp--) { 2424 ocfs2_clear_bit((bit_off + tmp), 2425 (unsigned long *) bg->bg_bitmap); 2426 if (undo_fn) 2427 undo_fn(bit_off + tmp, 2428 (unsigned long *) undo_bg->bg_bitmap); 2429 } 2430 le16_add_cpu(&bg->bg_free_bits_count, num_bits); 2431 if (le16_to_cpu(bg->bg_free_bits_count) > le16_to_cpu(bg->bg_bits)) { 2432 ocfs2_error(alloc_inode->i_sb, "Group descriptor # %llu has bit" 2433 " count %u but claims %u are freed. num_bits %d", 2434 (unsigned long long)le64_to_cpu(bg->bg_blkno), 2435 le16_to_cpu(bg->bg_bits), 2436 le16_to_cpu(bg->bg_free_bits_count), num_bits); 2437 return -EROFS; 2438 } On Wed, Sep 14, 2016 at 10:13 AM, Eric Ren <z...@suse.com> wrote: > Hi, > > On 09/14/2016 02:30 PM, Ishmael Tsoaela wrote: >> >> Hi Eric, >> >> Could you paste the code context around this line? >> Sep 13 08:10:18 nodeB kernel: [1104431.300882] kernel BUG at >> >> /build/linux-lts-wily-Vv6Eyd/linux-lts-wily-4.2.0/fs/ocfs2/suballoc.c:2419! > > This message is very import because it shows exactly which line of the > source code > directly results in this BUG() output. What I want you do is to paste out > the code around #2419 > of suballoc.c. Such I can locate where the BUG() is locally because the code > of line#2419 is different > with different code version. >> >> >> Apologies but I tried to understand this but failed >> >> >> root@nodeB:~# echo w > /proc/sysrq-trigger >> root@nodeB:~# >> >> Node reboot and mount points are accessble from all 3 nodes, not sure >> why but it seems it will be difficult to figure out what went wrong >> with ocfs2 without proper knowledge, so let me not waste any of your >> time, let me figure out 'crash`[1][2] or gdb' then hopefully when it >> happens next time I would have much better understanding > > OK, good luck! > > > Eric >> >> >> On Tue, Sep 13, 2016 at 11:44 AM, Eric Ren <z...@suse.com> wrote: >>> >>> On 09/13/2016 05:01 PM, Ishmael Tsoaela wrote: >>>> >>>> Hi Eric, >>>> >>>> Sorry Here are the other 2 syslogs if you need and debug output >>> >>> According to the logs, the nodeB should be the first one that got >>> problem. >>> >>> Could you paste the code context around this line? >>> Sep 13 08:10:18 nodeB kernel: [1104431.300882] kernel BUG at >>> >>> /build/linux-lts-wily-Vv6Eyd/linux-lts-wily-4.2.0/fs/ocfs2/suballoc.c:2419! >>>> >>>> The request in the snip attached just hangs >>> >>> NodeB should have taken this exclusive cluster lock, so any commands >>> trying >>> to access that file will hang up. >>> >>> Could you provide the output of `echo w > /proc/sysrq-trigger`? OCFS2 >>> issue >>> is not easy to debug if developer cannot reproduce >>> it locally, and this is the case. BTW, you can narrow down by >>> `crash`[1][2] >>> or gdb if you have some knowledge of kernel stuff. >>> >>> [1] http://www.dedoimedo.com/computers/crash-analyze.html >>> [2] https://people.redhat.com/anderson/crash_whitepaper/ >>> >>> Eric >>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Tue, Sep 13, 2016 at 10:37 AM, Ishmael Tsoaela <ishmae...@gmail.com> >>>> wrote: >>>>> >>>>> Thanks for the response >>>>> >>>>> >>>>> 1. the disk is a shared ceph rbd device >>>>> >>>>> #rbd showmapped >>>>> id pool image snap device >>>>> 1 vmimages block_vmimages - /dev/rbd1 >>>>> >>>>> >>>>> 2. ocfs2 has been working well for 2 months now, with a reboot 12 days >>>>> ago >>>>> >>>>> 3. 3 ceph nodes all have rbd image mapped and ocfs3 mounted >>>>> >>>>> commands used >>>>> >>>>> #sudo rbd map block_vmimages --pool vmimages --name >>>>> >>>>> #sudo mount /dev/rbd/vmimages/block_vmimages /mnt/vmimages/ >>>>> /dev/rbd1 >>>>> >>>>> 4. >>>>> root@nodeC:~# sudo debugfs.ocfs2 -R stats /dev/rbd1 >>>>> Revision: 0.90 >>>>> Mount Count: 0 Max Mount Count: 20 >>>>> State: 0 Errors: 0 >>>>> Check Interval: 0 Last Check: Tue Aug 2 15:41:12 2016 >>>>> Creator OS: 0 >>>>> Feature Compat: 3 backup-super strict-journal-super >>>>> Feature Incompat: 592 sparse inline-data xattr >>>>> Tunefs Incomplete: 0 >>>>> Feature RO compat: 1 unwritten >>>>> Root Blknum: 5 System Dir Blknum: 6 >>>>> First Cluster Group Blknum: 3 >>>>> Block Size Bits: 12 Cluster Size Bits: 12 >>>>> Max Node Slots: 16 >>>>> Extended Attributes Inline Size: 256 >>>>> Label: >>>>> UUID: 238F878003E7455FA5B01CC884D1047F >>>>> Hash: 919897149 (0x36d4843d) >>>>> DX Seed[0]: 0x00000000 >>>>> DX Seed[1]: 0x00000000 >>>>> DX Seed[2]: 0x00000000 >>>>> Cluster stack: classic o2cb >>>>> Inode: 2 Mode: 00 Generation: 1754092981 (0x688d55b5) >>>>> FS Generation: 1754092981 (0x688d55b5) >>>>> CRC32: 00000000 ECC: 0000 >>>>> Type: Unknown Attr: 0x0 Flags: Valid System Superblock >>>>> Dynamic Features: (0x0) >>>>> User: 0 (root) Group: 0 (root) Size: 0 >>>>> Links: 0 Clusters: 640000000 >>>>> ctime: 0x57a0a2f8 -- Tue Aug 2 15:41:12 2016 >>>>> atime: 0x0 -- Thu Jan 1 02:00:00 1970 >>>>> mtime: 0x57a0a2f8 -- Tue Aug 2 15:41:12 2016 >>>>> dtime: 0x0 -- Thu Jan 1 02:00:00 1970 >>>>> ctime_nsec: 0x00000000 -- 0 >>>>> atime_nsec: 0x00000000 -- 0 >>>>> mtime_nsec: 0x00000000 -- 0 >>>>> Refcount Block: 0 >>>>> Last Extblk: 0 Orphan Slot: 0 >>>>> Sub Alloc Slot: Global Sub Alloc Bit: 65535 >>>>> >>>>> >>>>> >>>>> thanks for the assistance >>>>> >>>>> >>>>> On Tue, Sep 13, 2016 at 10:23 AM, Eric Ren <z...@suse.com> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> On 09/13/2016 03:16 PM, Ishmael Tsoaela wrote: >>>>>>> >>>>>>> Hi All, >>>>>>> >>>>>>> I have an ocfs2 mount point of 3 ceph cluster nodes and suddenly I >>>>>>> cannot read and write to the mount point although the cluster is >>>>>>> clean >>>>>>> and showing no errors. >>>>>> >>>>>> 1. What is your ocfs2 shared disk? I mean it's a shared disk exported >>>>>> by >>>>>> iscsi target, or a ceph rbd device? >>>>>> 2. Did you check if ocfs2 works well before any read/write? and how? >>>>>> 3. Could you elaborating more details how the ceph nodes use ocfs2? >>>>>> 4. Please provide the output of: >>>>>> #sudo debugfs.ocfs2 -R stats /dev/sda >>>>>>> >>>>>>> >>>>>>> >>>>>>> Are the any other logs I can check? >>>>>> >>>>>> All log messages should go to /var/log/messages, could you attach the >>>>>> whole >>>>>> log file? >>>>>> >>>>>> Eric >>>>>>> >>>>>>> >>>>>>> There are some log in kern.log about >>>>>>> >>>>>>> >>>>>>> kern.log >>>>>>> >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104431.300882] kernel BUG at >>>>>>> >>>>>>> >>>>>>> >>>>>>> /build/linux-lts-wily-Vv6Eyd/linux-lts-wily-4.2.0/fs/ocfs2/suballoc.c:2419! >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104431.345504] invalid opcode: 0000 >>>>>>> [#1] >>>>>>> SMP >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104431.370081] Modules linked in: >>>>>>> vhost_net vhost macvtap macvlan ocfs2 quota_tree rbd libceph ipmi_si >>>>>>> mpt3sas mpt2sas raid_class scsi_transport_sas mptctl mptbase >>>>>>> xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 >>>>>>> iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 >>>>>>> xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp >>>>>>> ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter >>>>>>> ip_tables x_tables dell_rbu ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm >>>>>>> ocfs2_nodemanager ocfs2_stackglue configfs bridge stp llc binfmt_misc >>>>>>> ipmi_devintf kvm_amd dcdbas kvm input_leds joydev amd64_edac_mod >>>>>>> crct10dif_pclmul edac_core shpchp i2c_piix4 fam15h_power crc32_pclmul >>>>>>> edac_mce_amd ipmi_ssif k10temp aesni_intel aes_x86_64 lrw gf128mul >>>>>>> 8250_fintek glue_helper acpi_power_meter mac_hid serio_raw >>>>>>> ablk_helper >>>>>>> cryptd ipmi_msghandler xfs libcrc32c lp parport ixgbe dca hid_generic >>>>>>> uas usbhid vxlan usb_storage ip6_udp_tunnel hid udp_tunnel ptp >>>>>>> psmouse >>>>>>> bnx2 pps_core megaraid_sas mdio [last unloaded: ipmi_si] >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104431.898986] CPU: 10 PID: 65016 >>>>>>> Comm: cp Not tainted 4.2.0-27-generic #32~14.04.1-Ubuntu >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104432.012469] Hardware name: Dell >>>>>>> Inc. PowerEdge R515/0RMRF7, BIOS 2.0.2 10/22/2012 >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104432.134659] task: ffff880a61dca940 >>>>>>> ti: ffff88084a5ac000 task.ti: ffff88084a5ac000 >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104432.265260] RIP: >>>>>>> 0010:[<ffffffffc062026b>] [<ffffffffc062026b>] >>>>>>> _ocfs2_free_suballoc_bits+0x4db/0x4e0 [ocfs2] >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104432.406559] RSP: >>>>>>> 0018:ffff88084a5af798 EFLAGS: 00010246 >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104432.479958] RAX: 0000000000000000 >>>>>>> RBX: ffff881acebcb000 RCX: ffff881fcd372e00 >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104432.630768] RDX: ffff881fd0d4dc30 >>>>>>> RSI: ffff88197e351bc8 RDI: ffff880fd127b2b0 >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104432.789688] RBP: ffff88084a5af818 >>>>>>> R08: 0000000000000002 R09: 0000000000007e00 >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104432.950053] R10: ffff880d39a21020 >>>>>>> R11: ffff88084a5af550 R12: 00000000000000fa >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104433.113014] R13: 0000000000005ab1 >>>>>>> R14: 0000000000000000 R15: ffff880fb2d43000 >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104433.276484] FS: >>>>>>> 00007fcc68373840(0000) GS:ffff881fdde80000(0000) >>>>>>> knlGS:0000000000000000 >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104433.440016] CS: 0010 DS: 0000 ES: >>>>>>> 0000 CR0: 000000008005003b >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104433.521496] CR2: 00005647b2ee6d80 >>>>>>> CR3: 0000000198b93000 CR4: 00000000000406e0 >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104433.681357] Stack: >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104433.758498] 0000000000000000 >>>>>>> ffff880fd127b2e8 ffff881fc6655f08 00005bab00000000 >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104433.913655] ffff881fd0c51d80 >>>>>>> ffff88197e351bc8 ffff880fd127b330 ffff880e9eaa6000 >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104434.068609] ffff88197e351bc8 >>>>>>> ffffffff817ba6d6 0000000000000001 000000001ac592b1 >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104434.223347] Call Trace: >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104434.298560] [<ffffffff817ba6d6>] >>>>>>> ? >>>>>>> mutex_lock+0x16/0x37 >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104434.374183] [<ffffffffc0621bca>] >>>>>>> _ocfs2_free_clusters+0xea/0x200 [ocfs2] >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104434.449628] [<ffffffffc061ecb0>] >>>>>>> ? >>>>>>> ocfs2_put_slot+0xe0/0xe0 [ocfs2] >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104434.523971] [<ffffffffc061ecb0>] >>>>>>> ? >>>>>>> ocfs2_put_slot+0xe0/0xe0 [ocfs2] >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104434.595803] [<ffffffffc06234e5>] >>>>>>> ocfs2_free_clusters+0x15/0x20 [ocfs2] >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104434.666614] [<ffffffffc05d6037>] >>>>>>> __ocfs2_flush_truncate_log+0x247/0x560 [ocfs2] >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104434.806017] [<ffffffffc05d25a6>] >>>>>>> ? >>>>>>> ocfs2_num_free_extents+0x56/0x120 [ocfs2] >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104434.946141] [<ffffffffc05db258>] >>>>>>> ocfs2_remove_btree_range+0x4e8/0x760 [ocfs2] >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104435.086490] [<ffffffffc05dc720>] >>>>>>> ocfs2_commit_truncate+0x180/0x590 [ocfs2] >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104435.158189] [<ffffffffc06022b0>] >>>>>>> ? >>>>>>> ocfs2_allocate_extend_trans+0x130/0x130 [ocfs2] >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104435.297235] [<ffffffffc05f7e2c>] >>>>>>> ocfs2_truncate_file+0x39c/0x610 [ocfs2] >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104435.368060] [<ffffffffc05fe650>] >>>>>>> ? >>>>>>> ocfs2_read_inode_block+0x10/0x20 [ocfs2] >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104435.505117] [<ffffffffc05fa2d7>] >>>>>>> ocfs2_setattr+0x4b7/0xa50 [ocfs2] >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104435.574617] [<ffffffffc064c4fd>] >>>>>>> ? >>>>>>> ocfs2_xattr_get+0x9d/0x130 [ocfs2] >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104435.643722] [<ffffffff8120705e>] >>>>>>> notify_change+0x1ae/0x380 >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104435.712037] [<ffffffff811e8436>] >>>>>>> do_truncate+0x66/0xa0 >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104435.778685] [<ffffffff811f8527>] >>>>>>> path_openat+0x277/0x1330 >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104435.845776] [<ffffffffc05f2bed>] >>>>>>> ? >>>>>>> __ocfs2_cluster_unlock.isra.36+0x7d/0xb0 [ocfs2] >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104435.977677] [<ffffffff811fae8a>] >>>>>>> do_filp_open+0x7a/0xd0 >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104436.043693] [<ffffffff811f9f8f>] >>>>>>> ? >>>>>>> getname_flags+0x4f/0x1f0 >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104436.108385] [<ffffffff81208006>] >>>>>>> ? >>>>>>> __alloc_fd+0x46/0x110 >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104436.171504] [<ffffffff811ea509>] >>>>>>> do_sys_open+0x129/0x260 >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104436.232889] [<ffffffff811ea65e>] >>>>>>> SyS_open+0x1e/0x20 >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104436.294292] [<ffffffff817bc3b2>] >>>>>>> entry_SYSCALL_64_fastpath+0x16/0x75 >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104436.356257] Code: 65 c0 48 c7 c6 >>>>>>> e0 >>>>>>> 44 65 c0 41 b6 e2 48 8d 5d c8 48 8b 78 28 44 89 24 24 31 c0 49 c7 c4 >>>>>>> e2 ff ff ff e8 9a 8d 01 00 e9 c4 fd ff ff <0f> 0b 0f 0b 90 0f 1f 44 >>>>>>> 00 >>>>>>> 00 55 48 89 e5 41 57 41 89 cf b9 01 >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104436.549534] RIP >>>>>>> [<ffffffffc062026b>] _ocfs2_free_suballoc_bits+0x4db/0x4e0 [ocfs2] >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104436.681076] RSP >>>>>>> <ffff88084a5af798> >>>>>>> Sep 13 08:10:18 nodeB kernel: [1104436.834529] ---[ end trace >>>>>>> 5f4b84ac539ed56c ]--- >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Ocfs2-users mailing list >>>>>>> Ocfs2-users@oss.oracle.com >>>>>>> https://oss.oracle.com/mailman/listinfo/ocfs2-users >>>>>>> > _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users