** Description changed: - This issue is a continuation of - https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/2065515 + [ Impact ] - On Ubuntu 24.04 lts we did upgrade Ceph to 19.2.0-0ubuntu0.24.04.1 - - Previous release is : 19.2.0~git20240301.4c76c50-0ubuntu6 - - whenever upgrading (tested on 2 different clusters) the ceph-mon ends - up crashing repeatedly with the below stack error + - In Ubuntu 24.04 LTS, upgrading Ceph from `19.2.0~git20240301.4c76c50-0ubuntu6` to `19.2.0-0ubuntu0.24.04.1` may cause the `ceph-mon` daemon to crash when it encounters mismatched serialization order (see stack trace below). + - This can lead to monitor outages, preventing the entire Ceph cluster from reaching a healthy state. + - Mitigation is only possible by downgrading to the previous version. + - Monitor stacktrace: ``` ceph version 19.2.0 (16063ff2022298c9300e49a547a16ffda59baf13) squid (stable) 1: /lib/x86_64-linux-gnu/libc.so.6(+0x45320) [0x788409245320] 2: pthread_kill() 3: gsignal() 4: abort() 5: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa5ff5) [0x7884096a5ff5] 6: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb0da) [0x7884096bb0da] 7: (std::unexpected()+0) [0x7884096a5a55] 8: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb391) [0x7884096bb391] 9: (ceph::buffer::v15_2_0::list::iterator_impl<true>::copy(unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)+0x193) [0x78840a293593] 10: (MDSMap::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0xca1) [0x78840a4c3ab1] 11: (Filesystem::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x1c3) [0x78840a4e4303] 12: (FSMap::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x280) [0x78840a4e6ef0] 13: (MDSMonitor::update_from_paxos(bool*)+0x291) [0x631ac5dea801] 14: (Monitor::refresh_from_paxos(bool*)+0x124) [0x631ac5b7a164] 15: (Monitor::preinit()+0x98e) [0x631ac5bb2fbe] 16: main() 17: /lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x78840922a1ca] 18: __libc_start_main() 19: _start() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. - ``` + - The root cause of this bug is that the on-wire representation changed between the git snapshot 19.2.0~git20240301.4c76c50-0ubuntu6 and the squid release 19.2.0-0ubuntu0.24.04.1 in that the fields for `bal_rank_mask` `max_xattr_size` were swapped. + - The proposed fix tries to detect the on-wire presentation and adapt the decode order accordingly. - mitigation: - a rollback to the previous release 19.2.0~git20240301.4c76c50-0ubuntu6 is still possible to restore service + + [ Test Plan ] + + To validate that the proposed fix addresses the crash and introduces no + regressions, perform the following steps: + + 1. Setup + + - Deploy a juju model and add 7 24.04 VMs. + - On each VM, add apt pinning to pin Ceph to the snapshot version. + - Deploy 3x ceph-mon, 3x ceph-osd and 1x ceph-fs units to these machines. + - Configure a mount point, mount cephfs via ceph-fuse and write some test data. + + + 2. Baseline Check + - Verify the cluster is healthy (`ceph -s` shows `HEALTH_OK` or similar). + - Verify Ceph packages correspond to snapshot packages. + + 3. Upgrade Ceph packages + - Remove the apt pin. + - Upgrade Ceph packages to the fixed version (which includes the new decode logic). + + 4. Verification + + - Verify package versions. + - Restart Ceph services incl. MONs and verify services start correctly. + - Verify cephfs mounts can be mounted, and data can be read and written. + + + [ Where problems could occur ] + + The fix changes how `bal_rank_mask` and `max_xattr_size` fields are + decoded based on on-wire detection. The patch assumes that the max. + xattr size is less than 64GiB -- larger extended attrs are highly + unlikely. xattr sizes larger than 64GiB would probably result in mis- + decoding the protocol and could result in a crash. + + The Reef release should have the same field order as the snapshot. Older + releases should not be affected. + + [ Other Info / Original Description ] + + Also see bug https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/2065515
-- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2089565 Title: MON and MDS crash upgrading CEPH on ubuntu 24.04 LTS To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/2089565/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs