This would work If all nodes have a single function only (mon, mgr, old). I tried everything to update the monitors first, but due to the dependencies between the Ceph packages the monitors and mgr daemons can not simply be updated separately from the OSDs What I don't get, though, is that once all three monitors and mgrs are updated the OSDs do not fall back in line after a reboot. I will try to force the install of ceph-base, ceph-common and mon/mgr and then force upgrade the OSDs to test whether that will work. If not at least a workflow should be considered that allows upgrade of hyper converged clusters, which are becoming more and more important for edge sites.
On Fri, Apr 24, 2020 at 5:50 PM Dan Hill <1874...@bugs.launchpad.net> wrote: > Eoan packages Nautilus, while Focal packages Octopus: > ceph | 14.2.2-0ubuntu3 | eoan > ceph | 14.2.4-0ubuntu0.19.10.2 | eoan-security > ceph | 14.2.8-0ubuntu0.19.10.1 | eoan-updates > ceph | 15.2.1-0ubuntu1 | focal > ceph | 15.2.1-0ubuntu2 | focal-proposed > > When upgrading your cluster, make sure to follow the Octopus upgrade > guidelines [0]. Specifically, the Mon and Mgr nodes must be upgraded and > their services restarted before upgrading OSD nodes. > > [0] https://docs.ceph.com/docs/master/releases/octopus/#upgrading-from- > mimic-or-nautilus > <https://docs.ceph.com/docs/master/releases/octopus/#upgrading-from-mimic-or-nautilus> > > -- > You received this bug notification because you are subscribed to the bug > report. > https://bugs.launchpad.net/bugs/1874939 > > Title: > ceph-osd can't connect after upgrade to focal > > Status in ceph package in Ubuntu: > New > > Bug description: > Upon upgrading a Ceph node with do-release-upgrade from eoan to focal, > the OSD doesn't connect after the upgrade. I rolled back the change > (VBox snapshot) and tried again, same result. I also tried to hold > back the Ceph packages and upgrade after the fact, but again same > result. > > Epected behavior: OSD connects to cluster after upgrade. > > Actual behavior: OSD log shows endlessly repeated > 'tick_without_osd_lock' messages. OSD will stay down from perspective > of the cluster. > > Extract from debug log of OSD: > > 2020-04-24T16:25:35.811-0700 7fd70e83d700 5 osd.0 16499 heartbeat > osd_stat(store_statfs(0x44990000/0x40000000/0x240000000, data > 0x14bb97877/0x1bb660000, compress 0x0/0x0/0x0, omap 0x2bbf, meta > 0x3fffd441), peers [] op hist []) > 2020-04-24T16:25:35.811-0700 7fd70e83d700 20 osd.0 16499 > check_full_status cur ratio 0.769796, physical ratio 0.769796, new state > none > 2020-04-24T16:25:36.043-0700 7fd7272ea700 10 osd.0 16499 tick > 2020-04-24T16:25:36.043-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- > start > 2020-04-24T16:25:36.043-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- > finish > 2020-04-24T16:25:36.043-0700 7fd7272ea700 20 osd.0 16499 tick > last_purged_snaps_scrub 2020-04-24T15:54:43.601161-0700 next > 2020-04-25T15:54:43.601161-0700 > 2020-04-24T16:25:36.631-0700 7fd72606c700 10 osd.0 16499 > tick_without_osd_lock > 2020-04-24T16:25:37.055-0700 7fd7272ea700 10 osd.0 16499 tick > 2020-04-24T16:25:37.055-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- > start > 2020-04-24T16:25:37.055-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- > finish > 2020-04-24T16:25:37.055-0700 7fd7272ea700 20 osd.0 16499 tick > last_purged_snaps_scrub 2020-04-24T15:54:43.601161-0700 next > 2020-04-25T15:54:43.601161-0700 > 2020-04-24T16:25:37.595-0700 7fd72606c700 10 osd.0 16499 > tick_without_osd_lock > 2020-04-24T16:25:38.071-0700 7fd7272ea700 10 osd.0 16499 tick > 2020-04-24T16:25:38.071-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- > start > 2020-04-24T16:25:38.071-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- > finish > 2020-04-24T16:25:38.071-0700 7fd7272ea700 20 osd.0 16499 tick > last_purged_snaps_scrub 2020-04-24T15:54:43.601161-0700 next > 2020-04-25T15:54:43.601161-0700 > 2020-04-24T16:25:38.243-0700 7fd71cc0d700 20 osd.0 16499 reports for 0 > queries > 2020-04-24T16:25:38.583-0700 7fd72606c700 10 osd.0 16499 > tick_without_osd_lock > 2020-04-24T16:25:39.103-0700 7fd7272ea700 10 osd.0 16499 tick > 2020-04-24T16:25:39.103-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- > start > 2020-04-24T16:25:39.103-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- > finish > 2020-04-24T16:25:39.103-0700 7fd7272ea700 20 osd.0 16499 tick > last_purged_snaps_scrub 2020-04-24T15:54:43.601161-0700 next > 2020-04-25T15:54:43.601161-0700 > > This repeats over and over again. > > strace of the process yields lots of unfinished futex access attempts: > > [pid 2130] futex(0x55b17b8e216c, > FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, > tv_nsec=937726129}, FUTEX_BITSET_MATCH_ANY <unfinished ...> > [pid 2100] write(12, "2020-04-24T16:47:33.915-0700 7fd"..., 79) = 79 > [pid 2100] futex(0x55b17b7108e4, FUTEX_WAIT_PRIVATE, 0, NULL > <unfinished ...> > [pid 2190] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed > out) > [pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0 > [pid 2190] futex(0x55b17b775ab8, > FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772053, > tv_nsec=969572004}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection > timed out) > [pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0 > [pid 2190] futex(0x55b17b775ab8, > FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, > tv_nsec=20189832}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed > out) > [pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0 > [pid 2190] futex(0x55b17b775ab8, > FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, > tv_nsec=70811223}, FUTEX_BITSET_MATCH_ANY <unfinished ...> > [pid 2163] <... io_getevents resumed> [], {tv_sec=0, > tv_nsec=250000000}) = 0 > [pid 2134] <... io_getevents resumed> [], {tv_sec=0, > tv_nsec=250000000}) = 0 > [pid 2163] io_getevents(0x7fd7272eb000, 1, 16, <unfinished ...> > [pid 2134] io_getevents(0x7fd7272fc000, 1, 16, <unfinished ...> > [pid 2190] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed > out) > [pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0 > [pid 2190] futex(0x55b17b775ab8, > FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, > tv_nsec=121288477}, FUTEX_BITSET_MATCH_ANY <unfinished ...> > [pid 2200] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed > out) > [pid 2200] futex(0x55b17b7108e4, FUTEX_WAKE_PRIVATE, 2147483647) = 1 > [pid 2200] futex(0x7ffc4aa8b708, FUTEX_WAKE_PRIVATE, 1) = 0 > [pid 2200] futex(0x7ffc4aa8b770, > FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772055, > tv_nsec=102644954}, FUTEX_BITSET_MATCH_ANY <unfinished ...> > [pid 2100] <... futex resumed> ) = 0 > [pid 2100] futex(0x55b17b710838, FUTEX_WAKE_PRIVATE, 1) = 0 > [pid 2100] futex(0x55b17b7108e0, FUTEX_WAIT_PRIVATE, 0, NULL > <unfinished ...> > [pid 2190] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed > out) > [pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0 > [pid 2190] futex(0x55b17b775ab8, > FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, > tv_nsec=171906673}, FUTEX_BITSET_MATCH_ANY <unfinished ...> > [pid 2127] <... clock_nanosleep resumed> NULL) = 0 > [pid 2127] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, > tv_nsec=256000000}, <unfinished ...> > [pid 2190] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed > out) > [pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0 > [pid 2190] futex(0x55b17b775ab8, > FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, > tv_nsec=222271211}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection > timed out) > [pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0 > [pid 2190] futex(0x55b17b775ab8, > FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, > tv_nsec=273226419}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection > timed out) > [pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0 > [pid 2190] futex(0x55b17b775ab8, > FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, > tv_nsec=323615391}, FUTEX_BITSET_MATCH_ANY <unfinished ...> > [pid 2163] <... io_getevents resumed> [], {tv_sec=0, > tv_nsec=250000000}) = 0 > [pid 2163] io_getevents(0x7fd7272eb000, 1, 16, <unfinished ...> > [pid 2134] <... io_getevents resumed> [], {tv_sec=0, > tv_nsec=250000000}) = 0 > [pid 2134] io_getevents(0x7fd7272fc000, 1, 16, <unfinished ...> > [pid 2190] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed > out) > [pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0 > [pid 2190] futex(0x55b17b775ab8, > FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, > tv_nsec=373946132}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection > timed out) > [pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0 > [pid 2190] futex(0x55b17b775ab8, > FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, > tv_nsec=424283527}, FUTEX_BITSET_MATCH_ANY <unfinished ...> > [pid 2127] <... clock_nanosleep resumed> NULL) = 0 > [pid 2127] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, > tv_nsec=256000000}, <unfinished ...> > [pid 2190] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed > out) > [pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0 > [pid 2190] futex(0x55b17b775ab8, > FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, > tv_nsec=474599677}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection > timed out) > [pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0 > [pid 2190] futex(0x55b17b775ab8, > FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, > tv_nsec=525368586}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection > timed out) > [pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0 > [pid 2190] futex(0x55b17b775ab8, > FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, > tv_nsec=575839547}, FUTEX_BITSET_MATCH_ANY <unfinished ...> > [pid 2163] <... io_getevents resumed> [], {tv_sec=0, > tv_nsec=250000000}) = 0 > [pid 2134] <... io_getevents resumed> [], {tv_sec=0, > tv_nsec=250000000}) = 0 > [pid 2163] io_getevents(0x7fd7272eb000, 1, 16, <unfinished ...> > [pid 2134] io_getevents(0x7fd7272fc000, 1, 16, ^Cstrace: Process 2093 > detached > > > > Suspected cause: OSD can not connect to monitor. > > Repeatability: On 5 attempts (3 separate nodes and 2 repetitions) the > result was the same. > > Research done: I checked Launchpad and the Ceph bug tracker, couldn't > find something similar. Tried restart of process, reboot of node, > revert change and re-upgrade, hold Ceph packages and manually upgrade > after do-release-upgrade, strace of the process. > > Impact: Right now upgrade of Ceph to 20.04 LTS appears to be broken > > ProblemType: Bug > DistroRelease: Ubuntu 20.04 > Package: ceph-osd 15.2.1-0ubuntu1 > ProcVersionSignature: Ubuntu 5.4.0-26.30-generic 5.4.30 > Uname: Linux 5.4.0-26-generic x86_64 > ApportVersion: 2.20.11-0ubuntu27 > Architecture: amd64 > CasperMD5CheckResult: skip > Date: Fri Apr 24 16:30:53 2020 > ProcEnviron: > SHELL=/bin/bash > LANG=en_US.UTF-8 > TERM=xterm-256color > PATH=(custom, no user) > SourcePackage: ceph > UpgradeStatus: Upgraded to focal on 2020-04-24 (0 days ago) > > To manage notifications about this bug go to: > https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions > -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1874939 Title: ceph-osd can't connect after upgrade to focal To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs