Public bug reported:

How to reproduce:

On the initial installation, Z cluster had 1 monitor node, 3 OSDs, 1 MDS and 1 
MGR. Inorder to form a quorum, 2 more nodes have been added as monitor nodes 
which are OSDs already.
The Z cluster then had 3 monitor nodes of which 2 are both OSDs and Monitors.

However, at some point in time during the stress-ng run, the monitor
daemon crashed repeatedly on the cluster back to back. The crash stopped
only after removing both the monitor nodes which are OSDs from the
quorum and then the cluster remained stable.

Topology:

root@m8330013:~# ceph node ls all
{
    "mon": {
        "m8330013": [
            "m8330013"
        ],
       "m8330014": [
            "m8330014"
        ],
       "m8330015": [
            "m8330015"
        ]
    },
    "osd": {
        "m8330014": [
            0
        ],
        "m8330015": [
            1
        ],
        "m8330016": [
            2
        ]
    },
    "mds": {
        "m8330013": [
            "m8330013"
        ]
    },
    "mgr": {
        "m8330013": [
            "m8330013"
        ],
        "m8330015": [
            "m8330015"
        ]
    }
}
root@m8330013:~#

The below job file runs each filesystem stressor sequentially one per
CPU for 5 minutes and the shows the cumulative user and system time of
all the processes at the end of the stress run.

Stress-ng Job file :

run sequential
metrics
verbose
timeout 5m
times
timestamp

#0 means 1 stressor per CPU
access 0
bind-mount 0
chdir 0
chmod 0
chown 0
copy-file 0
dentry 0
dir 0
dirdeep 0
dnotify 0
dup 0
eventfd 0
fallocate 0
fanotify 0
fcntl 0
fiemap 0
file-ioctl 0
filename 0
flock 0
fstat 0
getdent 0
handle 0
inode-flags 0
inotify 0
io 0
iomix 0
ioprio 0
lease 0
link 0
locka 0
lockf 0
lockofd 0
mknod 0
open 0
procfs 0
rename 0
symlink 0
sync-file 0
utime 0
xattr 0

Command for Execution:

stress-ng --job <job_file> --temp-path <cephfs_mountpoint> --log-file
<log_file>

A proposed fixup sent to upstream:
https://github.com/ceph/ceph/pull/36697

As mentioned above, the fix for this issue landed upstream at PR:

https://github.com/ceph/ceph/pull/36697

which was backported to Octopus (15.2.x) release at PR:

https://github.com/ceph/ceph/pull/36813


This backported patch seems to be applied cleanly in ceph-15.2.3 at
focal-updates git tree at :

https://git.launchpad.net/ubuntu/+source/ceph/log/?h=applied/ubuntu
/focal-updates

Please apply the backported patch to this tree. Thanks.

Please be aware that upstream's backport patch
https://github.com/ceph/ceph/pull/36813 merged 2 patches in master
branch together:

https://github.com/ceph/ceph/pull/35920
https://github.com/ceph/ceph/pull/36697

which we need both.

** Affects: ceph (Ubuntu)
     Importance: Undecided
     Assignee: Skipper Bug Screeners (skipper-screen-team)
         Status: New


** Tags: architecture-s39064 bugnameltc-188070 severity-high 
targetmilestone-inin2004

** Tags added: architecture-s39064 bugnameltc-188070 severity-high
targetmilestone-inin2004

** Changed in: ubuntu
     Assignee: (unassigned) => Skipper Bug Screeners (skipper-screen-team)

** Package changed: ubuntu => ceph (Ubuntu)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1900690

Title:
  [Ubuntu 20.04] ceph: messages,mds: Fix decoding of enum types on big-
  endian systems

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1900690/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to