Re: [ceph-users] slow request and unresponsive kvm guests after upgrading ceph cluster and os, please help debugging

2020-01-07 Thread Jelle de Jong
Hello everybody, I think I fixed the issues after weeks of looking. question 1: anyone know hos to prevent iptables, nftables or conntrack to be loaded in the first time? Adding them to /etc/modprobe.d/blacklist.local.conf does not seem to work? What is recommended? question 2: what systemd

Re: [ceph-users] slow request and unresponsive kvm guests after upgrading ceph cluster and os, please help debugging

2020-01-07 Thread Stefan Kooman
Quoting Jelle de Jong (jelledej...@powercraft.nl): > question 2: what systemd target i can use to run a service after all > ceph-osds are loaded? I tried ceph.target ceph-osd.target both do not work > reliable. ceph-osd.target works for us (every time). Have you enabled all the individual OSD ser

Re: [ceph-users] slow request and unresponsive kvm guests after upgrading ceph cluster and os, please help debugging

2020-01-07 Thread Stefan Kooman
Quoting Paul Emmerich (paul.emmer...@croit.io): > We've also seen some problems with FileStore on newer kernels; 4.9 is the > last kernel that worked reliably with FileStore in my experience. > > But I haven't seen problems with BlueStore related to the kernel version > (well, except for that scru

[ceph-users] ceph (jewel) unable to recover after node failure

2020-01-07 Thread Hanspeter Kunz
Hi, after a node failure ceph is unable to recover, i.e. unable to reintegrate the failed node back into the cluster. what happened? 1. a node with 11 osds crashed, the remaining 4 nodes (also with 11 osds each) re-balanced, although reporting the following error condition: too many PGs per OSD

[ceph-users] ceph (jewel) unable to recover after node failure

2020-01-07 Thread Hanspeter Kunz
here is the output of ceph health detail: HEALTH_ERR 16 pgs are stuck inactive for more than 300 seconds; 134 pgs backfill_wait; 11 pgs backfilling; 69 pgs degraded; 14 pgs down; 2 pgs incomplete; 14 pgs peering; 6 pgs recovery_wait; 69 pgs stuck degraded; 16 pgs stuck inactive; 167 pgs stuck

Re: [ceph-users] Infiniband backend OSD communication

2020-01-07 Thread Nathan Stratton
Ok, so ipoib is required... ><> nathan stratton On Mon, Jan 6, 2020 at 4:45 AM Wei Zhao wrote: > From my understanding, the basic idea is that ceph exchange rdma > information(qp,gid and so) through ip address on rdma device, and then > communicate with each other throng rdma. But in my tests,

[ceph-users] CRUSH rebalance all at once or host-by-host?

2020-01-07 Thread Sean Matheny
We’re adding in a CRUSH hierarchy retrospectively in preparation for a big expansion. Previously we only had host and osd buckets, and now we’ve added in rack buckets. I’ve got sensible settings to limit rebalancing set, at least what has worked in the past: osd_max_backfills = 1 osd_recovery_t