Re: [ceph-users] Crashed MDS (segfault)

2019-10-21 Thread Yan, Zheng
On Fri, Oct 18, 2019 at 9:10 AM Gustavo Tonini wrote: > > Hi Zheng, > the cluster is running ceph mimic. This warning about network only appears > when using nautilus' cephfs-journal-tool. > > "cephfs-data-scan scan_links" does not report any issue. > > How could variable "newparent" be NULL at

[ceph-users] collectd Ceph metric

2019-10-21 Thread Liu, Changcheng
Hi all, Does anyone succeed to use collectd/ceph plugin to collect ceph cluster data? I'm using collectd(5.8.1) and Ceph-15.0.0. collectd failed to get cluster data with below error: "collectd.service holdoff time over, scheduling restart" Regards, Changcheng ___

Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-10-21 Thread Yan, Zheng
On Sun, Oct 20, 2019 at 1:53 PM Stefan Kooman wrote: > > Dear list, > > Quoting Stefan Kooman (ste...@bit.nl): > > > I wonder if this situation is more likely to be hit on Mimic 13.2.6 than > > on any other system. > > > > Any hints / help to prevent this from happening? > > We have had this happe

Re: [ceph-users] collectd Ceph metric

2019-10-21 Thread Marc Roos
I am, collectd with luminous, and upgraded to nautilus and collectd 5.8.1-1.el7 this weekend. Maybe increase logging or so. I had to wait a long time before collectd was supporting the luminous release, maybe it is the same with octopus (=15?) -Original Message- From: Liu, Changch

Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-10-21 Thread Stefan Kooman
Quoting Yan, Zheng (uker...@gmail.com): > delete 'mdsX_openfiles.0' object from cephfs metadata pool. (X is rank > of the crashed mds) Just to make sure I understand correctly. Current status is that the MDS is active (no standby for now) and not in a "crashed" state (although it has been crashin

Re: [ceph-users] collectd Ceph metric

2019-10-21 Thread Liu, Changcheng
On 09:50 Mon 21 Oct, Marc Roos wrote: > > I am, collectd with luminous, and upgraded to nautilus and collectd > 5.8.1-1.el7 this weekend. Maybe increase logging or so. > I had to wait a long time before collectd was supporting the luminous > release, maybe it is the same with octopus (=15?) >

Re: [ceph-users] collectd Ceph metric

2019-10-21 Thread Marc Roos
I have the same. I do not think ConvertSpecialMetricTypes is necessary. Globals true LongRunAvgLatency false ConvertSpecialMetricTypes true SocketPath "/var/run/ceph/ceph-osd.1.asok" -Original Message- Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] collectd

Re: [ceph-users] collectd Ceph metric

2019-10-21 Thread Liu, Changcheng
On 10:16 Mon 21 Oct, Marc Roos wrote: > I have the same. I do not think ConvertSpecialMetricTypes is necessary. > > > Globals true > > > > LongRunAvgLatency false > ConvertSpecialMetricTypes true > > SocketPath "/var/run/ceph/ceph-osd.1.asok" > > Same configuration, but there

Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-10-21 Thread Stefan Kooman
Quoting Yan, Zheng (uker...@gmail.com): > delete 'mdsX_openfiles.0' object from cephfs metadata pool. (X is rank > of the crashed mds) OK, MDS crashed again, restarted. I stopped it, deleted the object and restarted the MDS. It became active right away. Any idea on why the openfiles list (object

Re: [ceph-users] collectd Ceph metric

2019-10-21 Thread Marc Roos
Your collectd starts without the ceph plugin ok? I have also your error " didn't register a configuration callback", because I configured debug logging, but did not enable it by loading the plugin 'logfile'. Maybe it is the order in which your configuration files a read (I think this used to

Re: [ceph-users] collectd Ceph metric

2019-10-21 Thread Liu, Changcheng
Is there any instruction to install the plugin configuration? Attach my RHEL/collectd configuration file under /etc/ directory. On RHEL: [rdma@rdmarhel0 collectd.d]$ pwd /etc/collectd.d [rdma@rdmarhel0 collectd.d]$ tree . . 0 directories, 0 files [rdma@rdmarhel0 collectd.d

Re: [ceph-users] collectd Ceph metric

2019-10-21 Thread Marc Roos
The 'xx-.conf' are mine, custom. So I would not have to merge changes with newer /etc/collectd.conf rpm updates. I would suggest get a small configuration that is working, set debug logging[0], and increase the configuration until it fails with little steps. Load plugin ceph empty, confi

Re: [ceph-users] hanging slow requests: failed to authpin, subtree is being exported

2019-10-21 Thread Kenneth Waegeman
I've made a ticket for this issue: https://tracker.ceph.com/issues/42338 Thanks again! K On 15/10/2019 18:00, Kenneth Waegeman wrote: Hi Robert, all, On 23/09/2019 17:37, Robert LeBlanc wrote: On Mon, Sep 23, 2019 at 4:14 AM Kenneth Waegeman wrote: Hi all, When syncing data with rsync,

Re: [ceph-users] hanging slow requests: failed to authpin, subtree is being exported

2019-10-21 Thread Marc Roos
I think I am having this issue also (at least I had with luminous) I had to remove the hidden temp files rsync had left, when the cephfs mount 'stalled', otherwise I would never be able to complete the rsync. -Original Message- Cc: ceph-users Subject: Re: [ceph-users] hanging slow req

Re: [ceph-users] krbd / kcephfs - jewel client features question

2019-10-21 Thread Ilya Dryomov
On Sat, Oct 19, 2019 at 2:00 PM Lei Liu wrote: > > Hello llya, > > After updated client kernel version to 3.10.0-862 , ceph features shows: > > "client": { > "group": { > "features": "0x7010fb86aa42ada", > "release": "jewel", > "num": 5 > }, >

Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-10-21 Thread Yan, Zheng
On Mon, Oct 21, 2019 at 4:33 PM Stefan Kooman wrote: > > Quoting Yan, Zheng (uker...@gmail.com): > > > delete 'mdsX_openfiles.0' object from cephfs metadata pool. (X is rank > > of the crashed mds) > > OK, MDS crashed again, restarted. I stopped it, deleted the object and > restarted the MDS. It b

Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-10-21 Thread Stefan Kooman
Quoting Yan, Zheng (uker...@gmail.com): > I double checked the code, but didn't find any clue. Can you compile > mds with a debug patch? Sure, I'll try to do my best to get a properly packaged Ceph Mimic 13.2.6 with the debug patch in it (and / or get help to get it build). Do you already have th

[ceph-users] ceph balancer do not start

2019-10-21 Thread Jan Peters
Hello, I use ceph 12.2.12 and would like to activate the ceph balancer. unfortunately no redistribution of the PGs is started: ceph balancer status { "active": true, "plans": [], "mode": "crush-compat" } ceph balancer eval current cluster score 0.023776 (lower is better) ceph conf

Re: [ceph-users] krbd / kcephfs - jewel client features question

2019-10-21 Thread Lei Liu
Hello llya and paul, Thanks for your reply. Yes, you are right, 0x7fddff8ee8cbffb is come from kernel upgrade, it's reported by a docker container (digitalocean/ceph_exporter) use for ceph monitoring. Now upmap mode is enabled, client features: "client": { "group": { "featur

Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-10-21 Thread Yan, Zheng
On Mon, Oct 21, 2019 at 7:58 PM Stefan Kooman wrote: > > Quoting Yan, Zheng (uker...@gmail.com): > > > I double checked the code, but didn't find any clue. Can you compile > > mds with a debug patch? > > Sure, I'll try to do my best to get a properly packaged Ceph Mimic > 13.2.6 with the debug pat

[ceph-users] Ceph Science User Group Call October

2019-10-21 Thread Kevin Hrpcek
Hello, This Wednesday we'll have a ceph science user group call. This is an informal conversation focused on using ceph in htc/hpc and scientific research environments. Call details copied from the event: Wednesday October 23rd 14:00 UTC 4:00PM Central European 10:00AM Eastern American Main p

[ceph-users] Getting rid of prometheus messages in /var/log/messages

2019-10-21 Thread Vladimir Brik
Hello /var/log/messages on machines in our ceph cluster are inundated with entries from Prometheus scraping ("GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.11.1") Is it possible to configure ceph to not send those to syslog? If not, can I configure something so that none of ceph-mgr messages

Re: [ceph-users] Can't create erasure coded pools with k+m greater than hosts?

2019-10-21 Thread Salsa
Just to clarify my situation, We have 2 datacenters with 3 hosts each, 12 4TB disks each host (2 are RAID with OS installed and the remaining 10 are used for Ceph). Right now I'm trying a single DC installation and intended to migrate to multi site mirroring DC1 to DC2, so if we lose DC1 we can

[ceph-users] Nautilus - inconsistent PGs - stat mismatch

2019-10-21 Thread Andras Pataki
We have a new ceph Nautilus setup (Nautilus from scratch - not upgraded): # ceph versions {     "mon": {     "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)": 3     },     "mgr": {     "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nau

Re: [ceph-users] Crashed MDS (segfault)

2019-10-21 Thread Gustavo Tonini
Is there a possibility to lose data if I use "cephfs-data-scan init --force-init"? On Mon, Oct 21, 2019 at 4:36 AM Yan, Zheng wrote: > On Fri, Oct 18, 2019 at 9:10 AM Gustavo Tonini > wrote: > > > > Hi Zheng, > > the cluster is running ceph mimic. This warning about network only > appears when

[ceph-users] Fwd: large concurrent rbd operations block for over 15 mins!

2019-10-21 Thread Void Star Nill
Apparently the graph is too big, so my last post is stuck. Resending without the graph. Thanks -- Forwarded message - From: Void Star Nill Date: Mon, Oct 21, 2019 at 4:41 PM Subject: large concurrent rbd operations block for over 15 mins! To: ceph-users Hello, I have been ru

[ceph-users] Decreasing the impact of reweighting osds

2019-10-21 Thread Mark Kirkwood
We recently needed to reweight a couple of OSDs on one of our clusters (luminous on Ubuntu,  8 hosts, 8 OSD/host). I (think) we reweighted by approx 0.2. This was perhaps too much, as IO latency on RBD drives spiked to several seconds at times. We'd like to lessen this effect as much as we can

[ceph-users] clust recovery stuck

2019-10-21 Thread Philipp Schwaha
hi, I have a problem with a cluster being stuck in recovery after osd failure. at first recovery was doing quite well, but now it just sits there without any progress. I currently looks like this: health HEALTH_ERR 36 pgs are stuck inactive for more than 300 seconds 5

[ceph-users] Replace ceph osd in a container

2019-10-21 Thread Alex Litvak
Hello cephers, So I am having trouble with a new hardware systems with strange OSD behavior and I want to replace a disk with a brand new one to test the theory. I run all daemons in containers and on one of the nodes I have mon, mgr, and 6 osds. So following https://docs.ceph.com/docs/maste

Re: [ceph-users] clust recovery stuck

2019-10-21 Thread Eugen Block
Hi, can you share `ceph osd tree`? What crush rules are in use in your cluster? I assume that the two failed OSDs prevent the remapping because the rules can't be applied. Regards, Eugen Zitat von Philipp Schwaha : hi, I have a problem with a cluster being stuck in recovery after osd