Re: [ceph-users] hanging/stopped recovery/rebalance in Nautilus

2019-10-02 Thread Konstantin Shalygin
Hi,I often observed now that the recovery/rebalance in Nautilus starts quite fast but gets extremely slow (2-3 objects/s) even if there are like 20 OSDs involved. Right now I am moving (reweighted to 0) 16x8TB disks, it's running since 4 days and since 12h it's kind of stuck now at   cluster:

Re: [ceph-users] Ceph pg repair clone_missing?

2019-10-02 Thread Brad Hubbard
On Wed, Oct 2, 2019 at 9:00 PM Marc Roos wrote: > > > > Hi Brad, > > I was following the thread where you adviced on this pg repair > > I ran these rados 'list-inconsistent-obj'/'rados > list-inconsistent-snapset' and have output on the snapset. I tried to > extrapolate your comment on the data/om

Re: [ceph-users] rgw S3 lifecycle cannot keep up

2019-10-02 Thread Robin H. Johnson
On Wed, Oct 02, 2019 at 01:48:40PM +0200, Christian Pedersen wrote: > Hi Martin, > > Even before adding cold storage on HDD, I had the cluster with SSD only. That > also could not keep up with deleting the files. > I am no where near I/O exhaustion on the SSDs or even the HDDs. Please see my pres

Re: [ceph-users] tcmu-runner: mismatched sizes for rbd image size

2019-10-02 Thread Mike Christie
On 10/02/2019 02:15 PM, Kilian Ries wrote: > Ok i just compared my local python files and the git commit you sent me > - it really looks like i have the old files installed. All the changes > are missing in my local files. > > > > Where can i get a new ceph-iscsi-config package that has the fixe

[ceph-users] Unexpected increase in the memory usage of OSDs

2019-10-02 Thread Vladimir Brik
Hello I am running a Ceph 14.2.2 cluster and a few days ago, memory consumption of our OSDs started to unexpectedly grow on all 5 nodes, after being stable for about 6 months. Node memory consumption: https://icecube.wisc.edu/~vbrik/graph.png Average OSD resident size: https://icecube.wisc.ed

Re: [ceph-users] Local Device Health PG inconsistent

2019-10-02 Thread Reed Dier
And now to fill in the full circle. Sadly my solution was to run > $ ceph pg repair 33.0 which returned > 2019-10-02 15:38:54.499318 osd.12 (osd.12) 181 : cluster [DBG] 33.0 repair > starts > 2019-10-02 15:38:55.502606 osd.12 (osd.12) 182 : cluster [ERR] 33.0 repair : > stat mismatch, got 264/26

Re: [ceph-users] tcmu-runner: mismatched sizes for rbd image size

2019-10-02 Thread Kilian Ries
Ok i just compared my local python files and the git commit you sent me - it really looks like i have the old files installed. All the changes are missing in my local files. Where can i get a new ceph-iscsi-config package that has the fixe included? I have installed version: ceph-iscsi-confi

Re: [ceph-users] tcmu-runner: mismatched sizes for rbd image size

2019-10-02 Thread Kilian Ries
Yes, i created all four luns with these sizes: lun0 - 5120G lun1 - 5121G lun2 - 5122G lun3 - 5123G Its always one GB more per LUN... Is there any newer ceph-iscsi-config package than i have installed? ceph-iscsi-config-2.6-2.6.el7.noarch Then i could try to update the package and see if

[ceph-users] MDS Stability with lots of CAPS

2019-10-02 Thread Stefan Kooman
Hi, According to [1] there are new parameters in place to have the MDS behave more stable. Quoting that blog post "One of the more recent issues weve discovered is that an MDS with a very large cache (64+GB) will hang during certain recovery events." For all of us that are not (yet) running Nauti

Re: [ceph-users] tcmu-runner: mismatched sizes for rbd image size

2019-10-02 Thread Jason Dillaman
On Wed, Oct 2, 2019 at 9:50 AM Kilian Ries wrote: > > Hi, > > > i'm running a ceph mimic cluster with 4x ISCSI gateway nodes. Cluster was > setup via ceph-ansible v3.2-stable. I just checked my nodes and saw that only > two of the four configured iscsi gw nodes are working correct. I first > no

[ceph-users] tcmu-runner: mismatched sizes for rbd image size

2019-10-02 Thread Kilian Ries
Hi, i'm running a ceph mimic cluster with 4x ISCSI gateway nodes. Cluster was setup via ceph-ansible v3.2-stable. I just checked my nodes and saw that only two of the four configured iscsi gw nodes are working correct. I first noticed via gwcli: ### $gwcli -d ls Traceback (most recent cal

Re: [ceph-users] rgw S3 lifecycle cannot keep up

2019-10-02 Thread Christian Pedersen
Hi Martin, Even before adding cold storage on HDD, I had the cluster with SSD only. That also could not keep up with deleting the files. I am no where near I/O exhaustion on the SSDs or even the HDDs. Cheers, Christian On Oct 2 2019, at 1:23 pm, Martin Verges wrote: > Hello Christian, > > the

Re: [ceph-users] rgw S3 lifecycle cannot keep up

2019-10-02 Thread Martin Verges
Hello Christian, the problem is, that HDD is not capable of providing lots of IOs required for "~4 million small files". -- Martin Verges Managing director Mobile: +49 174 9335695 E-Mail: martin.ver...@croit.io Chat: https://t.me/MartinVerges croit GmbH, Freseniusstr. 31h, 81247 Munich CEO: Mar

[ceph-users] Ceph pg repair clone_missing?

2019-10-02 Thread Marc Roos
Hi Brad, I was following the thread where you adviced on this pg repair I ran these rados 'list-inconsistent-obj'/'rados list-inconsistent-snapset' and have output on the snapset. I tried to extrapolate your comment on the data/omap_digest_mismatch_info onto my situation. But I don't know

[ceph-users] rgw S3 lifecycle cannot keep up

2019-10-02 Thread Christian Pedersen
Hi, Using the S3 gateway I store ~4 million small files in my cluster every day. I have a lifecycle setup to move these files to cold storage after a day and delete them after two days. The default storage is SSD based and the cold storage is HDD. However the rgw lifecycle process cannot keep up

Re: [ceph-users] Have you enabled the telemetry module yet?

2019-10-02 Thread Stefan Kooman
> > I created this issue: https://tracker.ceph.com/issues/42116 > > Seems to be related to the 'crash' module not enabled. > > If you enable the module the problem should be gone. Now I need to check > why this message is popping up. Yup, crash module enabled and error message is gone. Either w

Re: [ceph-users] Have you enabled the telemetry module yet?

2019-10-02 Thread Wido den Hollander
On 10/1/19 4:38 PM, Stefan Kooman wrote: > Quoting Wido den Hollander (w...@42on.com): >> Hi, >> >> The Telemetry [0] module has been in Ceph since the Mimic release and >> when enabled it sends back a anonymized JSON back to >> https://telemetry.ceph.com/ every 72 hours with information about t