[ceph-users] "ceph pg scrub" does not start

2018-06-21 Thread Jake Grimmett
Dear All, A bad disk controller appears to have damaged our cluster... # ceph health HEALTH_ERR 10 scrub errors; Possible data damage: 10 pgs inconsistent probing to find bad pg... # ceph health detail HEALTH_ERR 10 scrub errors; Possible data damage: 10 pgs inconsistent OSD_SCRUB_ERRORS 10 scr

Re: [ceph-users] "ceph pg scrub" does not start

2018-06-21 Thread Wido den Hollander
On 06/21/2018 11:11 AM, Jake Grimmett wrote: > Dear All, > > A bad disk controller appears to have damaged our cluster... > > # ceph health > HEALTH_ERR 10 scrub errors; Possible data damage: 10 pgs inconsistent > > probing to find bad pg... > > # ceph health detail > HEALTH_ERR 10 scrub err

[ceph-users] MDS reports metadata damage

2018-06-21 Thread Hennen, Christian
Dear Community, here at ZIMK at the University of Trier we operate a Ceph Luminous Cluster as filer for a HPC environment via CephFS (Bluestore backend). During setup last year we made the mistake of not configuring the RAID as JBOD, so initially the 3 nodes only housed 1 OSD each. Currently, w

[ceph-users] init mon fail since use service rather than systemctl

2018-06-21 Thread xiang . dai
I met below issue: INFO: initialize ceph mon ... [ceph_deploy.conf][DEBUG ] found configuration file at: /root/.cephdeploy.conf [ceph_deploy.cli][INFO ] Invoked (1.5.25): /usr/bin/ceph-deploy --overwrite-conf mon create-initial [ceph_deploy.mon][DEBUG ] Deploying mon, cluster ceph hosts dx-st

Re: [ceph-users] CentOS Dojo at CERN

2018-06-21 Thread Kai Wagner
On 20.06.2018 17:39, Dan van der Ster wrote: > And BTW, if you can't make it to this event we're in the early days of > planning a dedicated Ceph + OpenStack Days at CERN around May/June > 2019. > More news on that later... Will that be during a CERN maintenance window? *that would raise my intere

Re: [ceph-users] init mon fail since use service rather than systemctl

2018-06-21 Thread Alfredo Deza
On Thu, Jun 21, 2018 at 8:41 AM, wrote: > I met below issue: > > INFO: initialize ceph mon ... > [ceph_deploy.conf][DEBUG ] found configuration file at: > /root/.cephdeploy.conf > [ceph_deploy.cli][INFO ] Invoked (1.5.25): /usr/bin/ceph-deploy > --overwrite-conf mon create-initial > [ceph_deploy

Re: [ceph-users] CentOS Dojo at CERN

2018-06-21 Thread Dan van der Ster
On Thu, Jun 21, 2018 at 2:41 PM Kai Wagner wrote: > > On 20.06.2018 17:39, Dan van der Ster wrote: > > And BTW, if you can't make it to this event we're in the early days of > > planning a dedicated Ceph + OpenStack Days at CERN around May/June > > 2019. > > More news on that later... > Will that

[ceph-users] Designating an OSD as a spare

2018-06-21 Thread Drew Weaver
Does anyone know if it is possible to designate an OSD as a spare so that if a disk dies in a host no administrative action needs to be immediately taken to remedy the situation? Thanks, -Drew ___ ceph-users mailing list ceph-users@lists.ceph.com http

Re: [ceph-users] MDS: journaler.pq decode error

2018-06-21 Thread John Spray
On Wed, Jun 20, 2018 at 2:17 PM Benjeman Meekhof wrote: > > Thanks for the response. I was also hoping to be able to debug better > once we got onto Mimic. We just finished that upgrade yesterday and > cephfs-journal-tool does find a corruption in the purge queue though > our MDS continues to st

Re: [ceph-users] "ceph pg scrub" does not start

2018-06-21 Thread Jake Grimmett
On 21/06/18 10:14, Wido den Hollander wrote: Hi Wido, >> Note the date stamps, the scrub command appears to be ignored >> >> Any ideas on why this is happening, and what we can do to fix the error? > > Are any of the OSDs involved with that PG currently doing recovery? If > so, they will ignore

Re: [ceph-users] Designating an OSD as a spare

2018-06-21 Thread Paul Emmerich
Spare disks are bad design. There is no point in having a disk that is not being used. Ceph will automatically remove a dead disk after 15 minutes from the cluster, backfilling the data onto other disks. Paul 2018-06-21 14:54 GMT+02:00 Drew Weaver : > Does anyone know if it is possible to desi

Re: [ceph-users] Designating an OSD as a spare

2018-06-21 Thread Drew Weaver
Yes, Eventually however you would probably want to replace that physical disk that has died and sometimes with remote deployments it is nice to not have to do that instantly which is how enterprise arrays and support contracts have worked for decades. I understand your point from a purely tech

Re: [ceph-users] Designating an OSD as a spare

2018-06-21 Thread Wido den Hollander
On 06/21/2018 03:35 PM, Drew Weaver wrote: > Yes, > >   > > Eventually however you would probably want to replace that physical disk > that has died and sometimes with remote deployments it is nice to not > have to do that instantly which is how enterprise arrays and support > contracts have wo

Re: [ceph-users] Backfill stops after a while after OSD reweight

2018-06-21 Thread Paul Emmerich
Your CRUSH rules will not change automatically. Check out the documentation for changing tunables: http://docs.ceph.com/docs/mimic/rados/operations/crush-map/#tunables 2018-06-20 18:27 GMT+02:00 Oliver Schulz : > Thanks, Paul - I could probably activate the Jewel tunables > profile without losin

Re: [ceph-users] MDS: journaler.pq decode error

2018-06-21 Thread Benjeman Meekhof
Thanks very much John! Skipping over the corrupt entry by setting a new expire_pos seems to have worked. The journal expire_pos is now advancing and pools are being purged. It has a little while to go to catch up to current write_pos but the journal inspect command gives an 'OK' for overall inte

Re: [ceph-users] MDS: journaler.pq decode error

2018-06-21 Thread Benjeman Meekhof
I do have one follow-up related question: While doing this I took offline all the standby MDS, and max_mds on our cluster is at 1. Were I to enable multiple MDS would they all actively split up processing the purge queue? We have not yet at this point ever allowed multi active MDS but plan to en

Re: [ceph-users] MDS: journaler.pq decode error

2018-06-21 Thread John Spray
On Thu, Jun 21, 2018 at 4:39 PM Benjeman Meekhof wrote: > > I do have one follow-up related question: While doing this I took > offline all the standby MDS, and max_mds on our cluster is at 1. Were > I to enable multiple MDS would they all actively split up processing > the purge queue? When yo

[ceph-users] lacp bonding | working as expected..?

2018-06-21 Thread mj
Hi, I'm trying out bonding to improve ceph performance on our cluster. (currently in a test setup, using 1G NICs, instead of 10G) Setup like this on the ProCurve 5412 chassis: Procurve chassis(config)# show trunk Load Balancing Method: L4-based Port | Name Type

Re: [ceph-users] lacp bonding | working as expected..?

2018-06-21 Thread mj
Pff the layout for the switch configuration was messed up. Sorry. Here is is again, hopefully better this time: Procurve chassis(config)# show trunk Load Balancing Method: L4-based Port | Name Type | Group Type + --

Re: [ceph-users] lacp bonding | working as expected..?

2018-06-21 Thread Jacob DeGlopper
Consider trying some variation in source and destination IP addresses and port numbers - unless you force it, iperf3 at least tends to pick only even port numbers for the ephemeral source port, which leads to all traffic being balanced to one link. In your example, where you see one link being

[ceph-users] Centos kernel

2018-06-21 Thread Steven Vacaroaia
Hi, Just wondering if you would recommend using newest kernel on Centos ( i.e. after installing regular Centos ( 3.10.0-862), enable elrepo.kernel and install 4.17 ) or simply stay with the regular one Many thanks Steven ___ ceph-users mailing list ceph

Re: [ceph-users] lacp bonding | working as expected..?

2018-06-21 Thread mj
Hi Jacob, Thanks for your reply. But I'm not sure I completely understand it. :-) On 06/21/2018 09:09 PM, Jacob DeGlopper wrote: In your example, where you see one link being used, I see an even source IP paired with an odd destination port number for both transfers, or is that a search and re

Re: [ceph-users] lacp bonding | working as expected..?

2018-06-21 Thread Jacob DeGlopper
OK, that was a search-and-replace error in the original quote. This is still something with your layer 3/4 load balancing. iperf2 does not support setting the source port, but iperf3 does - that might be worth a try.     -- jacob On 06/21/2018 03:37 PM, mj wrote: Hi Jacob, Thanks for you

Re: [ceph-users] lacp bonding | working as expected..?

2018-06-21 Thread Paul Emmerich
It's load balanced via a hash over the specified fields. So if you only got two flows there's a 50% chance that they will end up on the same link. Your real traffic will have a lot more flows so it will even out. Paul 2018-06-21 21:48 GMT+02:00 Jacob DeGlopper : > OK, that was a search-and-repla

Re: [ceph-users] fixing unrepairable inconsistent PG

2018-06-21 Thread Brad Hubbard
That seems like an authentication issue? Try running it like so... $ ceph --debug_monc 20 --debug_auth 20 pg 18.2 query On Thu, Jun 21, 2018 at 12:18 AM, Andrei Mikhailovsky wrote: > Hi Brad, > > Yes, but it doesn't show much: > > ceph pg 18.2 query > Error EPERM: problem getting command descri

Re: [ceph-users] init mon fail since use service rather than systemctl

2018-06-21 Thread xiang....@sky-data.cn
Thanks very much - Original Message - From: "Alfredo Deza" To: "xiang dai" Cc: "ceph-users" Sent: Thursday, June 21, 2018 8:42:34 PM Subject: Re: [ceph-users] init mon fail since use service rather than systemctl On Thu, Jun 21, 2018 at 8:41 AM, wrote: > I met below issue: > > INFO:

Re: [ceph-users] PG status is "active+undersized+degraded"

2018-06-21 Thread Dave.Chen
Hi Burkhard, Thanks for your explanation, I created an new OSD with 2TB from another node, it truly solved the issue, the status of Ceph cluster is " health HEALTH_OK" now. Another question is if three homogeneous OSD is spread across 2 nodes, I still got the warning message, and the status i

Re: [ceph-users] How to throttle operations like "rbd rm"

2018-06-21 Thread ceph
Hi Paul, Am 14. Juni 2018 00:33:09 MESZ schrieb Paul Emmerich : >2018-06-13 23:53 GMT+02:00 : > >> Hi yao, >> >> IIRC there is a *sleep* Option which is usefull when delete Operation >is >> being done from ceph sleep_trim or something like that. >> > >you are thinking of "osd_snap_trim_sleep"

Re: [ceph-users] PG status is "active+undersized+degraded"

2018-06-21 Thread Dave.Chen
I saw these statement from this link ( http://docs.ceph.com/docs/master/rados/operations/crush-map/ ), it that the reason which leads to the warning? " This, combined with the default CRUSH failure domain, ensures that replicas or erasure code shards are separated across hosts and a single ho