Re: [ceph-users] Ceph recovery kill VM's even with the smallest priority

2018-04-07 Thread Damian Dabrowski
ok now I understand, thanks for all this helpful answers! On Sat, Apr 7, 2018, 15:26 David Turner wrote: > I'm seconding what Greg is saying There is no reason to set nobackfill > and norecover just for restarting OSDs. That will only cause the problems > you're seeing without giving you any be

Re: [ceph-users] Ceph recovery kill VM's even with the smallest priority

2018-04-07 Thread David Turner
I'm seconding what Greg is saying There is no reason to set nobackfill and norecover just for restarting OSDs. That will only cause the problems you're seeing without giving you any benefit. There are reasons to use norecover and nobackfill but unless you're manually editing the crush map, having

Re: [ceph-users] Ceph recovery kill VM's even with the smallest priority

2018-04-04 Thread Gregory Farnum
On Thu, Mar 29, 2018 at 3:17 PM Damian Dabrowski wrote: > Greg, thanks for Your reply! > > I think Your idea makes sense, I've did tests and its quite hard to > understand for me. I'll try to explain my situation in few steps > below. > I think that ceph showing progress in recovery but it can on

Re: [ceph-users] Ceph recovery kill VM's even with the smallest priority

2018-03-29 Thread Damian Dabrowski
Greg, thanks for Your reply! I think Your idea makes sense, I've did tests and its quite hard to understand for me. I'll try to explain my situation in few steps below. I think that ceph showing progress in recovery but it can only solve objects which doesn't really changed. It won't try to repair

Re: [ceph-users] Ceph recovery kill VM's even with the smallest priority

2018-03-29 Thread Gregory Farnum
On Thu, Mar 29, 2018 at 7:27 AM Damian Dabrowski wrote: > Hello, > > Few days ago I had very strange situation. > > I had to turn off few OSDs for a while. So I've set flags:noout, > nobackfill, norecover and then turned off selected OSDs. > All was ok, but when I started these OSDs again all VMs

[ceph-users] Ceph recovery kill VM's even with the smallest priority

2018-03-29 Thread Damian Dabrowski
Hello, Few days ago I had very strange situation. I had to turn off few OSDs for a while. So I've set flags:noout, nobackfill, norecover and then turned off selected OSDs. All was ok, but when I started these OSDs again all VMs went down due to recovery process(even when recovery priority was ver

Re: [ceph-users] ceph recovery incomplete PGs on Luminous RC

2017-07-24 Thread Daniel K
I was able to export the PGs using the ceph-object-store tool and import them to the new OSDs. I moved some other OSDs from the bare metal on a node into a virtual machine on the same node and was surprised at how easy it was. Install ceph in the VM(using ceph-deploy) -- stop the OSD and dismount

Re: [ceph-users] ceph recovery incomplete PGs on Luminous RC

2017-07-24 Thread Gregory Farnum
On Fri, Jul 21, 2017 at 10:23 PM Daniel K wrote: > Luminous 12.1.0(RC) > > I replaced two OSD drives(old ones were still good, just too small), using: > > ceph osd out osd.12 > ceph osd crush remove osd.12 > ceph auth del osd.12 > systemctl stop ceph-osd@osd.12 > ceph osd rm osd.12 > > I later fo

[ceph-users] ceph recovery incomplete PGs on Luminous RC

2017-07-21 Thread Daniel K
Luminous 12.1.0(RC) I replaced two OSD drives(old ones were still good, just too small), using: ceph osd out osd.12 ceph osd crush remove osd.12 ceph auth del osd.12 systemctl stop ceph-osd@osd.12 ceph osd rm osd.12 I later found that I also should have unmounted it from /var/lib/ceph/osd-12 (r

Re: [ceph-users] Ceph recovery

2017-05-30 Thread David Turner
I just responded to this on the thread "Strange remap on host failure". I think that response covers your question. On Mon, May 29, 2017, 4:10 PM Laszlo Budai wrote: > Hello, > > can someone give me some directions on how the ceph recovery works? > Let's suppose we have a ceph cluster with sever

[ceph-users] Ceph recovery

2017-05-29 Thread Laszlo Budai
Hello, can someone give me some directions on how the ceph recovery works? Let's suppose we have a ceph cluster with several nodes grouped in 3 racks (2 nodes/rack). The crush map is configured to distribute PGs on OSDs from different racks. What happens if a node fails? Where can I read a des

[ceph-users] Ceph recovery stuck

2016-12-06 Thread Ben Erridge
We are using ceph 80.9 and we recently recovered from a power outage which caused some data loss. We had replica set to 1. Since then we have installed another node with the idea that we would change the replica to 3. We tried to change 1 of the pools to replica 3 but it always gets stuck. It's be

Re: [ceph-users] Ceph Recovery

2016-05-18 Thread Gaurav Bafna
You mean that you never see recovery without crush map removal ? That is strange. I see quick recovery in our two small clusters and even in our production when a daemon is killed. It's only when as osd crashes, I don't see recovery in production. Let me talk to ceph-devel community to find wheth

Re: [ceph-users] Ceph Recovery

2016-05-18 Thread Lazuardi Nasution
Hi Gaurav, It could be an issue. But, I never see crush map removal without recovery. Best regards, On Wed, May 18, 2016 at 1:41 PM, Gaurav Bafna wrote: > Is it a known issue and is it expected ? > > When as osd is marked out, the reweight becomes 0 and the PGs should > get remapped , right ?

Re: [ceph-users] Ceph Recovery

2016-05-17 Thread Gaurav Bafna
Is it a known issue and is it expected ? When as osd is marked out, the reweight becomes 0 and the PGs should get remapped , right ? I do see recovery after removing from crush map. Thanks Gaurav On Wed, May 18, 2016 at 12:08 PM, Lazuardi Nasution wrote: > Hi Gaurav, > > Not onnly marked out,

Re: [ceph-users] Ceph Recovery

2016-05-17 Thread Lazuardi Nasution
Hi Gaurav, Not onnly marked out, you need to remove it from crush map to make sure cluster do auto recovery. It seem taht the marked out OSD still appear on crush map calculation so it must be removed manually. You will see that there will be recovery process after you remove OSD from crush map.

Re: [ceph-users] Ceph Recovery

2016-05-16 Thread Gaurav Bafna
Hi Lazuardi No, there are no unfound or incomplete PGs. Replacing the osds surely makes the cluster health. But the problem should not have occurred in the first place. The cluster should have automatically healed after the OSDs were marked out of the cluster . Else this will be a manual process

Re: [ceph-users] Ceph Recovery

2016-05-16 Thread Lazuardi Nasution
Gaurav, Is there any unfound or incomplete PGs? If not, you can remove OSD (with monitoring ceph -w and ceph -s output) and then replace it with good one, one by one OSD. I have done with that successfully. Best regards, On Tue, May 17, 2016 at 12:30 PM, Gaurav Bafna wrote: > Even I faced the

Re: [ceph-users] Ceph Recovery

2016-05-16 Thread Lazuardi Nasution
Hi Wido, The 75% happen on 4 nodes of 24 OSDs each with pool size of two and minimum size of one. Any relation between this configuration and 75%? Best regards, On Tue, May 17, 2016 at 3:38 AM, Wido den Hollander wrote: > > > Op 14 mei 2016 om 12:36 schreef Lazuardi Nasution < > mrxlazuar...@g

Re: [ceph-users] Ceph Recovery

2016-05-16 Thread Gaurav Bafna
Even I faced the same issue with our production cluster . cluster fac04d85-db48-4564-b821-deebda046261 health HEALTH_WARN 658 pgs degraded 658 pgs stuck degraded 688 pgs stuck unclean 658 pgs stuck undersized 658 pgs undersized

Re: [ceph-users] Ceph Recovery

2016-05-16 Thread Wido den Hollander
> Op 14 mei 2016 om 12:36 schreef Lazuardi Nasution : > > > Hi Wido, > > Yes you are right. After removing the down OSDs, reformatting and bring > them up again, at least until 75% of total OSDs, my Ceph Cluster is healthy > again. It seem there is high probability of data safety if the total a

Re: [ceph-users] Ceph Recovery

2016-05-14 Thread Lazuardi Nasution
Hi Wido, Yes you are right. After removing the down OSDs, reformatting and bring them up again, at least until 75% of total OSDs, my Ceph Cluster is healthy again. It seem there is high probability of data safety if the total active PGs same with total PGs and total degraded PGs same with total un

Re: [ceph-users] Ceph Recovery

2016-05-13 Thread Wido den Hollander
> Op 13 mei 2016 om 11:55 schreef Lazuardi Nasution : > > > Hi Wido, > > The status is same after 24 hour running. It seem that the status will not > go to fully active+clean until all down OSDs back again. The only way to > make down OSDs to go back again is reformating or replace if HDDs has

Re: [ceph-users] Ceph Recovery

2016-05-13 Thread Lazuardi Nasution
Hi Wido, The status is same after 24 hour running. It seem that the status will not go to fully active+clean until all down OSDs back again. The only way to make down OSDs to go back again is reformating or replace if HDDs has hardware issue. Do you think that it is safe way to do? Best regards,

Re: [ceph-users] Ceph Recovery

2016-05-13 Thread Wido den Hollander
> Op 13 mei 2016 om 11:34 schreef Lazuardi Nasution : > > > Hi, > > After disaster and restarting for automatic recovery, I found following > ceph status. Some OSDs cannot be restarted due to file system corruption > (it seem that xfs is fragile). > > [root@management-b ~]# ceph status > c

[ceph-users] Ceph Recovery

2016-05-13 Thread Lazuardi Nasution
Hi, After disaster and restarting for automatic recovery, I found following ceph status. Some OSDs cannot be restarted due to file system corruption (it seem that xfs is fragile). [root@management-b ~]# ceph status cluster 3810e9eb-9ece-4804-8c56-b986e7bb5627 health HEALTH_WARN

Re: [ceph-users] Ceph Recovery Assistance, pgs stuck peering

2016-03-08 Thread David Zafman
I expected it to return to osd.36. Oh, if you set "noout" during this process then the pg won't move around when you down osd.36. I expected osd.36 to go down and back up quickly. Also, the pg 10.4f is the same situation, so try the same thing on osd.6. David On 3/8/16 1:05 PM, Ben Hines

Re: [ceph-users] Ceph Recovery Assistance, pgs stuck peering

2016-03-08 Thread Ben Hines
After making that setting, the pg appeared to start peering but then it actually changed the primary OSD to osd.100 - then went incomplete again. Perhaps it did that because another OSD had more data? I presume i need to set that value on each osd where the pg hops to. -Ben On Tue, Mar 8, 2016 at

Re: [ceph-users] Ceph Recovery Assistance, pgs stuck peering

2016-03-08 Thread David Zafman
Ben, I haven't look at everything in your message, but pg 12.7a1 has lost data because of writes that went only to osd.73. The way to recover this is to force recovery to ignore this fact and go with whatever data you have on the remaining OSDs. I assume that having min_size 1, having multip

[ceph-users] Ceph Recovery Assistance, pgs stuck peering

2016-03-07 Thread Ben Hines
Howdy, I was hoping someone could help me recover a couple pgs which are causing problems in my cluster. If we aren't able to resolve this soon, we may have to just destroy them and lose some data. Recovery has so far been unsuccessful. Data loss would probably cause some here to reconsider Ceph a

Re: [ceph-users] Ceph recovery network?

2015-04-27 Thread Sebastien Han
Well yes “pretty much” the same thing :). I think some people would like to distinguish recovery from replication and maybe perform some QoS around these 2. We have to replicate while recovering so one can impact the other. In the end, I just think it’s a doc issue, still waiting for a dev to ans

Re: [ceph-users] Ceph recovery network?

2015-04-26 Thread Robert LeBlanc
My understanding is that Monitors monitor the public address of the OSDs and other OSDs monitor the cluster address of the OSDs. Replication, recovery and backfill traffic all use the same network when you specify 'cluster network = ' in your ceph.conf. It is useful to remember that replication, re

[ceph-users] Ceph recovery network?

2015-04-26 Thread Sebastien Han
Hi list, While reading this http://ceph.com/docs/master/rados/configuration/network-config-ref/#ceph-networks, I came across the following sentence: "You can also establish a separate cluster network to handle OSD heartbeat, object replication and recovery traffic” I didn’t know it was possib

Re: [ceph-users] ceph recovery killing vms

2013-11-05 Thread Kevin Weiler
t Bauer mailto:kurt.ba...@univie.ac.at>> Date: Tuesday, November 5, 2013 2:52 PM To: Kevin Weiler mailto:kevin.wei...@imc-chicago.com>> Cc: "ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>" mailto:ceph-users@lists.ceph.com>> Subject: Re: [ceph-users] ce

Re: [ceph-users] ceph recovery killing vms

2013-11-05 Thread Kurt Bauer
Kevin Weiler schrieb: > Thanks Kyle, > > What's the unit for osd recovery max chunk? Have a look at http://ceph.com/docs/master/rados/configuration/osd-config-ref/ where all the possible OSD config options are described, especially have a look at the backfilling and recovery sections. > > Also, h

Re: [ceph-users] ceph recovery killing vms

2013-11-05 Thread Kevin Weiler
Thanks Kyle, What's the unit for osd recovery max chunk? Also, how do I find out what my current values are for these osd options? -- Kevin Weiler IT IMC Financial Markets | 233 S. Wacker Drive, Suite 4300 | Chicago, IL 60606 | http://imc-chicago.com/ Phone: +1 312-204-7439 | Fax: +1 312-24

Re: [ceph-users] ceph recovery killing vms

2013-10-30 Thread Karan Singh
Hello RZK Would you like to share your experience on this problem and your way of solving it . This sounds interesting. Regards Karan Singh - Original Message - From: "Rzk" To: ceph-users@lists.ceph.com Sent: Wednesday, 30 October, 2013 4:17:32 AM Subject: Re: [

Re: [ceph-users] ceph recovery killing vms

2013-10-29 Thread Rzk
Thanks Guys, after tested it in dev server, i have implemented the new config in prod system. next i will upgrade the hard drive.. :) thanks again All. On Tue, Oct 29, 2013 at 11:32 PM, Kyle Bader wrote: > Recovering from a degraded state by copying existing replicas to other > OSDs is going t

Re: [ceph-users] ceph recovery killing vms

2013-10-29 Thread Kyle Bader
Recovering from a degraded state by copying existing replicas to other OSDs is going to cause reads on existing replicas and writes to the new locations. If you have slow media then this is going to be felt more acutely. Tuning the backfill options I posted is one way to lessen the impact, another

Re: [ceph-users] ceph recovery killing vms

2013-10-29 Thread Kurt Bauer
Hi, maybe you want to have a look at the following thread: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-October/005368.html Could be that you suffer from the same problems. best regards, Kurt Rzk schrieb: > Hi all, > > I have the same problem, just curious. > could it be caused by po

Re: [ceph-users] ceph recovery killing vms

2013-10-28 Thread Rzk
Hi all, I have the same problem, just curious. could it be caused by poor hdd performance ? read/write speed doesn't match the network speed ? Currently i'm using desktop hdd in my cluster. Rgrds, Rzk On Tue, Oct 29, 2013 at 6:22 AM, Kyle Bader wrote: > You can change some OSD tunables to

Re: [ceph-users] ceph recovery killing vms

2013-10-28 Thread Kyle Bader
You can change some OSD tunables to lower the priority of backfills: osd recovery max chunk: 8388608 osd recovery op priority: 2 In general a lower op priority means it will take longer for your placement groups to go from degraded to active+clean, the idea is to balance recover

[ceph-users] ceph recovery killing vms

2013-10-28 Thread Kevin Weiler
Hi all, We have a ceph cluster that being used as a backing store for several VMs (windows and linux). We notice that when we reboot a node, the cluster enters a degraded state (which is expected), but when it begins to recover, it starts backfilling and it kills the performance of our VMs. The