That seems odd.  So you have 3 nodes, with 3 OSDs each.  You should've been
able to mark osd.0 down and out, then stop the daemon without having those
issues.

It's generally best to mark an osd down, then out, and wait until the
cluster has recovered completely before stopping the daemon and removing it
from the cluster.  That guarantees that you always have 3+ copies of the
data.

Disks don't always fail gracefully though.  If you have a sudden and
complete failure, you can't do it the nice way.  At that point, just mark
the osd down and out.  If your cluster was healthy before this event, you
shouldn't have any data problems.  If the cluster wasn't HEALTH_OK before
the event, you will likely have some problems.

Is your cluster HEALTH_OK now?  If not, can you give me the following?

   - ceph -s
   - ceph osd tree
   - ceph osd dump | grep ^pool
   - ceph pg dump_stuck
   - ceph pg query <pgid>  # For just one of the stuck PGs


I'm a bit confused why your cluster has a bunch of PGs in the remapped
state, but none in the remapping state.  It's supposed to be recovering,
and something is blocking that.



As to the hung VMs, during any recovery or backfill, you'll probably have
IO problems.  The ceph.conf defaults are intended for large clusters,
probably with SSD journals.  In my 3 nodes, 24 OSD cluster with no SSD
journals, recovery was IO starving my clients.  I de-prioritized recovery
with:
[osd]
  osd max backfills = 1
  osd recovery max active = 1
  osd recovery op priority = 1

It was still painful, but those values kept my cluster usable.  Since I've
grown to 5 nodes, and added SSD journals, I've been able to increase the
backfills and recovery active to 3.  I found those values through trial and
error, watching my RadosGW latency, and playing with ceph tell osd.\*
injectargs ...

I've found that I have problems if more than 20% of my OSDs are involved in
a backfilling operation.  With your 9 OSDs, you're guaranteeing that any
single event will always hit at least 22% of your OSDS, and probably more.
If you're unable to add more disks, I would highly recommend adding SSD
journals.



On Fri, Dec 19, 2014 at 8:08 AM, Chris Murray <chrismurra...@gmail.com>
wrote:
>
> Hello,
>
> I'm a newbie to CEPH, gaining some familiarity by hosting some virtual
> machines on a test cluster. I'm using a virtualisation product called
> Proxmox Virtual Environment, which conveniently handles cluster setup,
> pool setup, OSD creation etc.
>
> During the attempted removal of an OSD, my pool appeared to cease
> serving IO to virtual machines, and I'm wondering if I did something
> wrong or if there's something more to the process of removing an OSD.
>
> The CEPH cluster is small; 9 OSDs in total across 3 nodes. There's a
> pool called 'vmpool', with size=3 and min_size=1. It's a bit slow, but I
> see plenty of information on how to troubleshoot that, and understand I
> should be separating cluster communication onto a separate network
> segment to improve performance. CEPH version is Firefly - 0.80.7
>
> So, the issue was: I marked osd.0 as down & out (or possibly out & down,
> if order matters), and virtual machines hung. Almost immediately, 78 pgs
> were 'stuck inactive', and after some activity overnight, they remained
> that way:
>
>
>     cluster e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a
>      health HEALTH_WARN 290 pgs degraded; 78 pgs stuck inactive; 496 pgs
> stuck unclean; 4 requests are blocked > 32 sec; recovery 69696/685356
> objects degraded (10.169%)
>      monmap e3: 3 mons at
> {0=192.168.12.25:6789/0,1=192.168.12.26:6789/0,2=192.168.12.27:6789/0},
> election epoch 50, quorum 0,1,2 0,1,2
>      osdmap e669: 9 osds: 8 up, 8 in
>       pgmap v100175: 1216 pgs, 4 pools, 888 GB data, 223 kobjects
>             2408 GB used, 7327 GB / 9736 GB avail
>             69696/685356 objects degraded (10.169%)
>                   78 inactive
>                  720 active+clean
>                  290 active+degraded
>                  128 active+remapped
>
>
> I started the OSD to bring it back 'up'. It was still 'out'.
>
>
>     cluster e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a
>      health HEALTH_WARN 59 pgs degraded; 496 pgs stuck unclean; recovery
> 30513/688554 objects degraded (4.431%)
>      monmap e3: 3 mons at
> {0=192.168.12.25:6789/0,1=192.168.12.26:6789/0,2=192.168.12.27:6789/0},
> election epoch 50, quorum 0,1,2 0,1,2
>      osdmap e671: 9 osds: 9 up, 8 in
>       pgmap v103181: 1216 pgs, 4 pools, 892 GB data, 224 kobjects
>             2408 GB used, 7327 GB / 9736 GB avail
>             30513/688554 objects degraded (4.431%)
>                  720 active+clean
>                   59 active+degraded
>                  437 active+remapped
>   client io 2303 kB/s rd, 153 kB/s wr, 85 op/s
>
>
> The inactive pgs had disappeared.
> I stopped the OSD again, making it 'down' and 'out', as it was previous.
> At this point, I started my virtual machines again, which functioned
> correctly.
>
>
>     cluster e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a
>      health HEALTH_WARN 368 pgs degraded; 496 pgs stuck unclean;
> recovery 83332/688554 objects degraded (12.102%)
>      monmap e3: 3 mons at
> {0=192.168.12.25:6789/0,1=192.168.12.26:6789/0,2=192.168.12.27:6789/0},
> election epoch 50, quorum 0,1,2 0,1,2
>      osdmap e673: 9 osds: 8 up, 8 in
>       pgmap v103248: 1216 pgs, 4 pools, 892 GB data, 224 kobjects
>             2408 GB used, 7327 GB / 9736 GB avail
>             83332/688554 objects degraded (12.102%)
>                  720 active+clean
>                  368 active+degraded
>                  128 active+remapped
>   client io 19845 B/s wr, 6 op/s
>
>
> At this point, removing the OSD was successful, without any IO hanging.
>
>
> --------
>
> Have I tried to remove an OSD in an incorrect manner? I'm wondering what
> would happen in a legitimate failure scenario; what if a disk failure
> were followed with a host failure? Apologies if this is something that's
> been observed already; I've seen mentions of the same symptom, but
> seemingly for causes other than OSD removal.
>
> Thanks you in advance,
> Chris
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to