Peter Eisch
Senior Site Reliability Engineer
T1.612.445.5135
virginpulse.com
|virginpulse.com/global-challenge
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.
v2.64
On 3/13/20, 11:38 AM, "Wido den Hollander" <w...@42on.com> wrote:

    This email originates outside Virgin Pulse.


    On 3/13/20 4:09 PM, Peter Eisch wrote:
    > Full cluster is 14.2.8.
    >
    > I had some OSD drop overnight which results now in 4 inactive PGs. The
    > pools had three participant (2 ssd, 1 sas) OSDs. In each pool at least 1
    > ssd and 1 sas OSD is working without issue. I’ve ‘ceph pg repair <pg>’
    > but it doesn’t seem to make any changes.
    >
    > PG_AVAILABILITY Reduced data availability: 4 pgs inactive, 4 pgs 
incomplete
    > pg 10.2e is incomplete, acting [59,67]
    > pg 10.c3 is incomplete, acting [62,105]
    > pg 10.f3 is incomplete, acting [62,59]
    > pg 10.1d5 is incomplete, acting [87,106]
    >
    > Using `ceph pg <pg> query` I can see the OSD in each case of the ones
    > which failed. Respectively they are:
    > pg 10.2e participants: 59, 68, 77, 143
    > pg 10.c3 participants: 60, 62, 85, 102, 105, 106
    > pg 10.f3 participants: 59, 64, 75, 107
    > pg 10.1d5 participants: 64, 77, 87, 106
    >
    > The OSDs which are now down/out and have been removed from the crush map
    > and removed the auth are:
    > 62, 64, 68
    >
    > Of course I have lots of reports of slow OSDs now from OSDs worried
    > about the inactive PGs.
    >
    > How do I properly kick these PGs to have them drop their usage of the
    > OSDs which no longer exist?

    You don't. Because those OSDs hold the data you need.

    Why did  you remove them from the CRUSHMap, OSDMap and auth? As you need
    these to rebuild the PGs.

    Wido

The drives failed at a hardware level.  I've replaced OSDs with this by either 
planned migration or failure in previous instances without issue.  I didn't 
realize all the replicated copies were on just one drive in each pool.

What should my actions have been in this case?

  pool 10 volumes' replicated size 2 min_size 1 crush_rule 1 object_hash 
rjenkins pg_num 512 pgp_num 512 autoscale_mode warn last_change 47570 lfor 
0/0/40781 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd

Crush rule 1:
rule ssd_by_host {
        id 1
        type replicated
        min_size 1
        max_size 10
        step take default class ssd
        step chooseleaf firstn 0 type host
        step emit
}

peter

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to