Re: [ceph-users] Node crash, filesytem not usable

Webert de Souza Lima Tue, 15 May 2018 12:44:43 -0700

I'm sorry I wouldn't know, I'm on Jewel.
is your cluster HEALTH_OK now?

Regards,


Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Sun, May 13, 2018 at 6:29 AM Marc Roos <m.r...@f1-outsourcing.eu> wrote:

>
> In luminous
> osd_recovery_threads = osd_disk_threads ?
> osd_recovery_sleep = osd_recovery_sleep_hdd ?
>
> Or is this speeding up recovery, a lot different in luminous?
>
> [@~]# ceph daemon osd.0 config show | grep osd | grep thread
>     "osd_command_thread_suicide_timeout": "900",
>     "osd_command_thread_timeout": "600",
>     "osd_disk_thread_ioprio_class": "",
>     "osd_disk_thread_ioprio_priority": "-1",
>     "osd_disk_threads": "1",
>     "osd_op_num_threads_per_shard": "0",
>     "osd_op_num_threads_per_shard_hdd": "1",
>     "osd_op_num_threads_per_shard_ssd": "2",
>     "osd_op_thread_suicide_timeout": "150",
>     "osd_op_thread_timeout": "15",
>     "osd_peering_wq_threads": "2",
>     "osd_recovery_thread_suicide_timeout": "300",
>     "osd_recovery_thread_timeout": "30",
>     "osd_remove_thread_suicide_timeout": "36000",
>     "osd_remove_thread_timeout": "3600",
>
> -----Original Message-----
> From: Webert de Souza Lima [mailto:webert.b...@gmail.com]
> Sent: vrijdag 11 mei 2018 20:34
> To: ceph-users
> Subject: Re: [ceph-users] Node crash, filesytem not usable
>
> This message seems to be very concerning:
>  >            mds0: Metadata damage detected
>
>
> but for the rest, the cluster seems still to be recovering. you could
> try to seep thing up with ceph tell, like:
>
> ceph tell osd.* injectargs --osd_max_backfills=10
>
> ceph tell osd.* injectargs --osd_recovery_sleep=0.0
>
> ceph tell osd.* injectargs --osd_recovery_threads=2
>
>
>
> Regards,
>
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> Belo Horizonte - Brasil
> IRC NICK - WebertRLZ
>
>
> On Fri, May 11, 2018 at 3:06 PM Daniel Davidson
> <dani...@igb.illinois.edu> wrote:
>
>
>         Below id the information you were asking for.  I think they are
> size=2, min size=1.
>
>         Dan
>
>         # ceph status
>             cluster 7bffce86-9d7b-4bdf-a9c9-67670e68ca77
>
>
>
>
>              health HEALTH_ERR
>
>
>
>
>                     140 pgs are stuck inactive for more than 300 seconds
>                     64 pgs backfill_wait
>                     76 pgs backfilling
>                     140 pgs degraded
>                     140 pgs stuck degraded
>                     140 pgs stuck inactive
>                     140 pgs stuck unclean
>                     140 pgs stuck undersized
>                     140 pgs undersized
>                     210 requests are blocked > 32 sec
>                     recovery 38725029/695508092 objects degraded (5.568%)
>                     recovery 10844554/695508092 objects misplaced (1.559%)
>                     mds0: Metadata damage detected
>                     mds0: Behind on trimming (71/30)
>                     noscrub,nodeep-scrub flag(s) set
>              monmap e3: 4 mons at
> {ceph-0=172.16.31.1:6789/0,ceph-1=172.16.31.2:6789/0,ceph-2=172.16.31.3:
> 6789/0,ceph-3=172.16.31.4:6789/0}
>                     election epoch 824, quorum 0,1,2,3
> ceph-0,ceph-1,ceph-2,ceph-3
>               fsmap e144928: 1/1/1 up {0=ceph-0=up:active}, 1 up:standby
>              osdmap e35814: 32 osds: 30 up, 30 in; 140 remapped pgs
>                     flags
> noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
>               pgmap v43142427: 1536 pgs, 2 pools, 762 TB data, 331 Mobjects
>                     1444 TB used, 1011 TB / 2455 TB avail
>                     38725029/695508092 objects degraded (5.568%)
>                     10844554/695508092 objects misplaced (1.559%)
>                         1396 active+clean
>                           76
> undersized+degraded+remapped+backfilling+peered
>                           64
> undersized+degraded+remapped+wait_backfill+peered
>         recovery io 1244 MB/s, 1612 keys/s, 705 objects/s
>
>         ID  WEIGHT     TYPE NAME        UP/DOWN REWEIGHT PRIMARY-AFFINITY
>          -1 2619.54541 root default
>          -2  163.72159     host ceph-0
>           0   81.86079         osd.0         up  1.00000          1.00000
>           1   81.86079         osd.1         up  1.00000          1.00000
>          -3  163.72159     host ceph-1
>           2   81.86079         osd.2         up  1.00000          1.00000
>           3   81.86079         osd.3         up  1.00000          1.00000
>          -4  163.72159     host ceph-2
>           8   81.86079         osd.8         up  1.00000          1.00000
>           9   81.86079         osd.9         up  1.00000          1.00000
>          -5  163.72159     host ceph-3
>          10   81.86079         osd.10        up  1.00000          1.00000
>          11   81.86079         osd.11        up  1.00000          1.00000
>          -6  163.72159     host ceph-4
>           4   81.86079         osd.4         up  1.00000          1.00000
>           5   81.86079         osd.5         up  1.00000          1.00000
>          -7  163.72159     host ceph-5
>           6   81.86079         osd.6         up  1.00000          1.00000
>           7   81.86079         osd.7         up  1.00000          1.00000
>          -8  163.72159     host ceph-6
>          12   81.86079         osd.12        up  0.79999          1.00000
>          13   81.86079         osd.13        up  1.00000          1.00000
>          -9  163.72159     host ceph-7
>          14   81.86079         osd.14        up  1.00000          1.00000
>          15   81.86079         osd.15        up  1.00000          1.00000
>         -10  163.72159     host ceph-8
>          16   81.86079         osd.16        up  1.00000          1.00000
>          17   81.86079         osd.17        up  1.00000          1.00000
>         -11  163.72159     host ceph-9
>          18   81.86079         osd.18        up  1.00000          1.00000
>          19   81.86079         osd.19        up  1.00000          1.00000
>         -12  163.72159     host ceph-10
>          20   81.86079         osd.20        up  1.00000          1.00000
>          21   81.86079         osd.21        up  1.00000          1.00000
>         -13  163.72159     host ceph-11
>          22   81.86079         osd.22        up  1.00000          1.00000
>          23   81.86079         osd.23        up  1.00000          1.00000
>         -14  163.72159     host ceph-12
>          24   81.86079         osd.24        up  1.00000          1.00000
>          25   81.86079         osd.25        up  1.00000          1.00000
>         -15  163.72159     host ceph-13
>          26   81.86079         osd.26      down        0          1.00000
>          27   81.86079         osd.27      down        0          1.00000
>         -16  163.72159     host ceph-14
>          28   81.86079         osd.28        up  1.00000          1.00000
>          29   81.86079         osd.29        up  1.00000          1.00000
>         -17  163.72159     host ceph-15
>          30   81.86079         osd.30        up  1.00000          1.00000
>          31   81.86079         osd.31        up  1.00000          1.00000
>
>
>
>         On 05/11/2018 11:56 AM, David Turner wrote:
>
>
>                 What are some outputs of commands to show us the state of
> your
> cluster.  Most notable is `ceph status` but `ceph osd tree` would be
> helpful. What are the size of the pools in your cluster?  Are they all
> size=3 min_size=2?
>
>                 On Fri, May 11, 2018 at 12:05 PM Daniel Davidson
> <dani...@igb.illinois.edu> wrote:
>
>
>                         Hello,
>
>                         Today we had a node crash, and looking at it, it
> seems
> there is a
>                         problem with the RAID controller, so it is not
> coming
> back up, maybe
>                         ever.  It corrupted the local filesytem for the
> ceph
> storage there.
>
>                         The remainder of our storage (10.2.10) cluster is
> running, and it looks
>                         to be repairing and our min_size is set to 2.
> Normally,
> I would expect
>                         that the system would keep running normally from
> and end
> user
>                         perspective when this happens, but the system is
> down.
> All mounts that
>                         were up when this started look to be stale, and
> new
> mounts give the
>                         following error:
>
>                         # mount -t ceph ceph-0:/ /test/ -o
>
> name=admin,secretfile=/etc/ceph/admin.secret,noatime,_netdev,rbytes
>                         mount error 5 = Input/output error
>
>                         Any suggestions?
>
>                         Dan
>
>                         _______________________________________________
>                         ceph-users mailing list
>                         ceph-users@lists.ceph.com
>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
>
>         _______________________________________________
>         ceph-users mailing list
>         ceph-users@lists.ceph.com
>         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Node crash, filesytem not usable

Reply via email to