Re: [ceph-users] Intepreting reason for blocked request

Gregory Farnum Wed, 16 May 2018 14:35:34 -0700

On Sat, May 12, 2018 at 3:22 PM Bryan Henderson <bry...@giraffe-data.com>
wrote:


> I recently had some requests blocked indefinitely; I eventually cleared it
> up by recycling the OSDs, but I'd like some help interpreting the log
> messages
> that supposedly give clue as to what caused the blockage:
>
> (I reformatted for easy email reading)
>
> 2018-05-03 01:56:35.248623 osd.0 192.168.1.16:6800/348 53 :
>   cluster [WRN] 7 slow requests, 2 included below;
>   oldest blocked for > 961.596517 secs
>
> 2018-05-03 01:56:35.249122 osd.0 192.168.1.16:6800/348 54 :
>   cluster [WRN] slow request 961.557151 seconds old,
>   received at 2018-05-03 01:40:33.689191:
>     pg_query(4.f epoch 490) currently wait for new map
>
> 2018-05-03 01:56:35.249543 osd.0 192.168.1.16:6800/348 55 :
>   cluster [WRN] slow request 961.556655 seconds old,
>   received at 2018-05-03 01:40:33.689686:
>     pg_query(1.d epoch 490) currently wait for new map
>
> 2018-05-03 01:56:31.918589 osd.1 192.168.1.23:6800/345 80 :
>   cluster [WRN] 2 slow requests, 2 included below;
>   oldest blocked for > 960.677480 secs
>
> 2018-05-03 01:56:31.920076 osd.1 192.168.1.23:6800/345 81 :
>   cluster [WRN] slow request 960.677480 seconds old,
>   received at 2018-05-03 01:40:31.238642:
>     osd_op(mds.0.57:1 mds0_inotable [read 0~0] 2.b852b893
>       RETRY=2 ack+retry+read+known_if_redirected e490) currently reached_pg
>
> 2018-05-03 01:56:31.921526 osd.1 192.168.1.23:6800/345 82 :
>   cluster [WRN] slow request 960.663817 seconds old,
>   received at 2018-05-03 01:40:31.252305:
>     osd_op(mds.0.57:3 mds_snaptable [read 0~0] 2.d90270ad
>       RETRY=2 ack+retry+read+known_if_redirected e490) currently reached_pg
>
> "wait for new map": what map would that be, and where is the OSD expecting
> it
> to come from?
>

The OSD is waiting for a new OSD map, which it will get from one of its
peers or the monitor (by request). This tends to happen if the client sees
a newer version than the OSD does.


>
> "reached_pg"?
>

The request has been delivered into a queue for the PG to process, but it
hasn't been picked up and worked on yet. There's nothing about the request
that is blocking it here, but some other kind of back pressure is going on
— either the PG is working or waiting on another request that prevents it
from picking up new ones, or there's some kind of throttler preventing it
from picking up new work, or there's no CPU time available for some reason.

With what you've shown here, it looks like either your cluster is
dramatically overloaded, or else something is going on with the mds tables
that is killing the OSD whenever it tries to access them. I think there
were some past issues with them if they grew too large in older releases?
-Greg


>
> You see two OSDs: osd.0 and osd.1.  They're basically set up as a mirrored
> pair.
>
> Thanks.
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Intepreting reason for blocked request

Reply via email to