Thanks for your feedback !

I increased debug_ms to 1/5.

This is another slow request (full output from 'ceph daemon osd.5
dump_historic_ops' for this event is attached):


        {
            "description": "osd_op(client.171725953.0:404377591 8.9b
8:d90adab6:
::rbd_data.c47f3c390c8495.000000000001934a:head [set-alloc-hint object_size
4194
304 write_size 4194304,write 1413120~122880] snapc 0=[]
ondisk+write+known_if_re
directed e1205833)",
            "initiated_at": "2019-02-09 07:29:34.404655",
            "age": 387.914193,
            "duration": 340.224154,
            "type_data": {
                "flag_point": "commit sent; apply or cleanup",
                "client_info": {
                    "client": "client.171725953",
                    "client_addr": "192.168.61.66:0/4056439540",
                    "tid": 404377591
                },
                "events": [
                    {
                        "time": "2019-02-09 07:29:34.404655",
                        "event": "initiated"
                    },
....
....
                   {
                        "time": "2019-02-09 07:29:34.416752",
                        "event": "op_applied"
                    },
                    {
                        "time": "2019-02-09 07:29:34.417200",
                        "event": "sub_op_commit_rec from 14"
                    },
                    {
                        "time": "2019-02-09 07:35:14.628724",
                        "event": "sub_op_commit_rec from 29"
                    },

osd.5 has IP 192.168.222.202
osd.14 has IP 192.168.222.203
osd.29 has IP 192.168.222.204


Grepping using that client id from osd.5 log as far as I can understand
(please correct me if my debugging is completely wrong) the request to 5
and 14 is sent at  07:29:34:

2019-02-09 07:29:34.415808 7f5faf4b4700  1 -- 192.168.222.202:6816/157436
--> 192.168.222.203:6811/158495 -- osd_repop(client.171725953.0:404377591
8.9b e1205833/1205735 8:d90ada\
b6:::rbd_data.c47f3c390c8495.000000000001934a:head v 1205833'4767322) v2 --
0x56307bb61e00 con 0
2019-02-09 07:29:34.415868 7f5faf4b4700  1 -- 192.168.222.202:6816/157436
--> 192.168.222.204:6804/4159520 -- osd_repop(client.171725953.0:404377591
8.9b e1205833/1205735 8:d90ad\
ab6:::rbd_data.c47f3c390c8495.000000000001934a:head v 1205833'4767322) v2
-- 0x56308ca42400 con 0
2019-02-09 07:29:34.417132 7f5fcf60f700  1 -- 192.168.222.202:6816/157436
<== osd.14 192.168.222.203:6811/158495 11242 ====
osd_repop_reply(client.171725953.0:404377591 8.9b e120\
5833/1205735) v2 ==== 111+0+0 (634943494 0 0) 0x563090642780 con
0x56308bbd0000

The answer from 14 arrives immediately:

2019-02-09 07:29:34.417132 7f5fcf60f700  1 -- 192.168.222.202:6816/157436
<== osd.14 192.168.222.203:6811/158495 11242 ====
osd_repop_reply(client.171725953.0:404377591 8.9b e120\
5833/1205735) v2 ==== 111+0+0 (634943494 0 0) 0x563090642780 con
0x56308bbd0000

while the one from 29 arrives only at 7.35:

2019-02-09 07:35:14.628614 7f5fcee0e700  1 -- 192.168.222.202:6816/157436
<== osd.29 192.168.222.204:6804/4159520 1952 ====
osd_repop_reply(client.171725953.0:404377591 8.9b e120\
5833/1205735) v2 ==== 111+0+0 (3804866849 0 0) 0x56308f3f2a00 con
0x56308bf61000


In osd.29 log it looks like the request only arrives at 07.35 (and it
promptly replies):

2019-02-09 07:35:14.627462 7f99972cc700  1 -- 192.168.222.204:6804/4159520
<== osd.5 192.168.222.202:6816/157436 2527 ====
osd_repop(client.171725953.0:404377591 8.9b e1205833/1205735) v2 ====
1050+0+123635 (1225076790 0 171428115) 0x5610f5128a00 con 0x5610fc5bf000
2019-02-09 07:35:14.628343 7f998d6d4700  1 -- 192.168.222.204:6804/4159520
--> 192.168.222.202:6816/157436 --
osd_repop_reply(client.171725953.0:404377591 8.9b e1205833/1205735 ondisk,
result = 0) v2 -- 0x5610f4a51180 con 0


Network problems ?


Full logs for the 3 relevant OSDs (just for that time period) is at:
https://drive.google.com/drive/folders/1TG5MomMJsqVbsuFosvYokNptLufxOnPY?usp=sharing



Thanks again !
Cheers, Massimo



On Fri, Feb 8, 2019 at 11:50 PM Brad Hubbard <bhubb...@redhat.com> wrote:

> Try capturing another log with debug_ms turned up. 1 or 5 should be Ok
> to start with.
>
> On Fri, Feb 8, 2019 at 8:37 PM Massimo Sgaravatto
> <massimo.sgarava...@gmail.com> wrote:
> >
> > Our Luminous ceph cluster have been worked without problems for a while,
> but in the last days we have been suffering from continuous slow requests.
> >
> > We have indeed done some changes in the infrastructure recently:
> >
> > - Moved OSD nodes to a new switch
> > - Increased pg nums for a pool, to have about ~ 100 PGs/OSD (also
> because  we have to install new OSDs in the cluster). The output of 'ceph
> osd df' is attached.
> >
> > The problem could also be due to some ''bad' client, but in the log I
> don't see a clear "correlation" with specific clients or images for such
> blocked requests.
> >
> > I also tried to update to latest luminous release and latest CentOS7,
> but this didn't help.
> >
> >
> >
> > Attached you can find the detail of one of such slow operations which
> took about 266 secs (output from 'ceph daemon osd.11 dump_historic_ops').
> > So as far as I can understand from these events:
> >                     {
> >                         "time": "2019-02-08 10:26:25.651728",
> >                         "event": "op_commit"
> >                     },
> >                     {
> >                         "time": "2019-02-08 10:26:25.651965",
> >                         "event": "op_applied"
> >                     },
> >
> >                   {
> >                         "time": "2019-02-08 10:26:25.653236",
> >                         "event": "sub_op_commit_rec from 33"
> >                     },
> >                     {
> >                         "time": "2019-02-08 10:30:51.890404",
> >                         "event": "sub_op_commit_rec from 23"
> >                     },
> >
> > the problem seems with the  "sub_op_commit_rec from 23" event which took
> too much.
> > So the problem is that the answer from OSD 23 took to much ?
> >
> >
> > In the logs of the 2 OSD (11 and 23)in that time frame (attached) I
> can't find anything useful.
> > When the problem happened the load and the usage of memory was not high
> in the relevant nodes.
> >
> >
> > Any help to debug the issue is really appreciated ! :-)
> >
> > Thanks, Massimo
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Cheers,
> Brad
>
        {
            "description": "osd_op(client.171725953.0:404377591 8.9b 
8:d90adab6:::rbd_data.c47f3c390c8495.000000000001934a:head [set-alloc-hint 
object_size 4194304 write_size 4194304,write 1413120~122880] snapc 0=[] 
ondisk+write+known_if_redirected e1205833)",
            "initiated_at": "2019-02-09 07:29:34.404655",
            "age": 387.914193,
            "duration": 340.224154,
            "type_data": {
                "flag_point": "commit sent; apply or cleanup",
                "client_info": {
                    "client": "client.171725953",
                    "client_addr": "192.168.61.66:0/4056439540",
                    "tid": 404377591
                },
                "events": [
                    {
                        "time": "2019-02-09 07:29:34.404655",
                        "event": "initiated"
                    },
                    {
                        "time": "2019-02-09 07:29:34.415559",
                        "event": "queued_for_pg"
                    },
                    {
                        "time": "2019-02-09 07:29:34.415581",
                        "event": "reached_pg"
                    },
                    {
                        "time": "2019-02-09 07:29:34.415665",
                        "event": "started"
                    },
                    {
                        "time": "2019-02-09 07:29:34.415770",
                        "event": "waiting for subops from 14,29"
                    },
                    {
                        "time": "2019-02-09 07:29:34.415984",
                        "event": "commit_queued_for_journal_write"
                    },
                    {
                        "time": "2019-02-09 07:29:34.415984",
                        "event": "commit_queued_for_journal_write"
                    },
                    {
                        "time": "2019-02-09 07:29:34.415997",
                        "event": "write_thread_in_journal_buffer"
                    },
                    {
                        "time": "2019-02-09 07:29:34.416374",
                        "event": "journaled_completion_queued"
                    },
                    {
                        "time": "2019-02-09 07:29:34.416405",
                        "event": "op_commit"
                    },
                    {
                        "time": "2019-02-09 07:29:34.416752",
                        "event": "op_applied"
                    },
                    {
                        "time": "2019-02-09 07:29:34.417200",
                        "event": "sub_op_commit_rec from 14"
                    },
                    {
                        "time": "2019-02-09 07:35:14.628724",
                        "event": "sub_op_commit_rec from 29"
                    },
                    {
                        "time": "2019-02-09 07:35:14.628786",
                        "event": "commit_sent"
                    },
                    {
                        "time": "2019-02-09 07:35:14.628809",
                        "event": "done"
                    }
                ]
            }
        },

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to