Thanks for your feedback ! I increased debug_ms to 1/5.
This is another slow request (full output from 'ceph daemon osd.5 dump_historic_ops' for this event is attached): { "description": "osd_op(client.171725953.0:404377591 8.9b 8:d90adab6: ::rbd_data.c47f3c390c8495.000000000001934a:head [set-alloc-hint object_size 4194 304 write_size 4194304,write 1413120~122880] snapc 0=[] ondisk+write+known_if_re directed e1205833)", "initiated_at": "2019-02-09 07:29:34.404655", "age": 387.914193, "duration": 340.224154, "type_data": { "flag_point": "commit sent; apply or cleanup", "client_info": { "client": "client.171725953", "client_addr": "192.168.61.66:0/4056439540", "tid": 404377591 }, "events": [ { "time": "2019-02-09 07:29:34.404655", "event": "initiated" }, .... .... { "time": "2019-02-09 07:29:34.416752", "event": "op_applied" }, { "time": "2019-02-09 07:29:34.417200", "event": "sub_op_commit_rec from 14" }, { "time": "2019-02-09 07:35:14.628724", "event": "sub_op_commit_rec from 29" }, osd.5 has IP 192.168.222.202 osd.14 has IP 192.168.222.203 osd.29 has IP 192.168.222.204 Grepping using that client id from osd.5 log as far as I can understand (please correct me if my debugging is completely wrong) the request to 5 and 14 is sent at 07:29:34: 2019-02-09 07:29:34.415808 7f5faf4b4700 1 -- 192.168.222.202:6816/157436 --> 192.168.222.203:6811/158495 -- osd_repop(client.171725953.0:404377591 8.9b e1205833/1205735 8:d90ada\ b6:::rbd_data.c47f3c390c8495.000000000001934a:head v 1205833'4767322) v2 -- 0x56307bb61e00 con 0 2019-02-09 07:29:34.415868 7f5faf4b4700 1 -- 192.168.222.202:6816/157436 --> 192.168.222.204:6804/4159520 -- osd_repop(client.171725953.0:404377591 8.9b e1205833/1205735 8:d90ad\ ab6:::rbd_data.c47f3c390c8495.000000000001934a:head v 1205833'4767322) v2 -- 0x56308ca42400 con 0 2019-02-09 07:29:34.417132 7f5fcf60f700 1 -- 192.168.222.202:6816/157436 <== osd.14 192.168.222.203:6811/158495 11242 ==== osd_repop_reply(client.171725953.0:404377591 8.9b e120\ 5833/1205735) v2 ==== 111+0+0 (634943494 0 0) 0x563090642780 con 0x56308bbd0000 The answer from 14 arrives immediately: 2019-02-09 07:29:34.417132 7f5fcf60f700 1 -- 192.168.222.202:6816/157436 <== osd.14 192.168.222.203:6811/158495 11242 ==== osd_repop_reply(client.171725953.0:404377591 8.9b e120\ 5833/1205735) v2 ==== 111+0+0 (634943494 0 0) 0x563090642780 con 0x56308bbd0000 while the one from 29 arrives only at 7.35: 2019-02-09 07:35:14.628614 7f5fcee0e700 1 -- 192.168.222.202:6816/157436 <== osd.29 192.168.222.204:6804/4159520 1952 ==== osd_repop_reply(client.171725953.0:404377591 8.9b e120\ 5833/1205735) v2 ==== 111+0+0 (3804866849 0 0) 0x56308f3f2a00 con 0x56308bf61000 In osd.29 log it looks like the request only arrives at 07.35 (and it promptly replies): 2019-02-09 07:35:14.627462 7f99972cc700 1 -- 192.168.222.204:6804/4159520 <== osd.5 192.168.222.202:6816/157436 2527 ==== osd_repop(client.171725953.0:404377591 8.9b e1205833/1205735) v2 ==== 1050+0+123635 (1225076790 0 171428115) 0x5610f5128a00 con 0x5610fc5bf000 2019-02-09 07:35:14.628343 7f998d6d4700 1 -- 192.168.222.204:6804/4159520 --> 192.168.222.202:6816/157436 -- osd_repop_reply(client.171725953.0:404377591 8.9b e1205833/1205735 ondisk, result = 0) v2 -- 0x5610f4a51180 con 0 Network problems ? Full logs for the 3 relevant OSDs (just for that time period) is at: https://drive.google.com/drive/folders/1TG5MomMJsqVbsuFosvYokNptLufxOnPY?usp=sharing Thanks again ! Cheers, Massimo On Fri, Feb 8, 2019 at 11:50 PM Brad Hubbard <bhubb...@redhat.com> wrote: > Try capturing another log with debug_ms turned up. 1 or 5 should be Ok > to start with. > > On Fri, Feb 8, 2019 at 8:37 PM Massimo Sgaravatto > <massimo.sgarava...@gmail.com> wrote: > > > > Our Luminous ceph cluster have been worked without problems for a while, > but in the last days we have been suffering from continuous slow requests. > > > > We have indeed done some changes in the infrastructure recently: > > > > - Moved OSD nodes to a new switch > > - Increased pg nums for a pool, to have about ~ 100 PGs/OSD (also > because we have to install new OSDs in the cluster). The output of 'ceph > osd df' is attached. > > > > The problem could also be due to some ''bad' client, but in the log I > don't see a clear "correlation" with specific clients or images for such > blocked requests. > > > > I also tried to update to latest luminous release and latest CentOS7, > but this didn't help. > > > > > > > > Attached you can find the detail of one of such slow operations which > took about 266 secs (output from 'ceph daemon osd.11 dump_historic_ops'). > > So as far as I can understand from these events: > > { > > "time": "2019-02-08 10:26:25.651728", > > "event": "op_commit" > > }, > > { > > "time": "2019-02-08 10:26:25.651965", > > "event": "op_applied" > > }, > > > > { > > "time": "2019-02-08 10:26:25.653236", > > "event": "sub_op_commit_rec from 33" > > }, > > { > > "time": "2019-02-08 10:30:51.890404", > > "event": "sub_op_commit_rec from 23" > > }, > > > > the problem seems with the "sub_op_commit_rec from 23" event which took > too much. > > So the problem is that the answer from OSD 23 took to much ? > > > > > > In the logs of the 2 OSD (11 and 23)in that time frame (attached) I > can't find anything useful. > > When the problem happened the load and the usage of memory was not high > in the relevant nodes. > > > > > > Any help to debug the issue is really appreciated ! :-) > > > > Thanks, Massimo > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -- > Cheers, > Brad >
{ "description": "osd_op(client.171725953.0:404377591 8.9b 8:d90adab6:::rbd_data.c47f3c390c8495.000000000001934a:head [set-alloc-hint object_size 4194304 write_size 4194304,write 1413120~122880] snapc 0=[] ondisk+write+known_if_redirected e1205833)", "initiated_at": "2019-02-09 07:29:34.404655", "age": 387.914193, "duration": 340.224154, "type_data": { "flag_point": "commit sent; apply or cleanup", "client_info": { "client": "client.171725953", "client_addr": "192.168.61.66:0/4056439540", "tid": 404377591 }, "events": [ { "time": "2019-02-09 07:29:34.404655", "event": "initiated" }, { "time": "2019-02-09 07:29:34.415559", "event": "queued_for_pg" }, { "time": "2019-02-09 07:29:34.415581", "event": "reached_pg" }, { "time": "2019-02-09 07:29:34.415665", "event": "started" }, { "time": "2019-02-09 07:29:34.415770", "event": "waiting for subops from 14,29" }, { "time": "2019-02-09 07:29:34.415984", "event": "commit_queued_for_journal_write" }, { "time": "2019-02-09 07:29:34.415984", "event": "commit_queued_for_journal_write" }, { "time": "2019-02-09 07:29:34.415997", "event": "write_thread_in_journal_buffer" }, { "time": "2019-02-09 07:29:34.416374", "event": "journaled_completion_queued" }, { "time": "2019-02-09 07:29:34.416405", "event": "op_commit" }, { "time": "2019-02-09 07:29:34.416752", "event": "op_applied" }, { "time": "2019-02-09 07:29:34.417200", "event": "sub_op_commit_rec from 14" }, { "time": "2019-02-09 07:35:14.628724", "event": "sub_op_commit_rec from 29" }, { "time": "2019-02-09 07:35:14.628786", "event": "commit_sent" }, { "time": "2019-02-09 07:35:14.628809", "event": "done" } ] } },
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com