I’ve seen that only one time and noticed that there’s a bug fixed in 10.2.10 ( http://tracker.ceph.com/issues/20041 <http://tracker.ceph.com/issues/20041> ) Yes I use snapshots.
As I can see in my case the PG was scrubbing since 20 days but I’ve only 7 days logs so I’m not able to identify the affected PG. > Il giorno 10 nov 2017, alle ore 14:05, Peter Maloney > <peter.malo...@brockmann-consult.de> ha scritto: > > I have often seen a problem where a single osd in an eternal deep scrup > will hang any client trying to connect. Stopping or restarting that > single OSD fixes the problem. > > Do you use snapshots? > > Here's what the scrub bug looks like (where that many seconds is 14 hours): > >> ceph daemon "osd.$osd_number" dump_blocked_ops > >> { >> "description": "osd_op(client.6480719.0:2000419292 4.a27969ae >> rbd_data.46820b238e1f29.000000000000aa70 [set-alloc-hint object_size >> 524288 write_size 524288,write 0~4096] snapc 16ec0=[16ec0] >> ack+ondisk+write+known_if_redirected e148441)", >> "initiated_at": "2017-09-12 20:04:27.987814", >> "age": 49315.666393, >> "duration": 49315.668515, >> "type_data": [ >> "delayed", >> { >> "client": "client.6480719", >> "tid": 2000419292 >> }, >> [ >> { >> "time": "2017-09-12 20:04:27.987814", >> "event": "initiated" >> }, >> { >> "time": "2017-09-12 20:04:27.987862", >> "event": "queued_for_pg" >> }, >> { >> "time": "2017-09-12 20:04:28.004142", >> "event": "reached_pg" >> }, >> { >> "time": "2017-09-12 20:04:28.004219", >> "event": "waiting for scrub" >> } >> ] >> ] >> } > > > > > > > On 11/09/17 17:20, Matteo Dacrema wrote: >> Update: I noticed that there was a pg that remained scrubbing from the >> first day I found the issue to when I reboot the node and problem >> disappeared. >> Can this cause the behaviour I described before? >> >> >>> Il giorno 09 nov 2017, alle ore 15:55, Matteo Dacrema <mdacr...@enter.eu> >>> ha scritto: >>> >>> Hi all, >>> >>> I’ve experienced a strange issue with my cluster. >>> The cluster is composed by 10 HDDs nodes with 20 nodes + 4 journal each >>> plus 4 SSDs nodes with 5 SSDs each. >>> All the nodes are behind 3 monitors and 2 different crush maps. >>> All the cluster is on 10.2.7 >>> >>> About 20 days ago I started to notice that long backups hangs with "task >>> jbd2/vdc1-8:555 blocked for more than 120 seconds” on the HDD crush map. >>> About few days ago another VM start to have high iowait without doing iops >>> also on the HDD crush map. >>> >>> Today about a hundreds VMs wasn’t able to read/write from many volumes all >>> of them on HDD crush map. Ceph health was ok and no significant log entries >>> were found. >>> Not all the VMs experienced this problem and in the meanwhile the iops on >>> the journal and HDDs was very low even if I was able to do significant iops >>> on the working VMs. >>> >>> After two hours of debug I decided to reboot one of the OSD nodes and the >>> cluster start to respond again. Now the OSD node is back in the cluster and >>> the problem is disappeared. >>> >>> Can someone help me to understand what happened? >>> I see strange entries in the log files like: >>> >>> accept replacing existing (lossy) channel (new one lossy=1) >>> fault with nothing to send, going to standby >>> leveldb manual compact >>> >>> I can share all the logs that can help to identify the issue. >>> >>> Thank you. >>> Regards, >>> >>> Matteo >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> -- >>> Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non >>> infetto. >>> Seguire il link qui sotto per segnalarlo come spam: >>> http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=12EAC4481A.A6F60 >>> >>> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > -- > > -------------------------------------------- > Peter Maloney > Brockmann Consult > Max-Planck-Str. 2 > 21502 Geesthacht > Germany > Tel: +49 4152 889 300 > Fax: +49 4152 889 333 > E-mail: peter.malo...@brockmann-consult.de > Internet: http://www.brockmann-consult.de > -------------------------------------------- > > > -- > Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non > infetto. > Seguire il link qui sotto per segnalarlo come spam: > http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=7814247B63.A75D3 > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com