Re: [ceph-users] Cluster hang (deep scrub bug? "waiting for scrub")

Matteo Dacrema Mon, 13 Nov 2017 00:31:25 -0800

I’ve seen that only one time and noticed that there’s a bug fixed in 10.2.10 (  
http://tracker.ceph.com/issues/20041 <http://tracker.ceph.com/issues/20041> ) 
Yes I use snapshots.


As I can see in my case the PG was scrubbing since 20 days but I’ve only 7 days 
logs so I’m not able to identify the affected PG.



> Il giorno 10 nov 2017, alle ore 14:05, Peter Maloney 
> <peter.malo...@brockmann-consult.de> ha scritto:
> 
> I have often seen a problem where a single osd in an eternal deep scrup
> will hang any client trying to connect. Stopping or restarting that
> single OSD fixes the problem.
> 
> Do you use snapshots?
> 
> Here's what the scrub bug looks like (where that many seconds is 14 hours):
> 
>> ceph daemon "osd.$osd_number" dump_blocked_ops
> 
>>      {
>>          "description": "osd_op(client.6480719.0:2000419292 4.a27969ae
>> rbd_data.46820b238e1f29.000000000000aa70 [set-alloc-hint object_size
>> 524288 write_size 524288,write 0~4096] snapc 16ec0=[16ec0]
>> ack+ondisk+write+known_if_redirected e148441)",
>>          "initiated_at": "2017-09-12 20:04:27.987814",
>>          "age": 49315.666393,
>>          "duration": 49315.668515,
>>          "type_data": [
>>              "delayed",
>>              {
>>                  "client": "client.6480719",
>>                  "tid": 2000419292
>>              },
>>              [
>>                  {
>>                      "time": "2017-09-12 20:04:27.987814",
>>                      "event": "initiated"
>>                  },
>>                  {
>>                      "time": "2017-09-12 20:04:27.987862",
>>                      "event": "queued_for_pg"
>>                  },
>>                  {
>>                      "time": "2017-09-12 20:04:28.004142",
>>                      "event": "reached_pg"
>>                  },
>>                  {
>>                      "time": "2017-09-12 20:04:28.004219",
>>                      "event": "waiting for scrub"
>>                  }
>>              ]
>>          ]
>>      }
> 
> 
> 
> 
> 
> 
> On 11/09/17 17:20, Matteo Dacrema wrote:
>> Update:  I noticed that there was a pg that remained scrubbing from the 
>> first day I found the issue to when I reboot the node and problem 
>> disappeared.
>> Can this cause the behaviour I described before?
>> 
>> 
>>> Il giorno 09 nov 2017, alle ore 15:55, Matteo Dacrema <mdacr...@enter.eu> 
>>> ha scritto:
>>> 
>>> Hi all,
>>> 
>>> I’ve experienced a strange issue with my cluster.
>>> The cluster is composed by 10 HDDs nodes with 20 nodes + 4 journal each 
>>> plus 4 SSDs nodes with 5 SSDs each.
>>> All the nodes are behind 3 monitors and 2 different crush maps.
>>> All the cluster is on 10.2.7 
>>> 
>>> About 20 days ago I started to notice that long backups hangs with "task 
>>> jbd2/vdc1-8:555 blocked for more than 120 seconds” on the HDD crush map.
>>> About few days ago another VM start to have high iowait without doing iops 
>>> also on the HDD crush map.
>>> 
>>> Today about a hundreds VMs wasn’t able to read/write from many volumes all 
>>> of them on HDD crush map. Ceph health was ok and no significant log entries 
>>> were found.
>>> Not all the VMs experienced this problem and in the meanwhile the iops on 
>>> the journal and HDDs was very low even if I was able to do significant iops 
>>> on the working VMs.
>>> 
>>> After two hours of debug I decided to reboot one of the OSD nodes and the 
>>> cluster start to respond again. Now the OSD node is back in the cluster and 
>>> the problem is disappeared.
>>> 
>>> Can someone help me to understand what happened?
>>> I see strange entries in the log files like:
>>> 
>>> accept replacing existing (lossy) channel (new one lossy=1)
>>> fault with nothing to send, going to standby
>>> leveldb manual compact 
>>> 
>>> I can share all the logs that can help to identify the issue.
>>> 
>>> Thank you.
>>> Regards,
>>> 
>>> Matteo
>>> 
>>> 
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>>> --
>>> Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non 
>>> infetto.
>>> Seguire il link qui sotto per segnalarlo come spam: 
>>> http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=12EAC4481A.A6F60
>>> 
>>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> -- 
> 
> --------------------------------------------
> Peter Maloney
> Brockmann Consult
> Max-Planck-Str. 2
> 21502 Geesthacht
> Germany
> Tel: +49 4152 889 300
> Fax: +49 4152 889 333
> E-mail: peter.malo...@brockmann-consult.de
> Internet: http://www.brockmann-consult.de
> --------------------------------------------
> 
> 
> --
> Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non 
> infetto.
> Seguire il link qui sotto per segnalarlo come spam: 
> http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=7814247B63.A75D3
> 
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cluster hang (deep scrub bug? "waiting for scrub")

Reply via email to