Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?

Chris Murray Fri, 20 Mar 2015 16:04:31 -0700

Ah, I was wondering myself if compression could be causing an issue, but I'm 
reconsidering now. My latest experiment should hopefully help troubleshoot.


So, I remembered that ZLIB is slower, but is more 'safe for old kernels'. I try 
that:

find /var/lib/ceph/osd/ceph-1/current -xdev \( -type f -o -type d \) -exec 
btrfs filesystem defragment -v -czlib -- {} +

After much, much waiting, all files have been rewritten, but the OSD still gets 
stuck at the same point.

I've now unset the compress attribute on all files and started the defragment 
process again, but I'm not too hopeful since the files must be 
readable/writeable if I didn't get some failure during the defrag process.

find /var/lib/ceph/osd/ceph-1/current -xdev \( -type f -o -type d \) -exec 
chattr -c -- {} +
find /var/lib/ceph/osd/ceph-1/current -xdev \( -type f -o -type d \) -exec 
btrfs filesystem defragment -v -- {} +

(latter command still running)

Any other ideas at all? In the absence of the problem being spelled out to me 
with an error of some sort, I'm not sure how to troubleshoot further.
Is it safe to upgrade a problematic cluster, when the time comes, in case this 
ultimately is a CEPH bug which is fixed in something later than 0.80.9?

-----Original Message-----
From: Gregory Farnum [mailto:g...@gregs42.com] 
Sent: 18 March 2015 14:01
To: Chris Murray
Cc: ceph-users
Subject: Re: [ceph-users] More than 50% osds down, CPUs still busy; will the 
cluster recover without help?

On Wed, Mar 18, 2015 at 3:28 AM, Chris Murray <chrismurra...@gmail.com> wrote:
> Hi again Greg :-)
>
> No, it doesn't seem to progress past that point. I started the OSD again a 
> couple of nights ago:
>
> 2015-03-16 21:34:46.221307 7fe4a8aa7780 10 journal op_apply_finish 
> 13288339 open_ops 1 -> 0, max_applied_seq 13288338 -> 13288339
> 2015-03-16 21:34:46.221445 7fe4a8aa7780  3 journal journal_replay: r = 
> 0, op_seq now 13288339
> 2015-03-16 21:34:46.221513 7fe4a8aa7780  2 journal read_entry 
> 3951706112 : seq 13288340 1755 bytes
> 2015-03-16 21:34:46.221547 7fe4a8aa7780  3 journal journal_replay: 
> applying op seq 13288340
> 2015-03-16 21:34:46.221579 7fe4a8aa7780 10 journal op_apply_start 
> 13288340 open_ops 0 -> 1
> 2015-03-16 21:34:46.221610 7fe4a8aa7780 10 
> filestore(/var/lib/ceph/osd/ceph-1) _do_transaction on 0x3142480
> 2015-03-16 21:34:46.221651 7fe4a8aa7780 15 
> filestore(/var/lib/ceph/osd/ceph-1) _omap_setkeys 
> meta/16ef7597/infos/head//-1
> 2015-03-16 21:34:46.222017 7fe4a8aa7780 10 filestore oid: 
> 16ef7597/infos/head//-1 not skipping op, *spos 13288340.0.1
> 2015-03-16 21:34:46.222053 7fe4a8aa7780 10 filestore  > header.spos 
> 0.0.0
> 2015-03-16 21:34:48.096002 7fe49a5ac700 20 
> filestore(/var/lib/ceph/osd/ceph-1) sync_entry woke after 5.000178
> 2015-03-16 21:34:48.096037 7fe49a5ac700 10 journal commit_start 
> max_applied_seq 13288339, open_ops 1
> 2015-03-16 21:34:48.096040 7fe49a5ac700 10 journal commit_start 
> waiting for 1 open ops to drain
>
> There's the success line for 13288339, like you mentioned. But not one for 
> 13288340.
>
> Intriguing. So, those same 1755 bytes seem problematic every time the journal 
> is replayed? Interestingly, there is a lot (in time, not exactly data mass or 
> IOPs, but still more than 1755 bytes!) of activity while the log is at this 
> line:
>
> 2015-03-16 21:34:48.096040 7fe49a5ac700 10 journal commit_start 
> waiting for 1 open ops to drain
>
> ... but then the IO ceases and the log still doesn't go any further. I wonder 
> why 13288339 doesn't have that same  'waiting for ... open ops to drain' 
> line. Or the 'woke after' one for that matter.
>
> While there is activity on sdb, it 'pulses' every 10 seconds or so, like this:
>
> sdb              20.00         0.00      3404.00          0       3404
> sdb              16.00         0.00      2100.00          0       2100
> sdb              10.00         0.00      1148.00          0       1148
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb               1.00         0.00       496.00          0        496
> sdb              32.00         0.00      4940.00          0       4940
> sdb               8.00         0.00      1144.00          0       1144
> sdb               1.00         0.00         4.00          0          4
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb              17.00         0.00      3340.00          0       3340
> sdb              23.00         0.00      3368.00          0       3368
> sdb               1.00         0.00         4.00          0          4
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb              13.00         0.00      3332.00          0       3332
> sdb              18.00         0.00      2360.00          0       2360
> sdb              59.00         0.00      7464.00          0       7464
> sdb               0.00         0.00         0.00          0          0
>
> I was hoping Google may have held some clues, but it seems I'm the 
> only one :-)
>
> https://www.google.co.uk/?gws_rd=ssl#q=%22journal+commit_start+waiting
> +for%22+%22open+ops+to+drain%22
>
> I tried removing "compress-force=lzo" from "osd mount options btrfs" in 
> ceph.conf, in case it was the compression algorithm which was 'blocking' the 
> writes, but the same seems to be observed. Last entry in log, activity, 
> activity stops, log doesn't progress further.

Ooooohhhhhh.....I'm not terribly familiar with the code that's stuck, but last 
time it came up btrfs compression was known to be broken under the Ceph 
workload. :( I think it tended to crash rather than hang like this so I'm a bit 
surprised, but if this op is touching a "broken" file or something that could 
explain it.

> On another note, I'm not getting my own e-mails back to myself from the list 
> anymore. In case I should be troubleshooting this side, is/isn't this 
> happening to others?

Uh, no idea. I think that's a config option you can set on mailman?
But I'm using Gmail so this has always been hidden from me.
-Greg

-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2015.0.5751 / Virus Database: 4306/9314 - Release Date: 03/16/15
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?

Reply via email to