[ceph-users] Re: MON slow ops and growing MON store

Janek Bevendorff Thu, 25 Feb 2021 11:58:35 -0800

Thanks, Dan.

On the first MON, the command doesn’t even return, but I was able to get a dump 
from the one I restarted most recently. The oldest ops look like this:


        {
            "description": "log(1000 entries from seq 17876238 at 
2021-02-25T15:13:20.306487+0100)",
            "initiated_at": "2021-02-25T20:40:34.698932+0100",
            "age": 183.762551121,
            "duration": 183.762599201,
            "type_data": {
                "events": [
                    {
                        "time": "2021-02-25T20:40:34.698932+0100",
                        "event": "initiated"
                    },
                    {
                        "time": "2021-02-25T20:40:34.698636+0100",
                        "event": "throttled"
                    },
                    {
                        "time": "2021-02-25T20:40:34.698932+0100",
                        "event": "header_read"
                    },
                    {
                        "time": "2021-02-25T20:40:34.701407+0100",
                        "event": "all_read"
                    },
                    {
                        "time": "2021-02-25T20:40:34.701455+0100",
                        "event": "dispatched"
                    },
                    {
                        "time": "2021-02-25T20:40:34.701458+0100",
                        "event": "mon:_ms_dispatch"
                    },
                    {
                        "time": "2021-02-25T20:40:34.701459+0100",
                        "event": "mon:dispatch_op"
                    },
                    {
                        "time": "2021-02-25T20:40:34.701459+0100",
                        "event": "psvc:dispatch"
                    },
                    {
                        "time": "2021-02-25T20:40:34.701490+0100",
                        "event": "logm:wait_for_readable"
                    },
                    {
                        "time": "2021-02-25T20:40:34.701491+0100",
                        "event": "logm:wait_for_readable/paxos"
                    },
                                        {
                        "time": "2021-02-25T20:40:34.701496+0100",
                        "event": "paxos:wait_for_readable"
                    },
                    {
                        "time": "2021-02-25T20:40:34.989198+0100",
                        "event": "callback finished"
                    },
                    {
                        "time": "2021-02-25T20:40:34.989199+0100",
                        "event": "psvc:dispatch"
                    },
                    {
                        "time": "2021-02-25T20:40:34.989208+0100",
                        "event": "logm:preprocess_query"
                    },
                    {
                        "time": "2021-02-25T20:40:34.989208+0100",
                        "event": "logm:preprocess_log"
                    },
                    {
                        "time": "2021-02-25T20:40:34.989278+0100",
                        "event": "forward_request_leader"
                    },
                    {
                        "time": "2021-02-25T20:40:34.989344+0100",
                        "event": "forwarded"
                    },
                    {
                        "time": "2021-02-25T20:41:58.658022+0100",
                        "event": "resend forwarded message to leader"
                    },
                    {
                        "time": "2021-02-25T20:42:27.735449+0100",
                        "event": "resend forwarded message to leader"
                    }
                ],
                "info": {
                    "seq": 41550,
                    "src_is_mon": false,
                    "source": "osd.104 v2:XXX:6864/16579",
                    "forwarded_to_leader": true
                }


Any idea what that might be about? Almost looks like this: 
https://tracker.ceph.com/issues/24180
I set debug_mon to 0, but I keep getting a lot of log spill in journals. It’s 
about 1-2 messages per second, mostly RocksDB stuff, but nothing that actually 
looks serious or even log-worthy. I noticed that before that despite logging 
being set to warning level, the cluster log keeps being written to the MON log. 
But it shouldn’t cause such massive stability issues, should it? The date on 
the log op is also weird. 15:13+0100 was hours ago.

Here’s my log config:

global                            advanced  clog_to_syslog_level                
                 warning
global                            basic     err_to_syslog                       
                 true
global                            basic     log_to_file                         
                 false
global                            basic     log_to_stderr                       
                 false
global                            basic     log_to_syslog                       
                 true
global                            advanced  mon_cluster_log_file_level          
                 error
global                            advanced  mon_cluster_log_to_file             
                 false
global                            advanced  mon_cluster_log_to_stderr           
                 false
global                            advanced  mon_cluster_log_to_syslog           
                 false
global                            advanced  mon_cluster_log_to_syslog_level     
                 warning



Ceph version is 15.2.8.

Janek


> On 25. Feb 2021, at 20:33, Dan van der Ster <d...@vanderster.com> wrote:
> 
> ceph daemon mon.`hostname -s` ops
> 
> That should show you the accumulating ops.
> 
> .. dan
> 
> 
> On Thu, Feb 25, 2021, 8:23 PM Janek Bevendorff 
> <janek.bevendo...@uni-weimar.de <mailto:janek.bevendo...@uni-weimar.de>> 
> wrote:
> Hi,
> 
> All of a sudden, we are experiencing very concerning MON behaviour. We have 
> five MONs and all of them have thousands up to tens of thousands of slow ops, 
> the oldest one blocking basically indefinitely (at least the timer keeps 
> creeping up). Additionally, the MON stores keep inflating heavily. Under 
> normal circumstances we have about 450-550MB there. Right now its 27GB and 
> growing (rapidly).
> 
> I tried restarting all MONs, I disabled auto-scaling (just in case) and 
> checked the system load and hardware. I also restarted the MGR and MDS 
> daemons, but to no avail.
> 
> Is there any way I can debug this properly? I can’t seem to find how I can 
> actually view what ops are causing this and what client (if any) may be 
> responsible for it.
> 
> Thanks
> Janek
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io>
> To unsubscribe send an email to ceph-users-le...@ceph.io 
> <mailto:ceph-users-le...@ceph.io>

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MON slow ops and growing MON store

Reply via email to