[ceph-users] Log format in Ceph

2020-01-08 Thread Sinan Polat
Hi,


I couldn't find any documentation or information regarding the log format in
Ceph. For example, I have 2 log lines (see below). For each 'word' I would like
to know what it is/means.

As far as I know, I can break the log lines into:
[date] [timestamp] [unknown] [unknown] [unknown] [pthread] [colon char]
[unknown] [PRIORITY] [message]

Can anyone fill in the [unknown] fields, or redirect me to some
documentation/information?

2020-01-07 15:45:15.593092 osd.3 osd.3 10.36.212.72:6800/5645 2117 : cluster
[WRN] slow request 30.762632 seconds old, received at 2020-01-07
15:44:44.830356: osd_op(client.2127384.0:772793 1.25
1:a71849c8:::rbd_data.20760c15c9284.0014:head [stat,write
8323072~65536] snapc 0=[] ondisk+write+known_if_redirected e191) currently
waiting for rw locks

2020-01-08 03:23:48.297619 mgr.bms-cephmon03-lab client.1199560
10.36.212.93:0/2512770604 2398154 : cluster [DBG] pgmap v2398247: 320 pgs: 320
active+clean; 96.4GiB data, 292GiB used, 2.38TiB / 2.67TiB avail; 0B/s rd,
161KiB/s wr, 19op/s

Thanks!

Sinan___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Log format in Ceph

2020-01-08 Thread Stefan Kooman
Quoting Sinan Polat (si...@turka.nl):
> Hi,
> 
> 
> I couldn't find any documentation or information regarding the log format in
> Ceph. For example, I have 2 log lines (see below). For each 'word' I would 
> like
> to know what it is/means.
> 
> As far as I know, I can break the log lines into:
> [date] [timestamp] [unknown] [unknown] [unknown] [pthread] [colon char]
> [unknown] [PRIORITY] [message]
> 
> Can anyone fill in the [unknown] fields, or redirect me to some
> documentation/information?

Issue "ceph daemon osd.3 dump_historic_slow_ops" on the storage node
hosting this OSD and you will get JSON output with the reason
(flag_point) of the slow op and the series of events.

Gr. Stefan


-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Log format in Ceph

2020-01-08 Thread Sinan Polat
Hi Stefan,

I do not want to know the reason. I want to parse Ceph logs (and use it in 
Elastic). But without knowing the log format I can’t parse. I know that the 
first and second ‘words’ are date + timestamp, but what about the 3rd-5th words 
of a log line?

Sinan 

> Op 8 jan. 2020 om 09:48 heeft Stefan Kooman  het volgende 
> geschreven:
> 
> Quoting Sinan Polat (si...@turka.nl):
>> Hi,
>> 
>> 
>> I couldn't find any documentation or information regarding the log format in
>> Ceph. For example, I have 2 log lines (see below). For each 'word' I would 
>> like
>> to know what it is/means.
>> 
>> As far as I know, I can break the log lines into:
>> [date] [timestamp] [unknown] [unknown] [unknown] [pthread] [colon char]
>> [unknown] [PRIORITY] [message]
>> 
>> Can anyone fill in the [unknown] fields, or redirect me to some
>> documentation/information?
> 
> Issue "ceph daemon osd.3 dump_historic_slow_ops" on the storage node
> hosting this OSD and you will get JSON output with the reason
> (flag_point) of the slow op and the series of events.
> 
> Gr. Stefan
> 
> 
> -- 
> | BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
> | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] monitor ghosted

2020-01-08 Thread Peter Eisch
Hi,

This morning one of my three monitor hosts got booted from the Nautilus 14.2.4 
cluster and it won’t regain.  There haven’t been any changes, or events at this 
site at all.  The conf file is the [unchanged] and the same as the other two 
monitors.  The host is also running the MDS and MGR apps without any issue.  
The ceph-mon log shows this repeating:

2020-01-08 13:33:29.403 7fec1a736700  1 mon.cephmon02@1(probing) e7 
handle_auth_request failed to assign global_id
2020-01-08 13:33:29.433 7fec1a736700  1 mon.cephmon02@1(probing) e7 
handle_auth_request failed to assign global_id
2020-01-08 13:33:29.541 7fec1a736700  1 mon.cephmon02@1(probing) e7 
handle_auth_request failed to assign global_id
...

There is nothing in the logs of the two remaining/healthy monitors.  What is my 
best practice to get this host back in the cluster?

peter


Peter Eisch
Senior Site Reliability Engineer
T1.612.659.3228
virginpulse.com
|virginpulse.com/global-challenge
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.
v2.64
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] monitor ghosted

2020-01-08 Thread sascha a.
what does ceph mon dump say?

If i run into problems like this im reprovisioning the monitor and readd it
from scratch, this works, but if this is best practice i dont know..

Peter Eisch  schrieb am Mi., 8. Jan. 2020,
20:48:

> Hi,
>
> This morning one of my three monitor hosts got booted from the Nautilus
> 14.2.4 cluster and it won’t regain. There haven’t been any changes, or
> events at this site at all. The conf file is the [unchanged] and the same
> as the other two monitors. The host is also running the MDS and MGR apps
> without any issue. The ceph-mon log shows this repeating:
>
> 2020-01-08 13:33:29.403 7fec1a736700 1 mon.cephmon02@1(probing) e7
> handle_auth_request failed to assign global_id
> 2020-01-08 13:33:29.433 7fec1a736700 1 mon.cephmon02@1(probing) e7
> handle_auth_request failed to assign global_id
> 2020-01-08 13:33:29.541 7fec1a736700 1 mon.cephmon02@1(probing) e7
> handle_auth_request failed to assign global_id
> ...
>
> There is nothing in the logs of the two remaining/healthy monitors. What
> is my best practice to get this host back in the cluster?
>
> peter
>
> Peter Eisch​
> Senior Site Reliability Engineer
> T *1.612.659.3228* <1.612.659.3228>
> [image: Facebook] 
> [image: LinkedIn] 
> [image: Twitter] 
> *virginpulse.com* 
> | *virginpulse.com/global-challenge*
> 
>
> Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | 
> Switzerland | United Kingdom | USA
> Confidentiality Notice: The information contained in this e-mail,
> including any attachment(s), is intended solely for use by the designated
> recipient(s). Unauthorized use, dissemination, distribution, or
> reproduction of this message by anyone other than the intended
> recipient(s), or a person designated as responsible for delivering such
> messages to the intended recipient, is strictly prohibited and may be
> unlawful. This e-mail may contain proprietary, confidential or privileged
> information. Any views or opinions expressed are solely those of the author
> and do not necessarily represent those of Virgin Pulse, Inc. If you have
> received this message in error, or are not the named recipient(s), please
> immediately notify the sender and delete this e-mail message.
> v2.64
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] monitor ghosted

2020-01-08 Thread Brad Hubbard
On Thu, Jan 9, 2020 at 5:48 AM Peter Eisch 
wrote:

> Hi,
>
> This morning one of my three monitor hosts got booted from the Nautilus
> 14.2.4 cluster and it won’t regain. There haven’t been any changes, or
> events at this site at all. The conf file is the [unchanged] and the same
> as the other two monitors. The host is also running the MDS and MGR apps
> without any issue. The ceph-mon log shows this repeating:
>
> 2020-01-08 13:33:29.403 7fec1a736700 1 mon.cephmon02@1(probing) e7
> handle_auth_request failed to assign global_id
> 2020-01-08 13:33:29.433 7fec1a736700 1 mon.cephmon02@1(probing) e7
> handle_auth_request failed to assign global_id
> 2020-01-08 13:33:29.541 7fec1a736700 1 mon.cephmon02@1(probing) e7
> handle_auth_request failed to assign global_id
> ...
>

Try gathering a log with debug_mon 20. That should provide more detail
about why  AuthMonitor::_assign_global_id() didn't return an ID.


> There is nothing in the logs of the two remaining/healthy monitors. What
> is my best practice to get this host back in the cluster?
>
> peter
>
> Peter Eisch
> Senior Site Reliability Engineer
> T *1.612.659.3228* <1.612.659.3228>
> [image: Facebook] 
> [image: LinkedIn] 
> [image: Twitter] 
> *virginpulse.com* 
> | *virginpulse.com/global-challenge*
> 
>
> Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | 
> Switzerland | United Kingdom | USA
> Confidentiality Notice: The information contained in this e-mail,
> including any attachment(s), is intended solely for use by the designated
> recipient(s). Unauthorized use, dissemination, distribution, or
> reproduction of this message by anyone other than the intended
> recipient(s), or a person designated as responsible for delivering such
> messages to the intended recipient, is strictly prohibited and may be
> unlawful. This e-mail may contain proprietary, confidential or privileged
> information. Any views or opinions expressed are solely those of the author
> and do not necessarily represent those of Virgin Pulse, Inc. If you have
> received this message in error, or are not the named recipient(s), please
> immediately notify the sender and delete this e-mail message.
> v2.64
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH rebalance all at once or host-by-host?

2020-01-08 Thread Sean Matheny
I tested this out by setting norebalance and norecover, moving the host buckets 
under the rack buckets (all of them), and then unsetting. Ceph starts melting 
down with escalating slow requests, even with backfill and recovery parameters 
set to throttle. I moved the host buckets back to the default root bucket, and 
things mostly came right, but I still had some inactive / unknown pgs that I 
had to restart some OSDs to get back to health_ok.

I’m sure there’s a way you can tune things or fade in crush weights or 
something, but I’m happy just moving one at a time.

Our environment has 224 OSDs on 14 hosts, btw.

Cheers,
Sean M


On 8/01/2020, at 1:32 PM, Sean Matheny 
mailto:s.math...@auckland.ac.nz>> wrote:

We’re adding in a CRUSH hierarchy retrospectively in preparation for a big 
expansion. Previously we only had host and osd buckets, and now we’ve added in 
rack buckets.

I’ve got sensible settings to limit rebalancing set, at least what has worked 
in the past:
osd_max_backfills = 1
osd_recovery_threads = 1
osd_recovery_priority = 5
osd_client_op_priority = 63
osd_recovery_max_active = 3

I thought it would save a lot of unnecessary data movement if I move the 
existing host buckets to the new rack buckets all at once, rather than 
host-by-host. As long as recovery is throttled correctly, it shouldn’t matter 
how many objects are misplaced, the thinking goes.

1) Is doing all at once advisable, or am I putting myself at a much greater 
risk if I do have failures during the rebalance (which could take quite a 
while)?
2) My failure domain is currently set at the host level. If I want to change 
the failure domain to ‘rack’, when should I best change this (e.g. after the 
rebalancing finishes for moving the hosts to the racks)?

v12.2.2 if it makes a difference.

Cheers,
Sean M






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com