[ceph-users] HEALTH_WARN, 3 daemons have recently crashed

2020-01-10 Thread Simon Oosthoek
Hi,

last week I upgraded our ceph to 14.2.5 (from 14.2.4) and either during
the procedure or shortly after that, some osds crashed. I re-initialised
them and that should be enough to fix everything, I thought.

I looked a bit further and I do see a lot of lines like this (which are
worrying I suppose):

ceph.log:2020-01-10 10:06:41.049879 mon.cephmon3 (mon.0) 234423 :
cluster [DBG] osd.97 reported immediately failed by osd.67

osd.109
osd.133
osd.139
osd.111
osd.38
osd.65
osd.38
osd.65
osd.97

Now everything seems to be OK, but the WARN status remains. Is this a
"feature" of 14.2.5 or am I missing something?

Below the output of `ceph -s`

Cheers

/Simon

10:13 [root@cephmon1 ~]# ceph -s
  cluster:
id: b489547c-ba50-4745-a914-23eb78e0e5dc
health: HEALTH_WARN
3 daemons have recently crashed

  services:
mon: 3 daemons, quorum cephmon3,cephmon1,cephmon2 (age 27h)
mgr: cephmon3(active, since 27h), standbys: cephmon1, cephmon2
mds: cephfs:1 {0=cephmds1=up:active} 1 up:standby
osd: 168 osds: 168 up (since 6m), 168 in (since 3d); 11 remapped pgs

  data:
pools:   10 pools, 5216 pgs
objects: 167.61M objects, 134 TiB
usage:   245 TiB used, 1.5 PiB / 1.8 PiB avail
pgs: 1018213/1354096231 objects misplaced (0.075%)
 5203 active+clean
 10   active+remapped+backfill_wait
 2active+clean+scrubbing+deep
 1active+remapped+backfilling

  io:
client:   149 MiB/s wr, 0 op/s rd, 55 op/s wr
recovery: 0 B/s, 30 objects/s

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_WARN, 3 daemons have recently crashed

2020-01-10 Thread Ashley Merrick
Once you have fixed the issue your need to mark / archive the crash entry's as 
seen here: https://docs.ceph.com/docs/master/mgr/crash/



 On Fri, 10 Jan 2020 17:37:47 +0800 Simon Oosthoek 
 wrote 


Hi, 
 
last week I upgraded our ceph to 14.2.5 (from 14.2.4) and either during 
the procedure or shortly after that, some osds crashed. I re-initialised 
them and that should be enough to fix everything, I thought. 
 
I looked a bit further and I do see a lot of lines like this (which are 
worrying I suppose): 
 
ceph.log:2020-01-10 10:06:41.049879 mon.cephmon3 (mon.0) 234423 : 
cluster [DBG] osd.97 reported immediately failed by osd.67 
 
osd.109 
osd.133 
osd.139 
osd.111 
osd.38 
osd.65 
osd.38 
osd.65 
osd.97 
 
Now everything seems to be OK, but the WARN status remains. Is this a 
"feature" of 14.2.5 or am I missing something? 
 
Below the output of `ceph -s` 
 
Cheers 
 
/Simon 
 
10:13 [root@cephmon1 ~]# ceph -s 
 cluster: 
 id: b489547c-ba50-4745-a914-23eb78e0e5dc 
 health: HEALTH_WARN 
 3 daemons have recently crashed 
 
 services: 
 mon: 3 daemons, quorum cephmon3,cephmon1,cephmon2 (age 27h) 
 mgr: cephmon3(active, since 27h), standbys: cephmon1, cephmon2 
 mds: cephfs:1 {0=cephmds1=up:active} 1 up:standby 
 osd: 168 osds: 168 up (since 6m), 168 in (since 3d); 11 remapped pgs 
 
 data: 
 pools:   10 pools, 5216 pgs 
 objects: 167.61M objects, 134 TiB 
 usage:   245 TiB used, 1.5 PiB / 1.8 PiB avail 
 pgs: 1018213/1354096231 objects misplaced (0.075%) 
 5203 active+clean 
 10   active+remapped+backfill_wait 
 2active+clean+scrubbing+deep 
 1active+remapped+backfilling 
 
 io: 
 client:   149 MiB/s wr, 0 op/s rd, 55 op/s wr 
 recovery: 0 B/s, 30 objects/s 
 
___ 
ceph-users mailing list 
mailto:ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Looking for experience

2020-01-10 Thread Stefan Priebe - Profihost AG


> Am 10.01.2020 um 07:10 schrieb Mainor Daly :
> 
> 
> Hi Stefan, 
> 
> before I give some suggestions, can you first describe your usecase for which 
> you wanna use that setup? Also which aspects are important for you. 

It’s just the backup target of another ceph Cluster to sync snapshots once a 
day.

Important is pricing, performance todo this task and expansion. We would like 
to start with something around just 50tb storage.

Greets,
Stefan

> 
> 
>> Stefan Priebe - Profihost AG < s.pri...@profihost.ag> hat am 9. Januar 2020 
>> um 22:52 geschrieben:
>> 
>> 
>> As a starting point the current idea is to use something like:
>> 
>> 4-6 nodes with 12x 12tb disks each
>> 128G Memory
>> AMD EPYC 7302P 3GHz, 16C/32T
>> 128GB RAM
>> 
>> Something to discuss is
>> 
>> - EC or go with 3 replicas. We'll use bluestore with compression.
>> - Do we need something like Intel Optane for WAL / DB or not?
>> 
>> Since we started using ceph we're mostly subscribed to SSDs - so no
>> knowlege about HDD in place.
>> 
>> Greets,
>> Stefan
>>> Am 09.01.20 um 16:49 schrieb Stefan Priebe - Profihost AG:
>>> 
 Am 09.01.2020 um 16:10 schrieb Wido den Hollander < w...@42on.com>:
 
 
 
>> On 1/9/20 2:27 PM, Stefan Priebe - Profihost AG wrote:
>> Hi Wido,
>> Am 09.01.20 um 14:18 schrieb Wido den Hollander:
>> 
>> 
>>> On 1/9/20 2:07 PM, Daniel Aberger - Profihost AG wrote:
>>> >
 Am 09.01.20 um 13:39 schrieb Janne Johansson:
 >
 I'm currently trying to workout a concept for a ceph cluster which can
 be used as a target for backups which satisfies the following
 requirements:
 
 - approx. write speed of 40.000 IOP/s and 2500 Mbyte/s
 
 
 You might need to have a large (at least non-1) number of writers to 
 get
 to that sum of operations, as opposed to trying to reach it with one
 single stream written from one single client.
>>> 
>>> 
>>> We are aiming for about 100 writers.
>> 
>> So if I read it correctly the writes will be 64k each.
> 
> may be ;-) see below
> 
>> That should be doable, but you probably want something like NVMe for 
>> DB+WAL.
>> 
>> You might want to tune that larger writes also go into the WAL to speed
>> up the ingress writes. But you mainly want more spindles then less.
> 
> I would like to give a little bit more insight about this and most
> probobly some overhead we currently have in those numbers. Those values
> come from our old classic raid storage boxes. Those use btrfs + zlib
> compression + subvolumes for those backups and we've collected those
> numbers from all of them.
> 
> The new system should just replicate snapshots from the live ceph.
> Hopefully being able to use Erase Coding and compression? ;-)
> 
 
 Compression might work, but only if the data is compressable.
 
 EC usually writes very fast, so that's good. I would recommend a lot of
 spindles those. More spindles == more OSDs == more performance.
 
 So instead of using 12TB drives you can consider 6TB or 8TB drives.
>>> 
>>> Currently we have a lot of 5TB 2.5 drives in place so we could use them.we 
>>> would like to start with around 4000 Iops and 250 MB per second while using 
>>> 24 Drive boxes. We could please one or two NVMe PCIe cards in them.
>>> 
>>> 
>>> Stefan
>>> 
>>> >
 Wido
 
> Greets,
> Stefan
> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_WARN, 3 daemons have recently crashed

2020-01-10 Thread Simon Oosthoek
On 10/01/2020 10:41, Ashley Merrick wrote:
> Once you have fixed the issue your need to mark / archive the crash
> entry's as seen here: https://docs.ceph.com/docs/master/mgr/crash/

Hi Ashley,

thanks, I didn't know this before...

It turned out there were quite a few old crashes (since I never archived
them) and of the three most recent ones, two were like this:

"assert_msg": "/build/ceph-14.2.5/src/common/ceph_time.h: In function
'ceph::time_detail::timespan
ceph::to_timespan(ceph::time_detail::signedspan)' thread 7fbda425a700
time 2020-01-02
17:37:56.885082\n/build/ceph-14.2.5/src/common/ceph_time.h: 485: FAILED
ceph_assert(z >= signedspan::zero())\n",

And another one was too big to paste here ;-)

I did a `ceph crash archive-all` and now ceph is OK again :-)

Cheers

/Simon

> 
> 
>  On Fri, 10 Jan 2020 17:37:47 +0800 *Simon Oosthoek
> * wrote 
> 
> Hi,
> 
> last week I upgraded our ceph to 14.2.5 (from 14.2.4) and either during
> the procedure or shortly after that, some osds crashed. I
> re-initialised
> them and that should be enough to fix everything, I thought.
> 
> I looked a bit further and I do see a lot of lines like this (which are
> worrying I suppose):
> 
> ceph.log:2020-01-10 10:06:41.049879 mon.cephmon3 (mon.0) 234423 :
> cluster [DBG] osd.97 reported immediately failed by osd.67
> 
> osd.109
> osd.133
> osd.139
> osd.111
> osd.38
> osd.65
> osd.38
> osd.65
> osd.97
> 
> Now everything seems to be OK, but the WARN status remains. Is this a
> "feature" of 14.2.5 or am I missing something?
> 
> Below the output of `ceph -s`
> 
> Cheers
> 
> /Simon
> 
> 10:13 [root@cephmon1 ~]# ceph -s
> cluster:
> id: b489547c-ba50-4745-a914-23eb78e0e5dc
> health: HEALTH_WARN
> 3 daemons have recently crashed
> 
> services:
> mon: 3 daemons, quorum cephmon3,cephmon1,cephmon2 (age 27h)
> mgr: cephmon3(active, since 27h), standbys: cephmon1, cephmon2
> mds: cephfs:1 {0=cephmds1=up:active} 1 up:standby
> osd: 168 osds: 168 up (since 6m), 168 in (since 3d); 11 remapped pgs
> 
> data:
> pools: 10 pools, 5216 pgs
> objects: 167.61M objects, 134 TiB
> usage: 245 TiB used, 1.5 PiB / 1.8 PiB avail
> pgs: 1018213/1354096231 objects misplaced (0.075%)
> 5203 active+clean
> 10 active+remapped+backfill_wait
> 2 active+clean+scrubbing+deep
> 1 active+remapped+backfilling
> 
> io:
> client: 149 MiB/s wr, 0 op/s rd, 55 op/s wr
> recovery: 0 B/s, 30 objects/s
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph (jewel) unable to recover after node failure

2020-01-10 Thread Eugen Block

Hi,


A. will ceph be able to recover over time? I am afraid that the 14 PGs
that are down will not recover.


if all OSDs come back (stable) the recovery should eventually finish.


B. what caused the OSDs going down and up during recovery after the
failed OSD node came back online? (step 2 above) I suspect that the
high CPU load we saw on all the nodes caused timeouts in the OSD
daemons. Is this a reasonable assumption?


Yes, this is a reasonable assumption. Just some weeks ago we saw this  
in a customer cluster with EC pools. The OSDs were fully saturated,  
causing failing heartbeats from the peers, coming back up and so on  
(flapping OSDs). At the beginning the MON notices that the OSD  
processes are up although the peers report them as down but after 5 of  
these "down" reports by peers (config option osd_max_markdown_count)  
within 10 minutes (config osd_max_markdown_period) the OSD is marked  
as out, causing more rebalancing causing a higher load.


If there are no other hints for a different root cause you could set  
'ceph osd set nodown' to prevent that flapping. This should help the  
cluster to recover, it helped in the customer environment, although  
there also was another issue.


Regards,
Eugen


Zitat von Hanspeter Kunz :


Hi,

after a node failure ceph is unable to recover, i.e. unable to
reintegrate the failed node back into the cluster.

what happened?
1. a node with 11 osds crashed, the remaining 4 nodes (also with 11
osds each) re-balanced, although reporting the following error
condition:

too many PGs per OSD (314 > max 300)

2. after we put the failed node back online, automatic recovery
started, but very soon (after a few minutes) we saw OSDs randomly going
down and up on ALL the osd nodes (not only on the one that had failed).
we saw the the load (CPU) on the nodes was very high (average load 120)

3. the situation seemed to get worse over time (more and more OSDs
going down, less were coming back up) so we switched the node that had
failed off again.

4. after that, the cluster "calmed down", CPU load became normal
(average load ~4-5). we manually restarted the OSD daemons of the OSDs
that were still down and one after the other these OSDs came back up.
Recovery processes are still running now, but it seems to me that 14
PGs are not recoverable:

output of ceph -s:

 health HEALTH_ERR
16 pgs are stuck inactive for more than 300 seconds
255 pgs backfill_wait
16 pgs backfilling
205 pgs degraded
14 pgs down
2 pgs incomplete
14 pgs peering
48 pgs recovery_wait
205 pgs stuck degraded
16 pgs stuck inactive
335 pgs stuck unclean
156 pgs stuck undersized
156 pgs undersized
25 requests are blocked > 32 sec
recovery 1788571/71151951 objects degraded (2.514%)
recovery 2342374/71151951 objects misplaced (3.292%)
too many PGs per OSD (314 > max 300)

I have a few questions now:

A. will ceph be able to recover over time? I am afraid that the 14 PGs
that are down will not recover.

B. what caused the OSDs going down and up during recovery after the
failed OSD node came back online? (step 2 above) I suspect that the
high CPU load we saw on all the nodes caused timeouts in the OSD
daemons. Is this a reasonable assumption?

C. If indeed all this was caused by such an overload is there a way to
make the recovery process less CPU intensive?

D. What would you advise me to do/try to recover to a healthy state?

In what follows I try to give some more background information
(configuration, log messages).

ceph version: 10.2.11
OS version: debian jessie
[yes I know this is old]

cluster: 5 OSD nodes (12 cores, 64G RAM), 11 OSD per node, each OSD
daemon controls a 2 TB harddrive. The journals are written to an SSD.

ceph.conf:
-
[global]
fsid = [censored]
mon_initial_members = salomon, simon, ramon
mon_host = 10.65.16.44, 10.65.16.45, 10.65.16.46
public_network = 10.65.16.0/24
cluster_network = 10.65.18.0/24
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
mon osd down out interval = 7200
--

Log Messages (examples):

we see a lot of:

Jan  7 18:52:22 bruce ceph-osd[9184]: 2020-01-07 18:52:22.411377  
7f0ebd93b700 -1 osd
.29 15636 heartbeat_check: no reply from 10.65.16.43:6822 osd.48  
since back 2020-01-
07 18:51:20.119784 front 2020-01-07 18:52:21.575852 (cutoff  
2020-01-07 18:52:02.4113

30)

however, all the networks were up (the machines could ping each other).

I guess these are the log messages of OSDs going down (on one of the
nodes):
Jan  7 16:47:37 bruce ceph-osd[3684]: 2020-01-07 16:47:37.729691  
7fbe5ee73700 -1 osd.25 15017 *** Got signal Interrupt ***
Jan  7 16:47:37 bruce ceph-osd[3684]: 2020-01-07 16:47:37.729701  
7fbe5ee73700 -1 osd.25 15017 shutdown
Jan  7 16:47:43 bruce ceph-osd[5689]: 2020-01-07

[ceph-users] Hardware selection for ceph backup on ceph

2020-01-10 Thread Stefan Priebe - Profihost AG
Hi,

we‘re currently in the process of building a new ceph cluster to backup rbd 
images from multiple ceph clusters.

We would like to start with just a single ceph cluster to backup which is about 
50tb. Compression ratio of the data is around 30% while using zlib. We need to 
scale the backup cluster up to 1pb.

The workload on the original rbd images is mostly 4K writes so I expect rbd 
export-diff to do a lot of small writes.

The current idea is to use the following hw as a start:
6 Servers with:
 1 AMD EPYC 7302P 3GHz, 16C/32T
128g Memory
14x 12tb Toshiba Enterprise MG07ACA HDD drives 4K native 
Dual 25gb network

Does it fit? Has anybody experience with the drives? Can we use EC or do we 
need to use normal replication?

Greets,
Stefan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dashboard RBD Image listing takes forever

2020-01-10 Thread Ernesto Puerta
Hi Lenz,

That PR will need a lot of rebasing, as there's been later changes to the
rbd controller.

Nevertheless, while working on that I found a few quick wins that could be
easily implemented (I'll try to come back at this in the next weeks):

   - Caching object instances and using flyweight objects for ioctx,
   rbd.Images, stat, etc.
   - Removing redundant (heavyweight) call to RBDConfiguration.
   - Moving the actual disk usage calculation out of the 60-second loop.
   IMHO that info should be provided by RBD, perhaps calculated and cached in
   the rbd_support mgr module (@Jason)?

However that endpoint, if used with multiple RBD pools, namespaces, clones
and snapshots, is gonna have a hard time (O(N^4)-like) as it's fully serial.

Any other ideas?

@Matt: just curious, apart from the number of images, what's the amount of
rbd pools/clones/snapshots/... on your deployment?

Kind regards,

Ernesto Puerta

He / Him / His

Senior Software Engineer, Ceph

Red Hat 



On Mon, Jan 6, 2020 at 6:08 PM Lenz Grimmer  wrote:

> Hi Matt,
>
> On 1/6/20 4:33 PM, Matt Dunavant wrote:
>
> > I was hoping there was some update on this bug:
> > https://tracker.ceph.com/issues/39140
> >
> > In all recent versions of the dashboard, the RBD image page takes
> > forever to populate due to this bug. All our images have fast-diff
> > enabled, so it can take 15-20 min to populate this page with about
> > 20-30 images.
>
> Thanks for bringing this up and the reminder. I've just updated the
> tracker issue by pointing it to the current pull request that intends to
> address this: https://github.com/ceph/ceph/pull/28387 - looks like this
> approach needs further testing/review before we can merge it, it
> currently is still marked as "Draft".
>
> @Ernesto - any news/thoughts about this from your POV?
>
> Thanks,
>
> Lenz
>
> --
> SUSE Software Solutions Germany GmbH - Maxfeldstr. 5 - 90409 Nuernberg
> GF: Felix Imendörffer, HRB 36809 (AG Nürnberg)
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] where does 100% RBD utilization come from?

2020-01-10 Thread Philip Brown
Surprisingly, a google search didnt seem to find the answer on this, so guess I 
should ask here:

what determines if an rdb is "100% busy"?

I have some backend OSDs, and an iSCSI gateway, serving out some RBDs.

iostat on the gateway says rbd is 100% utilized

iostat on individual OSds only goes as high as about 60% on a per-device basis.
CPU is idle.
Doesnt seem like network interface is capped either.

So.. how do I improve RBD throughput?



--
Philip Brown| Sr. Linux System Administrator | Medata, Inc. 
5 Peters Canyon Rd Suite 250 
Irvine CA 92606 
Office 714.918.1310| Fax 714.918.1325 
pbr...@medata.com| www.medata.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD Marked down unable to restart continuously failing

2020-01-10 Thread Radhakrishnan2 S
Can someone please help to respond to the below query ? 

Regards
Radha Krishnan S
TCS Enterprise Cloud Practice
Tata Consultancy Services
Cell:- +1 848 466 4870
Mailto: radhakrishnan...@tcs.com
Website: http://www.tcs.com

Experience certainty.   IT Services
Business Solutions
Consulting



-Radhakrishnan2 S/CHN/TCS wrote: -
To: "Ceph Users" 
From: Radhakrishnan2 S/CHN/TCS
Date: 01/09/2020 08:34AM
Subject: OSD Marked down unable to restart continuously failing

Hello Everyone, 

One of the OSD node out of 16 has 12 OSD's with a bcache as NVMe, locally those 
osd daemons seem to be up and running, while the ceph osd tree shows them as 
down. Logs show that OSD's have struck IO for over 4096 sec. 

I tried checking for iostat, netstat, ceph -w  along with the logs. Is there a 
way to identify why this happening ? In addition, when I restart the OSD 
daemons on the respective OSD node, restart is failing. Any quick help pls.

Regards
Radha Krishnan S
TCS Enterprise Cloud Practice
Tata Consultancy Services
Cell:- +1 848 466 4870
Mailto: radhakrishnan...@tcs.com
Website: http://www.tcs.com

Experience certainty.   IT Services
Business Solutions
Consulting



-"ceph-users"  wrote: -
To: d.aber...@profihost.ag, "Janne Johansson" 
From: "Wido den Hollander" 
Sent by: "ceph-users" 
Date: 01/09/2020 08:19AM
Cc: "Ceph Users" , a.bra...@profihost.ag, 
"p.kra...@profihost.ag" , j.kr...@profihost.ag
Subject: Re: [ceph-users] Looking for experience

"External email. Open with Caution"


On 1/9/20 2:07 PM, Daniel Aberger - Profihost AG wrote:
> 
> Am 09.01.20 um 13:39 schrieb Janne Johansson:
>>
>> I'm currently trying to workout a concept for a ceph cluster which can
>> be used as a target for backups which satisfies the following
>> requirements:
>>
>> - approx. write speed of 40.000 IOP/s and 2500 Mbyte/s
>>
>>
>> You might need to have a large (at least non-1) number of writers to get
>> to that sum of operations, as opposed to trying to reach it with one
>> single stream written from one single client. 
> 
> 
> We are aiming for about 100 writers.

So if I read it correctly the writes will be 64k each.

That should be doable, but you probably want something like NVMe for DB+WAL.

You might want to tune that larger writes also go into the WAL to speed
up the ingress writes. But you mainly want more spindles then less.

Wido

> 
> Cheers
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com