[ceph-users] Bluestore OSDs keep crashing in BlueStore.cc: 8808: FAILED assert(r == 0)

2019-08-27 Thread Stefan Priebe - Profihost AG
Hello, since some month all our bluestore OSDs keep crashing from time to time. Currently about 5 OSDs per day. All of them show the following trace: Trace: 2019-07-24 08:36:48.995397 7fb19a711700 -1 rocksdb: submit_transaction error: Corruption: block checksum mismatch code = 2 Rocksdb transacti

[ceph-users] Bluestore OSDs keep crashing in BlueStore.cc: 8808: FAILED assert(r == 0)

2019-08-27 Thread Stefan Priebe - Profihost AG
Hello, since some month all our bluestore OSDs keep crashing from time to time. Currently about 5 OSDs per day. All of them show the following trace: Trace: 2019-07-24 08:36:48.995397 7fb19a711700 -1 rocksdb: submit_transaction error: Corruption: block checksum mismatch code = 2 Rocksdb transacti

[ceph-users] iostat and dashboard freezing

2019-08-27 Thread Jake Grimmett
Dear All, We have a new Nautilus (14.2.2) cluster, with 328 OSDs spread over 40 nodes. Unfortunately "ceph iostat" spends most of it's time frozen, with occasional periods of working normally for less than a minute, then freeze again for a couple of minutes, then come back to life, and so so on..

Re: [ceph-users] Bluestore OSDs keep crashing in BlueStore.cc: 8808: FAILED assert(r == 0)

2019-08-27 Thread Igor Fedotov
Hi Stefan, this looks like a duplicate for https://tracker.ceph.com/issues/37282 Actually the root cause selection might be quite wide. From HW issues to broken logic in RocksDB/BlueStore/BlueFS etc. As far as I understand you have different OSDs which are failing, right? Is the set of thes

Re: [ceph-users] iostat and dashboard freezing

2019-08-27 Thread Reed Dier
Curious what dist you're running on, as I've been having similar issues with instability in the mgr as well, curious if any similar threads to pull at. While the iostat command is running, is the active mgr using 100% CPU in top? Reed > On Aug 27, 2019, at 6:41 AM, Jake Grimmett wrote: > > De

[ceph-users] health: HEALTH_ERR Module 'devicehealth' has failed: Failed to import _strptime because the import lockis held by another thread.

2019-08-27 Thread Peter Eisch
Hi, What is the correct/best way to address a this? It seems like a python issue, maybe it's time I learn how to "restart" modules? The cluster seems to be working beyond this.     health: HEALTH_ERR             Module 'devicehealth' has failed: Failed to import _strptime because the import

Re: [ceph-users] iostat and dashboard freezing

2019-08-27 Thread Jake Grimmett
Hi Reed, That exactly matches what I'm seeing: when iostat is working OK, I see ~5% CPU use by ceph-mgr and when iostat freezes, ceph-mgr CPU increases to 100% regarding OS, I'm using Scientific Linux 7.7 Kernel 3.10.0-957.21.3.el7.x86_64 I'm not sure if the mgr initiates scrubbing, but if so,

Re: [ceph-users] iostat and dashboard freezing

2019-08-27 Thread Jake Grimmett
Whoops, I'm running Scientific Linux 7.6, going to upgrade to 7.7. soon... thanks Jake On 8/27/19 2:22 PM, Jake Grimmett wrote: > Hi Reed, > > That exactly matches what I'm seeing: > > when iostat is working OK, I see ~5% CPU use by ceph-mgr > and when iostat freezes, ceph-mgr CPU increases t

Re: [ceph-users] iostat and dashboard freezing

2019-08-27 Thread Lenz Grimmer
Hi Jake, On 8/27/19 3:22 PM, Jake Grimmett wrote: > That exactly matches what I'm seeing: > > when iostat is working OK, I see ~5% CPU use by ceph-mgr > and when iostat freezes, ceph-mgr CPU increases to 100% Does this also occur if the dashboard module is disabled? Just wondering if this is is

Re: [ceph-users] Bluestore OSDs keep crashing in BlueStore.cc: 8808: FAILED assert(r == 0)

2019-08-27 Thread Stefan Priebe - Profihost AG
Hi Igor, Am 27.08.19 um 14:11 schrieb Igor Fedotov: > Hi Stefan, > > this looks like a duplicate for > > https://tracker.ceph.com/issues/37282 > > Actually the root cause selection might be quite wide. > > From HW issues to broken logic in RocksDB/BlueStore/BlueFS etc. > > As far as I underst

Re: [ceph-users] Bluestore OSDs keep crashing in BlueStore.cc: 8808: FAILED assert(r == 0)

2019-08-27 Thread Igor Fedotov
see inline On 8/27/2019 4:41 PM, Stefan Priebe - Profihost AG wrote: Hi Igor, Am 27.08.19 um 14:11 schrieb Igor Fedotov: Hi Stefan, this looks like a duplicate for https://tracker.ceph.com/issues/37282 Actually the root cause selection might be quite wide. From HW issues to broken logic i

Re: [ceph-users] iostat and dashboard freezing

2019-08-27 Thread John Hearns
Try running gstack on the ceph mgr process when it is frozen? This could be a name resolution problem, as you suspect. Maybe gstack will show where the process is 'stuck'and this might be a call to your name resolution service. On Tue, 27 Aug 2019 at 14:25, Jake Grimmett wrote: > Whoops, I'm r

Re: [ceph-users] Bluestore OSDs keep crashing in BlueStore.cc: 8808: FAILED assert(r == 0)

2019-08-27 Thread Stefan Priebe - Profihost AG
see inline Am 27.08.19 um 15:43 schrieb Igor Fedotov: > see inline > > On 8/27/2019 4:41 PM, Stefan Priebe - Profihost AG wrote: >> Hi Igor, >> >> Am 27.08.19 um 14:11 schrieb Igor Fedotov: >>> Hi Stefan, >>> >>> this looks like a duplicate for >>> >>> https://tracker.ceph.com/issues/37282 >>> >>

Re: [ceph-users] cephfs-snapshots causing mds failover, hangs

2019-08-27 Thread thoralf schulze
hi Zheng, On 8/26/19 3:31 PM, Yan, Zheng wrote: […] > change code to : […] we can happily confirm that this resolves the issue. thank you _very_ much & with kind regards, t. signature.asc Description: OpenPGP digital signature ___ ceph-users mailin

Re: [ceph-users] iostat and dashboard freezing

2019-08-27 Thread Reed Dier
I'm currently seeing this with the dashboard disabled. My instability decreases, but isn't wholly cured, by disabling prometheus and rbd_support, which I use in tandem, as the only thing I'm using the prom-exporter for is the per-rbd metrics. > ceph mgr module ls > { > "enabled_modules": [

Re: [ceph-users] Bluestore OSDs keep crashing in BlueStore.cc: 8808: FAILED assert(r == 0)

2019-08-27 Thread Igor Fedotov
It sounds like OSD is "recovering" after checksum error. I.e. just failed OSD shows no errors in fsck and is able to restart and process new write requests for long enough period (longer than just a couple of minutes). Are these statements true? If so I can suppose this is accidental/volatile

Re: [ceph-users] Bluestore OSDs keep crashing in BlueStore.cc: 8808: FAILED assert(r == 0)

2019-08-27 Thread Stefan Priebe - Profihost AG
Am 27.08.19 um 16:20 schrieb Igor Fedotov: > It sounds like OSD is "recovering" after checksum error. May be no idea how this works. systemd is starting the osd again after crashing and than it runs for weeks or days again. > I.e. just failed OSD shows no errors in fsck and is able to restart an

Re: [ceph-users] iostat and dashboard freezing

2019-08-27 Thread Jake Grimmett
Yes, the problem still occurs with the dashboard disabled... Possibly relevant, when both the dashboard and iostat plugins are disabled, I occasionally see ceph-mgr rise to 100% CPU. as suggested by John Hearns, the output of gstack ceph-mgr when at 100% is here: http://p.ip.fi/52sV many thank

Re: [ceph-users] MON DNS Lookup & Version 2 Protocol

2019-08-27 Thread Jason Dillaman
On Wed, Jul 17, 2019 at 3:07 PM wrote: > > All; > > I'm trying to firm up my understanding of how Ceph works, and ease of > management tools and capabilities. > > I stumbled upon this: > http://docs.ceph.com/docs/nautilus/rados/configuration/mon-lookup-dns/ > > It got me wondering; how do you co

[ceph-users] Recovery from "FAILED assert(omap_num_objs <= MAX_OBJECTS)"

2019-08-27 Thread Zoë O'Connell
We have run in to what looks like bug 36094 (https://tracker.ceph.com/issues/36094) on our 13.2.6 cluster and unfortunately now one of our ranks (Rank 1) won't start - it comes up for a few seconds before the assigned MDS crashes again with the below log entries. It would appear that OpenFileTa

Re: [ceph-users] iostat and dashboard freezing

2019-08-27 Thread Jake Grimmett
Hi Reed, Lenz, John I've just tried disabling the balancer, so far ceph-mgr is keeping it's CPU mostly under 20%, even with both the iostat and dashboard back on. # ceph balancer off was [root@ceph-s1 backup]# ceph balancer status { "active": true, "plans": [], "mode": "upmap" } now

Re: [ceph-users] iostat and dashboard freezing

2019-08-27 Thread Reed Dier
Just to further piggyback, Probably the most "hard" the mgr seems to get pushed is when the balancer is engaged. When trying to eval a pool or cluster, it takes upwards of 30-120 seconds for it to score it, and then another 30-120 seconds to execute the plan, and it never seems to engage automa

Re: [ceph-users] Ceph Scientific Computing User Group

2019-08-27 Thread Kevin Hrpcek
The first ceph + htc/hpc/science virtual user group meeting is tomorrow Wednesday August 28th at 10:30am us eastern/4:30pm eu central time. Duration will be kept to <= 1 hour. I'd like this to be conducted as a user group and not only one person talking/presenting. For this first meeting I'd li

[ceph-users] Ceph + SAMBA (vfs_ceph)

2019-08-27 Thread Salsa
I'm running a ceph installation on a lab to evaluate for production and I have a cluster running, but I need to mount on different windows servers and desktops. I created an NFS share and was able to mount it on my Linux desktop, but not a Win 10 desktop. Since it seems that Windows server 2016

Re: [ceph-users] health: HEALTH_ERR Module 'devicehealth' has failed: Failed to import _strptime because the import lockis held by another thread.

2019-08-27 Thread Konstantin Shalygin
What is the correct/best way to address a this? It seems like a python issue, maybe it's time I learn how to "restart" modules? The cluster seems to be working beyond this. Restart of single module is: `ceph mgr module disable devicehealth ; ceph mgr module enable devicehealth`. k ___

Re: [ceph-users] Ceph + SAMBA (vfs_ceph)

2019-08-27 Thread Konstantin Shalygin
I'm running a ceph installation on a lab to evaluate for production and I have a cluster running, but I need to mount on different windows servers and desktops. I created an NFS share and was able to mount it on my Linux desktop, but not a Win 10 desktop. Since it seems that Windows server 2016