Hello,
since some month all our bluestore OSDs keep crashing from time to time.
Currently about 5 OSDs per day.
All of them show the following trace:
Trace:
2019-07-24 08:36:48.995397 7fb19a711700 -1 rocksdb: submit_transaction
error: Corruption: block checksum mismatch code = 2 Rocksdb transacti
Hello,
since some month all our bluestore OSDs keep crashing from time to time.
Currently about 5 OSDs per day.
All of them show the following trace:
Trace:
2019-07-24 08:36:48.995397 7fb19a711700 -1 rocksdb: submit_transaction
error: Corruption: block checksum mismatch code = 2 Rocksdb transacti
Dear All,
We have a new Nautilus (14.2.2) cluster, with 328 OSDs spread over 40 nodes.
Unfortunately "ceph iostat" spends most of it's time frozen, with
occasional periods of working normally for less than a minute, then
freeze again for a couple of minutes, then come back to life, and so so
on..
Hi Stefan,
this looks like a duplicate for
https://tracker.ceph.com/issues/37282
Actually the root cause selection might be quite wide.
From HW issues to broken logic in RocksDB/BlueStore/BlueFS etc.
As far as I understand you have different OSDs which are failing, right?
Is the set of thes
Curious what dist you're running on, as I've been having similar issues with
instability in the mgr as well, curious if any similar threads to pull at.
While the iostat command is running, is the active mgr using 100% CPU in top?
Reed
> On Aug 27, 2019, at 6:41 AM, Jake Grimmett wrote:
>
> De
Hi,
What is the correct/best way to address a this? It seems like a python issue,
maybe it's time I learn how to "restart" modules? The cluster seems to be
working beyond this.
health: HEALTH_ERR
Module 'devicehealth' has failed: Failed to import _strptime
because the import
Hi Reed,
That exactly matches what I'm seeing:
when iostat is working OK, I see ~5% CPU use by ceph-mgr
and when iostat freezes, ceph-mgr CPU increases to 100%
regarding OS, I'm using Scientific Linux 7.7
Kernel 3.10.0-957.21.3.el7.x86_64
I'm not sure if the mgr initiates scrubbing, but if so,
Whoops, I'm running Scientific Linux 7.6, going to upgrade to 7.7. soon...
thanks
Jake
On 8/27/19 2:22 PM, Jake Grimmett wrote:
> Hi Reed,
>
> That exactly matches what I'm seeing:
>
> when iostat is working OK, I see ~5% CPU use by ceph-mgr
> and when iostat freezes, ceph-mgr CPU increases t
Hi Jake,
On 8/27/19 3:22 PM, Jake Grimmett wrote:
> That exactly matches what I'm seeing:
>
> when iostat is working OK, I see ~5% CPU use by ceph-mgr
> and when iostat freezes, ceph-mgr CPU increases to 100%
Does this also occur if the dashboard module is disabled? Just wondering
if this is is
Hi Igor,
Am 27.08.19 um 14:11 schrieb Igor Fedotov:
> Hi Stefan,
>
> this looks like a duplicate for
>
> https://tracker.ceph.com/issues/37282
>
> Actually the root cause selection might be quite wide.
>
> From HW issues to broken logic in RocksDB/BlueStore/BlueFS etc.
>
> As far as I underst
see inline
On 8/27/2019 4:41 PM, Stefan Priebe - Profihost AG wrote:
Hi Igor,
Am 27.08.19 um 14:11 schrieb Igor Fedotov:
Hi Stefan,
this looks like a duplicate for
https://tracker.ceph.com/issues/37282
Actually the root cause selection might be quite wide.
From HW issues to broken logic i
Try running gstack on the ceph mgr process when it is frozen?
This could be a name resolution problem, as you suspect. Maybe gstack will
show where the process is 'stuck'and this might be a call to your name
resolution service.
On Tue, 27 Aug 2019 at 14:25, Jake Grimmett wrote:
> Whoops, I'm r
see inline
Am 27.08.19 um 15:43 schrieb Igor Fedotov:
> see inline
>
> On 8/27/2019 4:41 PM, Stefan Priebe - Profihost AG wrote:
>> Hi Igor,
>>
>> Am 27.08.19 um 14:11 schrieb Igor Fedotov:
>>> Hi Stefan,
>>>
>>> this looks like a duplicate for
>>>
>>> https://tracker.ceph.com/issues/37282
>>>
>>
hi Zheng,
On 8/26/19 3:31 PM, Yan, Zheng wrote:
[…]
> change code to :
[…]
we can happily confirm that this resolves the issue.
thank you _very_ much & with kind regards,
t.
signature.asc
Description: OpenPGP digital signature
___
ceph-users mailin
I'm currently seeing this with the dashboard disabled.
My instability decreases, but isn't wholly cured, by disabling prometheus and
rbd_support, which I use in tandem, as the only thing I'm using the
prom-exporter for is the per-rbd metrics.
> ceph mgr module ls
> {
> "enabled_modules": [
It sounds like OSD is "recovering" after checksum error.
I.e. just failed OSD shows no errors in fsck and is able to restart and
process new write requests for long enough period (longer than just a
couple of minutes). Are these statements true? If so I can suppose this
is accidental/volatile
Am 27.08.19 um 16:20 schrieb Igor Fedotov:
> It sounds like OSD is "recovering" after checksum error.
May be no idea how this works. systemd is starting the osd again after
crashing and than it runs for weeks or days again.
> I.e. just failed OSD shows no errors in fsck and is able to restart an
Yes, the problem still occurs with the dashboard disabled...
Possibly relevant, when both the dashboard and iostat plugins are
disabled, I occasionally see ceph-mgr rise to 100% CPU.
as suggested by John Hearns, the output of gstack ceph-mgr when at 100%
is here:
http://p.ip.fi/52sV
many thank
On Wed, Jul 17, 2019 at 3:07 PM wrote:
>
> All;
>
> I'm trying to firm up my understanding of how Ceph works, and ease of
> management tools and capabilities.
>
> I stumbled upon this:
> http://docs.ceph.com/docs/nautilus/rados/configuration/mon-lookup-dns/
>
> It got me wondering; how do you co
We have run in to what looks like bug 36094
(https://tracker.ceph.com/issues/36094) on our 13.2.6 cluster and
unfortunately now one of our ranks (Rank 1) won't start - it comes up
for a few seconds before the assigned MDS crashes again with the below
log entries. It would appear that OpenFileTa
Hi Reed, Lenz, John
I've just tried disabling the balancer, so far ceph-mgr is keeping it's
CPU mostly under 20%, even with both the iostat and dashboard back on.
# ceph balancer off
was
[root@ceph-s1 backup]# ceph balancer status
{
"active": true,
"plans": [],
"mode": "upmap"
}
now
Just to further piggyback,
Probably the most "hard" the mgr seems to get pushed is when the balancer is
engaged.
When trying to eval a pool or cluster, it takes upwards of 30-120 seconds for
it to score it, and then another 30-120 seconds to execute the plan, and it
never seems to engage automa
The first ceph + htc/hpc/science virtual user group meeting is tomorrow
Wednesday August 28th at 10:30am us eastern/4:30pm eu central time. Duration
will be kept to <= 1 hour.
I'd like this to be conducted as a user group and not only one person
talking/presenting. For this first meeting I'd li
I'm running a ceph installation on a lab to evaluate for production and I have
a cluster running, but I need to mount on different windows servers and
desktops. I created an NFS share and was able to mount it on my Linux desktop,
but not a Win 10 desktop. Since it seems that Windows server 2016
What is the correct/best way to address a this? It seems like a python issue, maybe it's
time I learn how to "restart" modules? The cluster seems to be working beyond
this.
Restart of single module is: `ceph mgr module disable devicehealth ;
ceph mgr module enable devicehealth`.
k
___
I'm running a ceph installation on a lab to evaluate for production and I have
a cluster running, but I need to mount on different windows servers and
desktops. I created an NFS share and was able to mount it on my Linux desktop,
but not a Win 10 desktop. Since it seems that Windows server 2016
26 matches
Mail list logo