Thanks for you swift reply. Below is the requested information.
I understand the bit about not being able to reduce the pg count as we've come
across this issue once before. This is the reason I've been hesitant to make
any changes there without being 100% certain of getting it right and the i
MGR is stopped by me cause it took too much memories.
For pg status, I added some OSDs in this cluster, and it
Frank Schilder 于2020年10月29日周四 下午3:27写道:
> Your problem is the overall cluster health. The MONs store cluster history
> information that will be trimmed once it reaches HEALTH_OK. Restar
After add OSDs into the cluster, the recovery and backfill progress has not
finished yet
Zhenshi Zhou 于2020年10月29日周四 下午3:29写道:
> MGR is stopped by me cause it took too much memories.
> For pg status, I added some OSDs in this cluster, and it
>
> Frank Schilder 于2020年10月29日周四 下午3:27写道:
>
>> Your
On 10/28 17:26, Ki Wong wrote:
> Hello,
>
> I am at my wit's end.
>
> So I made a mistake in the configuration of my router and one
> of the monitors (out of 3) dropped out of the quorum and nothing
> I’ve done allow it to rejoin. That includes reinstalling the
> monitor with ceph-ansible.
>
> T
Thanks again Frank. That gives me something to digest (and try to understand).
One question regarding maintenance mode, these are production systems that are
required to be available all the time. What, exactly, will happen if I issue
this command for maintenance mode?
Thanks,
Mark
On Thu,
Really? First time I read this here, afaik you can get a split brain
like this.
-Original Message-
Sent: Thursday, October 29, 2020 12:16 AM
To: Eugen Block
Cc: ceph-users
Subject: [ceph-users] Re: frequent Monitor down
Eugen, I've got four physical servers and I've installed mon on a
I reset the pg_num after adding osd, it made some pg inactive(in
activating state)
Frank Schilder 于2020年10月29日周四 下午3:56写道:
> This does not explain incomplete and inactive PGs. Are you hitting
> https://tracker.ceph.com/issues/46847 (see also thread "Ceph does not
> recover from OSD restart"? In
Hi Mark,
it looks like you have some very large PGs. Also, you run with a quite low PG
count, in particular, for the large pool. Please post the output of "ceph df"
and "ceph osd pool ls detail" to see how much data is in each pool and some
pool info. I guess you need to increase the PG count o
Your problem is the overall cluster health. The MONs store cluster history
information that will be trimmed once it reaches HEALTH_OK. Restarting the MONs
only makes things worse right now. The health status is a mess, no MGR, a bunch
of PGs inactive, etc. This is what you need to resolve. How d
Cephfs pools are uncritical, because ceph fs splits very large files into
chunks of objectsize. The RGW pool is the problem, because RGW does not as far
as I know. A few 1TB uploads and you have a problem.
The calculation is confusing, because the term PG is used in two different
meanings, unfo
This does not explain incomplete and inactive PGs. Are you hitting
https://tracker.ceph.com/issues/46847 (see also thread "Ceph does not recover
from OSD restart"? In that case, temporarily stopping and restarting all new
OSDs might help.
Best regards,
=
Frank Schilder
AIT Risø
It will prevent OSDs from being marked out if you shut them down or the .
Changing PG counts does not require a shut down of OSDs, but sometimes OSDs get
overloaded by peering traffic and the MONs can loose contact for a while.
Setting noout will prevent flapping and also reduce the administrati
He he.
> It will prevent OSDs from being marked out if you shut them down or the .
... down or the MONs loose heartbeats due to high network load during peering.
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
From: Frank Schilder
Hi,
I was so anxious a few hours ago cause the sst files were growing so fast
and I don't think
the space on mon servers could afford it.
Let me talk it from the beginning. I have a cluster with OSD deployed on
SATA(7200rpm).
10T each OSD and I used ec pool for more space.I added new OSDs into th
I then follow someone's guidance, add 'mon compact on start = true' to the
config and restart one mon.
That mon has not joined the cluster until I added two mon deployed on
virtual machines with ssd into
the cluster.
And now the cluster is fine except the pg status.
[image: image.png]
[image: imag
On 2020-10-29 01:26, Ki Wong wrote:
> Hello,
>
> I am at my wit's end.
>
> So I made a mistake in the configuration of my router and one
> of the monitors (out of 3) dropped out of the quorum and nothing
> I’ve done allow it to rejoin. That includes reinstalling the
> monitor with ceph-ansible.
On the machines with the radosgateways, there is also a haproxy running.
(And makes https->http conversion).
I have tried it in both ways alreads.
on Port 443 (resolves to the external IP)
and
on the internal Port (with a hosts entry to the internal IP; on
the machine, where the ceph-mgr
I am trying to configure cloud sync module in my ceph cluster to implement
backup to AWS S3 cluster. I could not find configure using the available
documentation. Can someone help me to implement this?
Thanks,
Sailaja
___
ceph-users mailing list -- ce
We hit this issue over the weekend on our HDD backed EC Nautilus cluster while
removing a single OSD. We also did not have any luck using compaction. The
mon-logs filled up our entire root disk on the mon servers and we were running
on a single monitor for hours while we tried to finish recovery
I think you really need to sit down and explain the full story. Dropping
one-liners with new information will not work via e-mail.
I have never heard of the problem you are facing, so you did something that
possibly no-one else has done before. Unless we know the full history from the
last time
Hi,
you could lower the recovery settings to the default and see if that helps:
osd_max_backfills = 1
osd_recovery_max_active = 3
Regards,
Eugen
Zitat von Kamil Szczygieł :
Hi,
We're running Octopus and we've 3 control plane nodes (12 core, 64
GB memory each) that are running mon, mds an
On Thu, Oct 29, 2020 at 9:26 AM Ml Ml wrote:
>
> Hello,
> i played around with some log level i can´t remember and my logs are
> now getting bigger than my DVD-Movie collection.
> E.g.: journalctl -b -u
> ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@mon.ceph03.service >
> out.file is 1,1GB big.
>
> I
Hi,
I'm investigating an issue where 4 to 5 OSDs in a rack aren't marked as
down when the network is cut to that rack.
Situation:
- Nautilus cluster
- 3 racks
- 120 OSDs, 40 per rack
We performed a test where we turned off the network Top-of-Rack for each
rack. This worked as expected with
Hi Wido,
Could it be one of these?
mon osd min up ratio
mon osd min in ratio
36/120 is 0.3 so it might be one of those magic ratios at play.
Cheers,
Dan
On Thu, 29 Oct 2020, 18:05 Wido den Hollander, wrote:
> Hi,
>
> I'm investigating an issue where 4 to 5 OSDs in a rack aren't marked as
>
Thanks, David.
I just double checked and they can all connect to one another,
on both v1 and v2 ports.
-kc
> On Oct 29, 2020, at 12:41 AM, David Caro wrote:
>
> On 10/28 17:26, Ki Wong wrote:
>> Hello,
>>
>> I am at my wit's end.
>>
>> So I made a mistake in the configuration of my router an
Hi Alex,
We found that there were a huge number of keys in the "logm" and "osdmap"
table
while using ceph-monstore-tool. I think that could be the root cause.
Well, some pages also say that disable 'insight' module can resolve this
issue, but
I checked our cluster and we didn't enable this module
Hi,
We're running Octopus and we've 3 control plane nodes (12 core, 64 GB memory
each) that are running mon, mds and mgr and also 4 data nodes (12 core, 256 GB
memory, 13x10TB HDDs each). We've increased number of PGs inside our pool,
which resulted in all OSDs going crazy and reading the avera
Thanks for response...
I dont have the old OSDs (and not backups because this cluster is not so
important, this is the develop cluster, so the unknown PGs i need to
delete it (how i can do that?). But i dont want wipe all the Ceph
cluster, if i can delete the unkown and incomplete PGs, well so
Uff.. now two of the OSD are crashing with...
https://pastebin.ubuntu.com/p/qd6Tc2rpfm/
El 2020-10-29 13:11, Frank Schilder escribió:
... i will use now only one site, but need first stabilice the
cluster to remove the EC erasure coding and use replicate ...
If you change to one site only, th
Typically, the number of nodes is 2n+1 to cover n failures.
It's OK to have 4 nodes, from failure covering POV, it's the same
as 3 nodes. 4 nodes will cover 1 failure. If 2 nodes down, the
cluster is down. It works, just not make much sense.
Thanks!
Tony
> -Original Message-
> From: Marc R
Hello,
i played around with some log level i can´t remember and my logs are
now getting bigger than my DVD-Movie collection.
E.g.: journalctl -b -u
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@mon.ceph03.service >
out.file is 1,1GB big.
I did already try:
ceph tell mon.ceph03 config set debug_mon 0/1
Den tors 29 okt. 2020 kl 20:16 skrev Tony Liu :
> Typically, the number of nodes is 2n+1 to cover n failures.
> It's OK to have 4 nodes, from failure covering POV, it's the same
> as 3 nodes. 4 nodes will cover 1 failure. If 2 nodes down, the
> cluster is down. It works, just not make much sense.
> ... i will use now only one site, but need first stabilice the
> cluster to remove the EC erasure coding and use replicate ...
If you change to one site only, there is no point in getting rid of the EC
pool. Your main problem will be restoring the lost data. Do you have backup of
everything? D
On 2020-10-29 06:55, Mark Johnson wrote:
> I've been struggling with this one for a few days now. We had an OSD report
> as near full a few days ago. Had this happen a couple of times before and a
> reweight-by-utilization has sorted it out in the past. Tried the same again
> but this time we
Anyone can help? Bluefs mount failed after a long time
The error message:
2020-10-30 05:33:54.906725 7f1ad73f5e00 1 bluefs add_block_device bdev 1
path /var/lib/ceph/osd/ceph-30/block size 7.28TiB
2020-10-30 05:33:54.906758 7f1ad73f5e00 1 bluefs mount
2020-10-30 06:00:32.881850 7f1ad73f5e00 -1 ***
Hi:
I have this ceph status:
-
cluster:
id: 039bf268-b5a6-11e9-bbb7-d06726ca4a78
health: HEALTH_WARN
noout flag(s) set
1 osds down
Reduced data availability: 191 pgs inactiv
Hi,
I have not tried, but maybe this will help with the unknown PGs, if you don’t
care any data loss.
ceph osd force-create-pg
在 2020年10月30日,10:46,Ing. Luis Felipe Domínguez Vega
写道:
Hi:
I have this ceph status:
---
Great and thanks, i fixed all unknowns with the command, now left the
incomplete, down, etc.
El 2020-10-29 23:57, 胡 玮文 escribió:
Hi,
I have not tried, but maybe this will help with the unknown PGs, if
you don’t care any data loss.
ceph osd force-create-pg
在 2020年10月30日,10:46,Ing. Luis Fel
Hi List,
After a successful upgrade from Mimic 13.2.8 to Nautilus 14.2.12 we
enabled msgr2. Soon after that both of the MDS servers (active /
active-standby) restarted.
We did not hit any ASSERTS this time, so that's good :>.
However, I have not seen this happening on four different test cluster
Hi:
I tried get info from a RBD image but:
-
root@fond-beagle:/# rbd list --pool cinder-ceph | grep
volume-dfcca6c8-cb96-4b79-bc85-b200a061dcda
volume-dfcca6c8-cb96-4b79-bc85-b200a061dcda
root@fond-beagle:/# rbd info --p
40 matches
Mail list logo