date:20201029

[ceph-users] Re: pgs stuck backfill_toofull

2020-10-29 Thread Mark Johnson

Thanks for you swift reply. Below is the requested information. I understand the bit about not being able to reduce the pg count as we've come across this issue once before. This is the reason I've been hesitant to make any changes there without being 100% certain of getting it right and the i

[ceph-users] Re: monitor sst files continue growing

2020-10-29 Thread Zhenshi Zhou

MGR is stopped by me cause it took too much memories. For pg status, I added some OSDs in this cluster, and it Frank Schilder 于2020年10月29日周四下午3:27写道： > Your problem is the overall cluster health. The MONs store cluster history > information that will be trimmed once it reaches HEALTH_OK. Restar

[ceph-users] Re: monitor sst files continue growing

2020-10-29 Thread Zhenshi Zhou

After add OSDs into the cluster, the recovery and backfill progress has not finished yet Zhenshi Zhou 于2020年10月29日周四下午3:29写道： > MGR is stopped by me cause it took too much memories. > For pg status, I added some OSDs in this cluster, and it > > Frank Schilder 于2020年10月29日周四下午3:27写道： > >> Your

[ceph-users] Re: Monitor persistently out-of-quorum

2020-10-29 Thread David Caro

On 10/28 17:26, Ki Wong wrote: > Hello, > > I am at my wit's end. > > So I made a mistake in the configuration of my router and one > of the monitors (out of 3) dropped out of the quorum and nothing > I’ve done allow it to rejoin. That includes reinstalling the > monitor with ceph-ansible. > > T

[ceph-users] Re: pgs stuck backfill_toofull

2020-10-29 Thread Mark Johnson

Thanks again Frank. That gives me something to digest (and try to understand). One question regarding maintenance mode, these are production systems that are required to be available all the time. What, exactly, will happen if I issue this command for maintenance mode? Thanks, Mark On Thu,

[ceph-users] Re: frequent Monitor down

2020-10-29 Thread Marc Roos

Really? First time I read this here, afaik you can get a split brain like this. -Original Message- Sent: Thursday, October 29, 2020 12:16 AM To: Eugen Block Cc: ceph-users Subject: [ceph-users] Re: frequent Monitor down Eugen, I've got four physical servers and I've installed mon on a

[ceph-users] Re: monitor sst files continue growing

2020-10-29 Thread Zhenshi Zhou

I reset the pg_num after adding osd, it made some pg inactive(in activating state) Frank Schilder 于2020年10月29日周四下午3:56写道： > This does not explain incomplete and inactive PGs. Are you hitting > https://tracker.ceph.com/issues/46847 (see also thread "Ceph does not > recover from OSD restart"? In

[ceph-users] Re: pgs stuck backfill_toofull

2020-10-29 Thread Frank Schilder

Hi Mark, it looks like you have some very large PGs. Also, you run with a quite low PG count, in particular, for the large pool. Please post the output of "ceph df" and "ceph osd pool ls detail" to see how much data is in each pool and some pool info. I guess you need to increase the PG count o

[ceph-users] Re: monitor sst files continue growing

2020-10-29 Thread Frank Schilder

Your problem is the overall cluster health. The MONs store cluster history information that will be trimmed once it reaches HEALTH_OK. Restarting the MONs only makes things worse right now. The health status is a mess, no MGR, a bunch of PGs inactive, etc. This is what you need to resolve. How d

[ceph-users] Re: pgs stuck backfill_toofull

2020-10-29 Thread Frank Schilder

Cephfs pools are uncritical, because ceph fs splits very large files into chunks of objectsize. The RGW pool is the problem, because RGW does not as far as I know. A few 1TB uploads and you have a problem. The calculation is confusing, because the term PG is used in two different meanings, unfo

[ceph-users] Re: monitor sst files continue growing

2020-10-29 Thread Frank Schilder

This does not explain incomplete and inactive PGs. Are you hitting https://tracker.ceph.com/issues/46847 (see also thread "Ceph does not recover from OSD restart"? In that case, temporarily stopping and restarting all new OSDs might help. Best regards, = Frank Schilder AIT Risø

[ceph-users] Re: pgs stuck backfill_toofull

2020-10-29 Thread Frank Schilder

It will prevent OSDs from being marked out if you shut them down or the . Changing PG counts does not require a shut down of OSDs, but sometimes OSDs get overloaded by peering traffic and the MONs can loose contact for a while. Setting noout will prevent flapping and also reduce the administrati

[ceph-users] Re: pgs stuck backfill_toofull

2020-10-29 Thread Frank Schilder

He he. > It will prevent OSDs from being marked out if you shut them down or the . ... down or the MONs loose heartbeats due to high network load during peering. = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder

[ceph-users] Re: monitor sst files continue growing

2020-10-29 Thread Zhenshi Zhou

Hi, I was so anxious a few hours ago cause the sst files were growing so fast and I don't think the space on mon servers could afford it. Let me talk it from the beginning. I have a cluster with OSD deployed on SATA(7200rpm). 10T each OSD and I used ec pool for more space.I added new OSDs into th

[ceph-users] Re: monitor sst files continue growing

2020-10-29 Thread Zhenshi Zhou

I then follow someone's guidance, add 'mon compact on start = true' to the config and restart one mon. That mon has not joined the cluster until I added two mon deployed on virtual machines with ssd into the cluster. And now the cluster is fine except the pg status. [image: image.png] [image: imag

[ceph-users] Re: Monitor persistently out-of-quorum

2020-10-29 Thread Stefan Kooman

On 2020-10-29 01:26, Ki Wong wrote: > Hello, > > I am at my wit's end. > > So I made a mistake in the configuration of my router and one > of the monitors (out of 3) dropped out of the quorum and nothing > I’ve done allow it to rejoin. That includes reinstalling the > monitor with ceph-ansible.

[ceph-users] Re: dashboard object gateway not working

2020-10-29 Thread Siegfried Höllrigl

On the machines with the radosgateways, there is also a haproxy running. (And makes https->http conversion). I have tried it in both ways alreads. on Port 443 (resolves to the external IP) and on the internal Port (with a hosts entry to the internal IP; on the machine, where the ceph-mgr

[ceph-users] Cloud Sync Module

2020-10-29 Thread Sailaja Yedugundla

I am trying to configure cloud sync module in my ceph cluster to implement backup to AWS S3 cluster. I could not find configure using the available documentation. Can someone help me to implement this? Thanks, Sailaja ___ ceph-users mailing list -- ce

[ceph-users] Re: monitor sst files continue growing

2020-10-29 Thread Alex Gracie

We hit this issue over the weekend on our HDD backed EC Nautilus cluster while removing a single OSD. We also did not have any luck using compaction. The mon-logs filled up our entire root disk on the mon servers and we were running on a single monitor for hours while we tried to finish recovery

[ceph-users] Re: monitor sst files continue growing

2020-10-29 Thread Frank Schilder

I think you really need to sit down and explain the full story. Dropping one-liners with new information will not work via e-mail. I have never heard of the problem you are facing, so you did something that possibly no-one else has done before. Unless we know the full history from the last time

[ceph-users] Re: Very high read IO during backfilling

2020-10-29 Thread Eugen Block

Hi, you could lower the recovery settings to the default and see if that helps: osd_max_backfills = 1 osd_recovery_max_active = 3 Regards, Eugen Zitat von Kamil Szczygieł : Hi, We're running Octopus and we've 3 control plane nodes (12 core, 64 GB memory each) that are running mon, mds an

[ceph-users] Re: How to reset Log Levels

2020-10-29 Thread Patrick Donnelly

On Thu, Oct 29, 2020 at 9:26 AM Ml Ml wrote: > > Hello, > i played around with some log level i can´t remember and my logs are > now getting bigger than my DVD-Movie collection. > E.g.: journalctl -b -u > ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@mon.ceph03.service > > out.file is 1,1GB big. > > I

[ceph-users] Not all OSDs in rack marked as down when the rack fails

2020-10-29 Thread Wido den Hollander

Hi, I'm investigating an issue where 4 to 5 OSDs in a rack aren't marked as down when the network is cut to that rack. Situation: - Nautilus cluster - 3 racks - 120 OSDs, 40 per rack We performed a test where we turned off the network Top-of-Rack for each rack. This worked as expected with

[ceph-users] Re: Not all OSDs in rack marked as down when the rack fails

2020-10-29 Thread Dan van der Ster

Hi Wido, Could it be one of these? mon osd min up ratio mon osd min in ratio 36/120 is 0.3 so it might be one of those magic ratios at play. Cheers, Dan On Thu, 29 Oct 2020, 18:05 Wido den Hollander, wrote: > Hi, > > I'm investigating an issue where 4 to 5 OSDs in a rack aren't marked as >

[ceph-users] Re: Monitor persistently out-of-quorum

2020-10-29 Thread Ki Wong

Thanks, David. I just double checked and they can all connect to one another, on both v1 and v2 ports. -kc > On Oct 29, 2020, at 12:41 AM, David Caro wrote: > > On 10/28 17:26, Ki Wong wrote: >> Hello, >> >> I am at my wit's end. >> >> So I made a mistake in the configuration of my router an

[ceph-users] Re: monitor sst files continue growing

2020-10-29 Thread Zhenshi Zhou

Hi Alex, We found that there were a huge number of keys in the "logm" and "osdmap" table while using ceph-monstore-tool. I think that could be the root cause. Well, some pages also say that disable 'insight' module can resolve this issue, but I checked our cluster and we didn't enable this module

[ceph-users] Very high read IO during backfilling

2020-10-29 Thread Kamil Szczygieł

Hi, We're running Octopus and we've 3 control plane nodes (12 core, 64 GB memory each) that are running mon, mds and mgr and also 4 data nodes (12 core, 256 GB memory, 13x10TB HDDs each). We've increased number of PGs inside our pool, which resulted in all OSDs going crazy and reading the avera

[ceph-users] Re: Huge HDD ceph monitor usage [EXT]

2020-10-29 Thread Ing . Luis Felipe Domínguez Vega

Thanks for response... I dont have the old OSDs (and not backups because this cluster is not so important, this is the develop cluster, so the unknown PGs i need to delete it (how i can do that?). But i dont want wipe all the Ceph cluster, if i can delete the unkown and incomplete PGs, well so

[ceph-users] Re: Huge HDD ceph monitor usage [EXT]

2020-10-29 Thread Ing . Luis Felipe Domínguez Vega

Uff.. now two of the OSD are crashing with... https://pastebin.ubuntu.com/p/qd6Tc2rpfm/ El 2020-10-29 13:11, Frank Schilder escribió: ... i will use now only one site, but need first stabilice the cluster to remove the EC erasure coding and use replicate ... If you change to one site only, th

[ceph-users] Re: frequent Monitor down

2020-10-29 Thread Tony Liu

Typically, the number of nodes is 2n+1 to cover n failures. It's OK to have 4 nodes, from failure covering POV, it's the same as 3 nodes. 4 nodes will cover 1 failure. If 2 nodes down, the cluster is down. It works, just not make much sense. Thanks! Tony > -Original Message- > From: Marc R

[ceph-users] How to reset Log Levels

2020-10-29 Thread Ml Ml

Hello, i played around with some log level i can´t remember and my logs are now getting bigger than my DVD-Movie collection. E.g.: journalctl -b -u ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@mon.ceph03.service > out.file is 1,1GB big. I did already try: ceph tell mon.ceph03 config set debug_mon 0/1

[ceph-users] Re: frequent Monitor down

2020-10-29 Thread Janne Johansson

Den tors 29 okt. 2020 kl 20:16 skrev Tony Liu : > Typically, the number of nodes is 2n+1 to cover n failures. > It's OK to have 4 nodes, from failure covering POV, it's the same > as 3 nodes. 4 nodes will cover 1 failure. If 2 nodes down, the > cluster is down. It works, just not make much sense.

[ceph-users] Re: Huge HDD ceph monitor usage [EXT]

2020-10-29 Thread Frank Schilder

> ... i will use now only one site, but need first stabilice the > cluster to remove the EC erasure coding and use replicate ... If you change to one site only, there is no point in getting rid of the EC pool. Your main problem will be restoring the lost data. Do you have backup of everything? D

[ceph-users] Re: pgs stuck backfill_toofull

2020-10-29 Thread Stefan Kooman

On 2020-10-29 06:55, Mark Johnson wrote: > I've been struggling with this one for a few days now. We had an OSD report > as near full a few days ago. Had this happen a couple of times before and a > reweight-by-utilization has sorted it out in the past. Tried the same again > but this time we

[ceph-users] bluefs mount failed(crash) after a long time

2020-10-29 Thread Elians Wan

Anyone can help? Bluefs mount failed after a long time The error message: 2020-10-30 05:33:54.906725 7f1ad73f5e00 1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-30/block size 7.28TiB 2020-10-30 05:33:54.906758 7f1ad73f5e00 1 bluefs mount 2020-10-30 06:00:32.881850 7f1ad73f5e00 -1 ***

[ceph-users] Fix PGs states

2020-10-29 Thread Ing . Luis Felipe Domínguez Vega

Hi: I have this ceph status: - cluster: id: 039bf268-b5a6-11e9-bbb7-d06726ca4a78 health: HEALTH_WARN noout flag(s) set 1 osds down Reduced data availability: 191 pgs inactiv

[ceph-users] Re: Fix PGs states

2020-10-29 Thread 胡玮文

Hi, I have not tried, but maybe this will help with the unknown PGs, if you don’t care any data loss. ceph osd force-create-pg 在 2020年10月30日，10:46，Ing. Luis Felipe Domínguez Vega 写道： Hi: I have this ceph status: ---

[ceph-users] Re: Fix PGs states

2020-10-29 Thread Ing . Luis Felipe Domínguez Vega

Great and thanks, i fixed all unknowns with the command, now left the incomplete, down, etc. El 2020-10-29 23:57, 胡玮文 escribió: Hi, I have not tried, but maybe this will help with the unknown PGs, if you don’t care any data loss. ceph osd force-create-pg 在 2020年10月30日，10:46，Ing. Luis Fel

[ceph-users] MDS restarts after enabling msgr2

2020-10-29 Thread Stefan Kooman

Hi List, After a successful upgrade from Mimic 13.2.8 to Nautilus 14.2.12 we enabled msgr2. Soon after that both of the MDS servers (active / active-standby) restarted. We did not hit any ASSERTS this time, so that's good :>. However, I have not seen this happening on four different test cluster

[ceph-users] Corrupted RBD image

2020-10-29 Thread Ing . Luis Felipe Domínguez Vega

Hi: I tried get info from a RBD image but: - root@fond-beagle:/# rbd list --pool cinder-ceph | grep volume-dfcca6c8-cb96-4b79-bc85-b200a061dcda volume-dfcca6c8-cb96-4b79-bc85-b200a061dcda root@fond-beagle:/# rbd info --p

40 matches

Mail list logo