[ceph-users] orch adoption and disk encryption without cephx?

2024-08-18 Thread Boris
Hi,

I have some legacy clusters that I can not move to cephx due to customer
workload. (We have a plan to move everything iterative to a new cluster,
but that might still take a lot time).

I would like to adopt the orchestrator and use the ceph disk encryption
feature. Is this possible without using cephx? It is only rbd workload.

I ask, because I remember that RGW actually required cephx after some
update, because there might be certificates in the mon db. And it took me
some time to find the problem and the solution back then. I would like to
not encounter this with rbd workload :)

Cheers
 Boris
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Identify laggy PGs

2024-08-18 Thread Boris
Good to know. Everything is bluestore and usually 5 spinners share an SSD
for block.db.
Memory should not be a problem. We plan with 4GB / OSD with a minimum of
256GB memory.

The pirmary affinity is a nice idea. I only thought about it in our s3
cluster, because the index is on SAS AND SATA SSDs and I use the SAS as
primary and the sata only for replication.

Am Sa., 17. Aug. 2024 um 15:23 Uhr schrieb Anthony D'Atri <
a...@dreamsnake.net>:

>
> Mostly when they’re spinners.  Especially back in the Filestore days with
> a colocated journal.  Don’t get me started on that.
>
> Too many PGs can exhaust RAM if you’re tight - or using Filestore still.
>
> For a SATA SSD I’d set pg_nums to average 200-300 per drive.  Your size
> mix complicates, though, because the larger OSDs will get many more than
> the smaller.   Be sure to set mon_max_pg_per_osd to like 1000.
>
> You might be experiment with primary affinity, so that the smaller OSDs
> are more likely to be primaries and thus will get more load.  I’ve seen a
> first-order approximation here increase read throughput by 20%
>
-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] CephFS troubleshooting

2024-08-18 Thread Eugenio Tampieri
Hello,
I'm writing to troubleshoot an otherwise functional Ceph quincy cluster that 
has issues with cephfs.
I cannot mount it with ceph-fuse (it gets stuck), and if I mount it with NFS I 
can list the directories but I cannot read or write anything.
Here's the output of ceph -s
  cluster:
id: 3b92e270-1dd6-11ee-a738-000c2937f0ec
health: HEALTH_WARN
mon ceph-storage-a is low on available space
1 daemons have recently crashed
too many PGs per OSD (328 > max 250)

  services:
mon:5 daemons, quorum 
ceph-mon-a,ceph-storage-a,ceph-mon-b,ceph-storage-c,ceph-storage-d (age 105m)
mgr:ceph-storage-a.ioenwq(active, since 106m), standbys: 
ceph-mon-a.tiosea
mds:1/1 daemons up, 2 standby
osd:4 osds: 4 up (since 104m), 4 in (since 24h)
rbd-mirror: 2 daemons active (2 hosts)
rgw:2 daemons active (2 hosts, 1 zones)

  data:
volumes: 1/1 healthy
pools:   13 pools, 481 pgs
objects: 231.83k objects, 648 GiB
usage:   1.3 TiB used, 1.8 TiB / 3.1 TiB avail
pgs: 481 active+clean

  io:
client:   1.5 KiB/s rd, 8.6 KiB/s wr, 1 op/s rd, 0 op/s wr
Best regards,

Eugenio Tampieri
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: weird outage of ceph

2024-08-18 Thread Anthony D'Atri


> 
> You may want to look into https://github.com/digitalocean/pgremapper to get 
> the situation under control first.
> 
> --
> Alex Gorbachev
> ISS

Not a bad idea.

>> We had a really weird outage today of ceph and I wonder how it came about.
>> The problem seems to have started around midnight, I still need to look if 
>> it was to the extend I found it in this morning or if it grew more 
>> gradually, but when I found it several osd servers had most or all osd 
>> processes down, to the point where our EC 8+3 buckets didn't work anymore.

Look at your metrics and systems.  Were the OSDs OOMkilled?

>> I see some of our OSDs are coming close to (but not quite) 80-85% full, 
>> There are many times when I've seen an overfull error lead to cascading and 
>> catastrophic failures. I suspect this may have been one of them.

One can (temporarily) raise the backfillfull / full ratios to help get out of a 
bad situation, but leaving them raised can lead to an even worse situation 
later.


>> Which brings me to another question, why is our balancer doing so badly at 
>> balancing the OSDs?

There are certain situations where the bundled balancer is confounded, 
including CRUSH trees with multiple roots.  Subtly, that may include a cluster 
where some CRUSH rules specify a deviceclass and some don’t, as with the .mgr 
pool if deployed by Rook.  That situation confounds the PG autoscaler for sure. 
 If this is the case, consider modifying CRUSH rules so that all specify a 
deviceclass, and/or simplifying your CRUSH tree if you have explicit multiple 
roots.

>> It's configured with upmap mode and it should work great with the amount of 
>> PGs per OSD we have

Which is?

>> , but it is letting some OSD's reach 80% full and others not yet 50% full 
>> (we're just over 61% full in total).
>> 
>> The current health status is:
>> HEALTH_WARN Low space hindering backfill (add storage if this doesn't 
>> resolve itself): 1 pg backfill_toofull 
>> [WRN] PG_BACKFILL_FULL: Low space hindering backfill (add storage if this 
>> doesn't resolve itself): 1 pg backfill_toofull 
>>pg 30.3fc is active+remapped+backfill_wait+backfill_toofull, acting 
>> [66,105,124,113,89,132,206,242,179]
>> 
>> I've started reweighting again, because the balancer is not doing it's job 
>> in our cluster for some reason...

Reweighting … are you doing “ceph osd crush reweight”, or “ceph osd reweight / 
reweight-by-utilization” ? The latter in conjunction with pg-upmap confuses the 
balancer.  If that’s the situation you have, I might

* Use pg-remapper or Dan’s 
https://gitlab.cern.ch/ceph/ceph-scripts/blob/master/tools/upmap/upmap-remapped.py
 to freeze the PG mappings temporarily.
* Temp jack up the backfillfull/full ratios for some working room, say to 95 / 
98 %
* One at a time, reset the override reweights to 1.0.  No data should move.
* Remove the manual upmaps one at time, in order of PGs on the most-full OSDs  
You should see a brief spurt of backfill
* Rinse, lather, repeat.
* This should progressively get you to a state where you no longer have any 
old-style override reweights, i.e. all OSDs have 1.0 for that value.
* Proceed removing the manual upmaps one or a few at a time
* The balancer should work now
* Set the ratios back to the default values


>> 
>> Below is our dashboard overview, you can see the start and recovery in the 
>> 24h graph...
>> 
>> Cheers
>> 
>> /Simon
>> 

>> 
>> 
>> --
>> I'm using my gmail.com  address, because the gmail.com 
>>  dmarc policy is "none", some mail servers will reject 
>> this (microsoft?) others will instead allow this when I send mail to a 
>> mailling list which has not yet been configured to send mail "on behalf of" 
>> the sender, but rather do a kind of "forward". The latter situation causes 
>> dkim/dmarc failures and the dmarc policy will be applied. see 
>> https://wiki.list.org/DEV/DMARC for more details
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io 
>> To unsubscribe send an email to ceph-users-le...@ceph.io 
>> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] The snaptrim queue of PGs has not decreased for several days.

2024-08-18 Thread Giovanna Ratini

Hello all,

We use Ceph (v18.2.2) and Rook (1.14.3) as the CSI for a Kubernetes 
environment. Last week, we had a problem with the MDS falling behind on 
trimming every 4-5 days (GitHub issue link 
). We resolved the issue 
using the steps outlined in the GitHub issue.


We have 3 hosts (I know, I need to increase this as soon as possible, 
and I will!) and 6 OSDs. After running the commands:


|ceph config set mds mds_dir_max_commit_size 80|, |
|

|ceph fs fail |, and |
|

|ceph fs set  joinable true|,

After that, the snaptrim queue for our PGs has stopped decreasing. All 
PGs of our CephFS are in either |active+clean+snaptrim_wait| or 
|active+clean+snaptrim| states. For example, the PG |3.12| is in the 
|active+clean+snaptrim| state, and its |snap_trimq_len| was 4077 
yesterday but has increased to 4538 today.


I increased the |osd_snap_trim_priority| to 10 (|ceph config set osd 
osd_snap_trim_priority 10|), but it didn't help. Only the PGs of our 
CephFS have this problem.


Do you have any ideas on how we can resolve this issue?

Thanks in advance,

Giovanna
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Bug with Cephadm module osd service preventing orchestrator start

2024-08-18 Thread benjaminmhuth
Hey there, so I went to upgrade my ceph from 18.2.2 to 18.2.4 and have 
encountered a problem with my managers. After they had been upgraded, my ceph 
orch module broke because the cephadm module would not load. This obviously 
halted the update because you can't really update without the orchestrator. 
Here are the logs related to why the cephadm module fails to start:

https://pastebin.com/SzHbEDVA

and the relevent part here:

"backtrace": [

" File \"/usr/share/ceph/mgr/cephadm/module.py\", line 591, in __init__\n 
self.to_remove_osds.load_from_store()",

" File \"/usr/share/ceph/mgr/cephadm/services/osd.py\", line 918, in 
load_from_store\n osd_obj = OSD.from_json(osd, rm_util=self.rm_util)",

" File \"/usr/share/ceph/mgr/cephadm/services/osd.py\", line 783, in 
from_json\n return cls(**inp)",

"TypeError: __init__() got an unexpected keyword argument 'original_weight'"

]

Unfortunately, I am at a loss to what passes this the original weight argument. 
I have attempted to migrate back to 18.2.2 and successfully redeployed a 
manager of that version, but it also has the same issue with the cephadm 
module. I believe this may be because I recently started several OSD drains, 
then canceled them, causing this to manifest once the managers restarted.

I've attempted to dig into this myself a bit and found the associated function 
in the ceph repo: 
https://github.com/ceph/ceph/blob/e0dd396793b679922e487332a2a4bc48e024a42f/src/pybind/mgr/cephadm/services/osd.py#L779
My issue may also have to do with this PR: 
https://github.com/ceph/ceph/commit/ba7fac074fb5ad072fcad10862f75c0a26a7591d
I unfortunately do not have enough familiarity with the source to see where 
these values are being set, or to know what keys to remove from the ceph 
config. Hopefully someone who knows about this will see this and take a look 
for me!

Thanks!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Data recovery after resharding mishap

2024-08-18 Thread Gauvain Pocentek
Hello list,

We have made a mistake and dynamically resharded a bucket in a multi-site
RGW setup running Quincy (support for this has been added in Reef). So we
have now ~200 million objects still stored in the rados cluster, but
completely removed from the bucket index (basically ceph has created a new
index for the bucket).

We would really like to recover these objects, but we are facing a few
issues with our ideas. Any help would be appreciated.

The main problem we face is that the rados objects contain binary data,
where we expected json data. The RGW is configured to use zlib compression,
so we think it can be the reason (although using zlib to decompress doesn't
work). Has anyone already faced this, and managed to recover the data from
rados objects?

Another idea we have is to update the new bucket index to inject the old
data. This looks possible as the object marker hasn't changed. We also have
access to the old indexes/shards, so we could get all the omap key/value
pairs and inject them in the new index. Has anyone been mad enough to try
this by any chance?

Any other idea to recover the data would help of course.

Thank you!

Gauvain
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-18 Thread Eugen Block
Can you share the current ceph status? Are the OSDs reporting anything  
suspicious? How is the disk utilization?


Zitat von Giovanna Ratini :


More information:

The snaptrim take a lot of time but the he objects_trimmed are "0"

 "objects_trimmed": 0,
"snaptrim_duration": 500.5807601752,

It could explain, why the queue are growing up..


Am 17.08.2024 um 14:37 schrieb Giovanna Ratini:

Hello again,

I checked the pgs dump. Snapshot grow up

Query für PG: 3.12
{
    "snap_trimq":  
"[5b974~3b,5cc3a~1,5cc3c~1,5cc3e~1,5cc40~1,5cd83~1,5cd85~1,5cd87~1,5cd89~1,5cecc~1,5cece~4,5ced3~2,5cf72~1,5cf74~4,5cf79~a2,5d0b8~1,5d0bb~1,5d0bd~a5,5d1f9~2,5d204~a5,5d349~a7,5d48e~3,5d493~a4,5d5d7~a7,5d71e~a3,5d7c2~3,5d860~1,5d865~4,5d86a~a2,5d9aa~1,5d9ac~1,5d9ae~a5,5daf3~a5,5db9a~2,5dc3a~a5,5dce1~1,5dce3~1,5dd81~a7,5dec8~a7,5e00f~a7,5e156~a8,5e29d~1,5e29f~a7,5e3e6~a8,5e52e~a6,5e5d6~2,5e676~a6,5e71e~2,5e7be~a9,5e907~a5,5e9ad~3,5ea50~a7,5eaf9~1,5eafb~1,5eb99~a7,5ec42~2,5ece2~a7,5ed8a~2,5ee2b~a9,5ef74~a7,5f01c~1,5f0bd~a1,5f15f~1,5f161~1,5f163~1,5f167~1,5f206~a1,5f2a8~1,5f2aa~1,5f2ac~1,5f2ae~1,5f34f~a1,5f3f1~1,5f3f3~1,5f3f5~1,5f3f7~1,5f499~a1,5f53b~1,5f53d~1,5f53f~1,5f541~1,5f5e3~a1,5f685~1,5f687~1,5f689~1,5f68d~1,5f72d~a1,5f7cf~1,5f7d1~1,5f7d3~1]",

*    "snap_trimq_len": 5421,*
    "state": "active+clean+snaptrim",
    "epoch": 734130,

Query für PG: 3.12
{
    "snap_trimq":  
"[5b976~39,5ba53~1,5ba56~a0,5cc3a~1,5cc3c~1,5cc3e~1,5cc40~1,5cd83~1,5cd85~1,5cd87~1,5cd89~1,5cecc~1,5cece~4,5ced3~2,5cf72~1,5cf74~4,5cf79~a2,5d0b8~1,5d0bb~1,5d0bd~a5,5d1f9~2,5d204~a5,5d349~a7,5d48e~3,5d493~a4,5d5d7~a7,5d71e~a3,5d7c2~3,5d860~1,5d865~4,5d86a~a2,5d9aa~1,5d9ac~1,5d9ae~a5,5daf3~a5,5db9a~2,5dc3a~a5,5dce1~1,5dce3~1,5dd81~a7,5dec8~a7,5e00f~a7,5e156~a8,5e29d~1,5e29f~a7,5e3e6~a8,5e52e~a6,5e5d6~2,5e676~a6,5e71e~2,5e7be~a9,5e907~a5,5e9ad~3,5ea50~a7,5eaf9~1,5eafb~1,5eb99~a7,5ec42~2,5ece2~a7,5ed8a~2,5ee2b~a9,5ef74~a7,5f01c~1,5f0bd~a1,5f15f~1,5f161~1,5f163~1,5f167~1,5f206~a1,5f2a8~1,5f2aa~1,5f2ac~1,5f2ae~1,5f34f~a1,5f3f1~1,5f3f3~1,5f3f5~1,5f3f7~1,5f499~a1,5f53b~1,5f53d~1,5f53f~1,5f541~1,5f5e3~a1,5f685~1,5f687~1,5f689~1,5f68d~1,5f72d~a1,5f7cf~1,5f7d1~1,5f7d3~1,5f875~a1]",

*   "snap_trimq_len": 5741,*
    "state": "active+clean+snaptrim",
    "epoch": 734240,
    "up": [

Do you know the way to see if the snaptim "process" works?

Best Regard

Gio


Am 17.08.2024 um 12:59 schrieb Giovanna Ratini:

Hello Eugen,

thank you for your answer.

I restarted all the kube-ceph nodes one after the other. Nothing  
has changed.


ok, I deactivate the snap ... : ceph fs snap-schedule deactivate /

Is there a way to see how many snapshots will be deleted per hour?

Regards,

Gio





Am 17.08.2024 um 10:12 schrieb Eugen Block:

Hi,

have you tried to fail the mgr? Sometimes the PG stats are not  
correct. You could also temporarily disable snapshots to see if  
things settle down.


Zitat von Giovanna Ratini :


Hello all,

We use Ceph (v18.2.2) and Rook (1.14.3) as the CSI for a  
Kubernetes environment. Last week, we had a problem with the MDS  
falling behind on trimming every 4-5 days (GitHub issue link).  
We resolved the issue using the steps outlined in the GitHub  
issue.


We have 3 hosts (I know, I need to increase this as soon as  
possible, and I will!) and 6 OSDs. After running the commands:


ceph config set mds mds_dir_max_commit_size 80,

ceph fs fail , and

ceph fs set  joinable true,

After that, the snaptrim queue for our PGs has stopped  
decreasing. All PGs of our CephFS are in either  
active+clean+snaptrim_wait or active+clean+snaptrim states. For  
example, the PG 3.12 is in the active+clean+snaptrim state, and  
its snap_trimq_len was 4077 yesterday but has increased to 4538  
today.


I increased the osd_snap_trim_priority to 10 (ceph config set  
osd osd_snap_trim_priority 10), but it didn't help. Only the PGs  
of our CephFS have this problem.


Do you have any ideas on how we can resolve this issue?

Thanks in advance,
Giovanna
p.s. I'm not a ceph expert :-).
Faulkener asked me for more information, so here it is:
MDS Memory: 11GB
mds_cache_memory_limit: 11,811,160,064 bytes

root@kube-master02:~# ceph fs snap-schedule status /
{
    "fs": "rook-cephfs",
    "subvol": null,
    "path": "/",
    "rel_path": "/",
    "schedule": "3h",
    "retention": {"h": 24, "w": 4},
    "start": "2024-05-05T00:00:00",
    "created": "2024-05-05T17:28:18",
    "first": "2024-05-05T18:00:00",
    "last": "2024-08-15T18:00:00",
    "last_pruned": "2024-08-15T18:00:00",
    "created_count": 817,
    "pruned_count": 817,
    "active": true
}
I do not understand if the snapshots in the PGs are correlated  
with the snapshots on CephFS. Until we encountered the issue  
with the "MDS falling behind on trimming every 4-5 days," we  
didn't have any problems with snapshots.


Could someone explain me this or send me to the documentation?
Thank you
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe s

[ceph-users] Re: ceph device ls missing disks

2024-08-18 Thread Alfredo Rezinovsky
No, I don´t monitor drivers in non-CEPH nodes. All my non-CEPH nodes are
disposables. Only CEPH has data I can't lose

El jue, 15 ago 2024 a las 20:49, Anthony D'Atri ()
escribió:

> Do you monitor OS drives on non-Ceph nodes?
>
>
>
> > On Aug 15, 2024, at 8:17 AM, Alfredo Rezinovsky 
> wrote:
> >
> > With: "ceph device ls" I can see which physical disks are used by which
> > daemons.
> >
> > OSD data and wal disks are ok.
> >
> > MON appears using operating system disks, or the disks where
> /var/lib/ceph
> > is mounted, also OK.
> >
> > The problem is that non monitor nodes operating system disks are missing
> > when they are really being used by all the daemons in the node.
> >
> > The problem with this is that there are no monitoring for those disks
> when
> > they are essential and failure points for all the daemons running in
> those
> > nodes.
> >
> > I think the OS devices should appear used by all the daemons having
> > essential information in those disks.
> >
> > --
> > Alfrenovsky
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>

-- 
Alfrenovsky
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: squid release codename

2024-08-18 Thread Alfredo Rezinovsky
Sorry for initiating this.

El sáb, 17 ago 2024 a las 10:11, Anthony D'Atri ()
escribió:

> > It's going to wreak havoc on search engines that can't tell when
> > someone's looking up Ceph versus the long-establish Squid Proxy.
>
> Search engines are way smarter than that, and I daresay that people are
> far more likely to search for “Ceph” or “Ceph squid" than for “squid” alone
> looking for Ceph.
>
>
> > I don’t know how many more (sub)species there are to start over from A
> (the first release was Argonaut)
>
> Ammonite is a natural, and two years later we *must* release Cthulhu.
>
> Cartoon names run some risk of trademark issues.
>
> > ...  that said, naming a *release* of a software with the name of
> > well known other open source software is pure crazyness.
>
> I haven’t seen the web cache used in years — maybe still in Antarctica?
> These are vanity names for fun.  I’ve found that more people know the
> numeric release they run than the codename anyway.
>
> > What's coming next? Ceph Redis? Ceph Apache? Or Apache Ceph?
>
> Since you mention Apache, their “Spark” is an overload.  And Apache itself
> is cultural appropriation but that’s a tangent.
>
> When I worked for Advanced Micro Devices we used the Auto Mounter Daemon
>
> I’ve also used AMANDA for backups, which was not a Boston song.
>
> Let’s not forget Apple’s iOS and Cisco’s IOS.
>
> Ceph Octopus, and this cable
> https://usb.brando.com/usb-octopus-4-port-hub-cable_p999c39d15.html and
> of course this one https://www.ebay.com/itm/110473961774
>
> The first Ceph release named after Jason’s posse.
> Bobcat colliding with skid-loaders and Goldthwaite
> Dumpling and gyoza
> Firefly and the Uriah Heep album (though Demons & Wizards was better)
> Giant and the Liz Taylor movie (and grocery store)
> Hammer and Jan
> Jewel and the singer
> Moreover, Ceph Nautilus:
> Korg software
> Process engineering software
> CMS
> GNOME file manager
> Firefox and the Clint Eastwood movie
> Chrome and the bumper on a 1962 Karmann Ghia
> Slack and the Linux distribution
>
> When I worked for Cisco, people thought I was in food service.  Namespaces
> are crowded.  Overlap happens.  Context resolves readily.
>
> Within the Cephapod scheme we’ve used Octopus and Nautilus, to not use
> Squid would be odd.  And Shantungendoceras doesn’t roll off the tongue.
>
>
>
> “What’s in a name?”  - Shakespeare
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Alfrenovsky
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: memory leak in mds?

2024-08-18 Thread Frédéric Nass
Hi Dario,

A workaround may be to downgrade client's kernel or ceph-fuse version to a 
lower version than those listed in Enrico's comment #22, I believe.
Can't say for sure though since I couldn't verify it myself.

Cheers,
Frédéric.


De : Dario Graña 
Envoyé : vendredi 16 août 2024 16:41
À : ceph-users
Objet : [ceph-users] memory leak in mds?

Hi all, 
We’re experiencing an issue with CephFS. I think we are facing this issue 
. The main symptom is that the MDS 
starts using a lot of memory within a few minutes and finally it gets 
killed by OS (Out Of Memory). Sometimes it happens once a week and 
sometimes 2 times a day. We are running ceph quincy 17.2.7 on both the 
cluster and clients. I have read through some emails on the mailing list 
about it, but I didn't find a workaround. Does anyone have any suggestions? 
Thanks in advance. 

-- 
Dario Graña 
PIC (Port d'Informació Científica) 
Campus UAB, Edificio D 
E-08193 Bellaterra, Barcelona 
http://www.pic.es 
Avis - Aviso - Legal Notice: http://legal.ifae.es 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: memory leak in mds?

2024-08-18 Thread Venky Shankar
[cc Xiubo]

On Fri, Aug 16, 2024 at 8:10 PM Dario Graña  wrote:
>
> Hi all,
> We’re experiencing an issue with CephFS. I think we are facing this issue
> . The main symptom is that the MDS
> starts using a lot of memory within a few minutes and finally it gets
> killed by OS (Out Of Memory). Sometimes it happens once a week and
> sometimes 2 times a day. We are running ceph quincy 17.2.7 on both the
> cluster and clients. I have read through some emails on the mailing list
> about it, but I didn't find a workaround. Does anyone have any suggestions?

It is likely that you are running into the issue described in the
mentioned tracker. The change is pending backport to quincy as of now,
so the other alternative approach might be downgrading the client.

> Thanks in advance.
>
> --
> Dario Graña
> PIC (Port d'Informació Científica)
> Campus UAB, Edificio D
> E-08193 Bellaterra, Barcelona
> http://www.pic.es
> Avis - Aviso - Legal Notice: http://legal.ifae.es
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Cheers,
Venky
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io