[ceph-users] Re: Odd RBD stats metrics

2025-07-28 Thread Joshua Baergen
Hi Chris, Assuming that the scrape period for prom is set to 1 minute, you could simply be racing against the scrape. Usually it's not a good idea to create range vectors with the same time range as the scrape period. Given that you're using irate(), you could increase that to [2m] or higher and s

[ceph-users] Re: HELP! Cluster usage increased after adding new nodes/osd's

2025-07-21 Thread Joshua Baergen
Hello, Any chance that these OSDs were deployed with different bluestore_min_alloc_size settings? Josh On Mon, Jul 7, 2025 at 2:39 PM mhnx wrote: > > Hello Stefan! > > All of my nodes and clients = Octopus 15.2.14 > > I have 1x RBD pool and 2000x rbd volumes with 100Gb / each > > > This is upma

[ceph-users] Re: ceph-osd/bluestore using page cache

2025-03-17 Thread Joshua Baergen
Hey Brian, The setting you're looking for is bluefs_buffered_io. This is very much a YMMV setting, so it's best to test with both modes, but I usually recommend turning it off for all but omap-intensive workloads (e.g. RGW index) due to it causing writes to tend to be split up into smaller pieces.

[ceph-users] Re: Understanding how crush works

2025-01-27 Thread Joshua Baergen
Hey Andre, Clients actually have access to more information than just the crushmap, which includes temporary PG mappings generated when a backfill is pending, as well as upmap items which override CRUSH's placement decision. You can see these in "ceph osd dump", for example. Josh On Mon, Jan 27,

[ceph-users] Re: radosgw daemons with "stuck ops"

2025-01-27 Thread Joshua Baergen
Hey Reid, This sounds similar to what we saw in https://tracker.ceph.com/issues/62256, in case that helps with your investigation. Josh On Mon, Jan 27, 2025 at 8:07 AM Reid Guyett wrote: > > Hello, > > We are experiencing slowdowns on one of our radosgw clusters. We restart > the radosgw daemon

[ceph-users] Re: Snaptriming speed degrade with pg increase

2025-01-14 Thread Joshua Baergen
Hey Istvan, > Quick update on this topic, seems to be the solution for us to offline > compact all osds. > After that all snaptrimming can finish in an hour rather than a day. Ah, this might be tombstone accumulation, then. You'd probably benefit from going to at least latest Pacific, enabling r

[ceph-users] Re: Adding Rack to crushmap - Rebalancing multiple PB of data - advice/experience

2025-01-13 Thread Joshua Baergen
Note that 'norebalance' disables the balancer but doesn't prevent backfill; you'll want to set 'nobackfill' as well. Josh On Sun, Jan 12, 2025 at 1:49 PM Anthony D'Atri wrote: > > [ ed: snag during moderation (somehow a newline was interpolated in the > Subject), so I’m sending this on behalf o

[ceph-users] Re: Slow initial boot of OSDs in large cluster with unclean state

2025-01-10 Thread Joshua Baergen
> > FWIW, having encountered these long-startup issues many times in the > > past on both HDD and QLC OSDs, I can pretty confidently say that > > throwing flash at the problem doesn't make it go away. Fewer issues > > with DB IOs contending with client IOs, but flapping can still occur > > during P

[ceph-users] Re: Slow initial boot of OSDs in large cluster with unclean state

2025-01-09 Thread Joshua Baergen
> I'm wondering about the influence of WAL/DBs collocated on HDDs on OSD > creation time, OSD startup time, peering and osdmap updates, and the role it > might play regarding flapping, when DB IOs compete with client IOs, even with > 100% active+clean PGs. FWIW, having encountered these long-st

[ceph-users] Re: MONs not trimming

2024-12-17 Thread Joshua Baergen
I think it was mentioned elsewhere in this thread that there are limitations to what upmap can do, especially in significant crush map change situations. It can't violate crush rules (mon-enforced), and if the same OSD shows up multiple times in a backfill then upmap can't deal with it. Creeping b

[ceph-users] Re: MONs not trimming

2024-12-17 Thread Joshua Baergen
Hey Janek, Ah, yes, we ran into that invalid json output in https://github.com/digitalocean/ceph_exporter as well. I have a patch I wrote for ceph_exporter that I can port over to pgremapper (that does similar to what your patch does). Josh On Tue, Dec 17, 2024 at 9:38 AM Janek Bevendorff wrote

[ceph-users] Re: Procedure for temporary evacuation and replacement

2024-10-18 Thread Joshua Baergen
Hi Frank, > Does this setting affect PG removal only or is it affecting other operations > as well? Essentially: can I leave it at its current value or should I reset > it to default? Only PG removal, which is why we set it high enough that it effectively disables that process. Josh __

[ceph-users] Re: Procedure for temporary evacuation and replacement

2024-10-17 Thread Joshua Baergen
Ah yes, if you see disk read IOPS going up and up on those draining OSDs then you might be having issues with older PG deletion logic interacting poorly with rocksdb tombstones. Josh On Thu, Oct 17, 2024 at 8:13 AM Eugen Block wrote: > > Hi Frank, > > how high is the disk utilization? We see thi

[ceph-users] Re: Procedure for temporary evacuation and replacement

2024-10-17 Thread Joshua Baergen
Is this a high-object-count application (S3 or small files in cephfs)? My guess is that they're going down at the end of PG deletions, where a rocksdb scan needs to happen. This scan can be really slow and can exceed heartbeat timeouts, among other things. Some improvements have been made over majo

[ceph-users] Re: 9 out of 11 missing shards of shadow object in ERC 8:3 pool.

2024-10-04 Thread Joshua Baergen
We saw this a fair bit in Nautilus, and I also suspected that there was something up with GC'd and/or deleted objects, but we never determined the cause. Notably it seemed to happen on PGs ending in 'ff' or 'fff', which was extra suspicious. We haven't seen it since Pacific. Josh On Fri, Oct 4, 2

[ceph-users] Re: High usage (DATA column) on dedicated for OMAP only OSDs

2024-09-19 Thread Joshua Baergen
Ah, yes, that's a good point - if there's backfill going on then buildup like this can happen. On Thu, Sep 19, 2024 at 10:08 AM Konstantin Shalygin wrote: > > Hi, > > On 19 Sep 2024, at 18:26, Joshua Baergen wrote: > > Whenever we've seen osdmaps not being tr

[ceph-users] Re: High usage (DATA column) on dedicated for OMAP only OSDs

2024-09-19 Thread Joshua Baergen
Whenever we've seen osdmaps not being trimmed, we've made sure that any down OSDs are out+destroyed, and then have rolled a restart through the mons. As of recent Pacific at least this seems to have reliably gotten us out of this situation. Josh On Thu, Sep 19, 2024 at 9:14 AM Igor Fedotov wrote

[ceph-users] Re: How to detect condition for offline compaction of RocksDB?

2024-07-19 Thread Joshua Baergen
Hey Frédéric, > Can I ask what symptoms made you interested in tombstones? Mostly poor index performance due to slow rocksdb iterators (the cause being excessive tombstone accumulation). > Do you think this phenomenon could be related to tombstones? And that > enabling rocksdb_cf_compact_on_del

[ceph-users] Re: How to detect condition for offline compaction of RocksDB?

2024-07-18 Thread Joshua Baergen
Hey Aleksandr, > In the Pacific we have RocksDB column families. It will be helpful in the > case of many tombstones to do resharding of our old OSDs? > Do you think It can help without rocksdb_cf_compact_on_deletion? > Or, maybe It can help much more with rocksdb_cf_compact_on_deletion? Ah, I'm

[ceph-users] Re: How to detect condition for offline compaction of RocksDB?

2024-07-18 Thread Joshua Baergen
> And my question is: we have regular compaction that does some work. Why It > doesn't help with tombstones? > Why only offline compaction can help in our case? Regular compaction will take care of any tombstones in the files that end up being compacted, and compaction, when triggered, may even f

[ceph-users] Re: How to detect condition for offline compaction of RocksDB?

2024-07-17 Thread Joshua Baergen
enerated in RGW scenario? > > We have another option in our version: rocksdb_delete_range_threshold > > Do you think it can be helpful? > > I think our problem is raised due to massive deletion generated by the > lifecycle ruleof big bucket. > On 16.07.2024, 19:25, "Josh

[ceph-users] Re: How to detect condition for offline compaction of RocksDB?

2024-07-16 Thread Joshua Baergen
Hello Aleksandr, What you're probably experiencing is tombstone accumulation, a known issue for Ceph's use of rocksdb. > 1. Why can't automatic compaction manage this on its own? rocksdb compaction is normally triggered by level fullness and not tombstone counts. However, there is a feature in r

[ceph-users] Re: Lousy recovery for mclock and reef

2024-05-24 Thread Joshua Baergen
I don't think the change took effect even with > updating ceph.conf, restart and a direct asok config set. target memory > value is confirmed to be set via asok config get > > Nothing has helped. I still cannot break the 21 MiB/s barrier. > > Does anyone have any more idea

[ceph-users] Re: Lousy recovery for mclock and reef

2024-05-24 Thread Joshua Baergen
It requires an OSD restart, unfortunately. Josh On Fri, May 24, 2024 at 11:03 AM Mazzystr wrote: > > Is that a setting that can be applied runtime or does it req osd restart? > > On Fri, May 24, 2024 at 9:59 AM Joshua Baergen > wrote: > > > Hey Chris, > > &

[ceph-users] Re: Lousy recovery for mclock and reef

2024-05-24 Thread Joshua Baergen
Hey Chris, A number of users have been reporting issues with recovery on Reef with mClock. Most folks have had success reverting to osd_op_queue=wpq. AIUI 18.2.3 should have some mClock improvements but I haven't looked at the list myself yet. Josh On Fri, May 24, 2024 at 10:55 AM Mazzystr wrot

[ceph-users] Re: Slow ops during recovery for RGW index pool only when degraded OSD is primary

2024-04-03 Thread Joshua Baergen
Might appropriate values > vary by pool type and/or media? > > > > > On Apr 3, 2024, at 13:38, Joshua Baergen wrote: > > > > We've had success using osd_async_recovery_min_cost=0 to drastically > > reduce slow ops during index recovery. > > > &

[ceph-users] Re: Slow ops during recovery for RGW index pool only when degraded OSD is primary

2024-04-03 Thread Joshua Baergen
We've had success using osd_async_recovery_min_cost=0 to drastically reduce slow ops during index recovery. Josh On Wed, Apr 3, 2024 at 11:29 AM Wesley Dillingham wrote: > > I am fighting an issue on an 18.2.0 cluster where a restart of an OSD which > supports the RGW index pool causes cripplin

[ceph-users] Re: S3 Partial Reads from Erasure Pool

2024-04-01 Thread Joshua Baergen
I think it depends what you mean by rados objects and s3 objects here. If you're talking about an object that was uploaded via MPU, and thus may comprise many rados objects, I don't think there's a difference in read behaviors based on pool type. If you're talking about reading a subset byte range

[ceph-users] Re: log_latency slow operation observed for submit_transact, latency = 22.644258499s

2024-03-22 Thread Joshua Baergen
Personally, I don't think the compaction is actually required. Reef has compact-on-iteration enabled, which should take care of this automatically. We see this sort of delay pretty often during PG cleaning, at the end of a PG being cleaned, when the PG has a high count of objects, whether or not OS

[ceph-users] Re: Why a lot of pgs are degraded after host(+osd) restarted?

2024-03-20 Thread Joshua Baergen
Hi Jaemin, It is normal for PGs to become degraded during a host reboot, since a copy of the data was taken offline and needs to be resynchronized after the host comes back. Normally this is quick, as the recovery mechanism only needs to modify those objects that have changed while the host is dow

[ceph-users] Re: OSDs not balanced

2024-03-04 Thread Joshua Baergen
The balancer will operate on all pools unless otherwise specified. Josh On Mon, Mar 4, 2024 at 1:12 PM Cedric wrote: > > Did the balancer has enabled pools ? "ceph balancer pool ls" > > Actually I am wondering if the balancer do something when no pools are > added. > > > > On Mon, Mar 4, 2024, 1

[ceph-users] Re: has anyone enabled bdev_enable_discard?

2024-03-02 Thread Joshua Baergen
Periodic discard was actually attempted in the past: https://github.com/ceph/ceph/pull/20723 A proper implementation would probably need appropriate scheduling/throttling that can be tuned so as to balance against client I/O impact. Josh On Sat, Mar 2, 2024 at 6:20 AM David C. wrote: > > Could