[ceph-users] DB/WALL and RGW index on the same NVME

2024-04-07 Thread Lukasz Borek
Hi!

I'm working on a POC cluster setup dedicated to backup app writing objects
via s3 (large objects, up to 1TB transferred via multipart upload process).

Initial setup is 18 storage nodes (12HDDs + 1 NVME card for DB/WALL) + EC
pool.  Plan is to use cephadm.

I'd like to follow good practice and put the RGW index pool on a
no-rotation drive. Question is how to do it?

   - replace a few HDDs (1 per node) with a SSD (how many? 4-6-8?)
   - reserve space on NVME drive on each node, create lv based OSD and let
   rgb index use the same NVME drive as DB/WALL

Thoughts?

-- 
Lukasz
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: DB/WALL and RGW index on the same NVME

2024-04-08 Thread Lukasz Borek
Thanks for clarifying.

So redhat doc
<https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html-single/ceph_object_gateway_for_production/index#adv-rgw-hardware-bucket-index>
is outdated?

3.6. Selecting SSDs for Bucket Indexes


When selecting OSD hardware for use with a Ceph Object Gateway—irrespective
> of the use case—Red Hat recommends considering an OSD node that has at
> least one SSD drive used exclusively for the bucket index pool. This is
> particularly important when buckets will contain a large number of objects.


A bucket index entry is approximately 200 bytes of data, stored as an
> object map (omap) in leveldb. While this is a trivial amount of data, some
> uses of Ceph Object Gateway can result in tens or hundreds of millions of
> objects in a single bucket. By mapping the bucket index pool to a CRUSH
> hierarchy of SSD nodes, the reduced latency provides a dramatic performance
> improvement when buckets contain very large numbers of objects.


> Important
> In a production cluster, a typical OSD node will have at least one SSD for
> the bucket index, AND at least on SSD for the journal.


Current utilisation is what osd df command shows in OMAP field?:

root@cephbackup:/# ceph osd df
> ID  CLASS  WEIGHTREWEIGHT  SIZE RAW USE   DATA OMAP META
>   AVAIL%USE   VAR   PGS  STATUS
>  0hdd   7.39870   1.0  7.4 TiB   894 GiB  769 GiB  1.5 MiB  3.4
> GiB  6.5 TiB  11.80  1.45   40  up
>  1hdd   7.39870   1.0  7.4 TiB   703 GiB  578 GiB  6.0 MiB  2.9
> GiB  6.7 TiB   9.27  1.14   37  up
>  2hdd   7.39870   1.0  7.4 TiB   700 GiB  576 GiB  3.1 MiB  3.1
> GiB  6.7 TiB   9.24  1.13   39  up





On Mon, 8 Apr 2024 at 08:42, Daniel Parkes  wrote:

> Hi Lukasz,
>
> RGW uses Omap objects for the index pool; Omaps are stored in Rocksdb
> database of each osd, not on the actual index pool, so by putting DB/WALL
> on an NVMe as you mentioned, you are already configuring the index pool on
> a non-rotational drive, you don't need to do anything else.
>
> You just need to size your DB/WALL partition accordingly. For RGW/object
> storage, a good starting point for the DB/Wall sizing is 4%.
>
> Example of Omap entries in the index pool using 0 bytes, as they are
> stored in Rocksdb:
>
> # rados -p default.rgw.buckets.index listomapkeys 
> .dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
> file1
> file2
> file4
> file10
>
> rados df -p default.rgw.buckets.index
> POOL_NAME  USED  OBJECTS  CLONES  COPIES  MISSING_ON_PRIMARY  
> UNFOUND  DEGRADED  RD_OPS   RD  WR_OPS  WR  USED COMPR  UNDER COMPR
> default.rgw.buckets.index   0 B   11   0  33   0  
>   0 0 208  207 KiB  41  20 KiB 0 B  0 B
>
> # rados -p default.rgw.buckets.index stat 
> .dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
> default.rgw.buckets.index/.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2 
> mtime 2022-12-20T07:32:11.00-0500, size 0
>
>
> On Sun, Apr 7, 2024 at 10:06 PM Lukasz Borek  wrote:
>
>> Hi!
>>
>> I'm working on a POC cluster setup dedicated to backup app writing objects
>> via s3 (large objects, up to 1TB transferred via multipart upload
>> process).
>>
>> Initial setup is 18 storage nodes (12HDDs + 1 NVME card for DB/WALL) + EC
>> pool.  Plan is to use cephadm.
>>
>> I'd like to follow good practice and put the RGW index pool on a
>> no-rotation drive. Question is how to do it?
>>
>>- replace a few HDDs (1 per node) with a SSD (how many? 4-6-8?)
>>- reserve space on NVME drive on each node, create lv based OSD and let
>>rgb index use the same NVME drive as DB/WALL
>>
>> Thoughts?
>>
>> --
>> Lukasz
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>>

-- 
Łukasz Borek
luk...@borek.org.pl
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: DB/WALL and RGW index on the same NVME

2024-04-08 Thread Lukasz Borek
>
> My understanding is that omap and EC are incompatible, though.

 Reason why multipart upload is using a non-EC pool to save metadata to an
omap database?




On Mon, 8 Apr 2024 at 20:21, Anthony D'Atri  wrote:

> My understanding is that omap and EC are incompatible, though.
>
> > On Apr 8, 2024, at 09:46, David Orman  wrote:
> >
> > I would suggest that you might consider EC vs. replication for index
> data, and the latency implications. There's more than just the nvme vs.
> rotational discussion to entertain, especially if using the more widely
> spread EC modes like 8+3. It would be worth testing for your particular
> workload.
> >
> > Also make sure to factor in storage utilization if you expect to see
> versioning/object lock in use. This can be the source of a significant
> amount of additional consumption that isn't planned for initially.
> >
> > On Mon, Apr 8, 2024, at 01:42, Daniel Parkes wrote:
> >> Hi Lukasz,
> >>
> >> RGW uses Omap objects for the index pool; Omaps are stored in Rocksdb
> >> database of each osd, not on the actual index pool, so by putting
> DB/WALL
> >> on an NVMe as you mentioned, you are already configuring the index pool
> on
> >> a non-rotational drive, you don't need to do anything else.
> >>
> >> You just need to size your DB/WALL partition accordingly. For RGW/object
> >> storage, a good starting point for the DB/Wall sizing is 4%.
> >>
> >> Example of Omap entries in the index pool using 0 bytes, as they are
> stored
> >> in Rocksdb:
> >>
> >> # rados -p default.rgw.buckets.index listomapkeys
> >> .dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
> >> file1
> >> file2
> >> file4
> >> file10
> >>
> >> rados df -p default.rgw.buckets.index
> >> POOL_NAME  USED  OBJECTS  CLONES  COPIES
> >> MISSING_ON_PRIMARY  UNFOUND  DEGRADED  RD_OPS   RD  WR_OPS  WR
> >> USED COMPR  UNDER COMPR
> >> default.rgw.buckets.index   0 B   11   0  33
> >>00 0 208  207 KiB  41  20 KiB 0 B
> >>0 B
> >>
> >> # rados -p default.rgw.buckets.index stat
> >> .dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
> >>
> default.rgw.buckets.index/.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
> >> mtime 2022-12-20T07:32:11.00-0500, size 0
> >>
> >>
> >> On Sun, Apr 7, 2024 at 10:06 PM Lukasz Borek 
> wrote:
> >>
> >>> Hi!
> >>>
> >>> I'm working on a POC cluster setup dedicated to backup app writing
> objects
> >>> via s3 (large objects, up to 1TB transferred via multipart upload
> process).
> >>>
> >>> Initial setup is 18 storage nodes (12HDDs + 1 NVME card for DB/WALL) +
> EC
> >>> pool.  Plan is to use cephadm.
> >>>
> >>> I'd like to follow good practice and put the RGW index pool on a
> >>> no-rotation drive. Question is how to do it?
> >>>
> >>>   - replace a few HDDs (1 per node) with a SSD (how many? 4-6-8?)
> >>>   - reserve space on NVME drive on each node, create lv based OSD and
> let
> >>>   rgb index use the same NVME drive as DB/WALL
> >>>
> >>> Thoughts?
> >>>
> >>> --
> >>> Lukasz
> >>> ___
> >>> ceph-users mailing list -- ceph-users@ceph.io
> >>> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>>
> >>>
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>

-- 
Łukasz Borek
luk...@borek.org.pl
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] tuning for backup target cluster

2024-05-24 Thread Lukasz Borek
Hi Everyone,

I'm putting together a HDD cluster with an ECC pool dedicated to the backup
environment. Traffic via s3. Version 18.2,  7 OSD nodes, 12 * 12TB HDD +
1NVME each, 4+2 ECC pool.

Wondering if there is some general guidance for startup setup/tuning in
regards to s3 object size. Files are read from fast storage (SSD/NVME) and
written to s3. Files sizes are 10MB-1TB, so it's not standard s3. traffic.
Backup for big files took hours to complete.

My first shot would be to increase default bluestore_min_alloc_size_hdd, to
reduce the number of stored objects, but I'm not sure if it's a
good direccion?  Any other parameters worth checking to support such a
traffic pattern?

Thanks!

-- 
Łukasz
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: tuning for backup target cluster

2024-05-27 Thread Lukasz Borek
Anthony, Darren
Thanks for response.

Answering your questions:

What is the network you have for this cluster?

25GB/s

> Is this a chassis with universal slots, or is that NVMe device maybe M.2
> or rear-cage?

12 * HDD via LSI jbod + 1 PCI NVME. Now it's 1.6TB, for the production plan
is to use 3.2TB.


`ceph df`
> `ceph osd dump | grep pool`
> So we can see what's going on HDD and what's on NVMe.


--- RAW STORAGE ---
CLASS SIZEAVAIL USED  RAW USED  %RAW USED
hdd703 TiB  587 TiB  116 TiB   116 TiB  16.51
TOTAL  703 TiB  587 TiB  116 TiB   116 TiB  16.51

--- POOLS ---
POOLID   PGS   STORED  OBJECTS USED  %USED  MAX
AVAIL
default.rgw.meta5264  6.0 KiB   13  131 KiB  0
 177 TiB
.mgr5432   28 MiB8   83 MiB  0
 177 TiB
.rgw.root   5564  2.0 KiB4   48 KiB  0
 177 TiB
default.rgw.control 5664  0 B8  0 B  0
 177 TiB
default.rgw.buckets.index   5932   34 MiB   33  102 MiB  0
 177 TiB
default.rgw.log 6332  3.6 KiB  209  408 KiB  0
 177 TiB
default.rgw.buckets.non-ec  6532   44 MiB   40  133 MiB  0
 177 TiB
4_2_EC  67  1024   71 TiB   18.61M  106 TiB  16.61
 355 TiB

# ceph osd dump | grep pool
pool 52 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 6
object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode off last_change
18206 lfor 0/0/13123 flags hashpspool stripe_width 0 application rgw
read_balance_score 5.27
pool 54 '.mgr' replicated size 3 min_size 2 crush_rule 6 object_hash
rjenkins pg_num 32 pgp_num 32 autoscale_mode off last_change 18206 lfor
0/0/13186 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1
application mgr read_balance_score 5.25
pool 55 '.rgw.root' replicated size 3 min_size 2 crush_rule 6 object_hash
rjenkins pg_num 64 pgp_num 64 autoscale_mode off last_change 18206 lfor
0/0/13191 flags hashpspool stripe_width 0 application rgw
read_balance_score 3.92
pool 56 'default.rgw.control' replicated size 3 min_size 2 crush_rule 6
object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode off last_change
18206 lfor 0/0/13200 flags hashpspool stripe_width 0 application rgw
read_balance_score 6.55
pool 59 'default.rgw.buckets.index' replicated size 3 min_size 2 crush_rule
6 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
18206 lfor 0/0/13594 flags hashpspool stripe_width 0 pg_autoscale_bias 4
application rgw read_balance_score 5.27
pool 63 'default.rgw.log' replicated size 3 min_size 2 crush_rule 6
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
18397 lfor 0/0/18386 flags hashpspool stripe_width 0 application rgw
read_balance_score 10.56
pool 65 'default.rgw.buckets.non-ec' replicated size 3 min_size 2
crush_rule 6 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on
last_change 18923 lfor 0/0/18921 flags hashpspool stripe_width 0
application rgw read_balance_score 7.89
pool 67 '4_2_EC' erasure profile 4_2 size 6 min_size 5 crush_rule 13
object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode off
last_change 23570 flags hashpspool stripe_width 16384 application rgw


You also have the metadata pools used by RGW that ideally need to be on
> NVME.
> Because you are using EC then there is the buckets.non-ec pool which is
> used to manage the OMAPS for the multipart uploads this is usually down at
> 8 PG’s and that will be limiting things as well.

This part is very interesting. Some time ago I asked a similar question here
.
Conclusion was that the index is covered by a bluestore. Should we consider
removing a few HDDs and replacing them with SSD to non-ec pool?

So now you have the question of do you have enough streams running in
> parallel? Have you tried a benchmarking tool such as minio warp to see what
> it can achieve.

I think so, warp shows 1.6GiB/s for 20GB  objects in 50 streams -
acceptable.

Changing the bluestone_min_alloc_size would be the last thing I would even
> consider. In fact I wouldn’t be changing it as you are in untested
> territory.

ACK! :)

Thanks!

On Mon, 27 May 2024 at 09:27, Darren Soothill 
wrote:

> So a few questions I have around this.
>
> What is the network you have for this cluster?
>
> Changing the bluestone_min_alloc_size would be the last thing I would even
> consider. In fact I wouldn’t be changing it as you are in untested
> territory.
>
> The challenge with making these sort of things perform is to generate lots
> of parallel streams so what ever is doing the uploading needs to be doing
> parallel multipart uploads. There is no mention of the uploading code that
> is being used.
>
> So with 7 Nodes each with 12 Disks and doing large files like this I would
> be expecting to see 50-70MB/s per useable HDD. By us

[ceph-users] Re: tuning for backup target cluster

2024-06-04 Thread Lukasz Borek
>
> I have certainly seen cases where the OMAPS have not stayed within the
> RocksDB/WAL NVME space and have been going down to disk.

How to monitor OMAPS size and if it does not get out of NVME?

The OP's number suggest IIRC like 120GB-ish for WAL+DB, though depending on
> workload spillover could of course still be a thing.

Correct. But for production deployment the plan is to use 3.2TB for 10
HDDs. In case of performance problems we will move non-ec pool to SSD (by
replacing few HDD by SSDs)

Using cephadm, is it possible to cut part of the NVME drive for OSD and
leave rest space for RocksDB/WALL? Now my deployment is as simple as :

# ceph orch  ls osd osd.dashboard-admin-1710711254620 --export
service_type: osd
service_id: dashboard-admin-1710711254620
service_name: osd.dashboard-admin-1710711254620
placement:
  host_pattern: cephbackup-osd3
spec:
  data_devices:
rotational: true
  db_devices:
rotational: false
  filter_logic: AND
  objectstore: bluestore

Thanks

On Mon, 3 Jun 2024 at 17:28, Anthony D'Atri  wrote:

>
> The OP's number suggest IIRC like 120GB-ish for WAL+DB, though depending
> on workload spillover could of course still be a thing.
>
> >
> > I have certainly seen cases where the OMAPS have not stayed within the
> RocksDB/WAL NVME space and have been going down to disk.
> >
> > This was on a large cluster with a lot of objects but the disks that
> where being used for the non-ec pool where seeing a lot more actual disk
> activity than the other disks in the system.
> >
> > Moving the non-ec pool onto NVME helped with a lot of operations that
> needed to be done to cleanup a lot of orphaned objects.
> >
> > Yes this was a large cluster with a lot of ingress data admitedly.
> >
> > Darren Soothill
> >
> > Want a meeting with me: https://calendar.app.google/MUdgrLEa7jSba3du9
> >
> > Looking for help with your Ceph cluster? Contact us at https://croit.io/
> >
> > croit GmbH, Freseniusstr. 31h, 81247 Munich
> > CEO: Martin Verges - VAT-ID: DE310638492
> > Com. register: Amtsgericht Munich HRB 231263
> > Web: https://croit.io/ | YouTube: https://goo.gl/PGE1Bx
> >
> >
> >
> >
> >> On 29 May 2024, at 21:24, Anthony D'Atri  wrote:
> >>
> >>
> >>
> >>> You also have the metadata pools used by RGW that ideally need to be
> on NVME.
> >>
> >> The OP seems to intend shared NVMe for WAL+DB, so that the omaps are on
> NVMe that way.
> >>
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Łukasz Borek
luk...@borek.org.pl
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: tuning for backup target cluster

2024-06-04 Thread Lukasz Borek
>
> You could check if your devices support NVMe namespaces and create more
> than one namespace on the device.

Wow, tricky. Will give it a try.

Thanks!


Łukasz Borek
luk...@borek.org.pl


On Tue, 4 Jun 2024 at 16:26, Robert Sander 
wrote:

> Hi,
>
> On 6/4/24 16:15, Anthony D'Atri wrote:
>
> > I've wondered for years what the practical differences are between using
> a namespace and a conventional partition.
>
> Namespaces show up as separate block devices in the kernel.
>
> The orchestrator will not touch any devices that contain a partition
> table or logical volume signatures.
>
> Regards
> --
> Robert Sander
> Heinlein Consulting GmbH
> Schwedter Str. 8/9b, 10119 Berlin
>
> https://www.heinlein-support.de
>
> Tel: 030 / 405051-43
> Fax: 030 / 405051-19
>
> Amtsgericht Berlin-Charlottenburg - HRB 220009 B
> Geschäftsführer: Peer Heinlein - Sitz: Berlin
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Lifecycle Stuck PROCESSING and UNINITIAL

2024-10-18 Thread Lukasz Borek
Don't think that the root cause has been found. I disabled versioning as I
have to manually remove expired objects using s3 client.

On Thu, 17 Oct 2024 at 17:50, Reid Guyett  wrote:

> Hello,
>
> I am experiencing an issue where it seems all lifecycles are showing either
> PROCESSING or UNINITIAL.
>
> > # radosgw-admin lc list
> > [
> > {
> > "bucket":
> > ":tesra:5e9bc383-f7bd-4fd1-b607-1e563bfe0011.833499554.20",
> > "shard": "lc.0",
> > "started": "Thu, 17 Oct 2024 00:00:01 GMT",
> > "status": "PROCESSING"
> > },
> > {
> > "bucket":
> > ":primevideos:5e9bc383-f7bd-4fd1-b607-1e563bfe0011.31134.886",
> > "shard": "lc.3",
> > "started": "Wed, 16 Oct 2024 00:00:01 GMT",
> > "status": "PROCESSING"
> > },
> > {
> > "bucket":
> > ":editorimages:5e9bc383-f7bd-4fd1-b607-1e563bfe0011.31134.3",
> > "shard": "lc.4",
> > "started": "Wed, 16 Oct 2024 00:00:01 GMT",
> > "status": "PROCESSING"
> > },
> > {
> > "bucket":
> > ":osbackup:5e9bc383-f7bd-4fd1-b607-1e563bfe0011.668063260.1",
> > "shard": "lc.10",
> > "started": "Thu, 17 Oct 2024 00:00:01 GMT",
> > "status": "PROCESSING"
> > },
> > {
> > "bucket":
> > ":projects0609:5e9bc383-f7bd-4fd1-b607-1e563bfe0011.856877269.1147",
> > "shard": "lc.10",
> > "started": "Thu, 01 Jan 1970 00:00:00 GMT",
> > "status": "UNINITIAL"
> > },
> > ...
> >
>
> I turned up the log level for rgw and see a most have some sort of error:
> "failed to put head" or "returned error ret==-2"
>
> > 2024-10-17T15:08:31.720+ 7fc9cf968640  0 lifecycle: RGWLC::process()
> > head.marker !empty() at START for shard==lc.10 head last stored at Mon
> Sep
> > 23 00:00:00 2024
> > 2024-10-17T15:08:31.720+ 7fc9cf968640 16 lifecycle:
> > RGWLC::expired_session started: 1729123201 interval: 86400(*2==172800)
> now:
> > 1729177711
> > 2024-10-17T15:08:31.720+ 7fc9cf968640  5 lifecycle: RGWLC::process():
> > ACTIVE entry:
> > :osbackup:5e9bc383-f7bd-4fd1-b607-1e563bfe0011.668063260.1::1729123201:1
> > index: 10 worker ix: 0
> > 2024-10-17T15:08:31.744+ 7fc9cf968640  0 lifecycle: RGWLC::process()
> > failed to put head lc.10
> > 2024-10-17T15:08:31.779+ 7fc9cf968640  5 lifecycle: RGWLC::process():
> > ENTER: index: 2 worker ix: 0
> > 2024-10-17T15:08:31.781+ 7fc9cf968640  0 lifecycle: RGWLC::process()
> > head.marker !empty() at START for shard==lc.2 head last stored at Wed Jun
> > 19 00:00:02 2024
> > 2024-10-17T15:08:31.781+ 7fc9cf968640  0 lifecycle: RGWLC::process()
> > sal_lc->get_entry(lc_shard, head.marker, entry) returned error ret==-2
> > 2024-10-17T15:08:31.782+ 7fc9cf968640  5 lifecycle: RGWLC::process():
> > ENTER: index: 23 worker ix: 0
> > 2024-10-17T15:08:31.783+ 7fc9cf968640  0 lifecycle: RGWLC::process()
> > head.marker !empty() at START for shard==lc.23 head last stored at Mon
> Jan
> >  8 00:00:00 2024
> > 2024-10-17T15:08:31.784+ 7fc9cf968640  0 lifecycle: RGWLC::process()
> > sal_lc->get_entry(lc_shard, head.marker, entry) returned error ret==-2
> > 2024-10-17T15:08:31.784+ 7fc9cf968640  5 lifecycle: RGWLC::process():
> > ENTER: index: 22 worker ix: 0
> > 2024-10-17T15:08:31.786+ 7fc9cf968640  0 lifecycle: RGWLC::process()
> > head.marker !empty() at START for shard==lc.22 head last stored at Mon
> Aug
> > 19 00:00:00 2024
> > 2024-10-17T15:08:31.786+ 7fc9cf968640  0 lifecycle: RGWLC::process()
> > sal_lc->get_entry(lc_shard, head.marker, entry) returned error ret==-2
> > 2024-10-17T15:08:31.787+ 7fc9cf968640  5 lifecycle: RGWLC::process():
> > ENTER: index: 17 worker ix: 0
> > 2024-10-17T15:08:31.788+ 7fc9cf968640  0 lifecycle: RGWLC::process()
> > head.marker !empty() at START for shard==lc.17 head last stored at Mon
> Jul
> > 22 00:00:00 2024
> > 2024-10-17T15:08:31.788+ 7fc9cf968640  0 lifecycle: RGWLC::process()
> > sal_lc->get_entry(lc_shard, head.marker, entry) returned error ret==-2
> > 2024-10-17T15:08:31.789+ 7fc9cf968640  5 lifecycle: RGWLC::process():
> > ENTER: index: 7 worker ix: 0
> > 2024-10-17T15:08:31.790+ 7fc9cf968640  0 lifecycle: RGWLC::process()
> > head.marker !empty() at START for shard==lc.7 head last stored at Sat Jun
> >  8 00:00:02 2024
> > 2024-10-17T15:08:31.791+ 7fc9cf968640  0 lifecycle: RGWLC::process()
> > sal_lc->get_entry(lc_shard, head.marker, entry) returned error ret==-2
> > 2024-10-17T15:08:31.791+ 7fc9cf968640  5 lifecycle: RGWLC::process():
> > ENTER: index: 0 worker ix: 0
> > 2024-10-17T15:08:31.793+ 7fc9cf968640  0 lifecycle: RGWLC::process()
> > head.marker !empty() at START for shard==lc.0 head last stored at Mon Sep
> > 23 00:00:00 2024
> > 2024-10-17T15:08:31.793+ 7fc9cf968640 16 lifecycle:
> > RGWLC::expired_session started: 1729123201 interval: 86400(*2==172800)
> now:
> > 1729177711
> > 2024-10-17T15:08:31.793+ 7fc9cf968640  5 lifecycle: RGWLC::

[ceph-users] Re: WAL on NVMe/SSD not used after OSD/HDD replace

2024-09-27 Thread Lukasz Borek
Adding --zap to orch command cleans WALL logical volume :

ceph orch osd rm 37 --replace *--zap*

After replacement, new OSD is correctly created. Tested a few times with
18.2.4.

Thanks.

On Fri, 27 Sept 2024 at 19:31, Igor Fedotov  wrote:

> Hi!
>
> I'm not an expert in the Ceph orchestrator but it looks to me like WAL
> volume hasn't been properly cleaned up during osd.1 removal.
>
> Please compare LVM tags for osd.0 and .1:
>
> osd.0:
>
> "devices": [
>  "/dev/sdc"
>  ],
>
> ...
>
>  "lv_tags":
> "...,ceph.osd_fsid=d472bf9f-c17d-4939-baf5-514a07db66bc,ceph.osd_id=0,...",
>
>   "devices": [
>  "/dev/sdb"
>  ],
>  "lv_tags":
> "...ceph.osd_fsid=d472bf9f-c17d-4939-baf5-514a07db66bc,ceph.osd_id=0,...",
>
> osd_fsid (OSD daemon UUID) is the same for both devices, which allows
> ceph-volume to bind these volumes to the relevant OSD.
>
> OSD.1:
>
>{
>  "devices": [
>  "/dev/sdd"
>  ],
>  "lv_tags":
> "...ceph.osd_fsid=d94bda82-e59f-4d3d-81cd-28ea69c5e02f,ceph.osd_id=1,...",
>
> ...
>
>  {
>  "devices": [
>  "/dev/sdb"
>  ],
>  "lv_tags":
> "...ceph.osd_fsid=7a1d0007-71ff-4011-8a18-e6de1499cbdf,ceph.osd_id=1,...",
>
> osd_fsid tags are different, WAL volume's one is apparently a legacy UUID.
>
> This WAL volume is not bound to new osd.1  (lvtags for osd.1 main volume
> confirms that since there are no WAL related members there) and it still
> keeps setting for the legacy OSD.1.
>
> In other words this is an orpan volume for now and apparently could be
> safely recreated and assigned back to osd.1 via ceph-colume lvm new-wal
> command. Certainly better try in the test env first.
>
> Hope this helps.
>
> Thanks,
>
> Igor
>
> On 9/27/2024 3:48 PM, mailing-lists wrote:
> > Dear Ceph-users,
> > I have a problem that I'd like to have your input for.
> >
> > Preface:
> > I have got a test-cluster and a productive-cluster. Both are setup the
> > same and both are having the same "issue". I am running Ubuntu 22.04
> > and deployed ceph 17.2.3 via cephadm. Upgraded to 17.2.7 later on,
> > which is the version we are currently running. Since the issue seem to
> > be the exact same on the test-cluster, I will post
> > test-cluster-outputs here for better readability.
> >
> > The issue:
> > I have replaced disks and after the replacement, it does not show that
> > it would use the NVMe as WAL device anymore. The LV still exists, but
> > the metadata of the osd does not show it, as it would be with any
> > other osd/hdd, that hasnt been replaced.
> >
> > ODS.1 (incorrect, bluefs_dedicated_wal: "0")
> > ```
> > {
> > "id": 1,
> > "arch": "x86_64",
> > "back_addr":
> > "[v2:192.168.6.241:6802/3213655489,v1:192.168.6.241:6803/3213655489]",
> > "back_iface": "",
> > "bluefs": "1",
> > "bluefs_dedicated_db": "0",
> > "bluefs_dedicated_wal": "0",
> > "bluefs_single_shared_device": "1",
> > "bluestore_bdev_access_mode": "blk",
> > "bluestore_bdev_block_size": "4096",
> > "bluestore_bdev_dev_node": "/dev/dm-3",
> > "bluestore_bdev_devices": "sdd",
> > "bluestore_bdev_driver": "KernelDevice",
> > "bluestore_bdev_optimal_io_size": "0",
> > "bluestore_bdev_partition_path": "/dev/dm-3",
> > "bluestore_bdev_rotational": "1",
> > "bluestore_bdev_size": "17175674880",
> > "bluestore_bdev_support_discard": "1",
> > "bluestore_bdev_type": "hdd",
> > "bluestore_min_alloc_size": "4096",
> > "ceph_release": "quincy",
> > "ceph_version": "ceph version 17.2.7
> > (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)",
> > "ceph_version_short": "17.2.7",
> > "ceph_version_when_created": "",
> > "container_hostname": "bi-ubu-srv-ceph2-01",
> > "container_image":
> > "
> quay.io/ceph/ceph@sha256:28323e41a7d17db238bdcc0a4d7f38d272f75c1a499bc30f59b0b504af132c6b
> ",
> > "cpu": "AMD EPYC 75F3 32-Core Processor",
> > "created_at": "",
> > "default_device_class": "hdd",
> > "device_ids": "sdd=QEMU_HARDDISK_drive-scsi3",
> > "device_paths":
> > "sdd=/dev/disk/by-path/pci-:00:05.0-scsi-0:0:3:0",
> > "devices": "sdd",
> > "distro": "centos",
> > "distro_description": "CentOS Stream 8",
> > "distro_version": "8",
> > "front_addr":
> >
> "[v2:.241:6800/3213655489,v1:.241:6801/3213655489]",
> > "front_iface": "",
> > "hb_back_addr":
> > "[v2:192.168.6.241:6806/3213655489,v1:192.168.6.241:6807/3213655489]",
> > "hb_front_addr":
> >
> "[v2:.241:6804/3213655489,v1:.241:6805/3213655489]",
> > "hostname": "bi-ubu-srv-ceph2-01",
> > "journal_rotational": "1",
> > "kernel_description": "#132-Ubuntu SMP Thu Aug 29 13:45:52 UTC 2024",
> > "kernel_version": "5.15.0-122-generic",
> > "mem_swap_kb": "4018172",
> > "mem_total_kb": "5025288",
> > "network_numa_unknown_ifaces": "back_iface,fr

[ceph-users] lifecycle for versioned bucket

2024-09-17 Thread Lukasz Borek
Hi,

I'm having issue with lifecycle jobs for 18.2.4 cluster with versioning
enabled bucket.


/# radosgw-admin lc list
[
{
"bucket":
":mongobackup-prod:c3e0a369-71df-40f5-a5c0-51e859efe0e0.96754.1",
"shard": "lc.0",
"started": "Thu, 01 Jan 1970 00:00:00 GMT",
"status": "COMPLETE"
}
]


During RGW startup :

2024-09-17T16:51:06.148+ 7f944d02d640  2 garbage collection: garbage
collection: start
2024-09-17T16:51:06.152+ 7f944501d640  2 lifecycle: life cycle: start
2024-09-17T16:51:06.152+ 7f944501d640  5 lifecycle: RGWLC::process():
ENTER: index: 26 worker ix: 0
2024-09-17T16:51:06.152+ 7f943f812640  2 lifecycle: life cycle: start
2024-09-17T16:51:06.152+ 7f943f812640  5 lifecycle: RGWLC::process():
ENTER: index: 18 worker ix: 1
2024-09-17T16:51:06.152+ 7f944501d640  5 lifecycle: RGWLC::process()
process shard rollover lc_shard=lc.26 head.marker=
head.shard_rollover_date=0
2024-09-17T16:51:06.152+ 7f944501d640  5 lifecycle: RGWLC::process()
entry.bucket.empty() == true at START 1 (this is possible mainly before any
lc policy has been stored or after removal of an lc_shard object)
2024-09-17T16:51:06.152+ 7f943a007640  2 lifecycle: life cycle: start
2024-09-17T16:51:06.152+ 7f943a007640  5 lifecycle: RGWLC::process():
ENTER: index: 17 worker ix: 2
2024-09-17T16:51:06.156+ 7f943f812640  5 lifecycle: RGWLC::process()
process shard rollover lc_shard=lc.18 head.marker=
head.shard_rollover_date=0
2024-09-17T16:51:06.156+ 7f943f812640  5 lifecycle: RGWLC::process()
entry.bucket.empty() == true at START 1 (this is possible mainly before any
lc policy has been stored or after removal of an lc_shard object)
2024-09-17T16:51:06.156+ 7f944501d640  5 lifecycle: RGWLC::process():
ENTER: index: 9 worker ix: 0
[...]
2024-09-17T16:51:06.568+ 7f94137ba640  2 lifecycle: life cycle: stop
2024-09-17T16:51:06.572+ 7f94137ba640  5 lifecycle: schedule life cycle
next start time: Wed Sep 18 01:00:00 2024

Scheduled job  ends with:

2024-09-17T00:00:01.379+ 7fd17f843700  0 lifecycle: RGWLC::process()
head.marker !empty() at START for shard==lc.31 head last stored at Mon Sep
16 06:34:39 2024
2024-09-17T00:00:01.379+ 7fd17f843700  0 lifecycle: RGWLC::process()
sal_lc->get_entry(lc_shard, head.marker, entry) returned error ret==-2

Anyone can explain the logic behind that?

Thanks.

-- 
Łukasz Borek
luk...@borek.org.pl
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: About erasure code for larger hdd

2024-12-09 Thread Lukasz Borek
I'd start with 3+2, so you have one node left for recovery in case one
fails. 6-node and 90 hdd per node sounds like a long recovery that needs to
be tested for sure.

On Mon, 9 Dec 2024 at 06:10, Phong Tran Thanh 
wrote:

> Hi community,
>
> Please help with advice on selecting an erasure coding algorithm for a
> 6-node cluster with 540 OSDs. What would be the appropriate values for *k*
> and *m*? The cluster requires a high level of HA and consistent
> throughput.
>
> Email: tranphong...@gmail.com
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Łukasz Borek
luk...@borek.org.pl
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: fqdn in spec

2025-01-08 Thread Lukasz Borek
I never used fqdn this way, but there is an option for cephadm bootstrap
command

  --allow-fqdn-hostname
allow hostname that is fully-qualified (contains
".")

Worth checking. Not sure what's behind.

Thanks


On Wed, 8 Jan 2025 at 12:14, Piotr Pisz  wrote:

> Hi,
>
> We add hosts to the cluster using fqdn, manually (ceph orch host add)
> everything works fine.
> However, if we use the spec file as below, the whole thing falls apart.
>
> ---
> service_type: host
> addr: xx.xx.xx.xx
> hostname: ceph001.xx002.xx.xx.xx.com
> location:
>   root: xx002
>   rack: rack01
> labels:
>   - osd
>   - rgw
> ---
> service_type: osd
> service_id: object_hdd
> service_name: osd.object_hdd
> placement:
>   host_pattern: ceph*
> crush_device_class: object_hdd
> spec:
>   data_devices:
> rotational: 1
>   db_devices:
> rotational: 0
> size: '3000G:'
> ---
> service_type: osd
> service_id: index_nvme
> service_name: osd.index_nvme
> placement:
>   host_pattern: ceph*
> crush_device_class: index_nvme
> spec:
>   data_devices:
> rotational: 0
> size: ':900G'
>
> Applying this spec results in two hosts, one fqdn and the other short:
>
> root@mon001(xx002):~/cephadm# ceph osd df tree
> ID  CLASS   WEIGHT REWEIGHT  SIZE RAW USE  DATA OMAP
> META
> AVAIL%USE  VAR   PGS  STATUS  TYPE NAME
> -4  0 -  0 B  0 B  0 B 0 B
> 0 B  0 B 0 0-  root dias002
> -3  0 -  0 B  0 B  0 B 0 B
> 0 B  0 B 0 0-  rack rack01
> -2  0 -  0 B  0 B  0 B 0 B
> 0 B  0 B 0 0-  host
> ceph001.xx002.xx.xx.xx.com
> -1  662.71497 -  663 TiB  7.0 TiB  102 MiB  37 KiB  1.7
> GiB  656 TiB  1.05  1.00-  root default
> -9  662.71497 -  663 TiB  7.0 TiB  102 MiB  37 KiB  1.7
> GiB  656 TiB  1.05  1.00-  host ceph001
> 36  index_nvme0.87329   1.0  894 GiB   33 MiB  2.7 MiB   1 KiB   30
> MiB  894 GiB  0.00  0.000  up  osd.36
>  0  object_hdd   18.38449   1.0   18 TiB  199 GiB  2.7 MiB   1 KiB   56
> MiB   18 TiB  1.06  1.000  up  osd.0
>  1  object_hdd   18.38449   1.0   18 TiB  199 GiB  2.7 MiB   1 KiB   74
> MiB   18 TiB  1.06  1.000  up  osd.1
>  2  object_hdd   18.38449   1.0   18 TiB  199 GiB  2.7 MiB   1 KiB   56
> MiB   18 TiB  1.06  1.000  up  osd.2
>
> This looks like a bug, but I'm not sure, maybe someone has encountered
> something similar?
>
> Regards,
> Piotr
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Łukasz Borek
luk...@borek.org.pl
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Diskprediction_local mgr module removal - Call for feedback

2025-04-10 Thread Lukasz Borek
+1

I wasn't aware that this module is obsolete and was trying to start it a
few weeks ago.

We develop a home-made  solution some time ago to monitor smart data from
both HDD (uncorrected errors, grown defect list) and SSD (WLC/TBW). But
keeping it up to date with non-unified disk models is a nightmare.

Alert : "OSD.12 is going to fail. Replace it soon" before seeing SLOW_OPS
would be a game changer!

Thanks!

On Tue, 8 Apr 2025 at 10:00, Michal Strnad  wrote:

> Hi.
>
>  From our point of view, it's important to keep disk failure prediction
> tool as part of Ceph, ideally as an MGR module. In environments with
> hundreds or thousands of disks, it's crucial to know whether, for
> example, a significant number of them are likely to fail within a month
> - which, in the best-case scenario, would mean performance degradation,
> and in the worst-case, data loss.
>
> Some have already responded to the deprecation of diskprediction by
> starting to develop their own solutions. For instance, just yesterday,
> Daniel Persson published a solution [1] on his website that addresses
> the same problem.
>
> Would it be possible to join forces and try to revive that module?
>
> [1] https://www.youtube.com/watch?v=Gr_GtC9dcMQ
>
> Thanks,
> Michal
>
>
> On 4/8/25 01:18, Yaarit Hatuka wrote:
> > Hi everyone,
> >
> > On today's Ceph Steering Committee call we discussed the idea of removing
> > the diskprediction_local mgr module, as the current prediction model is
> > obsolete and not maintained.
> >
> > We would like to gather feedback from the community about the usage of
> this
> > module, and find out if anyone is interested in maintaining it.
> >
> > Thanks,
> > Yaarit
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Łukasz Borek
luk...@borek.org.pl
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Request for Recommendations: Tiering/Archiving from NetApp to Ceph (with stub file support)

2025-05-05 Thread Lukasz Borek
Hi MJ,

Not sure if it's the right direction but backup software (like for example
Commvault) has an archiving option with stub support. You can configure the
s3 cloud library with CEPH s3 as a backend. I've never tested it - just a
thought.

On Mon, 5 May 2025 at 00:49, sacawulu  wrote:

> Hi all,
>
> We're exploring solutions to offload large volumes of data (on the order
> of petabytes) from our NetApp all-flash storage to our more
> cost-effective, HDD-based Ceph storage cluster, based on criteria such
> as: last access time older than X years.
>
> Ideally, we would like to leave behind a 'stub' or placeholder file on
> the NetApp side to preserve the original directory structure and
> potentially enable some sort of transparent access or recall if needed.
> This kind of setup is commonly supported by solutions like
> DataCore/FileFly, but as far as we can tell, FileFly doesn’t support
> Ceph as a backend and instead favors its own Swarm object store.
>
> Has anyone here implemented a similar tiering/archive/migration solution
> involving NetApp and Ceph?
>
> We’re specifically looking for:
>
> *Enterprise-grade tooling
>
> *Stub file support or similar metadata-preserving offload
>
> *Support and reliability (given the scale, we can’t afford data loss
> or inconsistency)
>
> *Either commercial or well-supported open source solutions
>
> Any do’s/don’ts, war stories, or product recommendations would be
> greatly appreciated. We’re open to paying for software or services if it
> brings us the reliability and integration we need.
>
> Thanks in advance!
>
> MJ
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Łukasz Borek
luk...@borek.org.pl
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Patching Ceph cluster

2025-04-24 Thread Lukasz Borek
>
> For upgrade the OS we have something similar, but exiting maintenance mode
> is broken (with 17.2.7) :(
> I need to check the tracker for similar issues and if I can't find
> anything, I will create a ticket

For 18.2.2 first maint exit command threw an exception for some reason. In
my patching script I execute commands in a loop and the 2nd shoot usually
works.

exit maint 1/3
Error EINVAL: Traceback (most recent call last):
  File "/usr/share/ceph/mgr/mgr_module.py", line 1809, in _handle_command
return self.handle_command(inbuf, cmd)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 183, in
handle_command
return dispatch[cmd['prefix']].call(self, cmd, inbuf)
  File "/usr/share/ceph/mgr/mgr_module.py", line 474, in call
return self.func(mgr, **kwargs)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 119, in

wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs)
 # noqa: E731
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 108, in
wrapper
return func(*args, **kwargs)
  File "/usr/share/ceph/mgr/orchestrator/module.py", line 778, in
_host_maintenance_exit
raise_if_exception(completion)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 237, in
raise_if_exception
e = pickle.loads(c.serialized_exception)
TypeError: __init__() missing 2 required positional arguments: 'hostname'
and 'addr'

exit maint 2/3
Ceph cluster f3e63d9e-2f4c-11ef-87a2-0f1170f55ed5 on cephbackup-osd1 has
exited maintenance mode
exit maint 3/3
Error EINVAL: Host cephbackup-osd1 is not in maintenance mode
Fri Apr 25 07:17:58 CEST 2025 cluster state is HEALTH_WARN
Fri Apr 25 07:18:02 CEST 2025 cluster state is HEALTH_WARN
[...]




On Thu, 13 Jun 2024 at 22:07, Sake Ceph  wrote:

>
>
> For upgrade the OS we have something similar, but exiting maintenance mode
> is broken (with 17.2.7) :(
> I need to check the tracker for similar issues and if I can't find
> anything, I will create a ticket.
>
> Kind regards,
> Sake
>
> > Op 12-06-2024 19:02 CEST schreef Daniel Brown
> :
> >
> >
> > I have two ansible roles, one for enter, one for exit. There’s likely
> better ways to do this — and I’ll not be surprised if someone here lets me
> know. They’re using orch commands via the cephadm shell. I’m using Ansible
> for other configuration management in my environment, as well, including
> setting up clients of the ceph cluster.
> >
> >
> > Below excerpts from main.yml in the “tasks” for the enter/exit roles.
> The host I’m running ansible from is one of my CEPH servers - I’ve limited
> which process run there though so it’s in the cluster but not equal to the
> others.
> >
> >
> > —
> > Enter
> > —
> >
> > - name: Ceph Maintenance Mode Enter
> >   shell:
> >
> > cmd: ' cephadm shell ceph orch host maintenance enter {{
> (ansible_ssh_host|default(ansible_host))|default(inventory_hostname) }}
> --force --yes-i-really-mean-it ‘
> >   become: True
> >
> >
> >
> > —
> > Exit
> > —
> >
> >
> > - name: Ceph Maintenance Mode Exit
> >   shell:
> > cmd: 'cephadm shell ceph orch host maintenance exit {{
> (ansible_ssh_host|default(ansible_host))|default(inventory_hostname) }} ‘
> >   become: True
> >   connection: local
> >
> >
> > - name: Wait for Ceph to be available
> >   ansible.builtin.wait_for:
> > delay: 60
> > host: '{{
> (ansible_ssh_host|default(ansible_host))|default(inventory_hostname) }}’
> > port: 9100
> >   connection: local
> >
> >
> >
> >
> >
> >
> > > On Jun 12, 2024, at 11:28 AM, Michael Worsham <
> mwors...@datadimensions.com> wrote:
> > >
> > > Interesting. How do you set this "maintenance mode"? If you have a
> series of documented steps that you have to do and could provide as an
> example, that would be beneficial for my efforts.
> > >
> > > We are in the process of standing up both a dev-test environment
> consisting of 3 Ceph servers (strictly for testing purposes) and a new
> production environment consisting of 20+ Ceph servers.
> > >
> > > We are using Ubuntu 22.04.
> > >
> > > -- Michael
> > > From: Daniel Brown 
> > > Sent: Wednesday, June 12, 2024 9:18 AM
> > > To: Anthony D'Atri 
> > > Cc: Michael Worsham ; ceph-users@ceph.io
> 
> > > Subject: Re: [ceph-users] Patching Ceph cluster
> > >  This is an external email. Please take care when clicking links or
> opening attachments. When in doubt, check with the Help Desk or Security.
> > >
> > >
> > > There’s also a Maintenance mode that you can set for each server, as
> you’re doing updates, so that the cluster doesn’t try to move data from
> affected OSD’s, while the server being updated is offline or down. I’ve
> worked some on automating this with Ansible, but have found my process
> (and/or my cluster) still requires some manual intervention while it’s
> running to get things done cleanly.
> > >
> > >
> > >
> > > > On Jun 12, 2024, at 8:49 AM, Anthony D'Atri 
> wrote:
> > > >
> > > > Do you mean patching the OS?
> > > >
> > > > If so, easy -- one 

[ceph-users] Re: ceph health mute behavior

2025-06-25 Thread Lukasz Borek
Looks like I'm not alone in drop off scrub performance after last update? :)


Łukasz Borek
luk...@borek.org.pl


On Wed, 25 Jun 2025 at 11:58, Eugen Block  wrote:

> Thanks Frédéric.
> The customer found the sticky flag, too. I must admit, I haven't used
> the mute command too often yet, usually I try to get to the bottom of
> a warning and rather fix the underlying issue. :-D
> So the mute clears if the number increases:
>
> >>  if (q->second.count > p->second.count)
>
> That makes sense, and I agree that an admin might want to know about
> that. Then this is resolved for me, thanks for the quick response!
>
> Eugen
>
> Zitat von Frédéric Nass :
>
> > Hi Eugen,
> >
> > Reading the code, the muted alert was cleared because it was
> > non-sticky and the number of affected PGs increased (which was
> > decided to be a good reason to alert the admin).
> >
> > Have you tried to use the --sticky argument on the 'ceph health
> > mute' command?
> >
> > Cheers,
> > Frédéric.
> >
> > - Le 25 Juin 25, à 9:21, Eugen Block ebl...@nde.ag a écrit :
> >
> >> Hi,
> >>
> >> I'm trying to understand the "ceph health mute" behavior. In this
> >> case, I'm referring to the warning PG_NOT_DEEP_SCRUBBED. If you mute
> >> it for a week and the cluster continues deep-scrubbing, the "mute"
> >> will clear at some point although there are still PGs not
> >> deep-scrubbed in time warnings. I could verify this in a tiny lab with
> >> 19.2.2, setting osd_deep_scrub_interval to 10 minutes, the warning
> >> pops up. Then I mute that warning, issue deep-scrubs for several
> >> pools, and at some point I see this in the mon log:
> >>
> >> Jun 25 08:53:28 host1 ceph-mon[823315]: log_channel(cluster) log [WRN]
> >> : Health check update: 61 pgs not deep-scrubbed in time
> >> (PG_NOT_DEEP_SCRUBBED)
> >> Jun 25 08:53:28 host1 ceph-mon[823315]: Health check update: 61 pgs
> >> not deep-scrubbed in time (PG_NOT_DEEP_SCRUBBED)
> >> Jun 25 08:53:29 host1 ceph-mon[823315]: pgmap v164176: 389 pgs: 389
> >> active+clean; 428 MiB data, 57 GiB used, 279 GiB / 336 GiB avail
> >> ...
> >> Jun 25 08:53:31 host1 ceph-mon[823315]: log_channel(cluster) log [INF]
> >> : Health alert mute PG_NOT_DEEP_SCRUBBED cleared (count increased from
> >> 60 to 61)
> >> Jun 25 08:53:31 host1 ceph-mon[823315]: Health alert mute
> >> PG_NOT_DEEP_SCRUBBED cleared (count increased from 60 to 61)
> >>
> >>
> >> I don't really understand what the code does [0] (I'm not a dev):
> >>
> >> ---snip---
> >> if (!p->second.sticky) {
> >>   auto q = all.checks.find(p->first);
> >>   if (q == all.checks.end()) {
> >>  mon.clog->info() << "Health alert mute " << p->first
> >><< " cleared (health alert cleared)";
> >>  p = pending_mutes.erase(p);
> >>  changed = true;
> >>  continue;
> >>   }
> >>   if (p->second.count) {
> >>  // count-based mute
> >>  if (q->second.count > p->second.count) {
> >>mon.clog->info() << "Health alert mute " << p->first
> >>  << " cleared (count increased from " <<
> p->second.count
> >>  << " to " << q->second.count << ")";
> >>p = pending_mutes.erase(p);
> >>changed = true;
> >>continue;
> >> ---snip---
> >>
> >> Could anyone shed some light what I'm not understanding? Why would the
> >> mute clear although there are still PGs not deep-scrubbed?
> >>
> >> Thanks!
> >> Eugen
> >>
> >> [0]
> >>
> https://github.com/ceph/ceph/blob/d78ffd1247d6cef5cbd829e77204185dc0d3a8ba/src/mon/HealthMonitor.cc#L431
> >>
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: osd latencies and grafana dashboards, squid 19.2.2

2025-07-24 Thread Lukasz Borek
>
> So the question is, is the extra multiple of 1000 incorrect in the 'OSD
> Overview' dashboard? Or am I not understanding things correctly?

latency_count is integer, returns numbers of samples, latency_sum is sum of
latencies from _count samples in seconds (so you multiply it by 1000 to get
ms)

On Thu, 24 Jul 2025 at 20:44, Christopher Durham  wrote:

> In my 19.2.2/squid cluster, (Rocky 9 Linux)  I am trying to determine if I
> am having
> issues with OSD latency. The following URL:
>
> https://sysdig.com/blog/monitoring-ceph-prometheus/
>
> states the following about prometheus metrics:
>
> * ceph_osd_op_r_latency_count: Returns the number of reading operations
> running.
> * ceph_osd_op_r_latency_sum: Returns the time, in milliseconds, taken by
> the reading operations. This metric includes the queue time.
> * ceph_osd_op_w_latency_count: Returns the number of writing operations
> running.
> * ceph_osd_op_w_latency_sum: Returns the time, in milliseconds, taken by
> the writing operations. This metric includes the queue time.
>
> and
>
> * ceph_osd_commit_latency_ms: Returns the time it takes OSD to read or
> write to the journal.
> * ceph_osd_apply_latency_ms: Returns the time it takes to write the
> journal to the physical disk.
>
>
> The first set states 'includes the queue time'. What exactly does this
> mean? Does this mean that this is the time waiting before writing to the
> journal while in the memory of the ceph-osd daemon? If so, does the latter
> two metrics mean that once the writes start,
> this is the time it takes to write to the journal or the disk?
>
> Does the first set of metrics *include* the latter? In other words, are
> the apply/commit latencies included in the *[r,w]_latency_sum?
>
> The URL above suggests that to calculate the write latency for a given
> OSD, you do the following:
>
> (rate(ceph_osd_op_w_latency_sum[5m]) /
> rate(ceph_osd_op_w_latency_count[5m]) >= 0)
>
> However, the grafana dashboard 'OSD Overview' (in ceph-dashboards-19.2.2
> rpm), does something very similar for max,avg,quantile:
>
> max(rate(ceph_osd_op_w_max_latency_sum(cluster=\$cluster\,}[$__rate_interval])/
> on (ceph_daemon)
> rate(ceph_osd_op_w_max_latency_count(cluster=\$cluster\,}[$__rate_interval])
> * 1000))
>
> The extra multiple of 1000 seems extraneous based on the fact that the
> *latency_count is already in milliseconds, and the graph itself shows 'ms'.
> and led me to think I have disk latency issues, as these numbers are high.
> (maybe I do, and maybe I misunderstand something)
>
> Other dashboards such as 'Ceph Cluster - Advanced' which, in the panel
> 'OSD Commit Latency Distribution' use
> similar promQL expressions but without the extra multiple of 1000, which
> looks alot better for evaluation of my latencies.
>
> So the question is, is the extra multiple of 1000 incorrect in the 'OSD
> Overview' dashboard? Or am I not understanding things correctly?
>
> Also, does 'ceph osd perf' just show the apply/commit sum/count metrics
> from above?
>
> Thanks for any assistance.
> -Chris
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Łukasz Borek
luk...@borek.org.pl
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io