So you have NVMe SSD OSDs, SATA SSD OSDs, and HDD OSDs with offload onto
NVNe SSDs.
Did you have a specific reason to explicitly specify wal_devices_. It’s
usually fine to just run with the default WAL size, with the WAL colocated
with the DB. And thus give your DB partitions a bit more space.
What are your use-cases for these three classes of OSDs? Looks like you
have 42x 20T HDD OSDs, 63x NVMe OSDs, and 84x 7.6T SATA SSD OSDs?
Apparently with the 15T SSDs divided into 3x OSDs each? How much CPU do
you have on these nodes? Any specific reason to have chopped up the NVMe
SSDs into thirds?
It looks to me as though your .mgr pool is using the default
replicated_rule, which does not specify a device class. This will confound
the balancer and if enabled the pg_autoscaler.
I recommend changing the .mgr pool to use the CRUSH rule that the
non-buckets.data pools use, which should be one that specifies
3-replication constrained to one of the SSD device classes. As it is the
.mgr pool may be placed on any of the three device classes, which is
trivial with respect to space, but confounds as I mentioned.
Or you could manually edit the CRUSH map and change the #0
replicated_rule to specify nvme_class but it sounds like you’re new to Ceph
and I don’t want to frighten you with that process which unfortunately is
still old-school. Changing the rule as I suggested will be much safer.
The numbers look like you have all of the RGW pools except buckets.data
on the nvme_class SSDs, which is fine, but you won’t begin to use all their
capacity, the index pool will maybe use 5-10% of the capacity used by your
buckets.data pool over time, depending on your distribution of object sizes
and the replication strategy of your buckets.data pool. Doing the math
I’ll speculate that your buckets.data pool is using a … EC 5+2 profile?
True? If so I might suggest rebuilding if/while you still can. There are
distinct advantages to having EC K+M < the number of OSD nodes.
Hi,
Yes, I have separate NVME namespaces allocated for WAL and DB to each
spinning disk
Namespaces, or partitions?
Does that mean I still have to hunt for the 8TB culprit ?
Okay so `ceph cf` shows 8.2 TB of raw space used on the hdd_class OSDs,
that’s your concern, right?
Please share outputs of the following:
`ceph osd df` (showing a few of each device class)
`ceph osd dump | grep pool`
`ceph osd metadata NNNN | egrep /dev\|bluefs_\|bluestore_bdev` for at
least one OSD of each device class. And run it yourself without specifying
an OSD ID so it captures all, and see if all OSDs in each device class look
the same.
`ceph osd device ls-by-host ceph-host-1
It’s entirely possible that your WAL+DB aren’t actually offloaded to SSDs
as you intended. Advanced OSD service specs can be tricky.
That’s my suspicion, that the WAL+DB are actually still on your HDDs.
Which can be migrated in-situ, or you can nuke the site from orbit and
redeploy.
A note about your OSD specs. Specifying the models as you’re doing is
totally supported. But think about what happens if you add nodes in the
future that have different drive SKUs, or you RMA a drive and they send you
a different SKU as the replacement.
It’s usually more future-proof to use a size range in the spec for each
osd service instead of `model`, with a bit of margin to account for base 2
units vs base 10 units.
Here’s an example that creates OSDs on SSDs between 490 and 1200 GB, this
is on systems that have ~ 1TB nominal drives. The systems also have 2TB
SATA SSDs that are used for WAL+DB offload, which are above the 1200GB
limit specified so they aren’t matched.
service_type: osd
service_id: dashboard-admin-1705602677615
service_name: osd.dashboard-admin-1705602677615
placement:
host_pattern: *
spec:
data_devices:
rotational: 0
size: 490G:1200G
filter_logic: AND
objectstore: bluestore
And here is a spec that matches any HDD larger than 18T and deploys OSDs
on them without offload. This cluster has 20TB HDDs, so the range of 18+
TB matches both the SEAGATE_ST20000NM007H and SEAGATE_ST20000NM002D drives
present.
service_type: osd
service_id: cost_capacity
service_name: osd.cost_capacity
placement:
host_pattern: noactuallyusedanymore
spec:
data_devices:
rotational: 1
size: '18T:'
filter_logic: AND
objectstore: bluestore
Oh, and make sure that your HDDs and SSDs are all updated to the most
recent firmware. If you have Dell chassis, run DSU on the nodes and update
all firmware, but skip the OS drivers. If you have HP chassis, you can get
firmware update scripts from their web site, but I suspect these aren’t
HP. If anyone else, they’re likely generic drives and you can get firmware
updaters from the mfgs respective web sites.
Then reboot nodes one at a time to effect the firmware, letting the
cluster completely recover between each reboot.
If yes , what would be the most efficient way of finding out what takes
the space ?
Apologies for sending pictures but we are operating in an air gapped
environment
I used this spec file to create the OSDs
<image.png>
Here is the osd tree of one of the servers
all the other 6 are similar
<image.png>
Steven
On Sun, 29 Jun 2025 at 14:25, Anthony D'Atri <a...@dreamsnake.net> wrote:
WAL by default rides along with the DB and rarely warrants a separate or
larger allocation.
Since you say you’ve allocated DB space, does that mean that you have
WAL+DB offloaded onto SSDs? If so they don’t contribute to the space used
on the hdd device class.
> On Jun 29, 2025, at 1:56 PM, Steven Vacaroaia <ste...@gmail.com>
wrote:
>
> Hi Janne
>
> Thanks
> That make sense since I have allocated 196GB for DB and 5 GB for WALL
for
> all 42 spinning OSDs
> Again, thanks
> Steveb
>
> On Sun, 29 Jun 2025 at 12:02, Janne Johansson <icepic...@gmail.com>
wrote:
>
>> Den sön 29 juni 2025 kl 17:22 skrev Steven Vacaroaia <
ste...@gmail.com>:
>>
>>> Hi,
>>>
>>> I just built a new CEPH squid cluster with 7 nodes
>>> Since this is brand new, there is no actuall data on it except few
test
>>> files in the S3 data.bucket
>>>
>>> Why is "ceph -s" reporting 8 TB of used capacity ?
>>>
>>
>> Because each OSD will have GBs of preallocated data for the RocksDB,
>> write-ahead-logs and other structures, and this counts against "raw
>> available space", even if you don't have objects of this size put
into the
>> pools, the creation of the DBs and other things happened at osd
creation,
>> or when the first object was made, and are there even if you delete
the
>> object later.
>>
>> --
>> May the most significant bit of your life be positive.
>>
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io