[ceph-users] Re: ceph squid - huge difference between capacity reported by "ceph -s" and "ceph df "

Eugen Block Wed, 02 Jul 2025 03:49:24 -0700

Hi,

that is correct, no need to specifiy wal, they will be automaticallycolocated on the db devices.


Zitat von Steven Vacaroaia <ste...@gmail.com>:

Hello

I have redeployed the cluster

I am planning to using bellow spec file

--dry-run  shows that DB partitions will be created BUT not WAL ones

My understanding , based on your comments, is that there is no ned to
specify wal_devices as they will automatically be colocated on the
db_devices
Is that correct ?

If not, how to "tell" to collocated wal on same devices as db  ?

Many thanks for your help

[image: image.png]


[image: image.png]


On Mon, 30 Jun 2025 at 09:26, Anthony D'Atri <a...@dreamsnake.net> wrote:


Hi Anthony

Appreciate you taking the time to provide so much guidance
All your messages in this mailing list are well documented and VERY
helpful


You’re most welcome.  The community is a vital part of Ceph.

I have attached a text file with the output of the commands you mentioned


Didn’t you say you were airgapped?

You are right , there is no /dev/bluefs_/bluefsstore_bdev pointing to the
NVME namespaces
No specific reason to use separate WAL , just remembered it was
recommended back in the "mimic" days ( yes, I am a bit "rusty")


I think the wording may have been and perhaps still is a bit misleading.
It’s really more of a “if you happen to have drives of 3 different speeds”
thing.

I am using 4+2 EC


Groovy, that’s fine with 7 hosts.

I have 2 separate NVMEs ( 1.6 TB each) dedicated to DB/WAL for HDD
I want to use one for the 6 HDD that I currently have , and save the other
for when I will be adding more HDDs

The servers are SuperMicro SSG-641E with 2 x Intel GOLD 6530  (  32 cores
each) and 1TB RAM


So 64 pcores / 128 vcores / threads per server?


My plan was/is to use the 3 x 15TB NVME on each server for high
performance pools (like metadata for cephfs or index for RGW)


That works, but do note that both of those pools require relatively little
capacity and this strategy results in your best media being mostly unused.
With CephFS I might suggest configuring the first data pool on this device
class to accelerate head objects, and one or more additional data pools on
the other media that you can pin to specific subdirectories.  So if your
mountpoint is /mycephfs, you might have directories called

/mycephfs/slow  # on HDD
/mycephfs/faster # on SATA SSDs
/mycephfs/wickedfast # on NVMe SSDs

Assuming that these drives are what you have to work with.  If you have
other uses for drives you might consider reallocating them to more closely
and efficiently meet your plans for the cluster.

I have carved them in 3 ( 5 TB each) so I can deploy more PGs hence
increase performance


You really did just step out of a time machine from Mimic ;). This
strategy hasn’t been necessary for several releases.  Before, say, Quincy
it was common practice to split NVMe SSDs into 2 or more OSDs each in order
to increase parallelism at the OSD level.  With recent releases, though,
this is no longer necessary in most cases, and you’ll be better off by
provisioning a single OSD per device.  You could do this in-situ by
adjusting your service spec and zapping one device’s complement of OSDs at
a time, letting the orchestrator redeploy a single OSD in their place with
subsequent rebalancing.

That said, given that your wording mentions more PGs … you don’t need more
OSDs to deploy more PGs.  You can accommodate more PGs by setting

global        advanced  mon_target_pg_per_osd        250
global        advanced  mon_max_pg_per_osd          1000


The SSD are meant to be used for RBD ( proxmox )


Groovy.  You might deploy two RBD pools, one on the SATA SSDs and one on
the NVMe SSDs, with two Proxmox storage classes.


The HDD are meant to be used for archiving data using  S3 and cephfs.

I can redeploy -  please clarify 2 things about osd spec file

  1. can I use NVME namespaces ( created with nvme command - see below) or
should I let ceph partition the NVME disk ?


I’ve wondered that myself for years.  My research has yet to show a
benefit to using a namespace instead of LVM partitioning.  If there is a
salient benefit to that approach that someone else can pipe in with, I’d
love to know the details.

So if using an SSD for WAL/DB offload:

* Do not specify a separate WAL device
* Let Ceph carve it into appropriate LVM slices so that it’s fully
utilized, that’s a lot less hassle than namespaces.

If using an SSD for OSDs
* One per drive

       in either case, how do you specify  WHICH device/NVME to use for
DB/WAL ( as I have 2 identical ones and want to keep the second one for
adding more HDD in the future )

        nvme create-ns $device --nsze=$per_ns_blocks --ncap=$per_ns_blocks
--flbas=0 --dps=0
        nvme attach-ns $device --namespace-id=$i --controllers=`nvme
id-ctrl $device


Check out advanced OSD service specs in the docs.  There multiple ways to
skin that cat.



 2. to deploy 3 different types of OSDs using the spec I have with the
advice from you , I am guessing this is the correct spec


I’ll defer to others here, I’ve got a major headache this morning and
don’t trust my ability right now to properly interpret these



service_type: osd
service_id: hdd_osd
crush_device_class: hdd_class
placement:
  host_pattern: *
spec:
   data_devices
     rotational: 1
  db_devices:
    rotational: 0
    size: 1000G:1600G # I think as this is it will slice up all available
drives evenly.  You probably want to specify the path to just one offload
drive for now, and an explicit DB size that is capacity/N
  filter_logic: AND
  objectstore: bluestore

service_id: ssd_osd
crush_device_class: ssd_class
placement:
  host_pattern: *
spec:
   data_devices
     rotational: 0
    size: 6000G:8000G

service_id: nvme_osd
crush_device_class: nvme_class
placement:
  host_pattern: *
spec:
   data_devices
     rotational: 0
    size: 4000G:5500G

Many thanks
Steven

On Sun, 29 Jun 2025 at 17:45, Anthony D'Atri <a...@dreamsnake.net> wrote:

So you have NVMe SSD OSDs, SATA SSD OSDs, and HDD OSDs with offload onto
NVNe SSDs.

Did you have a specific reason to explicitly specify wal_devices_. It’s
usually fine to just run with the default WAL size, with the WAL colocated
with the DB.  And thus give your DB partitions a bit more space.

What are your use-cases for these three classes of OSDs?  Looks like you
have 42x 20T HDD OSDs, 63x NVMe OSDs,  and 84x 7.6T SATA SSD OSDs?
Apparently with the 15T SSDs divided into 3x OSDs each?  How much CPU do
you have on these nodes? Any specific reason to have chopped up the NVMe
SSDs into thirds?

It looks to me as though your .mgr pool is using the default
replicated_rule, which does not specify a device class.  This will confound
the balancer and if enabled the pg_autoscaler.
I recommend changing the .mgr pool to use the CRUSH rule that the
non-buckets.data pools use, which should be one that specifies
3-replication constrained to one of the SSD device classes.  As it is the
.mgr pool may be placed on any of the three device classes, which is
trivial with respect to space, but confounds as I mentioned.

Or you could manually edit the CRUSH map and change the #0
replicated_rule to specify nvme_class but it sounds like you’re new to Ceph
and I don’t want to frighten you with that process which unfortunately is
still old-school.  Changing the rule as I suggested will be much safer.

The numbers look like you have all of the RGW pools except buckets.data
on the nvme_class SSDs, which is fine, but you won’t begin to use all their
capacity, the index pool will maybe use 5-10% of the capacity used by your
buckets.data pool over time, depending on your distribution of object sizes
and the replication strategy of your buckets.data pool.  Doing the math
I’ll speculate that your buckets.data pool is using a … EC 5+2 profile?
True?  If so I might suggest rebuilding if/while you still can.  There are
distinct advantages to having EC K+M < the number of OSD nodes.




Hi,

Yes, I have separate NVME namespaces allocated  for WAL and DB to each
spinning disk


Namespaces, or partitions?


Does that mean I still have to hunt for the 8TB culprit ?


Okay so `ceph cf` shows 8.2 TB of raw space used on the hdd_class OSDs,
that’s your concern, right?

Please share outputs of the following:

`ceph osd df` (showing a few of each device class)
`ceph osd dump | grep pool`
`ceph osd metadata NNNN | egrep /dev\|bluefs_\|bluestore_bdev` for at
least one OSD of each device class.  And run it yourself without specifying
an OSD ID so it captures all, and see if all OSDs in each device class look
the same.
`ceph osd device ls-by-host ceph-host-1

It’s entirely possible that your WAL+DB aren’t actually offloaded to SSDs
as you intended.  Advanced OSD service specs can be tricky.

That’s my suspicion, that the WAL+DB are actually still on your HDDs.
Which can be migrated in-situ, or you can nuke the site from orbit and
redeploy.

A note about your OSD specs.  Specifying the models as you’re doing is
totally supported.  But think about what happens if you add nodes in the
future that have different drive SKUs, or you RMA a drive and they send you
a different SKU as the replacement.

It’s usually more future-proof to use a size range in the spec for each
osd service instead of `model`, with a bit of margin to account for base 2
units vs base 10 units.

Here’s an example that creates OSDs on SSDs between 490 and 1200 GB, this
is on systems that have ~ 1TB nominal drives.  The systems also have 2TB
SATA SSDs that are used for WAL+DB offload, which are above the 1200GB
limit specified so they aren’t matched.

service_type: osd
service_id: dashboard-admin-1705602677615
service_name: osd.dashboard-admin-1705602677615
placement:
  host_pattern: *
spec:
  data_devices:
    rotational: 0
    size: 490G:1200G
  filter_logic: AND
  objectstore: bluestore

And here is a spec that matches any HDD larger than 18T and deploys OSDs
on them without offload.  This cluster has 20TB HDDs, so the range of 18+
TB matches both the SEAGATE_ST20000NM007H and SEAGATE_ST20000NM002D drives
present.

service_type: osd
service_id: cost_capacity
service_name: osd.cost_capacity
placement:
  host_pattern: noactuallyusedanymore
spec:
  data_devices:
    rotational: 1
    size: '18T:'
  filter_logic: AND
  objectstore: bluestore

Oh, and make sure that your HDDs and SSDs are all updated to the most
recent firmware.  If you have Dell chassis, run DSU on the nodes and update
all firmware, but skip the OS drivers.  If you have HP chassis, you can get
firmware update scripts from their web site, but I suspect these aren’t
HP.  If anyone else, they’re likely generic drives and you can get firmware
updaters from the mfgs respective web sites.

Then reboot nodes one at a time to effect the firmware, letting the
cluster completely recover between each reboot.




If yes , what would be the most efficient way of finding out what takes
the space ?

Apologies for sending pictures but we are operating in an air gapped
environment

I used this spec file to create the OSDs

<image.png>

Here is the osd tree of one of the servers
all the other 6 are similar

<image.png>

Steven


On Sun, 29 Jun 2025 at 14:25, Anthony D'Atri <a...@dreamsnake.net> wrote:

WAL by default rides along with the DB and rarely warrants a separate or
larger allocation.

Since you say you’ve allocated DB space, does that mean that you have
WAL+DB offloaded onto SSDs?  If so they don’t contribute to the space used
on the hdd device class.


> On Jun 29, 2025, at 1:56 PM, Steven Vacaroaia <ste...@gmail.com>
wrote:
>
> Hi Janne
>
> Thanks
> That make sense since I have allocated 196GB for DB and 5 GB for WALL
for
> all 42 spinning OSDs
> Again, thanks
> Steveb
>
> On Sun, 29 Jun 2025 at 12:02, Janne Johansson <icepic...@gmail.com>
wrote:
>
>> Den sön 29 juni 2025 kl 17:22 skrev Steven Vacaroaia <
ste...@gmail.com>:
>>
>>> Hi,
>>>
>>> I just built a new CEPH squid cluster with 7 nodes
>>> Since this is brand new, there is no actuall data on it except few
test
>>> files in the S3 data.bucket
>>>
>>> Why is "ceph -s" reporting 8 TB of used capacity ?
>>>
>>
>> Because each OSD will have GBs of preallocated data for the RocksDB,
>> write-ahead-logs and other structures, and this counts against "raw
>> available space", even if you don't have objects of this size put
into the
>> pools, the creation of the DBs and other things happened at osd
creation,
>> or when the first object was made, and are there even if you delete
the
>> object later.
>>
>> --
>> May the most significant bit of your life be positive.
>>
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph squid - huge difference between capacity reported by "ceph -s" and "ceph df "

Reply via email to