[ceph-users] Re: ceph squid - huge difference between capacity reported by "ceph -s" and "ceph df "

Anthony D'Atri Wed, 02 Jul 2025 07:53:30 -0700

Yes, if not otherwise specified the WAL will be placed alongside the DB.
The WAL is usually of a certain modest fixed size; the DB benefits considerably 
more from additional space in many cases, depending on the use-case.  RBD-only 
clusters tend to make much lower demands on the DB than RGW clusters, 
especially if there are a lot of versioned or tiny S3 objects.


Verify after deploying that all is as you expect with `ceph osd metadata`, 
`ceph device ls`

The shared offload devices should show association with multiple OSDs.

As you’ve probably found, you will need to completely zap all drives for the 
orchestrator to recognize them, including the partition tables.

ATA_Micron_5400_MTFD_242849F1A773             m1833:sdad       osd.53 osd.62 
osd.71 osd.80 osd.89         0%
ATA_Micron_5400_MTFD_242849F1A9A5             dd1329:sdv       osd.306 osd.307 
osd.308 osd.309 osd.310    0%
SEAGATE_ST20000NM002D_ZVT74XC40000W319HWKU    x828:sdf        osd.32
SEAGATE_ST20000NM002D_ZVT88MQH0000C3241QSB    xxdd1333:sdr       osd.189


> On Jul 2, 2025, at 6:12 AM, Steven Vacaroaia <ste...@gmail.com> wrote:
> 
> Hello
> 
> I have redeployed the cluster
> 
> I am planning to using bellow spec file
> 
> --dry-run  shows that DB partitions will be created BUT not WAL ones 
> 
> My understanding , based on your comments, is that there is no ned to specify 
> wal_devices as they will automatically be colocated on the db_devices 
> Is that correct ?
> 
> If not, how to "tell" to collocated wal on same devices as db  ?
> 
> Many thanks for your help
>  

> 
> 

> 
> 
> On Mon, 30 Jun 2025 at 09:26, Anthony D'Atri <a...@dreamsnake.net 
> <mailto:a...@dreamsnake.net>> wrote:
>> 
>>> Hi Anthony
>>> 
>>> Appreciate you taking the time to provide so much guidance 
>>> All your messages in this mailing list are well documented and VERY helpful 
>> 
>> You’re most welcome.  The community is a vital part of Ceph.
>> 
>>> I have attached a text file with the output of the commands you mentioned 
>> 
>> Didn’t you say you were airgapped?
>> 
>>> You are right , there is no /dev/bluefs_/bluefsstore_bdev pointing to the 
>>> NVME namespaces
>>> No specific reason to use separate WAL , just remembered it was recommended 
>>> back in the "mimic" days ( yes, I am a bit "rusty")
>> 
>> I think the wording may have been and perhaps still is a bit misleading.   
>> It’s really more of a “if you happen to have drives of 3 different speeds” 
>> thing.
>> 
>>> I am using 4+2 EC 
>> 
>> Groovy, that’s fine with 7 hosts.
>> 
>>> I have 2 separate NVMEs ( 1.6 TB each) dedicated to DB/WAL for HDD
>>> I want to use one for the 6 HDD that I currently have , and save the other 
>>> for when I will be adding more HDDs
>>> 
>>> The servers are SuperMicro SSG-641E with 2 x Intel GOLD 6530  (  32 cores 
>>> each) and 1TB RAM
>> 
>> So 64 pcores / 128 vcores / threads per server?
>>  
>>> My plan was/is to use the 3 x 15TB NVME on each server for high performance 
>>> pools (like metadata for cephfs or index for RGW)
>> 
>> That works, but do note that both of those pools require relatively little 
>> capacity and this strategy results in your best media being mostly unused.
>> With CephFS I might suggest configuring the first data pool on this device 
>> class to accelerate head objects, and one or more additional data pools on 
>> the other media that you can pin to specific subdirectories.  So if your 
>> mountpoint is /mycephfs, you might have directories called
>> 
>> /mycephfs/slow       # on HDD
>> /mycephfs/faster     # on SATA SSDs
>> /mycephfs/wickedfast # on NVMe SSDs
>> 
>> Assuming that these drives are what you have to work with.  If you have 
>> other uses for drives you might consider reallocating them to more closely 
>> and efficiently meet your plans for the cluster.
>> 
>>> I have carved them in 3 ( 5 TB each) so I can deploy more PGs hence 
>>> increase performance
>> 
>> You really did just step out of a time machine from Mimic ;). This strategy 
>> hasn’t been necessary for several releases.  Before, say, Quincy it was 
>> common practice to split NVMe SSDs into 2 or more OSDs each in order to 
>> increase parallelism at the OSD level.  With recent releases, though, this 
>> is no longer necessary in most cases, and you’ll be better off by 
>> provisioning a single OSD per device.  You could do this in-situ by 
>> adjusting your service spec and zapping one device’s complement of OSDs at a 
>> time, letting the orchestrator redeploy a single OSD in their place with 
>> subsequent rebalancing.
>> 
>> That said, given that your wording mentions more PGs … you don’t need more 
>> OSDs to deploy more PGs.  You can accommodate more PGs by setting
>> 
>> global        advanced  mon_target_pg_per_osd        250
>> global        advanced  mon_max_pg_per_osd          1000
>> 
>> 
>>> The SSD are meant to be used for RBD ( proxmox )
>> 
>> Groovy.  You might deploy two RBD pools, one on the SATA SSDs and one on the 
>> NVMe SSDs, with two Proxmox storage classes.
>> 
>>> 
>>> The HDD are meant to be used for archiving data using  S3 and cephfs.
>>>  
>>> I can redeploy -  please clarify 2 things about osd spec file
>>> 
>>>   1. can I use NVME namespaces ( created with nvme command - see below) or 
>>> should I let ceph partition the NVME disk ?
>> 
>> I’ve wondered that myself for years.  My research has yet to show a benefit 
>> to using a namespace instead of LVM partitioning.  If there is a salient 
>> benefit to that approach that someone else can pipe in with, I’d love to 
>> know the details.
>> 
>> So if using an SSD for WAL/DB offload:
>> 
>> * Do not specify a separate WAL device
>> * Let Ceph carve it into appropriate LVM slices so that it’s fully utilized, 
>> that’s a lot less hassle than namespaces.
>> 
>> If using an SSD for OSDs
>> * One per drive
>> 
>>>        in either case, how do you specify  WHICH device/NVME to use for 
>>> DB/WAL ( as I have 2 identical ones and want to keep the second one for 
>>> adding more HDD in the future ) 
>>> 
>>>         nvme create-ns $device --nsze=$per_ns_blocks --ncap=$per_ns_blocks 
>>> --flbas=0 --dps=0
>>>         nvme attach-ns $device --namespace-id=$i --controllers=`nvme 
>>> id-ctrl $device
>> 
>> Check out advanced OSD service specs in the docs.  There multiple ways to 
>> skin that cat.  
>> 
>> 
>>> 
>>>  2. to deploy 3 different types of OSDs using the spec I have with the 
>>> advice from you , I am guessing this is the correct spec
>> 
>> I’ll defer to others here, I’ve got a major headache this morning and don’t 
>> trust my ability right now to properly interpret these
>> 
>>> 
>>> 
>>> service_type: osd
>>> service_id: hdd_osd
>>> crush_device_class: hdd_class
>>> placement:
>>>   host_pattern: *
>>> spec:
>>>    data_devices
>>>      rotational: 1
>>>   db_devices:
>>>     rotational: 0
>>>     size: 1000G:1600G # I think as this is it will slice up all available 
>>> drives evenly.  You probably want to specify the path to just one offload 
>>> drive for now, and an explicit DB size that is capacity/N
>>>   filter_logic: AND
>>>   objectstore: bluestore
>>> 
>>> service_id: ssd_osd
>>> crush_device_class: ssd_class
>>> placement:
>>>   host_pattern: *
>>> spec:
>>>    data_devices
>>>      rotational: 0
>>>     size: 6000G:8000G
>>>  
>>> service_id: nvme_osd
>>> crush_device_class: nvme_class
>>> placement:
>>>   host_pattern: *
>>> spec:
>>>    data_devices
>>>      rotational: 0
>>>     size: 4000G:5500G
>>> 
>>> Many thanks
>>> Steven
>>> 
>>> On Sun, 29 Jun 2025 at 17:45, Anthony D'Atri <a...@dreamsnake.net 
>>> <mailto:a...@dreamsnake.net>> wrote:
>>>> So you have NVMe SSD OSDs, SATA SSD OSDs, and HDD OSDs with offload onto 
>>>> NVNe SSDs.
>>>> 
>>>> Did you have a specific reason to explicitly specify wal_devices_. It’s 
>>>> usually fine to just run with the default WAL size, with the WAL colocated 
>>>> with the DB.  And thus give your DB partitions a bit more space.
>>>> 
>>>> What are your use-cases for these three classes of OSDs?  Looks like you 
>>>> have 42x 20T HDD OSDs, 63x NVMe OSDs,  and 84x 7.6T SATA SSD OSDs?  
>>>> Apparently with the 15T SSDs divided into 3x OSDs each?  How much CPU do 
>>>> you have on these nodes? Any specific reason to have chopped up the NVMe 
>>>> SSDs into thirds?
>>>> 
>>>> It looks to me as though your .mgr pool is using the default 
>>>> replicated_rule, which does not specify a device class.  This will 
>>>> confound the balancer and if enabled the pg_autoscaler.  
>>>> I recommend changing the .mgr pool to use the CRUSH rule that the 
>>>> non-buckets.data pools use, which should be one that specifies 
>>>> 3-replication constrained to one of the SSD device classes.  As it is the 
>>>> .mgr pool may be placed on any of the three device classes, which is 
>>>> trivial with respect to space, but confounds as I mentioned.
>>>> 
>>>> Or you could manually edit the CRUSH map and change the #0 replicated_rule 
>>>> to specify nvme_class but it sounds like you’re new to Ceph and I don’t 
>>>> want to frighten you with that process which unfortunately is still 
>>>> old-school.  Changing the rule as I suggested will be much safer.
>>>> 
>>>> The numbers look like you have all of the RGW pools except buckets.data on 
>>>> the nvme_class SSDs, which is fine, but you won’t begin to use all their 
>>>> capacity, the index pool will maybe use 5-10% of the capacity used by your 
>>>> buckets.data pool over time, depending on your distribution of object 
>>>> sizes and the replication strategy of your buckets.data pool.  Doing the 
>>>> math I’ll speculate that your buckets.data pool is using a … EC 5+2 
>>>> profile?  True?  If so I might suggest rebuilding if/while you still can.  
>>>> There are distinct advantages to having EC K+M < the number of OSD nodes.
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> Hi,
>>>>> 
>>>>> Yes, I have separate NVME namespaces allocated  for WAL and DB to each 
>>>>> spinning disk 
>>>> 
>>>> Namespaces, or partitions?
>>>> 
>>>>> 
>>>>> Does that mean I still have to hunt for the 8TB culprit ?
>>>> 
>>>> Okay so `ceph cf` shows 8.2 TB of raw space used on the hdd_class OSDs, 
>>>> that’s your concern, right?
>>>> 
>>>> Please share outputs of the following:
>>>> 
>>>> `ceph osd df` (showing a few of each device class)
>>>> `ceph osd dump | grep pool`
>>>> `ceph osd metadata NNNN | egrep /dev\|bluefs_\|bluestore_bdev` for at 
>>>> least one OSD of each device class.  And run it yourself without 
>>>> specifying an OSD ID so it captures all, and see if all OSDs in each 
>>>> device class look the same. 
>>>> `ceph osd device ls-by-host ceph-host-1
>>>> 
>>>> It’s entirely possible that your WAL+DB aren’t actually offloaded to SSDs 
>>>> as you intended.  Advanced OSD service specs can be tricky.
>>>> 
>>>> That’s my suspicion, that the WAL+DB are actually still on your HDDs.  
>>>> Which can be migrated in-situ, or you can nuke the site from orbit and 
>>>> redeploy.
>>>> 
>>>> A note about your OSD specs.  Specifying the models as you’re doing is 
>>>> totally supported.  But think about what happens if you add nodes in the 
>>>> future that have different drive SKUs, or you RMA a drive and they send 
>>>> you a different SKU as the replacement.
>>>> 
>>>> It’s usually more future-proof to use a size range in the spec for each 
>>>> osd service instead of `model`, with a bit of margin to account for base 2 
>>>> units vs base 10 units.
>>>> 
>>>> Here’s an example that creates OSDs on SSDs between 490 and 1200 GB, this 
>>>> is on systems that have ~ 1TB nominal drives.  The systems also have 2TB 
>>>> SATA SSDs that are used for WAL+DB offload, which are above the 1200GB 
>>>> limit specified so they aren’t matched.
>>>> 
>>>> service_type: osd
>>>> service_id: dashboard-admin-1705602677615
>>>> service_name: osd.dashboard-admin-1705602677615
>>>> placement:
>>>>   host_pattern: *
>>>> spec:
>>>>   data_devices:
>>>>     rotational: 0
>>>>     size: 490G:1200G
>>>>   filter_logic: AND
>>>>   objectstore: bluestore
>>>> 
>>>> And here is a spec that matches any HDD larger than 18T and deploys OSDs 
>>>> on them without offload.  This cluster has 20TB HDDs, so the range of 18+ 
>>>> TB matches both the SEAGATE_ST20000NM007H and SEAGATE_ST20000NM002D drives 
>>>> present.  
>>>> 
>>>> service_type: osd
>>>> service_id: cost_capacity
>>>> service_name: osd.cost_capacity
>>>> placement:
>>>>   host_pattern: noactuallyusedanymore
>>>> spec:
>>>>   data_devices:
>>>>     rotational: 1
>>>>     size: '18T:'
>>>>   filter_logic: AND
>>>>   objectstore: bluestore
>>>> 
>>>> Oh, and make sure that your HDDs and SSDs are all updated to the most 
>>>> recent firmware.  If you have Dell chassis, run DSU on the nodes and 
>>>> update all firmware, but skip the OS drivers.  If you have HP chassis, you 
>>>> can get firmware update scripts from their web site, but I suspect these 
>>>> aren’t HP.  If anyone else, they’re likely generic drives and you can get 
>>>> firmware updaters from the mfgs respective web sites.
>>>> 
>>>> Then reboot nodes one at a time to effect the firmware, letting the 
>>>> cluster completely recover between each reboot.
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> If yes , what would be the most efficient way of finding out what takes 
>>>>> the space ?
>>>>> 
>>>>> Apologies for sending pictures but we are operating in an air gapped 
>>>>> environment 
>>>>> 
>>>>> I used this spec file to create the OSDs
>>>>> 
>>>>> <image.png>
>>>>> 
>>>>> Here is the osd tree of one of the servers 
>>>>> all the other 6 are similar 
>>>>> 
>>>>> <image.png>
>>>>> 
>>>>> Steven
>>>>> 
>>>>> 
>>>>> On Sun, 29 Jun 2025 at 14:25, Anthony D'Atri <a...@dreamsnake.net 
>>>>> <mailto:a...@dreamsnake.net>> wrote:
>>>>>> WAL by default rides along with the DB and rarely warrants a separate or 
>>>>>> larger allocation.
>>>>>> 
>>>>>> Since you say you’ve allocated DB space, does that mean that you have 
>>>>>> WAL+DB offloaded onto SSDs?  If so they don’t contribute to the space 
>>>>>> used on the hdd device class.
>>>>>> 
>>>>>> 
>>>>>> > On Jun 29, 2025, at 1:56 PM, Steven Vacaroaia <ste...@gmail.com 
>>>>>> > <mailto:ste...@gmail.com>> wrote:
>>>>>> > 
>>>>>> > Hi Janne
>>>>>> > 
>>>>>> > Thanks
>>>>>> > That make sense since I have allocated 196GB for DB and 5 GB for WALL 
>>>>>> > for
>>>>>> > all 42 spinning OSDs
>>>>>> > Again, thanks
>>>>>> > Steveb
>>>>>> > 
>>>>>> > On Sun, 29 Jun 2025 at 12:02, Janne Johansson <icepic...@gmail.com 
>>>>>> > <mailto:icepic...@gmail.com>> wrote:
>>>>>> > 
>>>>>> >> Den sön 29 juni 2025 kl 17:22 skrev Steven Vacaroaia 
>>>>>> >> <ste...@gmail.com <mailto:ste...@gmail.com>>:
>>>>>> >> 
>>>>>> >>> Hi,
>>>>>> >>> 
>>>>>> >>> I just built a new CEPH squid cluster with 7 nodes
>>>>>> >>> Since this is brand new, there is no actuall data on it except few 
>>>>>> >>> test
>>>>>> >>> files in the S3 data.bucket
>>>>>> >>> 
>>>>>> >>> Why is "ceph -s" reporting 8 TB of used capacity ?
>>>>>> >>> 
>>>>>> >> 
>>>>>> >> Because each OSD will have GBs of preallocated data for the RocksDB,
>>>>>> >> write-ahead-logs and other structures, and this counts against "raw
>>>>>> >> available space", even if you don't have objects of this size put 
>>>>>> >> into the
>>>>>> >> pools, the creation of the DBs and other things happened at osd 
>>>>>> >> creation,
>>>>>> >> or when the first object was made, and are there even if you delete 
>>>>>> >> the
>>>>>> >> object later.
>>>>>> >> 
>>>>>> >> --
>>>>>> >> May the most significant bit of your life be positive.
>>>>>> >> 
>>>>>> > _______________________________________________
>>>>>> > ceph-users mailing list -- ceph-users@ceph.io 
>>>>>> > <mailto:ceph-users@ceph.io>
>>>>>> > To unsubscribe send an email to ceph-users-le...@ceph.io 
>>>>>> > <mailto:ceph-users-le...@ceph.io>
>>>>>> 
>>>> 
>> 
>> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph squid - huge difference between capacity reported by "ceph -s" and "ceph df "

Reply via email to