[ceph-users] Erasure coding and backfilling speed
Hi. I have a Ceph (NVME) based cluster with 12 hosts and 40 OSD's .. currently it is backfilling pg's but I cannot get it to run more than 20 backfilling (pgs) at the same time (6+2 profile) osd_max_backfills = 100 and osd_recovery_max_active_ssd = 50 (non-sane) but it still stops at 20 with 40+ in backfill_wait Any idea about how to speed it up? Thanks. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] HBase/HDFS on Ceph/CephFS
Hi We have an 3 year old Hadoop cluster - up for refresh - so it is time to evaluate options. The "only" usecase is running an HBase installation which is important for us and migrating out of HBase would be a hazzle. Our Ceph usage has expanded and in general - we really like what we see. Thus - Can this be "sanely" consolidated somehow? I have seen this: https://docs.ceph.com/docs/jewel/cephfs/hadoop/ But it seem really-really bogus to me. It recommends that you set: pool 3 'hadoop1' rep size 1 min_size 1 Which would - if I understand correct - be disastrous. The Hadoop end would replicated in 3 across - but within Ceph the replication would be 1. The 1 replication in ceph means pulling the OSD node would "gaurantee" the pg's to go inactive - which could be ok - but there is nothing gauranteeing that the other Hadoop replicas are not served out of the same OSD-node/pg? In which case - rebooting an OSD node would bring the hadoop cluster unavailable. Is anyone serving HBase out of Ceph - how does the stadck and configuration look? If I went for 3 x replication in both Ceph and HDFS then it would definately work, but 9x copies of the dataset is a bit more than what looks feasible at the moment. Thanks for your reflections/input. Jesper ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: HBase/HDFS on Ceph/CephFS
> local filesystem is a bit tricky, we just tried a POC that mounting > CephFS > into every hadoop , configure Hadoop using LocalFS with Replica = 1. > Which > end up with each data only write once into cephfs and cephfs take care of > the data durability. Can you tell a bit more about this? well yes I loose data-locality - but HBase is not that well in maintaining that anyway - When starting up it does not distribute shards to the HDFS nodes that has data but pulls randomly. It gets locality either by "major compact" or waiting for compaction to re-write everything again. I may get equally good data locality with Ceph-based SSD as with local HDDs (which I currently have) Jesper ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Ceph MDS - busy?
Hi. How do I find out if the MDS is "busy" - being the one limiting CephFS metadata throughput. (12.2.8). $ time find . | wc -l 1918069 real8m43.008s user0m2.689s sys 0m7.818s or 3.667ms per file. In the light of "potentially batching" and a network latency of ~0.20ms to the MDS - I have a feeling that this could be significantly improved. Then I additionally tried to do the same through the NFS -ganesha gateway. For reference: Same - but on "local DAS - xfs". $ time find . | wc -l 1918061 real0m4.848s user0m2.360s sys 0m2.816s Same but "above local DAS over NFS": $ time find . | wc -l 1918061 real5m56.546s user0m2.903s sys 0m34.381s jk@ceph-mon1:~$ sudo ceph fs status [sudo] password for jk: cephfs - 84 clients == +--++---+---+---+---+ | Rank | State |MDS|Activity | dns | inos | +--++---+---+---+---+ | 0 | active | ceph-mds2 | Reqs: 1369 /s | 11.3M | 11.3M | | 0-s | standby-replay | ceph-mds1 | Evts:0 /s |0 |0 | +--++---+---+---+---+ +--+--+---+---+ | Pool | type | used | avail | +--+--+---+---+ | cephfs_metadata | metadata | 226M | 16.4T | | cephfs_data| data | 164T | 132T | | cephfs_data_ec42 | data | 180T | 265T | +--+--+---+---+ +-+ | Standby MDS | +-+ +-+ MDS version: ceph version 12.2.5-45redhat1xenial (d4b9f17b56b3348566926849313084dd6efc2ca2) luminous (stable) How can we asses where the bottleneck is and what to do to speed it up? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSD weight on Luminous
unless uou have enabled some balancing - then this is very normal (actually pretty good normal) Jesper Sent from myMail for iOS Thursday, 14 May 2020, 09.35 +0200 from Florent B. : >Hi, > >I have something strange on a Ceph Luminous cluster. > >All OSDs have the same size, the same weight, and one of them is used at >88% by Ceph (osd.3) while others are around 40 to 50% usage : > ># ceph osd df >ID CLASS WEIGHT REWEIGHT SIZE USE DATA OMAP META >AVAIL %USE VAR PGS > 2 hdd 0.49179 1.0 504GiB 264GiB 263GiB 63.7MiB 960MiB >240GiB 52.34 1.14 81 >13 hdd 0.49179 1.0 504GiB 267GiB 266GiB 55.7MiB 1.37GiB >236GiB 53.09 1.16 94 >20 hdd 0.49179 1.0 504GiB 235GiB 234GiB 62.5MiB 962MiB >268GiB 46.70 1.02 99 >21 hdd 0.49179 1.0 504GiB 306GiB 305GiB 65.2MiB 991MiB >198GiB 60.75 1.32 87 >22 hdd 0.49179 1.0 504GiB 185GiB 184GiB 51.9MiB 972MiB >318GiB 36.83 0.80 73 >23 hdd 0.49179 1.0 504GiB 167GiB 166GiB 60.9MiB 963MiB >337GiB 33.07 0.72 80 >24 hdd 0.49179 1.0 504GiB 235GiB 234GiB 67.5MiB 956MiB >268GiB 46.74 1.02 90 >25 hdd 0.49179 1.0 504GiB 183GiB 182GiB 68.8MiB 955MiB >321GiB 36.32 0.79 100 > 3 hdd 0.49179 1.0 504GiB 442GiB 440GiB 77.5MiB 1.15GiB >61.9GiB 87.70 1.91 103 >26 hdd 0.49179 1.0 504GiB 220GiB 219GiB 61.2MiB 963MiB >283GiB 43.78 0.95 80 >29 hdd 0.49179 1.0 504GiB 298GiB 296GiB 77.4MiB 1013MiB >206GiB 59.09 1.29 106 >30 hdd 0.49179 1.0 504GiB 183GiB 182GiB 60.2MiB 964MiB >321GiB 36.32 0.79 88 >10 hdd 0.49179 1.0 504GiB 176GiB 175GiB 56.5MiB 968MiB >327GiB 35.02 0.76 85 >11 hdd 0.49179 1.0 504GiB 209GiB 208GiB 62.5MiB 961MiB >295GiB 41.42 0.90 89 > 0 hdd 0.49179 1.0 504GiB 253GiB 252GiB 55.7MiB 968MiB >251GiB 50.18 1.09 76 > 1 hdd 0.49179 1.0 504GiB 199GiB 198GiB 60.4MiB 964MiB >305GiB 39.51 0.86 92 >16 hdd 0.49179 1.0 504GiB 219GiB 218GiB 58.2MiB 966MiB >284GiB 43.51 0.95 85 >17 hdd 0.49179 1.0 504GiB 231GiB 230GiB 69.0MiB 955MiB >272GiB 45.97 1.00 97 >14 hdd 0.49179 1.0 504GiB 210GiB 209GiB 61.0MiB 963MiB >293GiB 41.72 0.91 74 >15 hdd 0.49179 1.0 504GiB 182GiB 181GiB 50.7MiB 973MiB >322GiB 36.10 0.79 72 >18 hdd 0.49179 1.0 504GiB 297GiB 296GiB 53.7MiB 978MiB >206GiB 59.03 1.29 87 >19 hdd 0.49179 1.0 504GiB 125GiB 124GiB 61.9MiB 962MiB >379GiB 24.81 0.54 82 > TOTAL 10.8TiB 4.97TiB 4.94TiB 1.33GiB 21.4GiB >5.85TiB 45.91 >MIN/MAX VAR: 0.54/1.91 STDDEV: 12.80 > > >Is it a normal situation ? Is there any way to let Ceph handle this >alone or am I forced to reweight the OSD manually ? > >Thank you. > >Florent >___ >ceph-users mailing list -- ceph-users@ceph.io >To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Change crush rule on pool
Hi I would like to change the crush rule so data lands on ssd instead of hdd, can this be done on the fly and migration will just happen or do I need to do something to move data? Jesper Sent from myMail for iOS ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Change crush rule on pool
> I would like to change the crush rule so data lands on ssd instead of hdd, > can this be done on the fly and migration will just happen or do I need to > do something to move data? I would actually like to relocate my object store to a new storage tier. Is the best to: 1) create new pool on storage tier (SSD) 2) stop activity 3) rados cppool data to the new one. 4) rename the pool back into the "default.rgw.buckets.data" pool. Done? Thanks. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Change crush rule on pool
Can i do that - when the SSDs are allready used in another crush rule - backing and kvm_ssd rbd’s? Jesper Sent from myMail for iOS Saturday, 12 September 2020, 11.01 +0200 from anthony.da...@gmail.com : >If you have capacity to have both online at the same time, why not add the >SSDs to the existing pool, let the cluster converge, then remove the HDDs? >Either all at once or incrementally? With care you’d have zero service >impact. If you want to change the replication strategy at the same time, that >would be more complex. > >— Anthony > >> On Sep 12, 2020, at 12:42 AM, jes...@krogh.cc wrote: >> >>> I would like to change the crush rule so data lands on ssd instead of hdd, >>> can this be done on the fly and migration will just happen or do I need to >>> do something to move data? >> >> I would actually like to relocate my object store to a new storage tier. >> Is the best to: >> >> 1) create new pool on storage tier (SSD) >> 2) stop activity >> 3) rados cppool data to the new one. >> 4) rename the pool back into the "default.rgw.buckets.data" pool. >> >> Done? >> >> Thanks. >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: krdb upmap compatibility
What will actually happen if an old client comes by, potential data damage - or just broken connections from the client? jesper Sent from myMail for iOS Monday, 26 August 2019, 20.16 +0200 from Paul Emmerich : >4.13 or newer is enough for upmap > >-- >Paul Emmerich > >Looking for help with your Ceph cluster? Contact us at https://croit.io > >croit GmbH >Freseniusstr. 31h >81247 München >www.croit.io >Tel: +49 89 1896585 90 > >On Mon, Aug 26, 2019 at 8:01 PM Frank R < frankaritc...@gmail.com > wrote: >> >> It seems that with Linux kernel 4.16.10 krdb clients are seen as Jewel >> rather than Luminous. Can someone tell me which kernel version will be seen >> as Luminous as I want to enable the Upmap Balancer. >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io >___ >ceph-users mailing list -- ceph-users@ceph.io >To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: the ceph rbd read dd with fio performance diffrent so huge?
concurrency is widely different 1:30 Jesper Sent from myMail for iOS Tuesday, 27 August 2019, 16.25 +0200 from linghucongs...@163.com : >The performance with the dd and fio diffrent is so huge? > >I have 25 OSDS with 8TB hdd. with dd I only get 410KB/s read perfomance,but >with fio I get 991.23MB/s read perfomance. > >like below: > >Thanks in advance! > >root@Server-d5754749-cded-4964-8129-ba1accbe86b3:~# time dd of=/dev/zero >if=/mnt/testw.dbf bs=4k count=1 iflag=direct >1+0 records in >1+0 records out >4096 bytes (41 MB, 39 MiB) copied, 99.9445 s, 410 kB/s > >real 1m39.950s >user 0m0.040s >sys 0m0.292s > > > >root@Server-d5754749-cded-4964-8129-ba1accbe86b3:~# > fio --filename=/mnt/test1 -direct=1 -iodepth 1 -thread -rw=read >-ioengine=libaio -bs=4k -size=1G -numjobs=30 -runtime=10 >-group_reporting -name=mytest >mytest: (g=0): rw=read, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1 >... >fio-2.2.10 >Starting 30 threads >Jobs: 30 (f=30): [R(30)] [100.0% done] [1149MB/0KB/0KB /s] [294K/0/0 iops] >[eta 00m:00s] >mytest: (groupid=0, jobs=30): err= 0: pid=5261: Tue Aug 27 13:37:28 2019 > read : io=9915.2MB, bw=991.23MB/s, iops=253752, runt= 10003msec > slat (usec): min=2, max=200020, avg=39.10, stdev=1454.14 > clat (usec): min=1, max=160019, avg=38.57, stdev=1006.99 > lat (usec): min=4, max=200022, avg=87.37, stdev=1910.99 > clat percentiles (usec): > | 1.00th=[ 1], 5.00th=[ 1], 10.00th=[ 1], 20.00th=[ 1], > | 30.00th=[ 1], 40.00th=[ 1], 50.00th=[ 1], 60.00th=[ 1], > | 70.00th=[ 1], 80.00th=[ 2], 90.00th=[ 2], 95.00th=[ 2], > | 99.00th=[ 612], 99.50th=[ 684], 99.90th=[ 780], 99.95th=[ 1020], > | 99.99th=[56064] > bw (KB /s): min= 7168, max=46680, per=3.30%, avg=33460.79, stdev=12024.35 > lat (usec) : 2=73.62%, 4=22.38%, 10=0.05%, 20=0.03%, 50=0.01% > lat (usec) : 100=0.01%, 250=0.03%, 500=1.93%, 750=1.75%, 1000=0.14% > lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01% > lat (msec) : 100=0.03%, 250=0.01% > cpu : usr=1.83%, sys=4.30%, ctx=104743, majf=0, minf=59 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued : total=r=2538284/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 > latency : target=0, window=0, percentile=100.00%, depth=1 > >Run status group 0 (all jobs): > READ: io=9915.2MB, aggrb=991.23MB/s, minb=991.23MB/s, maxb=991.23MB/s, >mint=10003msec, maxt=10003msec > >Disk stats (read/write): > vdb: ios=98460/0, merge=0/0, ticks=48840/0, in_queue=49144, util=17.28% > > > > > >___ >ceph-users mailing list -- ceph-users@ceph.io >To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Danish ceph users
yes Sent from myMail for iOS Thursday, 29 August 2019, 15.52 +0200 from fr...@dtu.dk : >I would be in. > >= >Frank Schilder >AIT Risø Campus >Bygning 109, rum S14 > > >From: Torben Hørup < tor...@t-hoerup.dk > >Sent: 29 August 2019 14:03:13 >To: ceph-users@ceph.io >Subject: [ceph-users] Danish ceph users > >Hi > >A colleague and I are talking about making an event in Denmark for the >danish ceph community, and we would like to get a feeling of how many >ceph users are there in Denmark and hereof who would be interested in a >Danish ceph event ? > > >Regards, >Torben >___ >ceph-users mailing list -- ceph-users@ceph.io >To unsubscribe send an email to ceph-users-le...@ceph.io >___ >ceph-users mailing list -- ceph-users@ceph.io >To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MDS blocked ops; kernel: Workqueue: ceph-pg-invalid ceph_invalidate_work [ceph]
> Hi, I encountered a problem with blocked MDS operations and a client > becoming unresponsive. I dumped the MDS cache, ops, blocked ops and some > further log information here: > > https://files.dtu.dk/u/peQSOY1kEja35BI5/2010-09-03-mds-blocked-ops?l > > A user of our HPC system was running a job that creates a somewhat > stressful MDS load. This workload tends to lead to MDS warnings like "slow > metadata ops" and "client does not respond to caps release", which usually > disappear without intervantion after a while. We have a HPC cluster with 4K cores with 30+ (large'ish) servers - 128GB => 768GB compute nodes - and have experience similar issues. This bug seem very related: https://tracker.ceph.com/issues/41467 (we havent gotten a version with that patch yet). Upgrading to a 5.2 kernel with this commit: 3e1d0452edceebb903d23db53201013c940bf000 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3e1d0452edceebb903d23db53201013c940bf000 Was capable of deadlocking the kernel when memory pressure caused MDS to reclaim capabilities - smells similar. Jesper ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Building a petabyte cluster from scratch
> After years of using Ceph, we plan to build soon a new cluster bigger than > what > we've done in the past. As the project is still in reflection, I'd like to > have your thoughts on our planned design : any feedback is welcome :) > > > ## Requirements > > * ~1 PB usable space for file storage, extensible in the future > * The files are mostly "hot" data, no cold storage > * Purpose : storage for big files being essentially used on windows > workstations (10G access) > * Performance is better :) > > ## Global design > > * 8+3 Erasure Coded pool > * ZFS on RBD, exposed via samba shares (cluster with failover) > > > ## Hardware > > * 1 rack (multi-site would be better, of course...) > > * OSD nodes : 14 x supermicro servers >* 24 usable bays in 2U rackspace >* 16 x 10 TB nearline SAS HDD (8 bays for future needs) >* 2 x Xeon Silver 4212 (12C/24T) >* 128 GB RAM >* 4 x 40G QSFP+ > > * Networking : 2 x Cisco N3K 3132Q or 3164Q >* 2 x 40G per server for ceph network (LACP/VPC for HA) >* 2 x 40G per server for public network (LACP/VPC for HA) >* QSFP+ DAC cables > > > ## Sizing > > If we've done the maths well, we expect to have : > > * 2.24 PB of raw storage, extensible to 3.36 PB by adding HDD > * 1.63 PB expected usable space with 8+3 EC, extensible to 2.44 PB > * ~1 PB of usable space if we want to keep the OSD use under 66% to allow >loosing nodes without problem, extensible to 1.6 PB (same condition) > > > ## Reflections > > * We're used to run mons and mgrs daemons on a few of our OSD nodes, > without >any issue so far : is this a bad idea for a big cluster ? > > * We thought using cache tiering on an SSD pool, but a large part of the > PB is >used on a daily basis, so we expect the cache to be not so effective > and >really expensive ? > > * Could a 2x10G network be enough ? I would say yes, those slow disks will not deliver more anyway. This is going to be a relative "slow" setup with limited amount of read-caching - with 16 drives / 128GB memory it'll be a few GB per OSD for read caching - menaning that all read-and-write will hit the slow drives underneath. And that in a "double slow" fashion - where one write will hit 8 + 3 OSD's and wait for sync-ack back to the master - same with reads that will hit 8+3 OSD's before returning to the client. Workload depending - this may just work for you - but it is definately not fast. Suggestions for improvements: * Hardware raid with Battery Backed write-cache - will allow OSD to ack writes before hitting spinning rust. * More memory for OSD-level read-caching. * 3x replication instead of EC .. (we have all above in a "similar" setup ~1PB - 10 OSD - hosts). SSD-tiering pool (havent been there - but would like to test it out). -- Jesper ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Building a petabyte cluster from scratch
>> * Hardware raid with Battery Backed write-cache - will allow OSD to ack >> writes before hitting spinning rust. > > Disagree. See my litany from a few months ago. Use a plain, IT-mode HBA. > Take the $$ you save and put it toward building your cluster out of SSDs > instead of HDDs. That way you donât have to mess with the management > hassles of maintaining and allocating external WAL+DB partitions too. These things are not really comparable - are they? Cost of SSD vs. HDD is still in the 6:1 favor of HHD's. Yes SSD would be great but not nessesarily affordable - or have I missed something that makes the math work ? -- Jesper ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Building a petabyte cluster from scratch
> If k=8,m=3 is too slow on HDDs, so you need replica 3 and SSD DB/WAL, > vs EC 8,3 on SSD, then that's (1/3) / (8/11) = 0.45 multiplier on the > SSD space required vs HDDs. > That brings it from 6x to 2.7x. Then you have the benefit of not > needing separate SSDs for DB/WAL both in hardware cost and complexity. > SSDs will still be more expensive; but perhaps justifiable given the > performance, rebuild times, etc. > > If you only need cold-storage, then EC 8,3 on HDDs will be cheap. But > is that fast enough? Ok, I understand. We have a "hot" fraction of our dataset - and 10GB cache on all 113 HDD ~1TB effective read-cache - and then writes hitting the battery-backed write-cache - this can overspill and when hitting "cold" data performance varies. But the read/write amplification of EC is still un-manageable in pratice on HDD with an active dataset. -- Jesper ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Performance of old vs new hw?
Hi We have some oldish servers with ssds - all on 25gbit nics. R815 AMD - 2,4ghz+ Is there significant performance benefits in moving to a new NVMe based, new cpus? +20% IOPs? + 50% IOPs? Jesper Sent from myMail for iOS ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph Performance of Micron 5210 SATA?
But is random/sequential read performance still good? even during saturated write performance ? if so the tradeoff could fit quite some applications Sent from myMail for iOS Friday, 6 March 2020, 14.06 +0100 from vitalif : >Hi, > >Current QLC drives are total shit in terms of steady-state performance. >First 10-100 GB of data is written into the SLC cache which is fast, but >then the drive switches to its QLC memory and even the linear write >performance drops to ~90 MB/s which is actually worse than with HDDs! > >So, try to run a long linear write test and check the performance after >writing a lot of data. > >> Last monday I performed a quick test with those two disks already, >> probably not that relevant, but posting it anyway: >___ >ceph-users mailing list -- ceph-users@ceph.io >To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: New 3 node Ceph cluster
Hi. Unless there is plans for going to Petabyte scale with it - then I really dont see the benefits of getting CephFS involved over just an RBD image with a VM running standard samba on top. More performant and less complexity to handle - zero gains (by my book) Jesper > Hi, > > I am planning to create a new 3 node ceph storage cluster. > > I will be using Cephfs + with samba for max 10 clients for upload and > download. > > Storage Node HW is Intel Xeon E5v2 8 core single Proc, 32GB RAM and 10Gb > Nic 2 nos., 6TB SATA HDD 24 Nos. each node, OS separate SSD disk. > > Earlier I have tested orchestration using ceph-deploy in the test setup. > now, is there any other alternative to ceph-deploy? > > Can I restrict folder access to the user using cephfs + vfs samba or > should > I use ceph client + samba? > > Ubuntu or Centos? > > Any block size consideration for object size, metadata when using cephfs? > > Idea or suggestion from existing users. I am also going to start to > explore > all the above. > > regards > Amudhan > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Recommendation for decent write latency performance from HDDs
Hi. We have a need for "bulk" storage - but with decent write latencies. Normally we would do this with a DAS with a Raid5 with 2GB Battery backed write cache in front - As cheap as possible but still getting the features of scalability of ceph. In our "first" ceph cluster we did the same - just stuffed in BBWC in the OSD nodes and we're fine - but now we're onto the next one and systems like: https://www.supermicro.com/en/products/system/1U/6119/SSG-6119P-ACR12N4L.cfm Does not support a Raid controller like that - but is branded as for "Ceph Storage Solutions". It do however support 4 NVMe slots in the front - So - some level of "tiering" using the NVMe drives should be what is "suggested" - but what do people do? What is recommeneded. I see multiple options: Ceph tiering at the "pool - layer": https://docs.ceph.com/docs/master/rados/operations/cache-tiering/ And rumors that it is "deprectated: https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2.0/html/release_notes/deprecated_functionality Pro: Abstract layer Con: Deprecated? - Lots of warnings? Offloading the block.db on NVMe / SSD: https://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/ Pro: Easy to deal with - seem heavily supported. Con: As far as I can tell - this will only benefit the metadata of the osd- not actual data. Thus a data-commit to the osd til still be dominated by the writelatency of the underlying - very slow HDD. Bcache: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-June/027713.html Pro: Closest to the BBWC mentioned above - but with way-way larger cache sizes. Con: It is hard to see if I end up being the only one on the planet using this solution. Eat it - Writes will be as slow as hitting dead-rust - anything that cannot live with that need to be entirely on SSD/NVMe. Other? Thanks for your input. Jesper ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Recommendation for decent write latency performance from HDDs
> On Sat, Apr 4, 2020 at 4:13 PM wrote: >> Offloading the block.db on NVMe / SSD: >> https://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/ >> >> Pro: Easy to deal with - seem heavily supported. >> Con: As far as I can tell - this will only benefit the metadata of the >> osd- not actual data. Thus a data-commit to the osd til still be >> dominated >> by the writelatency of the underlying - very slow HDD. > > small writes (<= 32kb, configurable) are written to db first and > written back to the slow disk asynchronous to the original request. Now, that sounds really interesting - I havent been able to find that in the documentation - can you provide a pointer? Whats the configuratoin parameter named? Meaning moving block.dk to a say 256GB NVMe will do "the right thing" for the system and deliver a fast write cache for smallish writes. Would setting the parameter til 1MB be "insane"? Jesper ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] MDS_CACHE_OVERSIZED warning
Hi. I have a cluster that has been running for close to 2 years now - pretty much with the same setting, but over the past day I'm seeing this warning. (and the cache seem to keep growing) - Can I figure out which clients is accumulating the inodes? Ceph 12.2.8 - is it ok just to "bump" the memory to say 128GB - any negative sideeffects? jk@ceph-mon1:~$ sudo ceph health detail HEALTH_WARN 1 MDSs report oversized cache; 3 clients failing to respond to cache pressure MDS_CACHE_OVERSIZED 1 MDSs report oversized cache mdsceph-mds1(mds.0): MDS cache is too large (91GB/32GB); 34400070 inodes in use by clients, 3293 stray files Thanks - Jesper ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] pg_num != pgp_num - and unable to change.
Hi. Fresh cluster - after a dance where the autoscaler did not work (returned black) as described in the doc - I now seemingly have it working. It has bumpted target to something reasonable -- and is slowly incrementing pg_num and pgp_num by 2 over time (hope this is correct?) But . jskr@dkcphhpcmgt028:/$ sudo ceph osd pool ls detail | grep 62 pool 22 'cephfs.archive.ec62data' erasure profile ecprof62 size 8 min_size 7 crush_rule 3 object_hash rjenkins pg_num 150 pgp_num 22 pg_num_target 512 pgp_num_target 512 autoscale_mode on last_change 9159 lfor 0/0/9147 flags hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 24576 pg_num_min 128 target_size_ratio 0.4 application cephfs pg_num = 150 pgp_num = 22 and setting pgp_num seemingly have zero effect on the system .. not even with autoscaling set to off. jskr@dkcphhpcmgt028:/$ sudo ceph osd pool set cephfs.archive.ec62data pg_autoscale_mode off set pool 22 pg_autoscale_mode to off jskr@dkcphhpcmgt028:/$ sudo ceph osd pool set cephfs.archive.ec62data pgp_num 150 set pool 22 pgp_num to 150 jskr@dkcphhpcmgt028:/$ sudo ceph osd pool set cephfs.archive.ec62data pg_num_min 128 set pool 22 pg_num_min to 128 jskr@dkcphhpcmgt028:/$ sudo ceph osd pool set cephfs.archive.ec62data pg_num 150 set pool 22 pg_num to 150 jskr@dkcphhpcmgt028:/$ sudo ceph osd pool set cephfs.archive.ec62data pg_autoscale_mode on set pool 22 pg_autoscale_mode to on jskr@dkcphhpcmgt028:/$ sudo ceph progress PG autoscaler increasing pool 22 PGs from 150 to 512 (14s) [] jskr@dkcphhpcmgt028:/$ sudo ceph osd pool ls detail | grep 62 pool 22 'cephfs.archive.ec62data' erasure profile ecprof62 size 8 min_size 7 crush_rule 3 object_hash rjenkins pg_num 150 pgp_num 22 pg_num_target 512 pgp_num_target 512 autoscale_mode on last_change 9159 lfor 0/0/9147 flags hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 24576 pg_num_min 128 target_size_ratio 0.4 application cephfs pgp_num != pg_num ? In earlier versions of ceph (without autoscaler) I have only experienced that setting pg_num and pgp_num took immidiate effect? Jesper jskr@dkcphhpcmgt028:/$ sudo ceph version ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable) jskr@dkcphhpcmgt028:/$ sudo ceph health HEALTH_OK jskr@dkcphhpcmgt028:/$ sudo ceph status cluster: id: 5c384430-da91-11ed-af9c-c780a5227aff health: HEALTH_OK services: mon: 3 daemons, quorum dkcphhpcmgt031,dkcphhpcmgt029,dkcphhpcmgt028 (age 15h) mgr: dkcphhpcmgt031.afbgjx(active, since 32h), standbys: dkcphhpcmgt029.bnsegi, dkcphhpcmgt028.bxxkqd mds: 2/2 daemons up, 1 standby osd: 40 osds: 40 up (since 44h), 40 in (since 39h); 33 remapped pgs data: volumes: 2/2 healthy pools: 9 pools, 495 pgs objects: 24.85M objects, 60 TiB usage: 117 TiB used, 158 TiB / 276 TiB avail pgs: 13494029/145763897 objects misplaced (9.257%) 462 active+clean 23 active+remapped+backfilling 10 active+remapped+backfill_wait io: client: 0 B/s rd, 1.1 MiB/s wr, 0 op/s rd, 94 op/s wr recovery: 705 MiB/s, 208 objects/s progress: -- Jesper Krogh ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Cannot get backfill speed up
Hi. Fresh cluster - but despite setting: jskr@dkcphhpcmgt028:/$ sudo ceph config show osd.0 | grep recovery_max_active_ssd osd_recovery_max_active_ssd 50 mon default[20] jskr@dkcphhpcmgt028:/$ sudo ceph config show osd.0 | grep osd_max_backfills osd_max_backfills100 mon default[10] I still get jskr@dkcphhpcmgt028:/$ sudo ceph status cluster: id: 5c384430-da91-11ed-af9c-c780a5227aff health: HEALTH_OK services: mon: 3 daemons, quorum dkcphhpcmgt031,dkcphhpcmgt029,dkcphhpcmgt028 (age 16h) mgr: dkcphhpcmgt031.afbgjx(active, since 33h), standbys: dkcphhpcmgt029.bnsegi, dkcphhpcmgt028.bxxkqd mds: 2/2 daemons up, 1 standby osd: 40 osds: 40 up (since 45h), 40 in (since 39h); 21 remapped pgs data: volumes: 2/2 healthy pools: 9 pools, 495 pgs objects: 24.85M objects, 60 TiB usage: 117 TiB used, 159 TiB / 276 TiB avail pgs: 10655690/145764002 objects misplaced (7.310%) 474 active+clean 15 active+remapped+backfilling 6 active+remapped+backfill_wait io: client: 0 B/s rd, 1.4 MiB/s wr, 0 op/s rd, 116 op/s wr recovery: 328 MiB/s, 108 objects/s progress: Global Recovery Event (9h) [==..] (remaining: 25m) With these numbers for the setting - I would expect to get more than 15 active backfilling... (and based on SSD's and 2x25gbit network, I can also spend more resources on recovery than 328 MiB/s Thanks, . -- Jesper Krogh ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Cephfs metadata and MDS on same node
Dear Ceph’ers I am about to upgrade MDS nodes for Cephfs in the Ceph cluster (erasure code 8+3 ) I am administrating. Since they will get plenty of memory and CPU cores, I was wondering if it would be a good idea to move metadata OSDs (NVMe's currently on OSD nodes together with cephfs_data ODS (HDD)) to the MDS nodes? Configured as: 4 x MDS with each a metadata OSD and configured with 4 x replication so each metadata OSD would have a complete copy of metadata. I know MDS, stores al lot of metadata in RAM, but if metadata OSDs were on MDS nodes, would that not bring down latency? Anyway, I am just asking for your opinion on this? Pros and cons or even better somebody who actually have tried this? Best regards, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk<mailto:je...@mbg.au.dk> Tlf:+45 50906203 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Recover data from Cephfs snapshot
Hi Ceph'ers, I love the possibility to make snapshots on Cephfs systems. Although there is one thing that puzzles me. Creating snapshot takes no time to do and deleting snapshots can bring PGs into snaptrim state for some hours. While recovering data from a snapshot will always invoke a full data transfer, where data are "physically" being copied back into place. This can make recovering from snapshots on Cephfs a rather heavy procedure. I have even tried "mv" command but that also starts transfer real data instead of just moving metadata pointers. Am I missing some "ceph snapshot recover" command, that can move metadata pointers and make recovery much lighter, or is this just that way it is? Best reagards, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] replacing OSD nodes
5 377 46 322 24 306 53 200 240 338 #1.9TiB bytes available on most full OSD (306) ceph osd pg-upmap-items 20.6c5 334 371 30 340 70 266 241 407 3 233 186 356 40 312 294 391 #1.9TiB bytes available on most full OSD (233) ceph osd pg-upmap-items 20.6b4 344 338 226 389 319 362 309 411 85 379 248 233 121 318 0 254 #1.9TiB bytes available on most full OSD (233) ceph osd pg-upmap-items 20.6b1 325 292 35 371 347 153 146 390 12 343 88 327 27 355 54 250 192 408 #1.9TiB bytes available on most full OSD (153) ceph osd pg-upmap-items 20.57 82 389 282 356 103 165 62 284 67 408 252 366 #1.9TiB bytes available on most full OSD (165) ceph osd pg-upmap-items 20.50 244 355 319 228 154 397 63 317 113 378 97 276 288 150 #1.9TiB bytes available on most full OSD (228) ceph osd pg-upmap-items 20.47 343 351 107 283 81 332 76 398 160 410 26 378 #1.9TiB bytes available on most full OSD (283) ceph osd pg-upmap-items 20.3e 56 322 31 283 330 377 107 360 199 309 190 385 78 406 #1.9TiB bytes available on most full OSD (283) ceph osd pg-upmap-items 20.3b 91 349 312 414 268 386 45 244 125 371 #1.9TiB bytes available on most full OSD (244) ceph osd pg-upmap-items 20.3a 277 371 290 359 91 415 165 392 107 167 #1.9TiB bytes available on most full OSD (167) ceph osd pg-upmap-items 20.39 74 175 18 302 240 393 3 269 224 374 194 408 173 364 #1.9TiB bytes available on most full OSD (302) ... ... If I were to set this into effect, I would first set norecover and nobackfill, then run the script and unset norecover and nobackfill again. But I am uncertain if it would work? Or even if this is a good idea? It would be nice if Ceph did something similar automatically 🙂 Or maybe Ceph already does something similar, and I have just not been able to find it? If Ceph were to do this, it could be nice if the priority of backfill_wait PGs was rerun, perharps every 24 hours, as OSD availability landscape of course changes during backfill. I imagine this, especially, could stabilize recovery/rebalance on systems where space is a little tight. Best regards, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Universitetsbyen 81 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: replacing OSD nodes
Thanks for you answer Janne. Yes, I am also running "ceph osd reweight" on the "nearfull" osds, once they get too close for comfort. But I just though a continuous prioritization of rebalancing PGs, could make this process more smooth, with less/no need for handheld operations. Best, Jesper ------ Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Universitetsbyen 81 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 Fra: Janne Johansson Sendt: 20. juli 2022 10:47 Til: Jesper Lykkegaard Karlsen Cc: ceph-users@ceph.io Emne: Re: [ceph-users] replacing OSD nodes Den tis 19 juli 2022 kl 13:09 skrev Jesper Lykkegaard Karlsen : > > Hi all, > Setup: Octopus - erasure 8-3 > I had gotten to the point where I had some rather old OSD nodes, that I > wanted to replace with new ones. > The procedure was planned like this: > > * add new replacement OSD nodes > * set all OSDs on the retiring nodes to out. > * wait for everything to rebalance > * remove retiring nodes > After around 50% misplaced objects remaining, the OSDs started to complain > about backfillfull OSDs and nearfull OSDs. > A bit of a surprise to me, as RAW size is only 47% used. > It seems that rebalancing does not happen in a prioritized manner, where > planed backfill starts with the OSD with most space available space, but > "alphabetically" according to pg-name. > Is this really true? I don't know if it does it in any particular order, just that it certainly doesn't fire off requests to the least filled OSD to receive data first, so when I have gotten into similar situations, it just tried to run as many moves as possible given max_backfill and all that, then some/most might get stuck in toofull, but as the rest of the slots progress, space gets available and at some point those toofull ones get handled. It delays the completion but hasn't caused me any other specific problems. Though I will admit I have used "ceph osd reweight osd.123 " at times to force emptying of some OSDs, but that was more my impatience than anything else. -- May the most significant bit of your life be positive. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: replacing OSD nodes
It seems like a low hanging fruit to fix? There must be a reason why the developers have not made a prioritized order of backfilling PGs. Or maybe the prioritization is something else than available space? The answer remains unanswered, as well as if my suggested approach/script would work or not? Summer vacation? Best, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Universitetsbyen 81 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 Fra: Janne Johansson Sendt: 20. juli 2022 19:39 Til: Jesper Lykkegaard Karlsen Cc: ceph-users@ceph.io Emne: Re: [ceph-users] replacing OSD nodes Den ons 20 juli 2022 kl 11:22 skrev Jesper Lykkegaard Karlsen : > Thanks for you answer Janne. > Yes, I am also running "ceph osd reweight" on the "nearfull" osds, once they > get too close for comfort. > > But I just though a continuous prioritization of rebalancing PGs, could make > this process more smooth, with less/no need for handheld operations. You are absolutely right there, just wanted to chip in with my experiences of "it nags at me but it will still work out" so other people finding these mails later on can feel a bit relieved at knowing that a few toofull warnings aren't a major disaster and that it sometimes happens, because ceph looks for all possible moves, even those who will run late in the rebalancing. -- May the most significant bit of your life be positive. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: PG does not become active
Hi Frank, I think you need at least 6 OSD hosts to make EC 4+2 with faillure domain host. I do not know how it was possible for you to create that configuration at first? Could it be that you have multiple name for the OSD hosts? That would at least explain the one OSD down, being show as two OSDs down. Also, I believe that min_size should never be smaller than “coding” shards, which is 4 in this case. You can either make a new test setup with your three test OSD hosts using EC 2+1 or make e.g. 4+2, but with failure domain set to OSD. Best, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Universitetsbyen 81 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 > On 27 Jul 2022, at 17.32, Frank Schilder wrote: > > Update: the inactive PG got recovered and active after a lnngg wait. The > middle question is now answered. However, these two questions are still of > great worry: > > - How can 2 OSDs be missing if only 1 OSD is down? > - If the PG should recover, why is it not prioritised considering its severe > degradation > compared with all other PGs? > > I don't understand how a PG can loose 2 shards if 1 OSD goes down. That looks > really really bad to me (did ceph loose track of data??). > > The second is of no less importance. The inactive PG was holding back client > IO, leading to further warnings about slow OPS/requests/... Why are such > critically degraded PGs not scheduled for recovery first? There is a service > outage but only a health warning? > > Thanks and best regards. > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Frank Schilder > Sent: 27 July 2022 17:19:05 > To: ceph-users@ceph.io > Subject: [ceph-users] PG does not become active > > I'm testing octopus 15.2.16 and run into a problem right away. I'm filling up > a small test cluster with 3 hosts 3x3 OSDs and killed one OSD to see how > recovery works. I have one 4+2 EC pool with failure domain host and on 1 PGs > of this pool 2 (!!!) shards are missing. This most degraded PG is not > becoming active, its stuck inactive but peered. > > Questions: > > - How can 2 OSDs be missing if only 1 OSD is down? > - Wasn't there an important code change to allow recovery for an EC PG with at > least k shards present even if min_size>k? Do I have to set something? > - If the PG should recover, why is it not prioritised considering its severe > degradation > compared with all other PGs? > > I have already increased these crush tunables and executed a pg repeer to no > avail: > > tunable choose_total_tries 250 <-- default 100 > rule fs-data { >id 1 >type erasure >min_size 3 >max_size 6 >step set_chooseleaf_tries 50 <-- default 5 >step set_choose_tries 200 <-- default 100 >step take default >step choose indep 0 type osd >step emit > } > > Ceph health detail says to that: > > [WRN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive >pg 4.32 is stuck inactive for 37m, current state > recovery_wait+undersized+degraded+remapped+peered, last acting > [1,2147483647,2147483647,4,5,2] > > I don't want to cheat and set min_size=k on this pool. It should work by > itself. > > Thanks for any pointers! > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: PG does not become active
Ah I see, should have look at the “raw” data instead ;-) Then I agree this very weird? Best, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Universitetsbyen 81 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 > On 28 Jul 2022, at 12.45, Frank Schilder wrote: > > Hi Jesper, > > thanks for looking at this. The failure domain is OSD and not host. I typed > it wrong in the text, the copy of the crush rule shows it right: step choose > indep 0 type osd. > > I'm trying to reproduce the observation to file a tracker item, but it is > more difficult than expected. It might be a race condition, so far I didn't > see it again. I hope I can figure out when and why this is happening. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Jesper Lykkegaard Karlsen > Sent: 28 July 2022 12:02:51 > To: Frank Schilder > Cc: ceph-users@ceph.io > Subject: Re: [ceph-users] PG does not become active > > Hi Frank, > > I think you need at least 6 OSD hosts to make EC 4+2 with faillure domain > host. > > I do not know how it was possible for you to create that configuration at > first? > Could it be that you have multiple name for the OSD hosts? > That would at least explain the one OSD down, being show as two OSDs down. > > Also, I believe that min_size should never be smaller than “coding” shards, > which is 4 in this case. > > You can either make a new test setup with your three test OSD hosts using EC > 2+1 or make e.g. 4+2, but with failure domain set to OSD. > > Best, > Jesper > > -- > Jesper Lykkegaard Karlsen > Scientific Computing > Centre for Structural Biology > Department of Molecular Biology and Genetics > Aarhus University > Universitetsbyen 81 > 8000 Aarhus C > > E-mail: je...@mbg.au.dk > Tlf:+45 50906203 > >> On 27 Jul 2022, at 17.32, Frank Schilder wrote: >> >> Update: the inactive PG got recovered and active after a lnngg wait. The >> middle question is now answered. However, these two questions are still of >> great worry: >> >> - How can 2 OSDs be missing if only 1 OSD is down? >> - If the PG should recover, why is it not prioritised considering its severe >> degradation >> compared with all other PGs? >> >> I don't understand how a PG can loose 2 shards if 1 OSD goes down. That >> looks really really bad to me (did ceph loose track of data??). >> >> The second is of no less importance. The inactive PG was holding back client >> IO, leading to further warnings about slow OPS/requests/... Why are such >> critically degraded PGs not scheduled for recovery first? There is a service >> outage but only a health warning? >> >> Thanks and best regards. >> = >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> >> From: Frank Schilder >> Sent: 27 July 2022 17:19:05 >> To: ceph-users@ceph.io >> Subject: [ceph-users] PG does not become active >> >> I'm testing octopus 15.2.16 and run into a problem right away. I'm filling >> up a small test cluster with 3 hosts 3x3 OSDs and killed one OSD to see how >> recovery works. I have one 4+2 EC pool with failure domain host and on 1 PGs >> of this pool 2 (!!!) shards are missing. This most degraded PG is not >> becoming active, its stuck inactive but peered. >> >> Questions: >> >> - How can 2 OSDs be missing if only 1 OSD is down? >> - Wasn't there an important code change to allow recovery for an EC PG with >> at >> least k shards present even if min_size>k? Do I have to set something? >> - If the PG should recover, why is it not prioritised considering its severe >> degradation >> compared with all other PGs? >> >> I have already increased these crush tunables and executed a pg repeer to no >> avail: >> >> tunable choose_total_tries 250 <-- default 100 >> rule fs-data { >> id 1 >> type erasure >> min_size 3 >> max_size 6 >> step set_chooseleaf_tries 50 <-- default 5 >> step set_choose_tries 200 <-- default 100 >> step take default >> step choose indep 0 type osd >> step emit >> } >> >> Ceph health detail says to that:
[ceph-users] Re: cannot set quota on ceph fs root
Hi Frank, I guess there is alway the possibility to set quota on pool level with "target_max_objects" and “target_max_bytes” The cephfs quotas through attributes are only for sub-directories as far as I recall. Best, Jesper ------ Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Universitetsbyen 81 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 > On 28 Jul 2022, at 17.22, Frank Schilder wrote: > > Hi Gregory, > > thanks for your reply. It should be possible to set a quota on the root, > other vattribs can be set as well despite it being a mount point. There must > be something on the ceph side (or another bug in the kclient) preventing it. > > By the way, I can't seem to find cephfs-tools like cephfs-shell. I'm using > the image quay.io/ceph/ceph:v15.2.16 and its not installed in the image. A > "yum provides cephfs-shell" returns no candidate and I can't find > installation instructions. Could you help me out here? > > Thanks and best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Gregory Farnum > Sent: 28 July 2022 16:59:50 > To: Frank Schilder > Cc: ceph-users@ceph.io > Subject: Re: [ceph-users] cannot set quota on ceph fs root > > On Thu, Jul 28, 2022 at 1:01 AM Frank Schilder wrote: >> >> Hi all, >> >> I'm trying to set a quota on the ceph fs file system root, but it fails with >> "setfattr: /mnt/adm/cephfs: Invalid argument". I can set quotas on any >> sub-directory. Is this intentional? The documentation >> (https://docs.ceph.com/en/octopus/cephfs/quota/#quotas) says >> >>> CephFS allows quotas to be set on any directory in the system. >> >> Any includes the fs root. Is the documentation incorrect or is this a bug? > > I'm not immediately seeing why we can't set quota on the root, but the > root inode is special in a lot of ways so this doesn't surprise me. > I'd probably regard it as a docs bug. > > That said, there's also a good chance that the setfattr is getting > intercepted before Ceph ever sees it, since by setting it on the root > you're necessarily interacting with a mount point in Linux and those > can also be finicky...You could see if it works by using cephfs-shell. > -Greg > > >> >> Best regards, >> = >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io >> > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: replacing OSD nodes
Thanks you for your suggestions Josh, it is really appreciated. Pgremapper looks interesting and definitely something I will look into. I know the balancer will reach a well balanced PG landscape eventually, but I am not sure that it will prioritise backfill after “most available location” first. Then I might end up in the same situation, where some of the old (but not retired) OSD starts getting full. Then there is the “undo-upmaps” script left or maybe even the script that I propose in combination with “cancel-backfill”, as it just moves what Ceph was planing to move anyway, just in a prioritised manner. Have you tried the pgremapper youself Josh? Is it safe to use? And does the Ceph developers vouch for this methode? Status now is ~1,600,000,000 objects are now move, which is about half of all of the planned backfills. I have been reweighing OSD down, as they get to close to maximum usage, which works to some extend. Monitors on the other hand are now complaining about using a lot of disk space, due to the long time backfilling. There is still plenty of disk space on the mons, but I feel that the backfill is getting slower and slower, although still the same amount of PGs are backfilling. Can large disk usage on mons slow down backfill and other operations? Is it dangerous? Best, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Universitetsbyen 81 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 > On 28 Jul 2022, at 22.26, Josh Baergen wrote: > > I don't have many comments on your proposed approach, but just wanted > to note that how I would have approached this, assuming that you have > the same number of old hosts, would be to: > 1. Swap-bucket the hosts. > 2. Downweight the OSDs on the old hosts to 0.001. (Marking them out > (i.e. weight 0) prevents maps from being applied.) > 3. Add the old hosts back to the CRUSH map in their old racks or whatever. > 4. Use https://github.com/digitalocean/pgremapper#cancel-backfill. > 5. Then run https://github.com/digitalocean/pgremapper#undo-upmaps in > a loop to drain the old OSDs. > > This gives you the maximum concurrency and efficiency of movement, but > doesn't necessarily solve your balance issue if it's the new OSDs that > are getting full (that wasn't clear to me). It's still possible to > apply steps 2, 4, and 5 if the new hosts are in place. If you're not > in a rush could actually use the balancer instead of undo-upmaps in > step 5 to perform the rest of the data migration and then you wouldn't > have full OSDs. > > Josh > > On Fri, Jul 22, 2022 at 1:57 AM Jesper Lykkegaard Karlsen > wrote: >> >> It seems like a low hanging fruit to fix? >> There must be a reason why the developers have not made a prioritized order >> of backfilling PGs. >> Or maybe the prioritization is something else than available space? >> >> The answer remains unanswered, as well as if my suggested approach/script >> would work or not? >> >> Summer vacation? >> >> Best, >> Jesper >> >> -- >> Jesper Lykkegaard Karlsen >> Scientific Computing >> Centre for Structural Biology >> Department of Molecular Biology and Genetics >> Aarhus University >> Universitetsbyen 81 >> 8000 Aarhus C >> >> E-mail: je...@mbg.au.dk >> Tlf: +45 50906203 >> >> >> Fra: Janne Johansson >> Sendt: 20. juli 2022 19:39 >> Til: Jesper Lykkegaard Karlsen >> Cc: ceph-users@ceph.io >> Emne: Re: [ceph-users] replacing OSD nodes >> >> Den ons 20 juli 2022 kl 11:22 skrev Jesper Lykkegaard Karlsen >> : >>> Thanks for you answer Janne. >>> Yes, I am also running "ceph osd reweight" on the "nearfull" osds, once >>> they get too close for comfort. >>> >>> But I just though a continuous prioritization of rebalancing PGs, could >>> make this process more smooth, with less/no need for handheld operations. >> >> You are absolutely right there, just wanted to chip in with my >> experiences of "it nags at me but it will still work out" so other >> people finding these mails later on can feel a bit relieved at knowing >> that a few toofull warnings aren't a major disaster and that it >> sometimes happens, because ceph looks for all possible moves, even >> those who will run late in the rebalancing. >> >> -- >> May the most significant bit of your life be positive. >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: replacing OSD nodes
Cool thanks a lot! I will definitely put it in my toolbox. Best, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Universitetsbyen 81 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 > On 29 Jul 2022, at 00.35, Josh Baergen wrote: > >> I know the balancer will reach a well balanced PG landscape eventually, but >> I am not sure that it will prioritise backfill after “most available >> location” first. > > Correct, I don't believe it prioritizes in this way. > >> Have you tried the pgremapper youself Josh? > > My team wrote and maintains pgremapper and we've used it extensively, > but I'd always recommend trying it in test environments first. Its > effect on the system isn't much different than what you're proposing > (it simply manipulates the upmap exception table). > > Josh ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Potential bug in cephfs-data-scan?
Hi, I have recently been scanning the files in a PG with "cephfs-data-scan pg_files ...". Although, after a long time the scan was still running and the list of files consumed 44 GB, I stopped it, as something obviously was very wrong. It turns out some users had symlinks that looped and even a user had a symlink to "/". It does not make sense that cephfs-data-scan follows symlinks, as this will give a wrong picture of what files are in the target PG. I have looked though CEPHs bug reports, but I do not see anyone mentioning this. Although I am still on the recently deprecated Octopus, I suspect that this bug is also present in Pacific and Quincy? It might be related to this bug? https://tracker.ceph.com/issues/46166 But symptoms are different. Or, maybe there is a way to disable the following of symlinks in "cephfs-data-scan pg_files ..."? Best, Jesper ------ Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Universitetsbyen 81 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Potential bug in cephfs-data-scan?
Fra: Patrick Donnelly Sendt: 19. august 2022 16:16 Til: Jesper Lykkegaard Karlsen Cc: ceph-users@ceph.io Emne: Re: [ceph-users] Potential bug in cephfs-data-scan? On Fri, Aug 19, 2022 at 5:02 AM Jesper Lykkegaard Karlsen wrote: >> > >Hi, >> >> I have recently been scanning the files in a PG with "cephfs-data-scan >> pg_files ...". >Why? I had an incident where a PG that went down+incomplete after some OSD crashed + heavy load + ongoing snap trimming. Got it back up again with object store tool by marking complete. Then I wanted to show possible affected files with cephfs-data-scan in the unfortunate PG, so I could recover potential loss from backup. >> Although, after a long time the scan was still running and the list of files >> consumed 44 GB, I stopped it, as something obviously was very wrong. >> >> It turns out some users had symlinks that looped and even a user had a >> symlink to "/". >Symlinks are not stored in the data pool. This should be irrelevant. Okay, it may be a case of me "holding it wrong", but I do see "cephfs-data-scan pg_files" trying to follow any global or local symlink in the file structure, which leads to many more files registrered than possibly could be in that PG and even endless loops in some cases. If the symlinks are not stored in data pool, how can cephfs-data-scan then follow the link? And how do I get "cephfs-data-scan" to just show the symlinks as links and not follow them up or down in directory structure? Best, Jesper ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Potential bug in cephfs-data-scan?
Actually, it might have worked better if the PG had stayed down while running cephfs-data-scan, as it could only then get file structure from metadata pool and not touch each file/link in data pool? This would at least properly have given the list of files in (only) the affected PG? //Jesper Fra: Jesper Lykkegaard Karlsen Sendt: 19. august 2022 22:49 Til: Patrick Donnelly Cc: ceph-users@ceph.io Emne: [ceph-users] Re: Potential bug in cephfs-data-scan? Fra: Patrick Donnelly Sendt: 19. august 2022 16:16 Til: Jesper Lykkegaard Karlsen Cc: ceph-users@ceph.io Emne: Re: [ceph-users] Potential bug in cephfs-data-scan? On Fri, Aug 19, 2022 at 5:02 AM Jesper Lykkegaard Karlsen wrote: >> > >Hi, >> >> I have recently been scanning the files in a PG with "cephfs-data-scan >> pg_files ...". >Why? I had an incident where a PG that went down+incomplete after some OSD crashed + heavy load + ongoing snap trimming. Got it back up again with object store tool by marking complete. Then I wanted to show possible affected files with cephfs-data-scan in the unfortunate PG, so I could recover potential loss from backup. >> Although, after a long time the scan was still running and the list of files >> consumed 44 GB, I stopped it, as something obviously was very wrong. >> >> It turns out some users had symlinks that looped and even a user had a >> symlink to "/". >Symlinks are not stored in the data pool. This should be irrelevant. Okay, it may be a case of me "holding it wrong", but I do see "cephfs-data-scan pg_files" trying to follow any global or local symlink in the file structure, which leads to many more files registrered than possibly could be in that PG and even endless loops in some cases. If the symlinks are not stored in data pool, how can cephfs-data-scan then follow the link? And how do I get "cephfs-data-scan" to just show the symlinks as links and not follow them up or down in directory structure? Best, Jesper ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Remove corrupt PG
Hi all, I wanted to move a PG to an empty OSD, so I could do repairs on it without the whole OSD, which is full of other PG’s, would be effected with extensive downtime. Thus, I exported the PG with ceph-objectstore-tool, an after successful export I removed it. Unfortunately, the remove command was interrupted midway. This resulted in a PG that could not be remove with “ceph-objectstore-tool —op remove ….”, since the header is gone. Worse is that the OSD does not boot, due to it can see objects from the removed PG, but cannot access them. I have tried to remove the individual objects in that PG (also with objectstore-tool), but this process is extremely slow. When looping over the >65,000 object, each remove takes ~10 sec and is very compute intensive, which is approximately 7.5 days. Is the a faster way to get around this? Mvh. Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Universitetsbyen 81 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Remove corrupt PG
To answer my own question. The removal of the corrupt PG, could be fixed by doing ceph-objectstore-tool fuse mount-thingy. Then from the mount point, delete everything in the PGs head directory. This took only a few seconds (compared to 7.5 days) and after unmount and restart of the OSD it came back online. Best, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Universitetsbyen 81 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 > On 31 Aug 2022, at 20.53, Jesper Lykkegaard Karlsen wrote: > > Hi all, > > I wanted to move a PG to an empty OSD, so I could do repairs on it without > the whole OSD, which is full of other PG’s, would be effected with extensive > downtime. > > Thus, I exported the PG with ceph-objectstore-tool, an after successful > export I removed it. Unfortunately, the remove command was interrupted > midway. > This resulted in a PG that could not be remove with “ceph-objectstore-tool > —op remove ….”, since the header is gone. > Worse is that the OSD does not boot, due to it can see objects from the > removed PG, but cannot access them. > > I have tried to remove the individual objects in that PG (also with > objectstore-tool), but this process is extremely slow. > When looping over the >65,000 object, each remove takes ~10 sec and is very > compute intensive, which is approximately 7.5 days. > > Is the a faster way to get around this? > > Mvh. Jesper > > -- > Jesper Lykkegaard Karlsen > Scientific Computing > Centre for Structural Biology > Department of Molecular Biology and Genetics > Aarhus University > Universitetsbyen 81 > 8000 Aarhus C > > E-mail: je...@mbg.au.dk > Tlf:+45 50906203 > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Remove corrupt PG
Well not the total solution after all. There is still some metadata and header structure left that I still cannot delete with ceph-objectstore-tool —op remove. It makes a core dump. I think I need to declare the OSD lost anyway to the through this. Unless somebody have a better suggestion? Best, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Universitetsbyen 81 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 > On 1 Sep 2022, at 22.01, Jesper Lykkegaard Karlsen wrote: > > To answer my own question. > > The removal of the corrupt PG, could be fixed by doing ceph-objectstore-tool > fuse mount-thingy. > Then from the mount point, delete everything in the PGs head directory. > > This took only a few seconds (compared to 7.5 days) and after unmount and > restart of the OSD it came back online. > > Best, > Jesper > > -- > Jesper Lykkegaard Karlsen > Scientific Computing > Centre for Structural Biology > Department of Molecular Biology and Genetics > Aarhus University > Universitetsbyen 81 > 8000 Aarhus C > > E-mail: je...@mbg.au.dk > Tlf:+45 50906203 > >> On 31 Aug 2022, at 20.53, Jesper Lykkegaard Karlsen wrote: >> >> Hi all, >> >> I wanted to move a PG to an empty OSD, so I could do repairs on it without >> the whole OSD, which is full of other PG’s, would be effected with extensive >> downtime. >> >> Thus, I exported the PG with ceph-objectstore-tool, an after successful >> export I removed it. Unfortunately, the remove command was interrupted >> midway. >> This resulted in a PG that could not be remove with “ceph-objectstore-tool >> —op remove ….”, since the header is gone. >> Worse is that the OSD does not boot, due to it can see objects from the >> removed PG, but cannot access them. >> >> I have tried to remove the individual objects in that PG (also with >> objectstore-tool), but this process is extremely slow. >> When looping over the >65,000 object, each remove takes ~10 sec and is very >> compute intensive, which is approximately 7.5 days. >> >> Is the a faster way to get around this? >> >> Mvh. Jesper >> >> -- >> Jesper Lykkegaard Karlsen >> Scientific Computing >> Centre for Structural Biology >> Department of Molecular Biology and Genetics >> Aarhus University >> Universitetsbyen 81 >> 8000 Aarhus C >> >> E-mail: je...@mbg.au.dk >> Tlf:+45 50906203 >> >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] force-create-pg not working
Dear all, System: latest Octopus, 8+3 erasure Cephfs I have a PG that has been driving me crazy. It had gotten to a bad state after heavy backfilling, combined with OSD going down in turn. State is: active+recovery_unfound+undersized+degraded+remapped I have tried repairing it with ceph-objectstore-tool, but no luck so far. Given the time recovery takes this way and since data are under backup, I thought that I would do the "easy" approach instead and: * scan pg_files with cephfs-data-scan * delete data beloging to that pool * recreate PG with "ceph osd force-create-pg" * restore data Although, this has shown not to be so easy after all. ceph osd force-create-pg 20.13f --yes-i-really-mean-it seems to be accepted well enough with "pg 20.13f now creating, ok", but then nothing happens. Issuing the command again just gives a "pg 20.13f already creating" response. If I restart the primary OSD, then the pending force-create-pg disappears. I read that this could be due to crush map issue, but I have checked and that does not seem to be the case. Would it, for instance, be possible to do the force-create-pg manually with something like this?: * set nobackfill and norecovery * delete the pgs shards one by one * unset nobackfill and norecovery Any idea on how to proceed from here is most welcome. Thanks, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Universitetsbyen 81 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: force-create-pg not working
Hi Josh, Thanks for your reply. But this I already tried that, with no luck. Primary OSD goes down and hangs forever, upon "mark_unfound_lost delete” command. I guess it is too damaged to salvage, unless one really starts deleting individual corrupt objects? Anyway, as I said. files in the PG are identified and under backup, so I just want to healthy, no matter what ;-) I actually discovered that removing the pgs shards, with objectstore-tool indeed works in getting the pg back active-clean (containing 0 objects though). One just need to run a final remove - start/stop OSD - repair - mark-complete on the primary OSD. A scrub tells me that the "active+clean” state is for real. I also found out the more automated "force-create-pg" command only works on pgs that a in down state. Best, Jesper ------ Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Universitetsbyen 81 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 > On 20 Sep 2022, at 15.40, Josh Baergen wrote: > > Hi Jesper, > > Given that the PG is marked recovery_unfound, I think you need to > follow > https://docs.ceph.com/en/quincy/rados/troubleshooting/troubleshooting-pg/#unfound-objects. > > Josh > > On Tue, Sep 20, 2022 at 12:56 AM Jesper Lykkegaard Karlsen > wrote: >> >> Dear all, >> >> System: latest Octopus, 8+3 erasure Cephfs >> >> I have a PG that has been driving me crazy. >> It had gotten to a bad state after heavy backfilling, combined with OSD >> going down in turn. >> >> State is: >> >> active+recovery_unfound+undersized+degraded+remapped >> >> I have tried repairing it with ceph-objectstore-tool, but no luck so far. >> Given the time recovery takes this way and since data are under backup, I >> thought that I would do the "easy" approach instead and: >> >> * scan pg_files with cephfs-data-scan >> * delete data beloging to that pool >> * recreate PG with "ceph osd force-create-pg" >> * restore data >> >> Although, this has shown not to be so easy after all. >> >> ceph osd force-create-pg 20.13f --yes-i-really-mean-it >> >> seems to be accepted well enough with "pg 20.13f now creating, ok", but then >> nothing happens. >> Issuing the command again just gives a "pg 20.13f already creating" response. >> >> If I restart the primary OSD, then the pending force-create-pg disappears. >> >> I read that this could be due to crush map issue, but I have checked and >> that does not seem to be the case. >> >> Would it, for instance, be possible to do the force-create-pg manually with >> something like this?: >> >> * set nobackfill and norecovery >> * delete the pgs shards one by one >> * unset nobackfill and norecovery >> >> >> Any idea on how to proceed from here is most welcome. >> >> Thanks, >> Jesper >> >> >> -- >> Jesper Lykkegaard Karlsen >> Scientific Computing >> Centre for Structural Biology >> Department of Molecular Biology and Genetics >> Aarhus University >> Universitetsbyen 81 >> 8000 Aarhus C >> >> E-mail: je...@mbg.au.dk >> Tlf:+45 50906203 >> >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] cephfs quota used
Hi all, Cephfs quota work really well for me. A cool feature is that if one mounts a folder, which has quotas enabled, then the mountpoint will show as a partition of quota size and how much is used (e.g. with df command), nice! Now, I want to access the usage information of folders with quotas from root level of the cephfs. I have failed to find this information through getfattr commands, only quota limits are shown here, and du-command on individual folders is a suboptimal solution. The usage information must be somewhere in ceph metadata/mondb, but where and how do I read? Best, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephfs quota used
Thanks everybody, That was a quick answer. getfattr -n ceph.dir.rbytes $DIR Was the answer that worked for me. So getfattr was the solution after all. Is there some way I can display all attributes, without knowing them in forehand? I have tried: getfattr -d -m 'ceph.*' $DIR which gives me no output. Should that not list all atributes? This is on Rocky Linux kernel 4.18.0-348.2.1.el8_5.x86_64 Best, Jesper ------ Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 Fra: Sebastian Knust Sendt: 16. december 2021 13:01 Til: Jesper Lykkegaard Karlsen ; ceph-users@ceph.io Emne: Re: [ceph-users] cephfs quota used Hi Jasper, On 16.12.21 12:45, Jesper Lykkegaard Karlsen wrote: > Now, I want to access the usage information of folders with quotas from root > level of the cephfs. > I have failed to find this information through getfattr commands, only quota > limits are shown here, and du-command on individual folders is a suboptimal > solution. `getfattr -n ceph.quota.max_bytes /path` gives the specified quota for a given path. `getfattr -n ceph.dir.rbytes /path` gives the size of the path, as you would usually get with du for conventional file systems. As an example, I am using this script for weekly utilisation reports: > for i in /ceph-path-to-home-dirs/*; do > if [ -d "$i" ]; then > SIZE=$(getfattr -n ceph.dir.rbytes --only-values "$i") > QUOTA=$(getfattr -n ceph.quota.max_bytes --only-values "$i" > 2>/dev/null || echo 0) > PERC=$(echo $SIZE*100/$QUOTA | bc 2> /dev/null) > if [ -z "$PERC" ]; then PERC="--"; fi > printf "%-30s %8s %8s %8s%%\n" "$i" `numfmt --to=iec $SIZE` `numfmt > --to=iec $QUOTA` $PERC > fi > done Note that you can also mount CephFS with the "rbytes" mount option. IIRC the fuse clients defaults to it, for the kernel client you have to specify it in the mount command or fstab entry. The rbytes option returns the recursive path size (so the ceph.dir.rbytes fattr) in stat calls to directories, so you will see it with ls immediately. I really like it! Just beware that some software might have issues with this behaviour - alpine is the only example (bug report and patch proposal have been submitted) that I know of. Cheers Sebastian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephfs quota used
Just tested: getfattr -n ceph.dir.rbytes $DIR Works on CentOS 7, but not on Ubuntu 18.04 eighter. Weird? Best, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 Fra: Robert Gallop Sendt: 16. december 2021 13:42 Til: Jesper Lykkegaard Karlsen Cc: ceph-users@ceph.io Emne: Re: [ceph-users] Re: cephfs quota used >From what I understand you used to be able to do that but cannot on later >kernels? Seems there would be a list somewhere, but I can’t find it, maybe it’s changing too often depending on the kernel your using or something. But yeah, these attrs are one of the major reasons we are moving from traditional appliance NAS to ceph, the many other benefits come with it. On Thu, Dec 16, 2021 at 5:38 AM Jesper Lykkegaard Karlsen mailto:je...@mbg.au.dk>> wrote: Thanks everybody, That was a quick answer. getfattr -n ceph.dir.rbytes $DIR Was the answer that worked for me. So getfattr was the solution after all. Is there some way I can display all attributes, without knowing them in forehand? I have tried: getfattr -d -m 'ceph.*' $DIR which gives me no output. Should that not list all atributes? This is on Rocky Linux kernel 4.18.0-348.2.1.el8_5.x86_64 Best, Jesper ------ Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk<mailto:je...@mbg.au.dk> Tlf:+45 50906203 Fra: Sebastian Knust mailto:skn...@physik.uni-bielefeld.de>> Sendt: 16. december 2021 13:01 Til: Jesper Lykkegaard Karlsen mailto:je...@mbg.au.dk>>; ceph-users@ceph.io<mailto:ceph-users@ceph.io> mailto:ceph-users@ceph.io>> Emne: Re: [ceph-users] cephfs quota used Hi Jasper, On 16.12.21 12:45, Jesper Lykkegaard Karlsen wrote: > Now, I want to access the usage information of folders with quotas from root > level of the cephfs. > I have failed to find this information through getfattr commands, only quota > limits are shown here, and du-command on individual folders is a suboptimal > solution. `getfattr -n ceph.quota.max_bytes /path` gives the specified quota for a given path. `getfattr -n ceph.dir.rbytes /path` gives the size of the path, as you would usually get with du for conventional file systems. As an example, I am using this script for weekly utilisation reports: > for i in /ceph-path-to-home-dirs/*; do > if [ -d "$i" ]; then > SIZE=$(getfattr -n ceph.dir.rbytes --only-values "$i") > QUOTA=$(getfattr -n ceph.quota.max_bytes --only-values "$i" > 2>/dev/null || echo 0) > PERC=$(echo $SIZE*100/$QUOTA | bc 2> /dev/null) > if [ -z "$PERC" ]; then PERC="--"; fi > printf "%-30s %8s %8s %8s%%\n" "$i" `numfmt --to=iec $SIZE` `numfmt > --to=iec $QUOTA` $PERC > fi > done Note that you can also mount CephFS with the "rbytes" mount option. IIRC the fuse clients defaults to it, for the kernel client you have to specify it in the mount command or fstab entry. The rbytes option returns the recursive path size (so the ceph.dir.rbytes fattr) in stat calls to directories, so you will see it with ls immediately. I really like it! Just beware that some software might have issues with this behaviour - alpine is the only example (bug report and patch proposal have been submitted) that I know of. Cheers Sebastian ___ ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io> To unsubscribe send an email to ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io> ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephfs quota used
Woops, wrong copy/pasta: getfattr -n ceph.dir.rbytes $DIR works on all distributions I have tested. It is: getfattr -d -m 'ceph.*' $DIR that does not work on Rocky Linux 8, Ubuntu 18.04, but works on CentOS 7. Best, Jesper ------ Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 Fra: Jesper Lykkegaard Karlsen Sendt: 16. december 2021 13:57 Til: Robert Gallop Cc: ceph-users@ceph.io Emne: [ceph-users] Re: cephfs quota used Just tested: getfattr -n ceph.dir.rbytes $DIR Works on CentOS 7, but not on Ubuntu 18.04 eighter. Weird? Best, Jesper ------ Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 Fra: Robert Gallop Sendt: 16. december 2021 13:42 Til: Jesper Lykkegaard Karlsen Cc: ceph-users@ceph.io Emne: Re: [ceph-users] Re: cephfs quota used >From what I understand you used to be able to do that but cannot on later >kernels? Seems there would be a list somewhere, but I can’t find it, maybe it’s changing too often depending on the kernel your using or something. But yeah, these attrs are one of the major reasons we are moving from traditional appliance NAS to ceph, the many other benefits come with it. On Thu, Dec 16, 2021 at 5:38 AM Jesper Lykkegaard Karlsen mailto:je...@mbg.au.dk>> wrote: Thanks everybody, That was a quick answer. getfattr -n ceph.dir.rbytes $DIR Was the answer that worked for me. So getfattr was the solution after all. Is there some way I can display all attributes, without knowing them in forehand? I have tried: getfattr -d -m 'ceph.*' $DIR which gives me no output. Should that not list all atributes? This is on Rocky Linux kernel 4.18.0-348.2.1.el8_5.x86_64 Best, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk<mailto:je...@mbg.au.dk> Tlf:+45 50906203 Fra: Sebastian Knust mailto:skn...@physik.uni-bielefeld.de>> Sendt: 16. december 2021 13:01 Til: Jesper Lykkegaard Karlsen mailto:je...@mbg.au.dk>>; ceph-users@ceph.io<mailto:ceph-users@ceph.io> mailto:ceph-users@ceph.io>> Emne: Re: [ceph-users] cephfs quota used Hi Jasper, On 16.12.21 12:45, Jesper Lykkegaard Karlsen wrote: > Now, I want to access the usage information of folders with quotas from root > level of the cephfs. > I have failed to find this information through getfattr commands, only quota > limits are shown here, and du-command on individual folders is a suboptimal > solution. `getfattr -n ceph.quota.max_bytes /path` gives the specified quota for a given path. `getfattr -n ceph.dir.rbytes /path` gives the size of the path, as you would usually get with du for conventional file systems. As an example, I am using this script for weekly utilisation reports: > for i in /ceph-path-to-home-dirs/*; do > if [ -d "$i" ]; then > SIZE=$(getfattr -n ceph.dir.rbytes --only-values "$i") > QUOTA=$(getfattr -n ceph.quota.max_bytes --only-values "$i" > 2>/dev/null || echo 0) > PERC=$(echo $SIZE*100/$QUOTA | bc 2> /dev/null) > if [ -z "$PERC" ]; then PERC="--"; fi > printf "%-30s %8s %8s %8s%%\n" "$i" `numfmt --to=iec $SIZE` `numfmt > --to=iec $QUOTA` $PERC > fi > done Note that you can also mount CephFS with the "rbytes" mount option. IIRC the fuse clients defaults to it, for the kernel client you have to specify it in the mount command or fstab entry. The rbytes option returns the recursive path size (so the ceph.dir.rbytes fattr) in stat calls to directories, so you will see it with ls immediately. I really like it! Just beware that some software might have issues with this behaviour - alpine is the only example (bug report and patch proposal have been submitted) that I know of. Cheers Sebastian ___ ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io> To unsubscribe send an email to ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io> ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephfs quota used
To answer my own question. It seems Frank Schilder asked a similar question two years ago: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/6ENI42ZMHTTP2OONBRD7FDP7LQBC4P2E/ listxattr() was aparrently removed and not much have happen since then it seems. Anyway, I just made my own ceph-fs version of "du". ceph_du_dir: #!/bin/bash # usage: ceph_du_dir $DIR SIZE=$(getfattr -n ceph.dir.rbytes $1 2>/dev/null| grep "ceph\.dir\.rbytes" | awk -F\= '{print $2}' | sed s/\"//g) numfmt --to=iec-i --suffix=B --padding=7 $SIZE Prints out ceph-fs dir size in "human-readble" It works like a charm and my god it is fast!. Tools like that could be very useful, if provided by the development team 🙂 Best, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 Fra: Jesper Lykkegaard Karlsen Sendt: 16. december 2021 14:37 Til: Robert Gallop Cc: ceph-users@ceph.io Emne: [ceph-users] Re: cephfs quota used Woops, wrong copy/pasta: getfattr -n ceph.dir.rbytes $DIR works on all distributions I have tested. It is: getfattr -d -m 'ceph.*' $DIR that does not work on Rocky Linux 8, Ubuntu 18.04, but works on CentOS 7. Best, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 Fra: Jesper Lykkegaard Karlsen Sendt: 16. december 2021 13:57 Til: Robert Gallop Cc: ceph-users@ceph.io Emne: [ceph-users] Re: cephfs quota used Just tested: getfattr -n ceph.dir.rbytes $DIR Works on CentOS 7, but not on Ubuntu 18.04 eighter. Weird? Best, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 ____ Fra: Robert Gallop Sendt: 16. december 2021 13:42 Til: Jesper Lykkegaard Karlsen Cc: ceph-users@ceph.io Emne: Re: [ceph-users] Re: cephfs quota used From what I understand you used to be able to do that but cannot on later kernels? Seems there would be a list somewhere, but I can’t find it, maybe it’s changing too often depending on the kernel your using or something. But yeah, these attrs are one of the major reasons we are moving from traditional appliance NAS to ceph, the many other benefits come with it. On Thu, Dec 16, 2021 at 5:38 AM Jesper Lykkegaard Karlsen mailto:je...@mbg.au.dk>> wrote: Thanks everybody, That was a quick answer. getfattr -n ceph.dir.rbytes $DIR Was the answer that worked for me. So getfattr was the solution after all. Is there some way I can display all attributes, without knowing them in forehand? I have tried: getfattr -d -m 'ceph.*' $DIR which gives me no output. Should that not list all atributes? This is on Rocky Linux kernel 4.18.0-348.2.1.el8_5.x86_64 Best, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk<mailto:je...@mbg.au.dk> Tlf:+45 50906203 Fra: Sebastian Knust mailto:skn...@physik.uni-bielefeld.de>> Sendt: 16. december 2021 13:01 Til: Jesper Lykkegaard Karlsen mailto:je...@mbg.au.dk>>; ceph-users@ceph.io<mailto:ceph-users@ceph.io> mailto:ceph-users@ceph.io>> Emne: Re: [ceph-users] cephfs quota used Hi Jasper, On 16.12.21 12:45, Jesper Lykkegaard Karlsen wrote: > Now, I want to access the usage information of folders with quotas from root > level of the cephfs. > I have failed to find this information through getfattr commands, only quota > limits are shown here, and du-command on individual folders is a suboptimal > solution. `getfattr -n ceph.quota.max_bytes /path` gives the specified quota for a given path. `getfattr -n ceph.dir.rbytes /path` gives the size of the path, as you would usually get with du for conventional file systems. As an example, I am using this script for weekly utilisation reports: > for i in /ceph-path-to-home-dirs/*; do > if [ -d "$i" ]; then > SIZE=$(getfattr -n ceph.dir.rbytes --only-values "$i") > QUOTA=$(getfattr -n ceph.quota.max_bytes --only-values "$i" > 2>/dev/null || echo 0) > PERC=$(echo $SIZE*100/$QUOTA | bc 2> /dev/null) > if [ -z "$PERC" ]; then PERC="--"; fi >
[ceph-users] Re: cephfs quota used
Brilliant, thanks Jean-François Best, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 Fra: Jean-Francois GUILLAUME Sendt: 16. december 2021 23:03 Til: Jesper Lykkegaard Karlsen Cc: Robert Gallop ; ceph-users@ceph.io Emne: Re: [ceph-users] Re: cephfs quota used Hi, You can avoid using awk by passing --only-values to getfattr. This should look something like this : > #!/bin/bash > numfmt --to=iec-i --suffix=B --padding=7 $(getfattr --only-values -n > ceph.dir.rbytes $1 2>/dev/null) Best, --- Cordialement, Jean-François GUILLAUME Plateforme Bioinformatique BiRD Tél. : +33 (0)2 28 08 00 57 www.pf-bird.univ-nantes.fr<http://www.pf-bird.univ-nantes.fr> Inserm UMR 1087/CNRS UMR 6291 IRS-UN - 8 quai Moncousu - BP 70721 44007 Nantes Cedex 1 Le 2021-12-16 22:25, Jesper Lykkegaard Karlsen a écrit : > To answer my own question. > It seems Frank Schilder asked a similar question two years ago: > > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/6ENI42ZMHTTP2OONBRD7FDP7LQBC4P2E/ > > listxattr() was aparrently removed and not much have happen since then > it seems. > > Anyway, I just made my own ceph-fs version of "du". > > ceph_du_dir: > > #!/bin/bash > # usage: ceph_du_dir $DIR > SIZE=$(getfattr -n ceph.dir.rbytes $1 2>/dev/null| grep > "ceph\.dir\.rbytes" | awk -F\= '{print $2}' | sed s/\"//g) > numfmt --to=iec-i --suffix=B --padding=7 $SIZE > > Prints out ceph-fs dir size in "human-readble" > It works like a charm and my god it is fast!. > > Tools like that could be very useful, if provided by the development > team 🙂 > > Best, > Jesper > > -- > Jesper Lykkegaard Karlsen > Scientific Computing > Centre for Structural Biology > Department of Molecular Biology and Genetics > Aarhus University > Gustav Wieds Vej 10 > 8000 Aarhus C > > E-mail: je...@mbg.au.dk > Tlf:+45 50906203 > > > Fra: Jesper Lykkegaard Karlsen > Sendt: 16. december 2021 14:37 > Til: Robert Gallop > Cc: ceph-users@ceph.io > Emne: [ceph-users] Re: cephfs quota used > > Woops, wrong copy/pasta: > > getfattr -n ceph.dir.rbytes $DIR > > works on all distributions I have tested. > > It is: > > getfattr -d -m 'ceph.*' $DIR > > that does not work on Rocky Linux 8, Ubuntu 18.04, but works on CentOS > 7. > > Best, > Jesper > -- > Jesper Lykkegaard Karlsen > Scientific Computing > Centre for Structural Biology > Department of Molecular Biology and Genetics > Aarhus University > Gustav Wieds Vej 10 > 8000 Aarhus C > > E-mail: je...@mbg.au.dk > Tlf:+45 50906203 > > ____ > Fra: Jesper Lykkegaard Karlsen > Sendt: 16. december 2021 13:57 > Til: Robert Gallop > Cc: ceph-users@ceph.io > Emne: [ceph-users] Re: cephfs quota used > > Just tested: > > getfattr -n ceph.dir.rbytes $DIR > > Works on CentOS 7, but not on Ubuntu 18.04 eighter. > Weird? > > Best, > Jesper > ------ > Jesper Lykkegaard Karlsen > Scientific Computing > Centre for Structural Biology > Department of Molecular Biology and Genetics > Aarhus University > Gustav Wieds Vej 10 > 8000 Aarhus C > > E-mail: je...@mbg.au.dk > Tlf:+45 50906203 > > > Fra: Robert Gallop > Sendt: 16. december 2021 13:42 > Til: Jesper Lykkegaard Karlsen > Cc: ceph-users@ceph.io > Emne: Re: [ceph-users] Re: cephfs quota used > > From what I understand you used to be able to do that but cannot on > later kernels? > > Seems there would be a list somewhere, but I can’t find it, maybe > it’s changing too often depending on the kernel your using or > something. > > But yeah, these attrs are one of the major reasons we are moving from > traditional appliance NAS to ceph, the many other benefits come with > it. > > On Thu, Dec 16, 2021 at 5:38 AM Jesper Lykkegaard Karlsen > mailto:je...@mbg.au.dk>> wrote: > Thanks everybody, > > That was a quick answer. > > getfattr -n ceph.dir.rbytes $DIR > > Was the answer that worked for me. So getfattr was the solution after > all. > > Is there some way I can display all attributes, without knowing them > in forehand? > > I have tried: > > getfattr -d -m 'ceph.*' $DIR > > which gives me no output. Should that not list all atributes? > > This
[ceph-users] Re: cephfs quota used
Not to spam, but to make it output prettier, one can also separate the number from the byte-size prefix. numfmt --to=iec --suffix=B --padding=7 $(getfattr --only-values -n ceph.dir.rbytes $1 2>/dev/nul) | sed -r 's/([0-9])([a-zA-Z])/\1 \2/g; s/([a-zA-Z])([0-9])/\1 \2/g' //Jesper ------ Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 ____ Fra: Jesper Lykkegaard Karlsen Sendt: 16. december 2021 23:07 Til: Jean-Francois GUILLAUME Cc: Robert Gallop ; ceph-users@ceph.io Emne: [ceph-users] Re: cephfs quota used Brilliant, thanks Jean-François Best, Jesper ------ Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 Fra: Jean-Francois GUILLAUME Sendt: 16. december 2021 23:03 Til: Jesper Lykkegaard Karlsen Cc: Robert Gallop ; ceph-users@ceph.io Emne: Re: [ceph-users] Re: cephfs quota used Hi, You can avoid using awk by passing --only-values to getfattr. This should look something like this : > #!/bin/bash > numfmt --to=iec-i --suffix=B --padding=7 $(getfattr --only-values -n > ceph.dir.rbytes $1 2>/dev/null) Best, --- Cordialement, Jean-François GUILLAUME Plateforme Bioinformatique BiRD Tél. : +33 (0)2 28 08 00 57 www.pf-bird.univ-nantes.fr<http://www.pf-bird.univ-nantes.fr><http://www.pf-bird.univ-nantes.fr<http://www.pf-bird.univ-nantes.fr>> Inserm UMR 1087/CNRS UMR 6291 IRS-UN - 8 quai Moncousu - BP 70721 44007 Nantes Cedex 1 Le 2021-12-16 22:25, Jesper Lykkegaard Karlsen a écrit : > To answer my own question. > It seems Frank Schilder asked a similar question two years ago: > > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/6ENI42ZMHTTP2OONBRD7FDP7LQBC4P2E/ > > listxattr() was aparrently removed and not much have happen since then > it seems. > > Anyway, I just made my own ceph-fs version of "du". > > ceph_du_dir: > > #!/bin/bash > # usage: ceph_du_dir $DIR > SIZE=$(getfattr -n ceph.dir.rbytes $1 2>/dev/null| grep > "ceph\.dir\.rbytes" | awk -F\= '{print $2}' | sed s/\"//g) > numfmt --to=iec-i --suffix=B --padding=7 $SIZE > > Prints out ceph-fs dir size in "human-readble" > It works like a charm and my god it is fast!. > > Tools like that could be very useful, if provided by the development > team 🙂 > > Best, > Jesper > > -- > Jesper Lykkegaard Karlsen > Scientific Computing > Centre for Structural Biology > Department of Molecular Biology and Genetics > Aarhus University > Gustav Wieds Vej 10 > 8000 Aarhus C > > E-mail: je...@mbg.au.dk > Tlf:+45 50906203 > > > Fra: Jesper Lykkegaard Karlsen > Sendt: 16. december 2021 14:37 > Til: Robert Gallop > Cc: ceph-users@ceph.io > Emne: [ceph-users] Re: cephfs quota used > > Woops, wrong copy/pasta: > > getfattr -n ceph.dir.rbytes $DIR > > works on all distributions I have tested. > > It is: > > getfattr -d -m 'ceph.*' $DIR > > that does not work on Rocky Linux 8, Ubuntu 18.04, but works on CentOS > 7. > > Best, > Jesper > ------ > Jesper Lykkegaard Karlsen > Scientific Computing > Centre for Structural Biology > Department of Molecular Biology and Genetics > Aarhus University > Gustav Wieds Vej 10 > 8000 Aarhus C > > E-mail: je...@mbg.au.dk > Tlf:+45 50906203 > > ____ > Fra: Jesper Lykkegaard Karlsen > Sendt: 16. december 2021 13:57 > Til: Robert Gallop > Cc: ceph-users@ceph.io > Emne: [ceph-users] Re: cephfs quota used > > Just tested: > > getfattr -n ceph.dir.rbytes $DIR > > Works on CentOS 7, but not on Ubuntu 18.04 eighter. > Weird? > > Best, > Jesper > -- > Jesper Lykkegaard Karlsen > Scientific Computing > Centre for Structural Biology > Department of Molecular Biology and Genetics > Aarhus University > Gustav Wieds Vej 10 > 8000 Aarhus C > > E-mail: je...@mbg.au.dk > Tlf:+45 50906203 > > > Fra: Robert Gallop > Sendt: 16. december 2021 13:42 > Til: Jesper Lykkegaard Karlsen > Cc: ceph-users@ceph.io > Emne: Re: [ceph-users] Re: cephfs quota used > > From what I understand you used to be able to do that but cannot on > later kernels? > > Seems there would be a list somewhere,
[ceph-users] Re: cephfs quota used
Thanks Konstantin, Actually, I went a bit further and made the script more universal in usage: ceph_du_dir: # usage: ceph_du_dir $DIR1 ($DIR2 .) for i in $@; do if [[ -d $i && ! -L $i ]]; then echo "$(numfmt --to=iec --suffix=B --padding=7 $(getfattr --only-values -n ceph.dir.rbytes $i 2>/dev/nul) | sed -r 's/([0-9])([a-zA-Z])/\1 \2/g; s/([a-zA-Z])([0-9])/\1 \2/g') $i" fi done The above can be run as: ceph_du_dir $DIR with multiple directories: ceph_du_dir $DIR1 $DIR2 $DIR3 .. Or even with wildcard: ceph_du_dir $DIR/* Best, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 Fra: Konstantin Shalygin Sendt: 17. december 2021 09:17 Til: Jesper Lykkegaard Karlsen Cc: Robert Gallop ; ceph-users@ceph.io Emne: Re: [ceph-users] cephfs quota used Or you can mount with 'dirstat' option and use 'cat .' for determine CephFS stats: alias fsdf="cat . | grep rbytes | awk '{print \$2}' | numfmt --to=iec --suffix=B" [root@host catalog]# fsdf 245GB [root@host catalog]# Cheers, k On 17 Dec 2021, at 00:25, Jesper Lykkegaard Karlsen mailto:je...@mbg.au.dk>> wrote: Anyway, I just made my own ceph-fs version of "du". ceph_du_dir: #!/bin/bash # usage: ceph_du_dir $DIR SIZE=$(getfattr -n ceph.dir.rbytes $1 2>/dev/null| grep "ceph\.dir\.rbytes" | awk -F\= '{print $2}' | sed s/\"//g) numfmt --to=iec-i --suffix=B --padding=7 $SIZE Prints out ceph-fs dir size in "human-readble" It works like a charm and my god it is fast!. Tools like that could be very useful, if provided by the development team 🙂 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Healthy objects trapped in incomplete pgs
Dear Cephers, A few days ago disaster struck the Ceph cluster (erasure-coded) I am administrating, as the UPS power was pull from the cluster causing a power outage. After rebooting the system, 6 osds were lost (spread over 5 osd nodes) as they could not mount anymore, several others had damages. This was more than the host-faliure domain was setup to handle and auto-recovery failed and osds started downing in a cascading maner. When the dust settled, there were 8 pgs (of 2048) inactive and a bunch of osds down. I managed to recover 5 pgs, mainly by ceph-objectstore-tool export/import/repair commands, but now I am left with 3 pgs that are inactive and incomplete. One of the pgs seems un-salvageable, as I cannot get to become active at all (repair/import/export/lowering min_size), but the two others I can get active if I export/import one of the pg shards and restart osd. Rebuilding then starts but after a while one of the osds holding the pgs goes down, with a "FAILED ceph_assert(clone_size.count(clone))" message in the log. If I set osds to noout nodown, then I can that it is only rather few objects e.g. 161 of a pg of >10, that are failing to be remapped. Since most of the object in the two pgs seem intact, it would be sad to delete the whole pg (force-create-pg) and loose all that data. Is there a way to show and delete the failing objects? I have thought of a recovery plan and want to share that with you, so you can comment on this if it sounds doable or not? * Stop osds from recovering:ceph osd set norecover * bring back pgs active:ceph-objectstore-tool export/import and restart osd * find files in pgs: cephfs-data-scan pg_files * pull out as many as possible of those files to other location. * recreate pgs: ceph osd force-create-pg * restart recovery:ceph osd unset norecover * copy back in the recovered files Would that work or do you have a better suggestion? Cheers, Jesper ------ Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Cephadm stacktrace on copying ceph.conf
y", line 343, in _make_cd_request self._fs.basename(path)) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 224, in make_request raise exc asyncssh.sftp.SFTPFailure: scp: /tmp/var/lib/ceph/5c384430-da91-11ed-af9c-c780a5227aff/config/ceph.conf.new: Permission denied 3/26/24 9:38:09 PM[INF]Updating dkcphhpcmgt028:/var/lib/ceph/5c384430-da91-11ed-af9c-c780a5227aff/config/ceph.conf It seem to be related to the permissions that the manager writes the files with and the process copying them around. $ sudo ceph -v [sudo] password for adminjskr: ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable) Best regards, Jesper Agerbo Krogh Director Digitalization Digitalization Topsoe A/S Haldor Topsøes Allé 1 2800 Kgs. Lyngby Denmark Phone (direct): 27773240 Read more attopsoe.com Topsoe A/S and/or its affiliates. This e-mail message (including attachments, if any) is confidential and may be privileged. It is intended only for the addressee. Any unauthorised distribution or disclosure is prohibited. Disclosure to anyone other than the intended recipient does not constitute waiver of privilege. If you have received this email in error, please notify the sender by email and delete it and any attachments from your computer system and records. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io