[ceph-users] Erasure coding and backfilling speed

2023-07-05 Thread jesper
Hi. 

I have a Ceph (NVME) based cluster with 12 hosts and 40 OSD's .. currently it 
is backfilling pg's but I cannot get it to run more than 20 backfilling (pgs) 
at the same time (6+2 profile)
osd_max_backfills = 100 and osd_recovery_max_active_ssd = 50 (non-sane) but it 
still stops at 20 with 40+ in backfill_wait

Any idea about how to speed it up? 

Thanks.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] HBase/HDFS on Ceph/CephFS

2020-04-23 Thread jesper
Hi

We have an 3 year old Hadoop cluster - up for refresh - so it is time
to evaluate options. The "only" usecase is running an HBase installation
which is important for us and migrating out of HBase would be a hazzle.

Our Ceph usage has expanded and in general - we really like what we see.

Thus - Can this be "sanely" consolidated somehow? I have seen this:
https://docs.ceph.com/docs/jewel/cephfs/hadoop/
But it seem really-really bogus to me.

It recommends that you set:
pool 3 'hadoop1' rep size 1 min_size 1

Which would - if I understand correct - be disastrous. The Hadoop end would
replicated in 3 across - but within Ceph the replication would be 1.
The 1 replication in ceph means pulling the OSD node would "gaurantee" the
pg's to go inactive - which could be ok - but there is nothing
gauranteeing that the other Hadoop replicas are not served out of the same
OSD-node/pg? In which case - rebooting an OSD node would bring the hadoop
cluster unavailable.

Is anyone serving HBase out of Ceph - how does the stadck and
configuration look? If I went for 3 x replication in both Ceph and HDFS
then it would definately work, but 9x copies of the dataset is a bit more
than what looks feasible at the moment.

Thanks for your reflections/input.

Jesper
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: HBase/HDFS on Ceph/CephFS

2020-04-27 Thread jesper
> local filesystem is a bit tricky,  we just tried a POC that mounting
> CephFS
> into every hadoop ,  configure Hadoop using LocalFS with Replica = 1.
> Which
> end up with each data only write once into cephfs and cephfs take care of
> the data durability.

Can you tell a bit more about this?

well yes I loose data-locality - but HBase is not that well in maintaining
that anyway - When starting up it does not distribute shards to the HDFS
nodes that has data but pulls randomly. It gets locality either by "major
compact" or waiting for compaction to re-write everything again.  I may get
equally good data locality with Ceph-based SSD as with local HDDs (which I
currently have)

Jesper
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph MDS - busy?

2020-04-30 Thread jesper
Hi.

How do I find out if the MDS is "busy" - being the one limiting CephFS
metadata throughput. (12.2.8).

$ time find . | wc -l
1918069

real8m43.008s
user0m2.689s
sys 0m7.818s

or 3.667ms per file.
In the light of "potentially batching" and a network latency of ~0.20ms to
the MDS - I have a feeling that this could be significantly improved.

Then I additionally tried to do the same through the NFS -ganesha gateway.

For reference:
Same - but on "local DAS - xfs".
$ time find . | wc -l
1918061

real0m4.848s
user0m2.360s
sys 0m2.816s

Same but "above local DAS over NFS":
$ time find . | wc -l
1918061

real5m56.546s
user0m2.903s
sys 0m34.381s


jk@ceph-mon1:~$ sudo ceph fs status
[sudo] password for jk:
cephfs - 84 clients
==
+--++---+---+---+---+
| Rank | State  |MDS|Activity   |  dns  |  inos |
+--++---+---+---+---+
|  0   | active | ceph-mds2 | Reqs: 1369 /s | 11.3M | 11.3M |
| 0-s  | standby-replay | ceph-mds1 | Evts:0 /s |0  |0  |
+--++---+---+---+---+
+--+--+---+---+
|   Pool   |   type   |  used | avail |
+--+--+---+---+
| cephfs_metadata  | metadata |  226M | 16.4T |
|   cephfs_data|   data   |  164T |  132T |
| cephfs_data_ec42 |   data   |  180T |  265T |
+--+--+---+---+

+-+
| Standby MDS |
+-+
+-+
MDS version: ceph version 12.2.5-45redhat1xenial
(d4b9f17b56b3348566926849313084dd6efc2ca2) luminous (stable)

How can we asses where the bottleneck is and what to do to speed it up?



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD weight on Luminous

2020-05-14 Thread jesper

unless uou have enabled some balancing - then this is very normal (actually 
pretty good normal)

Jesper


Sent from myMail for iOS


Thursday, 14 May 2020, 09.35 +0200 from Florent B.  :
>Hi,
>
>I have something strange on a Ceph Luminous cluster.
>
>All OSDs have the same size, the same weight, and one of them is used at
>88% by Ceph (osd.3) while others are around 40 to 50% usage :
>
>#  ceph osd df
>ID CLASS WEIGHT  REWEIGHT SIZE    USE DATA    OMAP    META   
>AVAIL   %USE  VAR  PGS
> 2   hdd 0.49179  1.0  504GiB  264GiB  263GiB 63.7MiB  960MiB 
>240GiB 52.34 1.14  81
>13   hdd 0.49179  1.0  504GiB  267GiB  266GiB 55.7MiB 1.37GiB 
>236GiB 53.09 1.16  94
>20   hdd 0.49179  1.0  504GiB  235GiB  234GiB 62.5MiB  962MiB 
>268GiB 46.70 1.02  99
>21   hdd 0.49179  1.0  504GiB  306GiB  305GiB 65.2MiB  991MiB 
>198GiB 60.75 1.32  87
>22   hdd 0.49179  1.0  504GiB  185GiB  184GiB 51.9MiB  972MiB 
>318GiB 36.83 0.80  73
>23   hdd 0.49179  1.0  504GiB  167GiB  166GiB 60.9MiB  963MiB 
>337GiB 33.07 0.72  80
>24   hdd 0.49179  1.0  504GiB  235GiB  234GiB 67.5MiB  956MiB 
>268GiB 46.74 1.02  90
>25   hdd 0.49179  1.0  504GiB  183GiB  182GiB 68.8MiB  955MiB 
>321GiB 36.32 0.79 100
> 3   hdd 0.49179  1.0  504GiB  442GiB  440GiB 77.5MiB 1.15GiB
>61.9GiB  87.70 1.91 103
>26   hdd 0.49179  1.0  504GiB  220GiB  219GiB 61.2MiB  963MiB 
>283GiB 43.78 0.95  80
>29   hdd 0.49179  1.0  504GiB  298GiB  296GiB 77.4MiB 1013MiB 
>206GiB 59.09 1.29 106
>30   hdd 0.49179  1.0  504GiB  183GiB  182GiB 60.2MiB  964MiB 
>321GiB 36.32 0.79  88
>10   hdd 0.49179  1.0  504GiB  176GiB  175GiB 56.5MiB  968MiB 
>327GiB 35.02 0.76  85
>11   hdd 0.49179  1.0  504GiB  209GiB  208GiB 62.5MiB  961MiB 
>295GiB 41.42 0.90  89
> 0   hdd 0.49179  1.0  504GiB  253GiB  252GiB 55.7MiB  968MiB 
>251GiB 50.18 1.09  76
> 1   hdd 0.49179  1.0  504GiB  199GiB  198GiB 60.4MiB  964MiB 
>305GiB 39.51 0.86  92
>16   hdd 0.49179  1.0  504GiB  219GiB  218GiB 58.2MiB  966MiB 
>284GiB 43.51 0.95  85
>17   hdd 0.49179  1.0  504GiB  231GiB  230GiB 69.0MiB  955MiB 
>272GiB 45.97 1.00  97
>14   hdd 0.49179  1.0  504GiB  210GiB  209GiB 61.0MiB  963MiB 
>293GiB 41.72 0.91  74
>15   hdd 0.49179  1.0  504GiB  182GiB  181GiB 50.7MiB  973MiB 
>322GiB 36.10 0.79  72
>18   hdd 0.49179  1.0  504GiB  297GiB  296GiB 53.7MiB  978MiB 
>206GiB 59.03 1.29  87
>19   hdd 0.49179  1.0  504GiB  125GiB  124GiB 61.9MiB  962MiB 
>379GiB 24.81 0.54  82
>    TOTAL 10.8TiB 4.97TiB 4.94TiB 1.33GiB 21.4GiB
>5.85TiB 45.91 
>MIN/MAX VAR: 0.54/1.91  STDDEV: 12.80
>
>
>Is it a normal situation ? Is there any way to let Ceph handle this
>alone or am I forced to reweight the OSD manually ?
>
>Thank you.
>
>Florent
>___
>ceph-users mailing list --  ceph-users@ceph.io
>To unsubscribe send an email to  ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Change crush rule on pool

2020-08-05 Thread jesper


Hi

I would like to change the crush rule so data lands on ssd instead of hdd, can 
this be done on the fly and migration will just happen or do I need to do 
something to move data?

Jesper



Sent from myMail for iOS
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Change crush rule on pool

2020-09-12 Thread jesper
> I would like to change the crush rule so data lands on ssd instead of hdd,
> can this be done on the fly and migration will just happen or do I need to
> do something to move data?

I would actually like to relocate my object store to a new storage tier.
Is the best to:

1) create new pool on storage tier (SSD)
2) stop activity
3) rados cppool data to the new one.
4) rename the pool back into the "default.rgw.buckets.data" pool.

Done?

Thanks.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Change crush rule on pool

2020-09-12 Thread jesper

 Can i do that - when the SSDs are allready used in another crush rule - 
backing and kvm_ssd rbd’s?

Jesper



Sent from myMail for iOS


Saturday, 12 September 2020, 11.01 +0200 from anthony.da...@gmail.com  
:
>If you have capacity to have both online at the same time, why not add the 
>SSDs to the existing pool, let the cluster converge, then remove the HDDs?  
>Either all at once or incrementally?  With care you’d have zero service 
>impact.  If you want to change the replication strategy at the same time, that 
>would be more complex.
>
>— Anthony
>
>> On Sep 12, 2020, at 12:42 AM,  jes...@krogh.cc wrote:
>> 
>>> I would like to change the crush rule so data lands on ssd instead of hdd,
>>> can this be done on the fly and migration will just happen or do I need to
>>> do something to move data?
>> 
>> I would actually like to relocate my object store to a new storage tier.
>> Is the best to:
>> 
>> 1) create new pool on storage tier (SSD)
>> 2) stop activity
>> 3) rados cppool data to the new one.
>> 4) rename the pool back into the "default.rgw.buckets.data" pool.
>> 
>> Done?
>> 
>> Thanks.
>> ___
>> ceph-users mailing list --  ceph-users@ceph.io
>> To unsubscribe send an email to  ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: krdb upmap compatibility

2019-08-26 Thread jesper

What will actually happen if an old client comes by, potential data damage - or 
just broken connections from the client?

jesper 



Sent from myMail for iOS


Monday, 26 August 2019, 20.16 +0200 from Paul Emmerich  
:
>4.13 or newer is enough for upmap
>
>-- 
>Paul Emmerich
>
>Looking for help with your Ceph cluster? Contact us at  https://croit.io
>
>croit GmbH
>Freseniusstr. 31h
>81247 München
>www.croit.io
>Tel:  +49 89 1896585 90
>
>On Mon, Aug 26, 2019 at 8:01 PM Frank R < frankaritc...@gmail.com > wrote:
>>
>> It seems that with Linux kernel 4.16.10 krdb clients are seen as Jewel 
>> rather than Luminous. Can someone tell me which kernel version will be seen 
>> as Luminous as I want to enable the Upmap Balancer.
>> ___
>> ceph-users mailing list --  ceph-users@ceph.io
>> To unsubscribe send an email to  ceph-users-le...@ceph.io
>___
>ceph-users mailing list --  ceph-users@ceph.io
>To unsubscribe send an email to  ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: the ceph rbd read dd with fio performance diffrent so huge?

2019-08-27 Thread jesper

concurrency is widely different 1:30 

Jesper 



Sent from myMail for iOS


Tuesday, 27 August 2019, 16.25 +0200 from linghucongs...@163.com  
:
>The performance with the dd and fio diffrent is so huge?
>
>I have 25 OSDS with 8TB hdd. with dd I only get 410KB/s read perfomance,but 
>with fio I get 991.23MB/s read perfomance.
>
>like below:
>
>Thanks in advance!
>
>root@Server-d5754749-cded-4964-8129-ba1accbe86b3:~# time dd of=/dev/zero 
>if=/mnt/testw.dbf bs=4k count=1 iflag=direct
>1+0 records in
>1+0 records out
>4096 bytes (41 MB, 39 MiB) copied, 99.9445 s, 410 kB/s
>
>real    1m39.950s
>user    0m0.040s
>sys 0m0.292s
>
>
>
>root@Server-d5754749-cded-4964-8129-ba1accbe86b3:~#
> fio --filename=/mnt/test1 -direct=1 -iodepth 1 -thread -rw=read 
>-ioengine=libaio -bs=4k -size=1G -numjobs=30 -runtime=10 
>-group_reporting -name=mytest  
>mytest: (g=0): rw=read, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1
>...
>fio-2.2.10
>Starting 30 threads
>Jobs: 30 (f=30): [R(30)] [100.0% done] [1149MB/0KB/0KB /s] [294K/0/0 iops] 
>[eta 00m:00s] 
>mytest: (groupid=0, jobs=30): err= 0: pid=5261: Tue Aug 27 13:37:28 2019
>  read : io=9915.2MB, bw=991.23MB/s, iops=253752, runt= 10003msec
>    slat (usec): min=2, max=200020, avg=39.10, stdev=1454.14
>    clat (usec): min=1, max=160019, avg=38.57, stdev=1006.99
> lat (usec): min=4, max=200022, avg=87.37, stdev=1910.99
>    clat percentiles (usec):
> |  1.00th=[    1],  5.00th=[    1], 10.00th=[    1], 20.00th=[    1],
> | 30.00th=[    1], 40.00th=[    1], 50.00th=[    1], 60.00th=[    1],
> | 70.00th=[    1], 80.00th=[    2], 90.00th=[    2], 95.00th=[    2],
> | 99.00th=[  612], 99.50th=[  684], 99.90th=[  780], 99.95th=[ 1020],
> | 99.99th=[56064]
>    bw (KB  /s): min= 7168, max=46680, per=3.30%, avg=33460.79, stdev=12024.35
>    lat (usec) : 2=73.62%, 4=22.38%, 10=0.05%, 20=0.03%, 50=0.01%
>    lat (usec) : 100=0.01%, 250=0.03%, 500=1.93%, 750=1.75%, 1000=0.14%
>    lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
>    lat (msec) : 100=0.03%, 250=0.01%
>  cpu  : usr=1.83%, sys=4.30%, ctx=104743, majf=0, minf=59
>  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> issued    : total=r=2538284/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
> latency   : target=0, window=0, percentile=100.00%, depth=1
>
>Run status group 0 (all jobs):
>   READ: io=9915.2MB, aggrb=991.23MB/s, minb=991.23MB/s, maxb=991.23MB/s, 
>mint=10003msec, maxt=10003msec
>
>Disk stats (read/write):
>  vdb: ios=98460/0, merge=0/0, ticks=48840/0, in_queue=49144, util=17.28%
>
>
>
>
> 
>___
>ceph-users mailing list --  ceph-users@ceph.io
>To unsubscribe send an email to  ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Danish ceph users

2019-08-29 Thread jesper

yes



Sent from myMail for iOS


Thursday, 29 August 2019, 15.52 +0200 from fr...@dtu.dk  :
>I would be in.
>
>=
>Frank Schilder
>AIT Risø Campus
>Bygning 109, rum S14
>
>
>From: Torben Hørup < tor...@t-hoerup.dk >
>Sent: 29 August 2019 14:03:13
>To:  ceph-users@ceph.io
>Subject: [ceph-users] Danish ceph users
>
>Hi
>
>A colleague and I are talking about making an event in Denmark for the
>danish ceph community, and we would like to get a feeling of how many
>ceph users are there in Denmark and hereof who would be interested in a
>Danish ceph event ?
>
>
>Regards,
>Torben
>___
>ceph-users mailing list --  ceph-users@ceph.io
>To unsubscribe send an email to  ceph-users-le...@ceph.io
>___
>ceph-users mailing list --  ceph-users@ceph.io
>To unsubscribe send an email to  ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS blocked ops; kernel: Workqueue: ceph-pg-invalid ceph_invalidate_work [ceph]

2019-09-03 Thread jesper
> Hi, I encountered a problem with blocked MDS operations and a client
> becoming unresponsive. I dumped the MDS cache, ops, blocked ops and some
> further log information here:
>
> https://files.dtu.dk/u/peQSOY1kEja35BI5/2010-09-03-mds-blocked-ops?l
>
> A user of our HPC system was running a job that creates a somewhat
> stressful MDS load. This workload tends to lead to MDS warnings like "slow
> metadata ops" and "client does not respond to caps release", which usually
> disappear without intervantion after a while.

We have a HPC cluster with 4K cores with 30+ (large'ish) servers - 128GB
=> 768GB compute nodes - and have experience similar issues.

This bug seem very related:
https://tracker.ceph.com/issues/41467
(we havent gotten a version with that patch yet).

Upgrading to a 5.2 kernel with this commit:
3e1d0452edceebb903d23db53201013c940bf000
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3e1d0452edceebb903d23db53201013c940bf000

Was capable of deadlocking the kernel when memory pressure caused MDS to
reclaim capabilities - smells similar.



Jesper



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Building a petabyte cluster from scratch

2019-12-03 Thread jesper
> After years of using Ceph, we plan to build soon a new cluster bigger than
> what
> we've done in the past. As the project is still in reflection, I'd like to
> have your thoughts on our planned design : any feedback is welcome :)
>
>
> ## Requirements
>
>  * ~1 PB usable space for file storage, extensible in the future
>  * The files are mostly "hot" data, no cold storage
>  * Purpose : storage for big files being essentially used on windows
> workstations (10G access)
>  * Performance is better :)
>
> ## Global design
>
>  * 8+3 Erasure Coded pool
>  * ZFS on RBD, exposed via samba shares (cluster with failover)
>
>
> ## Hardware
>
>  * 1 rack (multi-site would be better, of course...)
>
>  * OSD nodes : 14 x supermicro servers
>* 24 usable bays in 2U rackspace
>* 16 x 10 TB nearline SAS HDD (8 bays for future needs)
>* 2 x Xeon Silver 4212 (12C/24T)
>* 128 GB RAM
>* 4 x 40G QSFP+
>
>  * Networking : 2 x Cisco N3K 3132Q or 3164Q
>* 2 x 40G per server for ceph network (LACP/VPC for HA)
>* 2 x 40G per server for public network (LACP/VPC for HA)
>* QSFP+ DAC cables
>
>
> ## Sizing
>
> If we've done the maths well, we expect to have :
>
>  * 2.24 PB of raw storage, extensible to 3.36 PB by adding HDD
>  * 1.63 PB expected usable space with 8+3 EC, extensible to 2.44 PB
>  * ~1 PB of usable space if we want to keep the OSD use under 66% to allow
>loosing nodes without problem, extensible to 1.6 PB (same condition)
>
>
> ## Reflections
>
>  * We're used to run mons and mgrs daemons on a few of our OSD nodes,
> without
>any issue so far : is this a bad idea for a big cluster ?
>
>  * We thought using cache tiering on an SSD pool, but a large part of the
> PB is
>used on a daily basis, so we expect the cache to be not so effective
> and
>really expensive ?
>
>  * Could a 2x10G network be enough ?

I would say yes, those slow disks will not deliver more anyway.
This is going to be a relative "slow" setup with limited amount of
read-caching - with 16 drives / 128GB memory it'll be a few GB per
OSD for read caching - menaning that all read-and-write will hit
the slow drives underneath.

And that in a "double slow" fashion - where one write will hit 8 + 3 OSD's
and wait for sync-ack back to the master - same with reads that
will hit 8+3 OSD's before returning to the client.

Workload depending - this may just work for you - but it is definately
not fast.

Suggestions for improvements:

* Hardware raid with Battery Backed write-cache - will allow OSD to ack
writes before hitting spinning rust.
* More memory for OSD-level read-caching.
* 3x replication instead of EC
.. (we have all above in a "similar" setup ~1PB - 10 OSD - hosts).
SSD-tiering pool (havent been there - but would like to test it out).

-- 
Jesper
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Building a petabyte cluster from scratch

2019-12-03 Thread jesper
>> * Hardware raid with Battery Backed write-cache - will allow OSD to ack
>> writes before hitting spinning rust.
>
> Disagree.  See my litany from a few months ago.  Use a plain, IT-mode HBA.
>  Take the $$ you save and put it toward building your cluster out of SSDs
> instead of HDDs.  That way you don’t have to mess with the management
> hassles of maintaining and allocating external WAL+DB partitions too.

These things are not really comparable - are they?  Cost of SSD vs. HDD is
 still in the 6:1 favor of HHD's. Yes SSD would be great but not
nessesarily affordable - or have I missed something that makes the math
work ?

-- 
Jesper
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Building a petabyte cluster from scratch

2019-12-03 Thread jesper
> If k=8,m=3 is too slow on HDDs, so you need replica 3 and SSD DB/WAL,
> vs EC 8,3 on SSD, then that's (1/3) / (8/11) = 0.45 multiplier on the
> SSD space required vs HDDs.
> That brings it from 6x to 2.7x. Then you have the benefit of not
> needing separate SSDs for DB/WAL both in hardware cost and complexity.
> SSDs will still be more expensive; but perhaps justifiable given the
> performance, rebuild times, etc.
>
> If you only need cold-storage, then EC 8,3 on HDDs will be cheap. But
> is that fast enough?

Ok, I understand.
We have a "hot" fraction of our dataset - and 10GB cache on all 113 HDD
~1TB effective read-cache - and then writes hitting the battery-backed
write-cache - this can overspill and when hitting "cold" data performance
varies.  But the read/write amplification of EC is still un-manageable in
pratice on HDD with an active dataset.


-- 
Jesper
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Performance of old vs new hw?

2020-02-17 Thread jesper


Hi

We have some oldish servers with ssds - all on 25gbit nics. R815 AMD - 2,4ghz+

Is there significant performance benefits in moving to a new NVMe based, new 
cpus?

+20% IOPs? + 50% IOPs?

Jesper



Sent from myMail for iOS
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Performance of Micron 5210 SATA?

2020-03-06 Thread jesper

But is random/sequential read performance still good? even during saturated 
write performance ? 

if so the tradeoff could fit quite some applications 



Sent from myMail for iOS


Friday, 6 March 2020, 14.06 +0100 from vitalif  :
>Hi,
>
>Current QLC drives are total shit in terms of steady-state performance. 
>First 10-100 GB of data is written into the SLC cache which is fast, but 
>then the drive switches to its QLC memory and even the linear write 
>performance drops to ~90 MB/s which is actually worse than with HDDs!
>
>So, try to run a long linear write test and check the performance after 
>writing a lot of data.
>
>> Last monday I performed a quick test with those two disks already,
>> probably not that relevant, but posting it anyway:
>___
>ceph-users mailing list --  ceph-users@ceph.io
>To unsubscribe send an email to  ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: New 3 node Ceph cluster

2020-03-14 Thread jesper
Hi.

Unless there is plans for going to Petabyte scale with it - then I really
dont see the benefits of getting CephFS involved over just an RBD image
with a VM running standard samba on top.

More performant and less complexity to handle - zero gains (by my book)

Jesper

> Hi,
>
> I am planning to create a new 3 node ceph storage cluster.
>
> I will be using Cephfs + with samba for max 10 clients for upload and
> download.
>
> Storage Node HW is Intel Xeon E5v2 8 core single Proc, 32GB RAM and 10Gb
> Nic 2 nos., 6TB SATA  HDD 24 Nos. each node, OS separate SSD disk.
>
> Earlier I have tested orchestration using ceph-deploy in the test setup.
> now, is there any other alternative to ceph-deploy?
>
> Can I restrict folder access to the user using cephfs + vfs samba or
> should
> I use ceph client + samba?
>
> Ubuntu or Centos?
>
> Any block size consideration for object size, metadata when using cephfs?
>
> Idea or suggestion from existing users. I am also going to start to
> explore
> all the above.
>
> regards
> Amudhan
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Recommendation for decent write latency performance from HDDs

2020-04-04 Thread jesper
Hi.

We have a need for "bulk" storage - but with decent write latencies.
Normally we would do this with a DAS with a Raid5 with 2GB Battery
backed write cache in front - As cheap as possible but still getting the
features of scalability of ceph.

In our "first" ceph cluster we did the same - just stuffed in BBWC
in the OSD nodes and we're fine - but now we're onto the next one and
systems like:
https://www.supermicro.com/en/products/system/1U/6119/SSG-6119P-ACR12N4L.cfm
Does not support a Raid controller like that - but is branded as for "Ceph
Storage Solutions".

It do however support 4 NVMe slots in the front - So - some level of
"tiering" using the NVMe drives should be what is "suggested" - but what
do people do? What is recommeneded. I see multiple options:

Ceph tiering at the "pool - layer":
https://docs.ceph.com/docs/master/rados/operations/cache-tiering/
And rumors that it is "deprectated:
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2.0/html/release_notes/deprecated_functionality

Pro: Abstract layer
Con: Deprecated? - Lots of warnings?

Offloading the block.db on NVMe / SSD:
https://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/

Pro: Easy to deal with - seem heavily supported.
Con: As far as I can tell - this will only benefit the metadata of the
osd- not actual data. Thus a data-commit to the osd til still be dominated
by the writelatency of the underlying - very slow HDD.

Bcache:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-June/027713.html

Pro: Closest to the BBWC mentioned above - but with way-way larger cache
sizes.
Con: It is hard to see if I end up being the only one on the planet using
this
solution.

Eat it - Writes will be as slow as hitting dead-rust - anything that
cannot live
with that need to be entirely on SSD/NVMe.

Other?

Thanks for your input.

Jesper
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recommendation for decent write latency performance from HDDs

2020-04-04 Thread jesper
> On Sat, Apr 4, 2020 at 4:13 PM  wrote:
>> Offloading the block.db on NVMe / SSD:
>> https://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/
>>
>> Pro: Easy to deal with - seem heavily supported.
>> Con: As far as I can tell - this will only benefit the metadata of the
>> osd- not actual data. Thus a data-commit to the osd til still be
>> dominated
>> by the writelatency of the underlying - very slow HDD.
>
> small writes (<= 32kb, configurable) are written to db first and
> written back to the slow disk asynchronous to the original request.

Now, that sounds really interesting - I havent been able to find that in
the documentation - can you provide a pointer? Whats the configuratoin
parameter named?

Meaning moving block.dk to a say 256GB NVMe will do "the right thing" for
the system and deliver a fast write cache for smallish writes.

Would setting the parameter til 1MB be "insane"?

Jesper
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] MDS_CACHE_OVERSIZED warning

2020-04-16 Thread jesper
Hi.

I have a cluster that has been running for close to 2 years now - pretty
much with the same setting, but over the past day I'm seeing this warning.

(and the cache seem to keep growing) - Can I figure out which clients is
accumulating the inodes?

Ceph 12.2.8 - is it ok just to "bump" the memory to say 128GB - any
negative sideeffects?

jk@ceph-mon1:~$ sudo ceph health detail
HEALTH_WARN 1 MDSs report oversized cache; 3 clients failing to respond to
cache pressure
MDS_CACHE_OVERSIZED 1 MDSs report oversized cache
mdsceph-mds1(mds.0): MDS cache is too large (91GB/32GB); 34400070
inodes in use by clients, 3293 stray files


Thanks - Jesper

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] pg_num != pgp_num - and unable to change.

2023-07-05 Thread Jesper Krogh

Hi.

Fresh cluster - after a dance where the autoscaler did not work 
(returned black) as described in the doc - I now seemingly have it 
working. It has bumpted target to something reasonable -- and is slowly 
incrementing pg_num and pgp_num by 2 over time (hope this is correct?)


But .
jskr@dkcphhpcmgt028:/$ sudo ceph osd pool ls detail | grep 62
pool 22 'cephfs.archive.ec62data' erasure profile ecprof62 size 8 
min_size 7 crush_rule 3 object_hash rjenkins pg_num 150 pgp_num 22 
pg_num_target 512 pgp_num_target 512 autoscale_mode on last_change 9159 
lfor 0/0/9147 flags hashpspool,ec_overwrites,selfmanaged_snaps,bulk 
stripe_width 24576 pg_num_min 128 target_size_ratio 0.4 application 
cephfs


pg_num = 150
pgp_num = 22

and setting pgp_num seemingly have zero effect on the system .. not even 
with autoscaling set to off.


jskr@dkcphhpcmgt028:/$ sudo ceph osd pool set cephfs.archive.ec62data 
pg_autoscale_mode off

set pool 22 pg_autoscale_mode to off
jskr@dkcphhpcmgt028:/$ sudo ceph osd pool set cephfs.archive.ec62data 
pgp_num 150

set pool 22 pgp_num to 150
jskr@dkcphhpcmgt028:/$ sudo ceph osd pool set cephfs.archive.ec62data 
pg_num_min 128

set pool 22 pg_num_min to 128
jskr@dkcphhpcmgt028:/$ sudo ceph osd pool set cephfs.archive.ec62data 
pg_num 150

set pool 22 pg_num to 150
jskr@dkcphhpcmgt028:/$ sudo ceph osd pool set cephfs.archive.ec62data 
pg_autoscale_mode on

set pool 22 pg_autoscale_mode to on
jskr@dkcphhpcmgt028:/$ sudo ceph progress
PG autoscaler increasing pool 22 PGs from 150 to 512 (14s)
[]
jskr@dkcphhpcmgt028:/$ sudo ceph osd pool ls detail | grep 62
pool 22 'cephfs.archive.ec62data' erasure profile ecprof62 size 8 
min_size 7 crush_rule 3 object_hash rjenkins pg_num 150 pgp_num 22 
pg_num_target 512 pgp_num_target 512 autoscale_mode on last_change 9159 
lfor 0/0/9147 flags hashpspool,ec_overwrites,selfmanaged_snaps,bulk 
stripe_width 24576 pg_num_min 128 target_size_ratio 0.4 application 
cephfs


pgp_num != pg_num ?

In earlier versions of ceph (without autoscaler) I have only experienced 
that setting pg_num and pgp_num took immidiate effect?


Jesper

jskr@dkcphhpcmgt028:/$ sudo ceph version
ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy 
(stable)

jskr@dkcphhpcmgt028:/$ sudo ceph health
HEALTH_OK
jskr@dkcphhpcmgt028:/$ sudo ceph status
  cluster:
id: 5c384430-da91-11ed-af9c-c780a5227aff
health: HEALTH_OK

  services:
mon: 3 daemons, quorum dkcphhpcmgt031,dkcphhpcmgt029,dkcphhpcmgt028 
(age 15h)
mgr: dkcphhpcmgt031.afbgjx(active, since 32h), standbys: 
dkcphhpcmgt029.bnsegi, dkcphhpcmgt028.bxxkqd

mds: 2/2 daemons up, 1 standby
osd: 40 osds: 40 up (since 44h), 40 in (since 39h); 33 remapped pgs

  data:
volumes: 2/2 healthy
pools:   9 pools, 495 pgs
objects: 24.85M objects, 60 TiB
usage:   117 TiB used, 158 TiB / 276 TiB avail
pgs: 13494029/145763897 objects misplaced (9.257%)
 462 active+clean
 23  active+remapped+backfilling
 10  active+remapped+backfill_wait

  io:
client:   0 B/s rd, 1.1 MiB/s wr, 0 op/s rd, 94 op/s wr
recovery: 705 MiB/s, 208 objects/s

  progress:


--
Jesper Krogh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Cannot get backfill speed up

2023-07-05 Thread Jesper Krogh



Hi.

Fresh cluster - but despite setting:
jskr@dkcphhpcmgt028:/$ sudo ceph config show osd.0 |  grep 
recovery_max_active_ssd
osd_recovery_max_active_ssd  50  
 
  mon   
default[20]
jskr@dkcphhpcmgt028:/$ sudo ceph config show osd.0 |  grep 
osd_max_backfills
osd_max_backfills100 
 
  mon   
default[10]


I still get
jskr@dkcphhpcmgt028:/$ sudo ceph status
  cluster:
id: 5c384430-da91-11ed-af9c-c780a5227aff
health: HEALTH_OK

  services:
mon: 3 daemons, quorum dkcphhpcmgt031,dkcphhpcmgt029,dkcphhpcmgt028 
(age 16h)
mgr: dkcphhpcmgt031.afbgjx(active, since 33h), standbys: 
dkcphhpcmgt029.bnsegi, dkcphhpcmgt028.bxxkqd

mds: 2/2 daemons up, 1 standby
osd: 40 osds: 40 up (since 45h), 40 in (since 39h); 21 remapped pgs

  data:
volumes: 2/2 healthy
pools:   9 pools, 495 pgs
objects: 24.85M objects, 60 TiB
usage:   117 TiB used, 159 TiB / 276 TiB avail
pgs: 10655690/145764002 objects misplaced (7.310%)
 474 active+clean
 15  active+remapped+backfilling
 6   active+remapped+backfill_wait

  io:
client:   0 B/s rd, 1.4 MiB/s wr, 0 op/s rd, 116 op/s wr
recovery: 328 MiB/s, 108 objects/s

  progress:
Global Recovery Event (9h)
  [==..] (remaining: 25m)

With these numbers for the setting - I would expect to get more than 15 
active backfilling... (and based on SSD's and 2x25gbit network, I can 
also spend more resources on recovery than 328 MiB/s


Thanks, .

--
Jesper Krogh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Cephfs metadata and MDS on same node

2021-03-09 Thread Jesper Lykkegaard Karlsen
Dear Ceph’ers

I am about to upgrade MDS nodes for Cephfs in the Ceph cluster (erasure code 
8+3 ) I am administrating.

Since they will get plenty of memory and CPU cores, I was wondering if it would 
be a good idea to move metadata OSDs (NVMe's currently on OSD nodes together 
with cephfs_data ODS (HDD)) to the MDS nodes?

Configured as:

4 x MDS with each a metadata OSD and configured with 4 x replication

so each metadata OSD would have a complete copy of metadata.

I know MDS, stores al lot of metadata in RAM, but if metadata OSDs were on MDS 
nodes, would that not bring down latency?

Anyway, I am just asking for your opinion on this? Pros and cons or even better 
somebody who actually have tried this?

Best regards,
Jesper

--
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Gustav Wieds Vej 10
8000 Aarhus C

E-mail: je...@mbg.au.dk<mailto:je...@mbg.au.dk>
Tlf:+45 50906203

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Recover data from Cephfs snapshot

2021-03-12 Thread Jesper Lykkegaard Karlsen
Hi Ceph'ers,

I love the possibility to make snapshots on Cephfs systems.

Although there is one thing that puzzles me.

Creating snapshot takes no time to do and deleting snapshots can bring PGs into 
snaptrim state for some hours.
While recovering data from a snapshot will always invoke a full data transfer, 
where data are "physically" being copied back into place.

This can make recovering from snapshots on Cephfs a rather heavy procedure.
I have even tried "mv" command but that also starts transfer real data instead 
of just moving metadata pointers.

Am I missing some "ceph snapshot recover" command, that can move metadata 
pointers and make recovery much lighter, or is this just that way it is?

Best reagards,
Jesper

--
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Gustav Wieds Vej 10
8000 Aarhus C

E-mail: je...@mbg.au.dk
Tlf:+45 50906203

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] replacing OSD nodes

2022-07-19 Thread Jesper Lykkegaard Karlsen
5 377 46 322 24 
306 53 200 240 338   #1.9TiB bytes available on most full OSD (306)
ceph osd pg-upmap-items 20.6c5 334 371 30 340 70 266 241 407 3 233 186 356 40 
312 294 391   #1.9TiB bytes available on most full OSD (233)
ceph osd pg-upmap-items 20.6b4 344 338 226 389 319 362 309 411 85 379 248 233 
121 318 0 254   #1.9TiB bytes available on most full OSD (233)
ceph osd pg-upmap-items 20.6b1 325 292 35 371 347 153 146 390 12 343 88 327 27 
355 54 250 192 408   #1.9TiB bytes available on most full OSD (153)
ceph osd pg-upmap-items 20.57 82 389 282 356 103 165 62 284 67 408 252 366   
#1.9TiB bytes available on most full OSD (165)
ceph osd pg-upmap-items 20.50 244 355 319 228 154 397 63 317 113 378 97 276 288 
150   #1.9TiB bytes available on most full OSD (228)
ceph osd pg-upmap-items 20.47 343 351 107 283 81 332 76 398 160 410 26 378   
#1.9TiB bytes available on most full OSD (283)
ceph osd pg-upmap-items 20.3e 56 322 31 283 330 377 107 360 199 309 190 385 78 
406   #1.9TiB bytes available on most full OSD (283)
ceph osd pg-upmap-items 20.3b 91 349 312 414 268 386 45 244 125 371   #1.9TiB 
bytes available on most full OSD (244)
ceph osd pg-upmap-items 20.3a 277 371 290 359 91 415 165 392 107 167   #1.9TiB 
bytes available on most full OSD (167)
ceph osd pg-upmap-items 20.39 74 175 18 302 240 393 3 269 224 374 194 408 173 
364   #1.9TiB bytes available on most full OSD (302)
...
...

If I were to set this into effect, I would first set norecover and nobackfill, 
then run the script and unset norecover and nobackfill again.
But I am uncertain if it would work? Or even if this is a good idea?

It would be nice if Ceph did something similar automatically 🙂
Or maybe Ceph already does something similar, and I have just not been able to 
find it?

If Ceph were to do this, it could be nice if the priority of backfill_wait PGs 
was rerun, perharps every 24 hours, as OSD availability landscape of course 
changes during backfill.

I imagine this, especially, could stabilize recovery/rebalance on systems where 
space is a little tight.

Best regards,
Jesper

--
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Universitetsbyen 81
8000 Aarhus C

E-mail: je...@mbg.au.dk
Tlf:+45 50906203

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: replacing OSD nodes

2022-07-20 Thread Jesper Lykkegaard Karlsen
Thanks for you answer Janne.

Yes, I am also running "ceph osd reweight" on the "nearfull" osds, once they 
get too close for comfort.

But I just though a continuous prioritization of rebalancing PGs, could make 
this process more smooth, with less/no need for handheld operations.

Best,
Jesper

------
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Universitetsbyen 81
8000 Aarhus C

E-mail: je...@mbg.au.dk
Tlf:+45 50906203


Fra: Janne Johansson 
Sendt: 20. juli 2022 10:47
Til: Jesper Lykkegaard Karlsen 
Cc: ceph-users@ceph.io 
Emne: Re: [ceph-users] replacing OSD nodes

Den tis 19 juli 2022 kl 13:09 skrev Jesper Lykkegaard Karlsen :
>
> Hi all,
> Setup: Octopus - erasure 8-3
> I had gotten to the point where I had some rather old OSD nodes, that I 
> wanted to replace with new ones.
> The procedure was planned like this:
>
>   *   add new replacement OSD nodes
>   *   set all OSDs on the retiring nodes to out.
>   *   wait for everything to rebalance
>   *   remove retiring nodes

> After around 50% misplaced objects remaining, the OSDs started to complain 
> about backfillfull OSDs and nearfull OSDs.
> A bit of a surprise to me, as RAW size is only 47% used.
> It seems that rebalancing does not happen in a prioritized manner, where 
> planed backfill starts with the OSD with most space available space, but 
> "alphabetically" according to pg-name.
> Is this really true?

I don't know if it does it in any particular order, just that it
certainly doesn't fire off requests to the least filled OSD to receive
data first, so when I have gotten into similar situations, it just
tried to run as many moves as possible given max_backfill and all
that, then some/most might get stuck in toofull, but as the rest of
the slots progress, space gets available and at some point those
toofull ones get handled. It delays the completion but hasn't caused
me any other specific problems.

Though I will admit I have used "ceph osd reweight osd.123
" at times to force emptying of some OSDs, but that was
more my impatience than anything else.


--
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: replacing OSD nodes

2022-07-22 Thread Jesper Lykkegaard Karlsen
It seems like a low hanging fruit to fix?
There must be a reason why the developers have not made a prioritized order of 
backfilling PGs.
Or maybe the prioritization is something else than available space?

The answer remains unanswered, as well as if my suggested approach/script would 
work or not?

Summer vacation?

Best,
Jesper

--
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Universitetsbyen 81
8000 Aarhus C

E-mail: je...@mbg.au.dk
Tlf:+45 50906203


Fra: Janne Johansson 
Sendt: 20. juli 2022 19:39
Til: Jesper Lykkegaard Karlsen 
Cc: ceph-users@ceph.io 
Emne: Re: [ceph-users] replacing OSD nodes

Den ons 20 juli 2022 kl 11:22 skrev Jesper Lykkegaard Karlsen :
> Thanks for you answer Janne.
> Yes, I am also running "ceph osd reweight" on the "nearfull" osds, once they 
> get too close for comfort.
>
> But I just though a continuous prioritization of rebalancing PGs, could make 
> this process more smooth, with less/no need for handheld operations.

You are absolutely right there, just wanted to chip in with my
experiences of "it nags at me but it will still work out" so other
people finding these mails later on can feel a bit relieved at knowing
that a few toofull warnings aren't a major disaster and that it
sometimes happens, because ceph looks for all possible moves, even
those who will run late in the rebalancing.

--
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: PG does not become active

2022-07-28 Thread Jesper Lykkegaard Karlsen
Hi Frank, 

I think you need at least 6 OSD hosts to make EC 4+2 with faillure domain host. 

I do not know how it was possible for you to create that configuration at 
first? 
Could it be that you have multiple name for the OSD hosts? 
That would at least explain the one OSD down, being show as two OSDs down. 

Also, I believe that min_size should never be smaller than “coding” shards, 
which is 4 in this case. 

You can either make a new test setup with your three test OSD hosts using EC 
2+1 or make e.g. 4+2, but with failure domain set to OSD. 

Best, 
Jesper
  
--
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Universitetsbyen 81
8000 Aarhus C

E-mail: je...@mbg.au.dk
Tlf:+45 50906203

> On 27 Jul 2022, at 17.32, Frank Schilder  wrote:
> 
> Update: the inactive PG got recovered and active after a lnngg wait. The 
> middle question is now answered. However, these two questions are still of 
> great worry:
> 
> - How can 2 OSDs be missing if only 1 OSD is down?
> - If the PG should recover, why is it not prioritised considering its severe 
> degradation
>  compared with all other PGs?
> 
> I don't understand how a PG can loose 2 shards if 1 OSD goes down. That looks 
> really really bad to me (did ceph loose track of data??).
> 
> The second is of no less importance. The inactive PG was holding back client 
> IO, leading to further warnings about slow OPS/requests/... Why are such 
> critically degraded PGs not scheduled for recovery first? There is a service 
> outage but only a health warning?
> 
> Thanks and best regards.
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> 
> From: Frank Schilder 
> Sent: 27 July 2022 17:19:05
> To: ceph-users@ceph.io
> Subject: [ceph-users] PG does not become active
> 
> I'm testing octopus 15.2.16 and run into a problem right away. I'm filling up 
> a small test cluster with 3 hosts 3x3 OSDs and killed one OSD to see how 
> recovery works. I have one 4+2 EC pool with failure domain host and on 1 PGs 
> of this pool 2 (!!!) shards are missing. This most degraded PG is not 
> becoming active, its stuck inactive but peered.
> 
> Questions:
> 
> - How can 2 OSDs be missing if only 1 OSD is down?
> - Wasn't there an important code change to allow recovery for an EC PG with at
>  least k shards present even if min_size>k? Do I have to set something?
> - If the PG should recover, why is it not prioritised considering its severe 
> degradation
>  compared with all other PGs?
> 
> I have already increased these crush tunables and executed a pg repeer to no 
> avail:
> 
> tunable choose_total_tries 250 <-- default 100
> rule fs-data {
>id 1
>type erasure
>min_size 3
>max_size 6
>step set_chooseleaf_tries 50 <-- default 5
>step set_choose_tries 200 <-- default 100
>step take default
>step choose indep 0 type osd
>step emit
> }
> 
> Ceph health detail says to that:
> 
> [WRN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive
>pg 4.32 is stuck inactive for 37m, current state 
> recovery_wait+undersized+degraded+remapped+peered, last acting 
> [1,2147483647,2147483647,4,5,2]
> 
> I don't want to cheat and set min_size=k on this pool. It should work by 
> itself.
> 
> Thanks for any pointers!
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: PG does not become active

2022-07-28 Thread Jesper Lykkegaard Karlsen
Ah I see, should have look at the “raw” data instead ;-)

Then I agree this very weird?

Best, 
Jesper

--
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Universitetsbyen 81
8000 Aarhus C

E-mail: je...@mbg.au.dk
Tlf:+45 50906203

> On 28 Jul 2022, at 12.45, Frank Schilder  wrote:
> 
> Hi Jesper,
> 
> thanks for looking at this. The failure domain is OSD and not host. I typed 
> it wrong in the text, the copy of the crush rule shows it right: step choose 
> indep 0 type osd.
> 
> I'm trying to reproduce the observation to file a tracker item, but it is 
> more difficult than expected. It might be a race condition, so far I didn't 
> see it again. I hope I can figure out when and why this is happening.
> 
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> 
> From: Jesper Lykkegaard Karlsen 
> Sent: 28 July 2022 12:02:51
> To: Frank Schilder
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] PG does not become active
> 
> Hi Frank,
> 
> I think you need at least 6 OSD hosts to make EC 4+2 with faillure domain 
> host.
> 
> I do not know how it was possible for you to create that configuration at 
> first?
> Could it be that you have multiple name for the OSD hosts?
> That would at least explain the one OSD down, being show as two OSDs down.
> 
> Also, I believe that min_size should never be smaller than “coding” shards, 
> which is 4 in this case.
> 
> You can either make a new test setup with your three test OSD hosts using EC 
> 2+1 or make e.g. 4+2, but with failure domain set to OSD.
> 
> Best,
> Jesper
> 
> --
> Jesper Lykkegaard Karlsen
> Scientific Computing
> Centre for Structural Biology
> Department of Molecular Biology and Genetics
> Aarhus University
> Universitetsbyen 81
> 8000 Aarhus C
> 
> E-mail: je...@mbg.au.dk
> Tlf:+45 50906203
> 
>> On 27 Jul 2022, at 17.32, Frank Schilder  wrote:
>> 
>> Update: the inactive PG got recovered and active after a lnngg wait. The 
>> middle question is now answered. However, these two questions are still of 
>> great worry:
>> 
>> - How can 2 OSDs be missing if only 1 OSD is down?
>> - If the PG should recover, why is it not prioritised considering its severe 
>> degradation
>> compared with all other PGs?
>> 
>> I don't understand how a PG can loose 2 shards if 1 OSD goes down. That 
>> looks really really bad to me (did ceph loose track of data??).
>> 
>> The second is of no less importance. The inactive PG was holding back client 
>> IO, leading to further warnings about slow OPS/requests/... Why are such 
>> critically degraded PGs not scheduled for recovery first? There is a service 
>> outage but only a health warning?
>> 
>> Thanks and best regards.
>> =
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>> 
>> 
>> From: Frank Schilder 
>> Sent: 27 July 2022 17:19:05
>> To: ceph-users@ceph.io
>> Subject: [ceph-users] PG does not become active
>> 
>> I'm testing octopus 15.2.16 and run into a problem right away. I'm filling 
>> up a small test cluster with 3 hosts 3x3 OSDs and killed one OSD to see how 
>> recovery works. I have one 4+2 EC pool with failure domain host and on 1 PGs 
>> of this pool 2 (!!!) shards are missing. This most degraded PG is not 
>> becoming active, its stuck inactive but peered.
>> 
>> Questions:
>> 
>> - How can 2 OSDs be missing if only 1 OSD is down?
>> - Wasn't there an important code change to allow recovery for an EC PG with 
>> at
>> least k shards present even if min_size>k? Do I have to set something?
>> - If the PG should recover, why is it not prioritised considering its severe 
>> degradation
>> compared with all other PGs?
>> 
>> I have already increased these crush tunables and executed a pg repeer to no 
>> avail:
>> 
>> tunable choose_total_tries 250 <-- default 100
>> rule fs-data {
>>   id 1
>>   type erasure
>>   min_size 3
>>   max_size 6
>>   step set_chooseleaf_tries 50 <-- default 5
>>   step set_choose_tries 200 <-- default 100
>>   step take default
>>   step choose indep 0 type osd
>>   step emit
>> }
>> 
>> Ceph health detail says to that:

[ceph-users] Re: cannot set quota on ceph fs root

2022-07-28 Thread Jesper Lykkegaard Karlsen
Hi Frank, 

I guess there is alway the possibility to set quota on pool level with 
"target_max_objects" and “target_max_bytes”
The cephfs quotas through attributes are only for sub-directories as far as I 
recall. 

Best, 
Jesper

------
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Universitetsbyen 81
8000 Aarhus C

E-mail: je...@mbg.au.dk
Tlf:+45 50906203

> On 28 Jul 2022, at 17.22, Frank Schilder  wrote:
> 
> Hi Gregory,
> 
> thanks for your reply. It should be possible to set a quota on the root, 
> other vattribs can be set as well despite it being a mount point. There must 
> be something on the ceph side (or another bug in the kclient) preventing it.
> 
> By the way, I can't seem to find cephfs-tools like cephfs-shell. I'm using 
> the image quay.io/ceph/ceph:v15.2.16 and its not installed in the image. A 
> "yum provides cephfs-shell" returns no candidate and I can't find 
> installation instructions. Could you help me out here?
> 
> Thanks and best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> 
> From: Gregory Farnum 
> Sent: 28 July 2022 16:59:50
> To: Frank Schilder
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] cannot set quota on ceph fs root
> 
> On Thu, Jul 28, 2022 at 1:01 AM Frank Schilder  wrote:
>> 
>> Hi all,
>> 
>> I'm trying to set a quota on the ceph fs file system root, but it fails with 
>> "setfattr: /mnt/adm/cephfs: Invalid argument". I can set quotas on any 
>> sub-directory. Is this intentional? The documentation 
>> (https://docs.ceph.com/en/octopus/cephfs/quota/#quotas) says
>> 
>>> CephFS allows quotas to be set on any directory in the system.
>> 
>> Any includes the fs root. Is the documentation incorrect or is this a bug?
> 
> I'm not immediately seeing why we can't set quota on the root, but the
> root inode is special in a lot of ways so this doesn't surprise me.
> I'd probably regard it as a docs bug.
> 
> That said, there's also a good chance that the setfattr is getting
> intercepted before Ceph ever sees it, since by setting it on the root
> you're necessarily interacting with a mount point in Linux and those
> can also be finicky...You could see if it works by using cephfs-shell.
> -Greg
> 
> 
>> 
>> Best regards,
>> =
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: replacing OSD nodes

2022-07-28 Thread Jesper Lykkegaard Karlsen
Thanks you for your suggestions Josh, it is really appreciated. 

Pgremapper looks interesting and definitely something I will look into.
 
I know the balancer will reach a well balanced PG landscape eventually, but I 
am not sure that it will prioritise backfill after “most available location” 
first. 
Then I might end up in the same situation, where some of the old (but not 
retired) OSD starts getting full. 

Then there is the “undo-upmaps” script left or maybe even the script that I 
propose in combination with “cancel-backfill”, as it just moves what Ceph was 
planing to move anyway, just in a prioritised manner. 

Have you tried the pgremapper youself Josh? 
Is it safe to use? 
And does the Ceph developers vouch for this methode?   

Status now is ~1,600,000,000 objects are now move, which is about half of all 
of the planned backfills. 
I have been reweighing OSD down, as they get to close to maximum usage, which 
works to some extend. 

Monitors on the other hand are now complaining about using a lot of disk space, 
due to the long time backfilling. 
There is still plenty of disk space on the mons, but I feel that the backfill 
is getting slower and slower, although still the same amount of PGs are 
backfilling. 

Can large disk usage on mons slow down backfill and other operations? 
Is it dangerous? 

Best, 
Jesper

--
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Universitetsbyen 81
8000 Aarhus C

E-mail: je...@mbg.au.dk
Tlf:+45 50906203

> On 28 Jul 2022, at 22.26, Josh Baergen  wrote:
> 
> I don't have many comments on your proposed approach, but just wanted
> to note that how I would have approached this, assuming that you have
> the same number of old hosts, would be to:
> 1. Swap-bucket the hosts.
> 2. Downweight the OSDs on the old hosts to 0.001. (Marking them out
> (i.e. weight 0) prevents maps from being applied.)
> 3. Add the old hosts back to the CRUSH map in their old racks or whatever.
> 4. Use https://github.com/digitalocean/pgremapper#cancel-backfill.
> 5. Then run https://github.com/digitalocean/pgremapper#undo-upmaps in
> a loop to drain the old OSDs.
> 
> This gives you the maximum concurrency and efficiency of movement, but
> doesn't necessarily solve your balance issue if it's the new OSDs that
> are getting full (that wasn't clear to me). It's still possible to
> apply steps 2, 4, and 5 if the new hosts are in place. If you're not
> in a rush could actually use the balancer instead of undo-upmaps in
> step 5 to perform the rest of the data migration and then you wouldn't
> have full OSDs.
> 
> Josh
> 
> On Fri, Jul 22, 2022 at 1:57 AM Jesper Lykkegaard Karlsen
>  wrote:
>> 
>> It seems like a low hanging fruit to fix?
>> There must be a reason why the developers have not made a prioritized order 
>> of backfilling PGs.
>> Or maybe the prioritization is something else than available space?
>> 
>> The answer remains unanswered, as well as if my suggested approach/script 
>> would work or not?
>> 
>> Summer vacation?
>> 
>> Best,
>> Jesper
>> 
>> --
>> Jesper Lykkegaard Karlsen
>> Scientific Computing
>> Centre for Structural Biology
>> Department of Molecular Biology and Genetics
>> Aarhus University
>> Universitetsbyen 81
>> 8000 Aarhus C
>> 
>> E-mail: je...@mbg.au.dk
>> Tlf:    +45 50906203
>> 
>> 
>> Fra: Janne Johansson 
>> Sendt: 20. juli 2022 19:39
>> Til: Jesper Lykkegaard Karlsen 
>> Cc: ceph-users@ceph.io 
>> Emne: Re: [ceph-users] replacing OSD nodes
>> 
>> Den ons 20 juli 2022 kl 11:22 skrev Jesper Lykkegaard Karlsen 
>> :
>>> Thanks for you answer Janne.
>>> Yes, I am also running "ceph osd reweight" on the "nearfull" osds, once 
>>> they get too close for comfort.
>>> 
>>> But I just though a continuous prioritization of rebalancing PGs, could 
>>> make this process more smooth, with less/no need for handheld operations.
>> 
>> You are absolutely right there, just wanted to chip in with my
>> experiences of "it nags at me but it will still work out" so other
>> people finding these mails later on can feel a bit relieved at knowing
>> that a few toofull warnings aren't a major disaster and that it
>> sometimes happens, because ceph looks for all possible moves, even
>> those who will run late in the rebalancing.
>> 
>> --
>> May the most significant bit of your life be positive.
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: replacing OSD nodes

2022-07-28 Thread Jesper Lykkegaard Karlsen
Cool thanks a lot! 
I will definitely put it in my toolbox. 

Best, 
Jesper

--
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Universitetsbyen 81
8000 Aarhus C

E-mail: je...@mbg.au.dk
Tlf:+45 50906203

> On 29 Jul 2022, at 00.35, Josh Baergen  wrote:
> 
>> I know the balancer will reach a well balanced PG landscape eventually, but 
>> I am not sure that it will prioritise backfill after “most available 
>> location” first.
> 
> Correct, I don't believe it prioritizes in this way.
> 
>> Have you tried the pgremapper youself Josh?
> 
> My team wrote and maintains pgremapper and we've used it extensively,
> but I'd always recommend trying it in test environments first. Its
> effect on the system isn't much different than what you're proposing
> (it simply manipulates the upmap exception table).
> 
> Josh

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Potential bug in cephfs-data-scan?

2022-08-19 Thread Jesper Lykkegaard Karlsen
Hi,

I have recently been scanning the files in a PG with "cephfs-data-scan pg_files 
...".

Although, after a long time the scan was still running and the list of files 
consumed 44 GB, I stopped it, as something obviously was very wrong.

It turns out some users had symlinks that looped and even a user had a symlink 
to "/".

It does not make sense that cephfs-data-scan follows symlinks, as this will 
give a wrong picture of what files are in the target PG.
I have looked though CEPHs bug reports, but I do not see anyone mentioning this.

Although I am still on the recently deprecated Octopus, I suspect that this bug 
is also present in Pacific and Quincy?

It might be related to this bug?

https://tracker.ceph.com/issues/46166

But symptoms are different.

Or, maybe there is a way to disable the following of symlinks in 
"cephfs-data-scan pg_files ..."?

Best,
Jesper

------
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Universitetsbyen 81
8000 Aarhus C

E-mail: je...@mbg.au.dk
Tlf:+45 50906203

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Potential bug in cephfs-data-scan?

2022-08-19 Thread Jesper Lykkegaard Karlsen



Fra: Patrick Donnelly 
Sendt: 19. august 2022 16:16
Til: Jesper Lykkegaard Karlsen 
Cc: ceph-users@ceph.io 
Emne: Re: [ceph-users] Potential bug in cephfs-data-scan?

On Fri, Aug 19, 2022 at 5:02 AM Jesper Lykkegaard Karlsen
 wrote:
>>
> >Hi,
>>
>> I have recently been scanning the files in a PG with "cephfs-data-scan 
>> pg_files ...".

>Why?

I had an incident where a PG that went down+incomplete after some OSD crashed + 
heavy load + ongoing snap trimming.
Got it back up again with object store tool by marking complete.
Then I wanted to show possible affected files with cephfs-data-scan in the 
unfortunate PG, so I could recover potential loss from backup.


>> Although, after a long time the scan was still running and the list of files 
>> consumed 44 GB, I stopped it, as something obviously was very wrong.
>>
>> It turns out some users had symlinks that looped and even a user had a 
>> symlink to "/".

>Symlinks are not stored in the data pool. This should be irrelevant.

Okay, it may be a case of me "holding it wrong", but I do see "cephfs-data-scan 
pg_files" trying to follow any global or local symlink in the file structure, 
which leads to many more files registrered than possibly could be in that PG 
and even endless loops in some cases.

If the symlinks are not stored in data pool, how can cephfs-data-scan then 
follow the link?
And how do I get "cephfs-data-scan" to just show the symlinks as links and not 
follow them up or down in directory structure?

Best,
Jesper


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Potential bug in cephfs-data-scan?

2022-08-19 Thread Jesper Lykkegaard Karlsen
Actually, it might have worked better if the PG had stayed down while running 
cephfs-data-scan, as it could only then get file structure from metadata pool 
and not touch each file/link in data pool?
This would at least properly have given the list of files in (only) the 
affected PG?

//Jesper


Fra: Jesper Lykkegaard Karlsen 
Sendt: 19. august 2022 22:49
Til: Patrick Donnelly 
Cc: ceph-users@ceph.io 
Emne: [ceph-users] Re: Potential bug in cephfs-data-scan?



Fra: Patrick Donnelly 
Sendt: 19. august 2022 16:16
Til: Jesper Lykkegaard Karlsen 
Cc: ceph-users@ceph.io 
Emne: Re: [ceph-users] Potential bug in cephfs-data-scan?

On Fri, Aug 19, 2022 at 5:02 AM Jesper Lykkegaard Karlsen
 wrote:
>>
> >Hi,
>>
>> I have recently been scanning the files in a PG with "cephfs-data-scan 
>> pg_files ...".

>Why?

I had an incident where a PG that went down+incomplete after some OSD crashed + 
heavy load + ongoing snap trimming.
Got it back up again with object store tool by marking complete.
Then I wanted to show possible affected files with cephfs-data-scan in the 
unfortunate PG, so I could recover potential loss from backup.


>> Although, after a long time the scan was still running and the list of files 
>> consumed 44 GB, I stopped it, as something obviously was very wrong.
>>
>> It turns out some users had symlinks that looped and even a user had a 
>> symlink to "/".

>Symlinks are not stored in the data pool. This should be irrelevant.

Okay, it may be a case of me "holding it wrong", but I do see "cephfs-data-scan 
pg_files" trying to follow any global or local symlink in the file structure, 
which leads to many more files registrered than possibly could be in that PG 
and even endless loops in some cases.

If the symlinks are not stored in data pool, how can cephfs-data-scan then 
follow the link?
And how do I get "cephfs-data-scan" to just show the symlinks as links and not 
follow them up or down in directory structure?

Best,
Jesper


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Remove corrupt PG

2022-08-31 Thread Jesper Lykkegaard Karlsen
Hi all, 

I wanted to move a PG to an empty OSD, so I could do repairs on it without the 
whole OSD, which is full of other PG’s, would be effected with extensive 
downtime. 

Thus, I exported the PG with ceph-objectstore-tool, an after successful export 
I removed it. Unfortunately, the remove command was interrupted midway. 
This resulted in a PG that could not be remove with “ceph-objectstore-tool —op 
remove ….”, since the header is gone. 
Worse is that the OSD does not boot, due to it can see objects from the removed 
PG, but cannot access them. 

I have tried to remove the individual objects in that PG (also with 
objectstore-tool), but this process is extremely slow. 
When looping over the >65,000 object, each remove takes ~10 sec and is very 
compute intensive, which is approximately 7.5 days. 

Is the a faster way to get around this? 

Mvh. Jesper

--
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Universitetsbyen 81
8000 Aarhus C

E-mail: je...@mbg.au.dk
Tlf:+45 50906203

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Remove corrupt PG

2022-09-01 Thread Jesper Lykkegaard Karlsen
To answer my own question. 

The removal of the  corrupt PG, could be fixed by doing ceph-objectstore-tool 
fuse mount-thingy. 
Then from the mount point, delete everything in the PGs head directory. 

This took only a few seconds (compared to 7.5 days) and after unmount and 
restart of the OSD it came back online. 

Best, 
Jesper

--
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Universitetsbyen 81
8000 Aarhus C

E-mail: je...@mbg.au.dk
Tlf:+45 50906203

> On 31 Aug 2022, at 20.53, Jesper Lykkegaard Karlsen  wrote:
> 
> Hi all, 
> 
> I wanted to move a PG to an empty OSD, so I could do repairs on it without 
> the whole OSD, which is full of other PG’s, would be effected with extensive 
> downtime. 
> 
> Thus, I exported the PG with ceph-objectstore-tool, an after successful 
> export I removed it. Unfortunately, the remove command was interrupted 
> midway. 
> This resulted in a PG that could not be remove with “ceph-objectstore-tool 
> —op remove ….”, since the header is gone. 
> Worse is that the OSD does not boot, due to it can see objects from the 
> removed PG, but cannot access them. 
> 
> I have tried to remove the individual objects in that PG (also with 
> objectstore-tool), but this process is extremely slow. 
> When looping over the >65,000 object, each remove takes ~10 sec and is very 
> compute intensive, which is approximately 7.5 days. 
> 
> Is the a faster way to get around this? 
> 
> Mvh. Jesper
> 
> --
> Jesper Lykkegaard Karlsen
> Scientific Computing
> Centre for Structural Biology
> Department of Molecular Biology and Genetics
> Aarhus University
> Universitetsbyen 81
> 8000 Aarhus C
> 
> E-mail: je...@mbg.au.dk
> Tlf:+45 50906203
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Remove corrupt PG

2022-09-01 Thread Jesper Lykkegaard Karlsen
Well not the total solution after all.
There is still some metadata and header structure left that I still cannot 
delete with ceph-objectstore-tool —op remove. 
It makes a core dump. 

I think I need to declare the OSD lost anyway to the through this. 
Unless somebody have a better suggestion?

Best, 
Jesper
--
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Universitetsbyen 81
8000 Aarhus C

E-mail: je...@mbg.au.dk
Tlf:+45 50906203

> On 1 Sep 2022, at 22.01, Jesper Lykkegaard Karlsen  wrote:
> 
> To answer my own question. 
> 
> The removal of the  corrupt PG, could be fixed by doing ceph-objectstore-tool 
> fuse mount-thingy. 
> Then from the mount point, delete everything in the PGs head directory. 
> 
> This took only a few seconds (compared to 7.5 days) and after unmount and 
> restart of the OSD it came back online. 
> 
> Best, 
> Jesper
> 
> --
> Jesper Lykkegaard Karlsen
> Scientific Computing
> Centre for Structural Biology
> Department of Molecular Biology and Genetics
> Aarhus University
> Universitetsbyen 81
> 8000 Aarhus C
> 
> E-mail: je...@mbg.au.dk
> Tlf:+45 50906203
> 
>> On 31 Aug 2022, at 20.53, Jesper Lykkegaard Karlsen  wrote:
>> 
>> Hi all, 
>> 
>> I wanted to move a PG to an empty OSD, so I could do repairs on it without 
>> the whole OSD, which is full of other PG’s, would be effected with extensive 
>> downtime. 
>> 
>> Thus, I exported the PG with ceph-objectstore-tool, an after successful 
>> export I removed it. Unfortunately, the remove command was interrupted 
>> midway. 
>> This resulted in a PG that could not be remove with “ceph-objectstore-tool 
>> —op remove ….”, since the header is gone. 
>> Worse is that the OSD does not boot, due to it can see objects from the 
>> removed PG, but cannot access them. 
>> 
>> I have tried to remove the individual objects in that PG (also with 
>> objectstore-tool), but this process is extremely slow. 
>> When looping over the >65,000 object, each remove takes ~10 sec and is very 
>> compute intensive, which is approximately 7.5 days. 
>> 
>> Is the a faster way to get around this? 
>> 
>> Mvh. Jesper
>> 
>> --
>> Jesper Lykkegaard Karlsen
>> Scientific Computing
>> Centre for Structural Biology
>> Department of Molecular Biology and Genetics
>> Aarhus University
>> Universitetsbyen 81
>> 8000 Aarhus C
>> 
>> E-mail: je...@mbg.au.dk
>> Tlf:+45 50906203
>> 
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] force-create-pg not working

2022-09-19 Thread Jesper Lykkegaard Karlsen
Dear all,

System: latest Octopus, 8+3 erasure Cephfs

I have a PG that has been driving me crazy.
It had gotten to a bad state after heavy backfilling, combined with OSD going 
down in turn.

State is:

active+recovery_unfound+undersized+degraded+remapped

I have tried repairing it with ceph-objectstore-tool, but no luck so far.
Given the time recovery takes this way and since data are under backup, I 
thought that I would do the "easy" approach instead and:

  *   scan pg_files with cephfs-data-scan
  *   delete data beloging to that pool
  *   recreate PG with "ceph osd force-create-pg"
  *   restore data

Although, this has shown not to be so easy after all.

ceph osd force-create-pg 20.13f --yes-i-really-mean-it

seems to be accepted well enough with "pg 20.13f now creating, ok", but then 
nothing happens.
Issuing the command again just gives a "pg 20.13f already creating" response.

If I restart the primary OSD, then the pending force-create-pg disappears.

I read that this could be due to crush map issue, but I have checked and that 
does not seem to be the case.

Would it, for instance, be possible to do the force-create-pg manually with 
something like this?:

  *   set nobackfill and norecovery
  *   delete the pgs shards one by one
  *   unset nobackfill and norecovery


Any idea on how to proceed from here is most welcome.

Thanks,
Jesper


--
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Universitetsbyen 81
8000 Aarhus C

E-mail: je...@mbg.au.dk
Tlf:+45 50906203

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: force-create-pg not working

2022-09-20 Thread Jesper Lykkegaard Karlsen
Hi Josh, 

Thanks for your reply. 
But this I already tried that, with no luck. 
Primary OSD goes down and hangs forever, upon "mark_unfound_lost delete” 
command. 

I guess it is too damaged to salvage, unless one really starts deleting 
individual corrupt objects?

Anyway, as I said. files in the PG are identified and under backup, so I just 
want to healthy, no matter what ;-)

I actually discovered that removing the pgs shards, with objectstore-tool 
indeed works in getting the pg back active-clean (containing 0 objects though). 

One just need to run a final remove - start/stop OSD - repair - mark-complete 
on the primary OSD. 
A scrub tells me that the "active+clean” state  is for real.

I also found out the more automated "force-create-pg" command only works on pgs 
that a in down state. 

Best, 
Jesper  
 

------
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Universitetsbyen 81
8000 Aarhus C

E-mail: je...@mbg.au.dk
Tlf:+45 50906203

> On 20 Sep 2022, at 15.40, Josh Baergen  wrote:
> 
> Hi Jesper,
> 
> Given that the PG is marked recovery_unfound, I think you need to
> follow 
> https://docs.ceph.com/en/quincy/rados/troubleshooting/troubleshooting-pg/#unfound-objects.
> 
> Josh
> 
> On Tue, Sep 20, 2022 at 12:56 AM Jesper Lykkegaard Karlsen
>  wrote:
>> 
>> Dear all,
>> 
>> System: latest Octopus, 8+3 erasure Cephfs
>> 
>> I have a PG that has been driving me crazy.
>> It had gotten to a bad state after heavy backfilling, combined with OSD 
>> going down in turn.
>> 
>> State is:
>> 
>> active+recovery_unfound+undersized+degraded+remapped
>> 
>> I have tried repairing it with ceph-objectstore-tool, but no luck so far.
>> Given the time recovery takes this way and since data are under backup, I 
>> thought that I would do the "easy" approach instead and:
>> 
>>  *   scan pg_files with cephfs-data-scan
>>  *   delete data beloging to that pool
>>  *   recreate PG with "ceph osd force-create-pg"
>>  *   restore data
>> 
>> Although, this has shown not to be so easy after all.
>> 
>> ceph osd force-create-pg 20.13f --yes-i-really-mean-it
>> 
>> seems to be accepted well enough with "pg 20.13f now creating, ok", but then 
>> nothing happens.
>> Issuing the command again just gives a "pg 20.13f already creating" response.
>> 
>> If I restart the primary OSD, then the pending force-create-pg disappears.
>> 
>> I read that this could be due to crush map issue, but I have checked and 
>> that does not seem to be the case.
>> 
>> Would it, for instance, be possible to do the force-create-pg manually with 
>> something like this?:
>> 
>>  *   set nobackfill and norecovery
>>  *   delete the pgs shards one by one
>>  *   unset nobackfill and norecovery
>> 
>> 
>> Any idea on how to proceed from here is most welcome.
>> 
>> Thanks,
>> Jesper
>> 
>> 
>> --
>> Jesper Lykkegaard Karlsen
>> Scientific Computing
>> Centre for Structural Biology
>> Department of Molecular Biology and Genetics
>> Aarhus University
>> Universitetsbyen 81
>> 8000 Aarhus C
>> 
>> E-mail: je...@mbg.au.dk
>> Tlf:+45 50906203
>> 
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] cephfs quota used

2021-12-16 Thread Jesper Lykkegaard Karlsen
Hi all,

Cephfs quota work really well for me.
A cool feature is that if one mounts a folder, which has quotas enabled, then 
the mountpoint will show as a partition of quota size and how much is used 
(e.g. with df command), nice!

Now, I want to access the usage information of folders with quotas from root 
level of the cephfs.
I have failed to find this information through getfattr commands, only quota 
limits are shown here, and du-command on individual folders is a suboptimal 
solution.
The usage information must be somewhere in ceph metadata/mondb, but where and 
how do I read?

Best,
Jesper

--
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Gustav Wieds Vej 10
8000 Aarhus C

E-mail: je...@mbg.au.dk
Tlf:+45 50906203

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs quota used

2021-12-16 Thread Jesper Lykkegaard Karlsen
Thanks everybody,

That was a quick answer.

getfattr -n ceph.dir.rbytes $DIR

Was the answer that worked for me. So getfattr was the solution after all.

Is there some way I can display all attributes, without knowing them in 
forehand?

I have tried:

getfattr -d -m 'ceph.*' $DIR

which gives me no output. Should that not list all atributes?

This is on Rocky Linux kernel 4.18.0-348.2.1.el8_5.x86_64

Best,
Jesper
------
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Gustav Wieds Vej 10
8000 Aarhus C

E-mail: je...@mbg.au.dk
Tlf:+45 50906203


Fra: Sebastian Knust 
Sendt: 16. december 2021 13:01
Til: Jesper Lykkegaard Karlsen ; ceph-users@ceph.io 

Emne: Re: [ceph-users] cephfs quota used

Hi Jasper,

On 16.12.21 12:45, Jesper Lykkegaard Karlsen wrote:
> Now, I want to access the usage information of folders with quotas from root 
> level of the cephfs.
> I have failed to find this information through getfattr commands, only quota 
> limits are shown here, and du-command on individual folders is a suboptimal 
> solution.

`getfattr -n ceph.quota.max_bytes /path` gives the specified quota for a
given path.
`getfattr -n ceph.dir.rbytes /path` gives the size of the path, as you
would usually get with du for conventional file systems.

As an example, I am using this script for weekly utilisation reports:
> for i in /ceph-path-to-home-dirs/*; do
> if [ -d "$i" ]; then
> SIZE=$(getfattr -n ceph.dir.rbytes --only-values "$i")
> QUOTA=$(getfattr -n ceph.quota.max_bytes --only-values "$i" 
> 2>/dev/null || echo 0)
> PERC=$(echo $SIZE*100/$QUOTA | bc 2> /dev/null)
> if [ -z "$PERC" ]; then PERC="--"; fi
> printf "%-30s %8s %8s %8s%%\n" "$i" `numfmt --to=iec $SIZE` `numfmt 
> --to=iec $QUOTA` $PERC
> fi
> done


Note that you can also mount CephFS with the "rbytes" mount option. IIRC
the fuse clients defaults to it, for the kernel client you have to
specify it in the mount command or fstab entry.

The rbytes option returns the recursive path size (so the
ceph.dir.rbytes fattr) in stat calls to directories, so you will see it
with ls immediately. I really like it!

Just beware that some software might have issues with this behaviour -
alpine is the only example (bug report and patch proposal have been
submitted) that I know of.

Cheers
Sebastian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs quota used

2021-12-16 Thread Jesper Lykkegaard Karlsen
Just tested:

getfattr -n ceph.dir.rbytes $DIR

Works on CentOS 7, but not on Ubuntu 18.04 eighter.
Weird?

Best,
Jesper
--
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Gustav Wieds Vej 10
8000 Aarhus C

E-mail: je...@mbg.au.dk
Tlf:+45 50906203


Fra: Robert Gallop 
Sendt: 16. december 2021 13:42
Til: Jesper Lykkegaard Karlsen 
Cc: ceph-users@ceph.io 
Emne: Re: [ceph-users] Re: cephfs quota used

>From what I understand you used to be able to do that but cannot on later 
>kernels?

Seems there would be a list somewhere, but I can’t find it, maybe it’s changing 
too often depending on the kernel your using or something.

But yeah, these attrs are one of the major reasons we are moving from 
traditional appliance NAS to ceph, the many other benefits come with it.

On Thu, Dec 16, 2021 at 5:38 AM Jesper Lykkegaard Karlsen 
mailto:je...@mbg.au.dk>> wrote:
Thanks everybody,

That was a quick answer.

getfattr -n ceph.dir.rbytes $DIR

Was the answer that worked for me. So getfattr was the solution after all.

Is there some way I can display all attributes, without knowing them in 
forehand?

I have tried:

getfattr -d -m 'ceph.*' $DIR

which gives me no output. Should that not list all atributes?

This is on Rocky Linux kernel 4.18.0-348.2.1.el8_5.x86_64

Best,
Jesper
------
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Gustav Wieds Vej 10
8000 Aarhus C

E-mail: je...@mbg.au.dk<mailto:je...@mbg.au.dk>
Tlf:+45 50906203


Fra: Sebastian Knust 
mailto:skn...@physik.uni-bielefeld.de>>
Sendt: 16. december 2021 13:01
Til: Jesper Lykkegaard Karlsen mailto:je...@mbg.au.dk>>; 
ceph-users@ceph.io<mailto:ceph-users@ceph.io> 
mailto:ceph-users@ceph.io>>
Emne: Re: [ceph-users] cephfs quota used

Hi Jasper,

On 16.12.21 12:45, Jesper Lykkegaard Karlsen wrote:
> Now, I want to access the usage information of folders with quotas from root 
> level of the cephfs.
> I have failed to find this information through getfattr commands, only quota 
> limits are shown here, and du-command on individual folders is a suboptimal 
> solution.

`getfattr -n ceph.quota.max_bytes /path` gives the specified quota for a
given path.
`getfattr -n ceph.dir.rbytes /path` gives the size of the path, as you
would usually get with du for conventional file systems.

As an example, I am using this script for weekly utilisation reports:
> for i in /ceph-path-to-home-dirs/*; do
> if [ -d "$i" ]; then
> SIZE=$(getfattr -n ceph.dir.rbytes --only-values "$i")
> QUOTA=$(getfattr -n ceph.quota.max_bytes --only-values "$i" 
> 2>/dev/null || echo 0)
> PERC=$(echo $SIZE*100/$QUOTA | bc 2> /dev/null)
> if [ -z "$PERC" ]; then PERC="--"; fi
> printf "%-30s %8s %8s %8s%%\n" "$i" `numfmt --to=iec $SIZE` `numfmt 
> --to=iec $QUOTA` $PERC
> fi
> done


Note that you can also mount CephFS with the "rbytes" mount option. IIRC
the fuse clients defaults to it, for the kernel client you have to
specify it in the mount command or fstab entry.

The rbytes option returns the recursive path size (so the
ceph.dir.rbytes fattr) in stat calls to directories, so you will see it
with ls immediately. I really like it!

Just beware that some software might have issues with this behaviour -
alpine is the only example (bug report and patch proposal have been
submitted) that I know of.

Cheers
Sebastian
___
ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
To unsubscribe send an email to 
ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs quota used

2021-12-16 Thread Jesper Lykkegaard Karlsen
Woops, wrong copy/pasta:

getfattr -n ceph.dir.rbytes $DIR

works on all distributions I have tested.

It is:

getfattr -d -m 'ceph.*' $DIR

that does not work on Rocky Linux 8, Ubuntu 18.04, but works on CentOS 7.

Best,
Jesper
------
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Gustav Wieds Vej 10
8000 Aarhus C

E-mail: je...@mbg.au.dk
Tlf:+45 50906203


Fra: Jesper Lykkegaard Karlsen 
Sendt: 16. december 2021 13:57
Til: Robert Gallop 
Cc: ceph-users@ceph.io 
Emne: [ceph-users] Re: cephfs quota used

Just tested:

getfattr -n ceph.dir.rbytes $DIR

Works on CentOS 7, but not on Ubuntu 18.04 eighter.
Weird?

Best,
Jesper
------
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Gustav Wieds Vej 10
8000 Aarhus C

E-mail: je...@mbg.au.dk
Tlf:+45 50906203


Fra: Robert Gallop 
Sendt: 16. december 2021 13:42
Til: Jesper Lykkegaard Karlsen 
Cc: ceph-users@ceph.io 
Emne: Re: [ceph-users] Re: cephfs quota used

>From what I understand you used to be able to do that but cannot on later 
>kernels?

Seems there would be a list somewhere, but I can’t find it, maybe it’s changing 
too often depending on the kernel your using or something.

But yeah, these attrs are one of the major reasons we are moving from 
traditional appliance NAS to ceph, the many other benefits come with it.

On Thu, Dec 16, 2021 at 5:38 AM Jesper Lykkegaard Karlsen 
mailto:je...@mbg.au.dk>> wrote:
Thanks everybody,

That was a quick answer.

getfattr -n ceph.dir.rbytes $DIR

Was the answer that worked for me. So getfattr was the solution after all.

Is there some way I can display all attributes, without knowing them in 
forehand?

I have tried:

getfattr -d -m 'ceph.*' $DIR

which gives me no output. Should that not list all atributes?

This is on Rocky Linux kernel 4.18.0-348.2.1.el8_5.x86_64

Best,
Jesper
--
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Gustav Wieds Vej 10
8000 Aarhus C

E-mail: je...@mbg.au.dk<mailto:je...@mbg.au.dk>
Tlf:+45 50906203


Fra: Sebastian Knust 
mailto:skn...@physik.uni-bielefeld.de>>
Sendt: 16. december 2021 13:01
Til: Jesper Lykkegaard Karlsen mailto:je...@mbg.au.dk>>; 
ceph-users@ceph.io<mailto:ceph-users@ceph.io> 
mailto:ceph-users@ceph.io>>
Emne: Re: [ceph-users] cephfs quota used

Hi Jasper,

On 16.12.21 12:45, Jesper Lykkegaard Karlsen wrote:
> Now, I want to access the usage information of folders with quotas from root 
> level of the cephfs.
> I have failed to find this information through getfattr commands, only quota 
> limits are shown here, and du-command on individual folders is a suboptimal 
> solution.

`getfattr -n ceph.quota.max_bytes /path` gives the specified quota for a
given path.
`getfattr -n ceph.dir.rbytes /path` gives the size of the path, as you
would usually get with du for conventional file systems.

As an example, I am using this script for weekly utilisation reports:
> for i in /ceph-path-to-home-dirs/*; do
> if [ -d "$i" ]; then
> SIZE=$(getfattr -n ceph.dir.rbytes --only-values "$i")
> QUOTA=$(getfattr -n ceph.quota.max_bytes --only-values "$i" 
> 2>/dev/null || echo 0)
> PERC=$(echo $SIZE*100/$QUOTA | bc 2> /dev/null)
> if [ -z "$PERC" ]; then PERC="--"; fi
> printf "%-30s %8s %8s %8s%%\n" "$i" `numfmt --to=iec $SIZE` `numfmt 
> --to=iec $QUOTA` $PERC
> fi
> done


Note that you can also mount CephFS with the "rbytes" mount option. IIRC
the fuse clients defaults to it, for the kernel client you have to
specify it in the mount command or fstab entry.

The rbytes option returns the recursive path size (so the
ceph.dir.rbytes fattr) in stat calls to directories, so you will see it
with ls immediately. I really like it!

Just beware that some software might have issues with this behaviour -
alpine is the only example (bug report and patch proposal have been
submitted) that I know of.

Cheers
Sebastian
___
ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
To unsubscribe send an email to 
ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs quota used

2021-12-16 Thread Jesper Lykkegaard Karlsen
To answer my own question.
It seems Frank Schilder asked a similar question two years ago:

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/6ENI42ZMHTTP2OONBRD7FDP7LQBC4P2E/

listxattr() was aparrently removed and not much have happen since then it seems.

Anyway, I just made my own ceph-fs version of "du".

ceph_du_dir:

#!/bin/bash
# usage: ceph_du_dir $DIR
SIZE=$(getfattr -n ceph.dir.rbytes $1 2>/dev/null| grep "ceph\.dir\.rbytes" | 
awk -F\= '{print $2}' | sed s/\"//g)
numfmt --to=iec-i --suffix=B --padding=7 $SIZE

Prints out ceph-fs dir size in "human-readble"
It works like a charm and my god it is fast!.

Tools like that could be very useful, if provided by the development team 🙂

Best,
Jesper

--
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Gustav Wieds Vej 10
8000 Aarhus C

E-mail: je...@mbg.au.dk
Tlf:+45 50906203


Fra: Jesper Lykkegaard Karlsen 
Sendt: 16. december 2021 14:37
Til: Robert Gallop 
Cc: ceph-users@ceph.io 
Emne: [ceph-users] Re: cephfs quota used

Woops, wrong copy/pasta:

getfattr -n ceph.dir.rbytes $DIR

works on all distributions I have tested.

It is:

getfattr -d -m 'ceph.*' $DIR

that does not work on Rocky Linux 8, Ubuntu 18.04, but works on CentOS 7.

Best,
Jesper
--
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Gustav Wieds Vej 10
8000 Aarhus C

E-mail: je...@mbg.au.dk
Tlf:+45 50906203


Fra: Jesper Lykkegaard Karlsen 
Sendt: 16. december 2021 13:57
Til: Robert Gallop 
Cc: ceph-users@ceph.io 
Emne: [ceph-users] Re: cephfs quota used

Just tested:

getfattr -n ceph.dir.rbytes $DIR

Works on CentOS 7, but not on Ubuntu 18.04 eighter.
Weird?

Best,
Jesper
--
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Gustav Wieds Vej 10
8000 Aarhus C

E-mail: je...@mbg.au.dk
Tlf:+45 50906203

____
Fra: Robert Gallop 
Sendt: 16. december 2021 13:42
Til: Jesper Lykkegaard Karlsen 
Cc: ceph-users@ceph.io 
Emne: Re: [ceph-users] Re: cephfs quota used

From what I understand you used to be able to do that but cannot on later 
kernels?

Seems there would be a list somewhere, but I can’t find it, maybe it’s changing 
too often depending on the kernel your using or something.

But yeah, these attrs are one of the major reasons we are moving from 
traditional appliance NAS to ceph, the many other benefits come with it.

On Thu, Dec 16, 2021 at 5:38 AM Jesper Lykkegaard Karlsen 
mailto:je...@mbg.au.dk>> wrote:
Thanks everybody,

That was a quick answer.

getfattr -n ceph.dir.rbytes $DIR

Was the answer that worked for me. So getfattr was the solution after all.

Is there some way I can display all attributes, without knowing them in 
forehand?

I have tried:

getfattr -d -m 'ceph.*' $DIR

which gives me no output. Should that not list all atributes?

This is on Rocky Linux kernel 4.18.0-348.2.1.el8_5.x86_64

Best,
Jesper
--
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Gustav Wieds Vej 10
8000 Aarhus C

E-mail: je...@mbg.au.dk<mailto:je...@mbg.au.dk>
Tlf:+45 50906203


Fra: Sebastian Knust 
mailto:skn...@physik.uni-bielefeld.de>>
Sendt: 16. december 2021 13:01
Til: Jesper Lykkegaard Karlsen mailto:je...@mbg.au.dk>>; 
ceph-users@ceph.io<mailto:ceph-users@ceph.io> 
mailto:ceph-users@ceph.io>>
Emne: Re: [ceph-users] cephfs quota used

Hi Jasper,

On 16.12.21 12:45, Jesper Lykkegaard Karlsen wrote:
> Now, I want to access the usage information of folders with quotas from root 
> level of the cephfs.
> I have failed to find this information through getfattr commands, only quota 
> limits are shown here, and du-command on individual folders is a suboptimal 
> solution.

`getfattr -n ceph.quota.max_bytes /path` gives the specified quota for a
given path.
`getfattr -n ceph.dir.rbytes /path` gives the size of the path, as you
would usually get with du for conventional file systems.

As an example, I am using this script for weekly utilisation reports:
> for i in /ceph-path-to-home-dirs/*; do
> if [ -d "$i" ]; then
> SIZE=$(getfattr -n ceph.dir.rbytes --only-values "$i")
> QUOTA=$(getfattr -n ceph.quota.max_bytes --only-values "$i" 
> 2>/dev/null || echo 0)
> PERC=$(echo $SIZE*100/$QUOTA | bc 2> /dev/null)
> if [ -z "$PERC" ]; then PERC="--"; fi
>   

[ceph-users] Re: cephfs quota used

2021-12-16 Thread Jesper Lykkegaard Karlsen
Brilliant, thanks Jean-François

Best,
Jesper

--
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Gustav Wieds Vej 10
8000 Aarhus C

E-mail: je...@mbg.au.dk
Tlf:+45 50906203


Fra: Jean-Francois GUILLAUME 
Sendt: 16. december 2021 23:03
Til: Jesper Lykkegaard Karlsen 
Cc: Robert Gallop ; ceph-users@ceph.io 

Emne: Re: [ceph-users] Re: cephfs quota used

Hi,

You can avoid using awk by passing --only-values to getfattr.

This should look something like this :

> #!/bin/bash
> numfmt --to=iec-i --suffix=B --padding=7 $(getfattr --only-values -n
> ceph.dir.rbytes $1 2>/dev/null)

Best,
---
Cordialement,
Jean-François GUILLAUME
Plateforme Bioinformatique BiRD

Tél. : +33 (0)2 28 08 00 57
www.pf-bird.univ-nantes.fr<http://www.pf-bird.univ-nantes.fr>

Inserm UMR 1087/CNRS UMR 6291
IRS-UN - 8 quai Moncousu - BP 70721
44007 Nantes Cedex 1

Le 2021-12-16 22:25, Jesper Lykkegaard Karlsen a écrit :
> To answer my own question.
> It seems Frank Schilder asked a similar question two years ago:
>
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/6ENI42ZMHTTP2OONBRD7FDP7LQBC4P2E/
>
> listxattr() was aparrently removed and not much have happen since then
> it seems.
>
> Anyway, I just made my own ceph-fs version of "du".
>
> ceph_du_dir:
>
> #!/bin/bash
> # usage: ceph_du_dir $DIR
> SIZE=$(getfattr -n ceph.dir.rbytes $1 2>/dev/null| grep
> "ceph\.dir\.rbytes" | awk -F\= '{print $2}' | sed s/\"//g)
> numfmt --to=iec-i --suffix=B --padding=7 $SIZE
>
> Prints out ceph-fs dir size in "human-readble"
> It works like a charm and my god it is fast!.
>
> Tools like that could be very useful, if provided by the development
> team 🙂
>
> Best,
> Jesper
>
> --
> Jesper Lykkegaard Karlsen
> Scientific Computing
> Centre for Structural Biology
> Department of Molecular Biology and Genetics
> Aarhus University
> Gustav Wieds Vej 10
> 8000 Aarhus C
>
> E-mail: je...@mbg.au.dk
> Tlf:+45 50906203
>
> 
> Fra: Jesper Lykkegaard Karlsen 
> Sendt: 16. december 2021 14:37
> Til: Robert Gallop 
> Cc: ceph-users@ceph.io 
> Emne: [ceph-users] Re: cephfs quota used
>
> Woops, wrong copy/pasta:
>
> getfattr -n ceph.dir.rbytes $DIR
>
> works on all distributions I have tested.
>
> It is:
>
> getfattr -d -m 'ceph.*' $DIR
>
> that does not work on Rocky Linux 8, Ubuntu 18.04, but works on CentOS
> 7.
>
> Best,
> Jesper
> --
> Jesper Lykkegaard Karlsen
> Scientific Computing
> Centre for Structural Biology
> Department of Molecular Biology and Genetics
> Aarhus University
> Gustav Wieds Vej 10
> 8000 Aarhus C
>
> E-mail: je...@mbg.au.dk
> Tlf:+45 50906203
>
> ____
> Fra: Jesper Lykkegaard Karlsen 
> Sendt: 16. december 2021 13:57
> Til: Robert Gallop 
> Cc: ceph-users@ceph.io 
> Emne: [ceph-users] Re: cephfs quota used
>
> Just tested:
>
> getfattr -n ceph.dir.rbytes $DIR
>
> Works on CentOS 7, but not on Ubuntu 18.04 eighter.
> Weird?
>
> Best,
> Jesper
> ------
> Jesper Lykkegaard Karlsen
> Scientific Computing
> Centre for Structural Biology
> Department of Molecular Biology and Genetics
> Aarhus University
> Gustav Wieds Vej 10
> 8000 Aarhus C
>
> E-mail: je...@mbg.au.dk
> Tlf:+45 50906203
>
> 
> Fra: Robert Gallop 
> Sendt: 16. december 2021 13:42
> Til: Jesper Lykkegaard Karlsen 
> Cc: ceph-users@ceph.io 
> Emne: Re: [ceph-users] Re: cephfs quota used
>
> From what I understand you used to be able to do that but cannot on
> later kernels?
>
> Seems there would be a list somewhere, but I can’t find it, maybe
> it’s changing too often depending on the kernel your using or
> something.
>
> But yeah, these attrs are one of the major reasons we are moving from
> traditional appliance NAS to ceph, the many other benefits come with
> it.
>
> On Thu, Dec 16, 2021 at 5:38 AM Jesper Lykkegaard Karlsen
> mailto:je...@mbg.au.dk>> wrote:
> Thanks everybody,
>
> That was a quick answer.
>
> getfattr -n ceph.dir.rbytes $DIR
>
> Was the answer that worked for me. So getfattr was the solution after
> all.
>
> Is there some way I can display all attributes, without knowing them
> in forehand?
>
> I have tried:
>
> getfattr -d -m 'ceph.*' $DIR
>
> which gives me no output. Should that not list all atributes?
>
> This 

[ceph-users] Re: cephfs quota used

2021-12-16 Thread Jesper Lykkegaard Karlsen
Not to spam, but to make it output prettier, one can also separate the number 
from the byte-size prefix.

numfmt --to=iec --suffix=B --padding=7 $(getfattr --only-values -n 
ceph.dir.rbytes $1 2>/dev/nul) | sed -r 's/([0-9])([a-zA-Z])/\1 \2/g; 
s/([a-zA-Z])([0-9])/\1 \2/g'

//Jesper
------
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Gustav Wieds Vej 10
8000 Aarhus C

E-mail: je...@mbg.au.dk
Tlf:+45 50906203

____
Fra: Jesper Lykkegaard Karlsen 
Sendt: 16. december 2021 23:07
Til: Jean-Francois GUILLAUME 
Cc: Robert Gallop ; ceph-users@ceph.io 

Emne: [ceph-users] Re: cephfs quota used

Brilliant, thanks Jean-François

Best,
Jesper

------
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Gustav Wieds Vej 10
8000 Aarhus C

E-mail: je...@mbg.au.dk
Tlf:+45 50906203


Fra: Jean-Francois GUILLAUME 
Sendt: 16. december 2021 23:03
Til: Jesper Lykkegaard Karlsen 
Cc: Robert Gallop ; ceph-users@ceph.io 

Emne: Re: [ceph-users] Re: cephfs quota used

Hi,

You can avoid using awk by passing --only-values to getfattr.

This should look something like this :

> #!/bin/bash
> numfmt --to=iec-i --suffix=B --padding=7 $(getfattr --only-values -n
> ceph.dir.rbytes $1 2>/dev/null)

Best,
---
Cordialement,
Jean-François GUILLAUME
Plateforme Bioinformatique BiRD

Tél. : +33 (0)2 28 08 00 57
www.pf-bird.univ-nantes.fr<http://www.pf-bird.univ-nantes.fr><http://www.pf-bird.univ-nantes.fr<http://www.pf-bird.univ-nantes.fr>>

Inserm UMR 1087/CNRS UMR 6291
IRS-UN - 8 quai Moncousu - BP 70721
44007 Nantes Cedex 1

Le 2021-12-16 22:25, Jesper Lykkegaard Karlsen a écrit :
> To answer my own question.
> It seems Frank Schilder asked a similar question two years ago:
>
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/6ENI42ZMHTTP2OONBRD7FDP7LQBC4P2E/
>
> listxattr() was aparrently removed and not much have happen since then
> it seems.
>
> Anyway, I just made my own ceph-fs version of "du".
>
> ceph_du_dir:
>
> #!/bin/bash
> # usage: ceph_du_dir $DIR
> SIZE=$(getfattr -n ceph.dir.rbytes $1 2>/dev/null| grep
> "ceph\.dir\.rbytes" | awk -F\= '{print $2}' | sed s/\"//g)
> numfmt --to=iec-i --suffix=B --padding=7 $SIZE
>
> Prints out ceph-fs dir size in "human-readble"
> It works like a charm and my god it is fast!.
>
> Tools like that could be very useful, if provided by the development
> team 🙂
>
> Best,
> Jesper
>
> --
> Jesper Lykkegaard Karlsen
> Scientific Computing
> Centre for Structural Biology
> Department of Molecular Biology and Genetics
> Aarhus University
> Gustav Wieds Vej 10
> 8000 Aarhus C
>
> E-mail: je...@mbg.au.dk
> Tlf:+45 50906203
>
> 
> Fra: Jesper Lykkegaard Karlsen 
> Sendt: 16. december 2021 14:37
> Til: Robert Gallop 
> Cc: ceph-users@ceph.io 
> Emne: [ceph-users] Re: cephfs quota used
>
> Woops, wrong copy/pasta:
>
> getfattr -n ceph.dir.rbytes $DIR
>
> works on all distributions I have tested.
>
> It is:
>
> getfattr -d -m 'ceph.*' $DIR
>
> that does not work on Rocky Linux 8, Ubuntu 18.04, but works on CentOS
> 7.
>
> Best,
> Jesper
> ------
> Jesper Lykkegaard Karlsen
> Scientific Computing
> Centre for Structural Biology
> Department of Molecular Biology and Genetics
> Aarhus University
> Gustav Wieds Vej 10
> 8000 Aarhus C
>
> E-mail: je...@mbg.au.dk
> Tlf:+45 50906203
>
> ____
> Fra: Jesper Lykkegaard Karlsen 
> Sendt: 16. december 2021 13:57
> Til: Robert Gallop 
> Cc: ceph-users@ceph.io 
> Emne: [ceph-users] Re: cephfs quota used
>
> Just tested:
>
> getfattr -n ceph.dir.rbytes $DIR
>
> Works on CentOS 7, but not on Ubuntu 18.04 eighter.
> Weird?
>
> Best,
> Jesper
> --
> Jesper Lykkegaard Karlsen
> Scientific Computing
> Centre for Structural Biology
> Department of Molecular Biology and Genetics
> Aarhus University
> Gustav Wieds Vej 10
> 8000 Aarhus C
>
> E-mail: je...@mbg.au.dk
> Tlf:+45 50906203
>
> 
> Fra: Robert Gallop 
> Sendt: 16. december 2021 13:42
> Til: Jesper Lykkegaard Karlsen 
> Cc: ceph-users@ceph.io 
> Emne: Re: [ceph-users] Re: cephfs quota used
>
> From what I understand you used to be able to do that but cannot on
> later kernels?
>
> Seems there would be a list somewhere,

[ceph-users] Re: cephfs quota used

2021-12-17 Thread Jesper Lykkegaard Karlsen
Thanks Konstantin,

Actually, I went a bit further and made the script more universal in usage:

ceph_du_dir:
# usage: ceph_du_dir $DIR1 ($DIR2 .)
for i in $@; do
if [[ -d $i && ! -L $i ]]; then
echo "$(numfmt --to=iec --suffix=B --padding=7 $(getfattr --only-values -n 
ceph.dir.rbytes $i 2>/dev/nul) | sed -r 's/([0-9])([a-zA-Z])/\1 \2/g; 
s/([a-zA-Z])([0-9])/\1 \2/g') $i"
fi
done

The above can be run as:

ceph_du_dir $DIR

with multiple directories:

ceph_du_dir $DIR1 $DIR2 $DIR3 ..

Or even with wildcard:

ceph_du_dir $DIR/*

Best,
Jesper

--
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Gustav Wieds Vej 10
8000 Aarhus C

E-mail: je...@mbg.au.dk
Tlf:+45 50906203


Fra: Konstantin Shalygin 
Sendt: 17. december 2021 09:17
Til: Jesper Lykkegaard Karlsen 
Cc: Robert Gallop ; ceph-users@ceph.io 

Emne: Re: [ceph-users] cephfs quota used

Or you can mount with 'dirstat' option and use 'cat .' for determine CephFS 
stats:

alias fsdf="cat . | grep rbytes | awk '{print \$2}' | numfmt --to=iec 
--suffix=B"

[root@host catalog]# fsdf
245GB
[root@host catalog]#


Cheers,
k

On 17 Dec 2021, at 00:25, Jesper Lykkegaard Karlsen 
mailto:je...@mbg.au.dk>> wrote:

Anyway, I just made my own ceph-fs version of "du".

ceph_du_dir:

#!/bin/bash
# usage: ceph_du_dir $DIR
SIZE=$(getfattr -n ceph.dir.rbytes $1 2>/dev/null| grep "ceph\.dir\.rbytes" | 
awk -F\= '{print $2}' | sed s/\"//g)
numfmt --to=iec-i --suffix=B --padding=7 $SIZE

Prints out ceph-fs dir size in "human-readble"
It works like a charm and my god it is fast!.

Tools like that could be very useful, if provided by the development team 🙂

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Healthy objects trapped in incomplete pgs

2020-04-23 Thread Jesper Lykkegaard Karlsen
Dear Cephers,


A few days ago disaster struck the Ceph cluster (erasure-coded) I am 
administrating, as the UPS power was pull from the cluster causing a power 
outage.


After rebooting the system, 6 osds were lost (spread over 5 osd nodes) as they 
could not mount anymore, several others had damages. This was more than the 
host-faliure domain was setup to handle and auto-recovery failed and osds 
started downing in a cascading maner.


When the dust settled, there were 8 pgs (of 2048) inactive and a bunch of osds 
down. I managed to recover 5 pgs, mainly by ceph-objectstore-tool 
export/import/repair commands, but now I am left with 3 pgs that are inactive 
and incomplete.


One of the pgs seems un-salvageable, as I cannot get to become active at all 
(repair/import/export/lowering min_size), but the two others I can get active 
if I export/import one of the pg shards and restart osd.


Rebuilding then starts but after a while one of the osds holding the pgs goes 
down, with a "FAILED ceph_assert(clone_size.count(clone))" message in the log.

If I set osds to noout nodown, then I can that it is only rather few objects 
e.g. 161 of a pg of >10, that are failing to be remapped.


Since most of the object in the two pgs seem intact, it would be sad to delete 
the whole pg (force-create-pg) and loose all that data.


Is there a way to show and delete the failing objects?


I have thought of a recovery plan and want to share that with you, so you can 
comment on this if it sounds doable or not?


  *   Stop osds from recovering:ceph osd set norecover
  *   bring back pgs active:ceph-objectstore-tool export/import and 
restart osd
  *   find files in pgs:  cephfs-data-scan pg_files  

  *   pull out as many as possible of those files to other location.
  *   recreate pgs:  ceph osd force-create-pg 
  *   restart recovery:ceph osd unset norecover
  *   copy back in the recovered files


Would that work or do you have a better suggestion?


Cheers,

Jesper


------
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Gustav Wieds Vej 10
8000 Aarhus C

E-mail: je...@mbg.au.dk
Tlf:+45 50906203

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Cephadm stacktrace on copying ceph.conf

2024-03-26 Thread Jesper Agerbo Krogh [JSKR]
y", line 343, in _make_cd_request 
self._fs.basename(path)) File "/lib/python3.6/site-packages/asyncssh/scp.py", 
line 224, in make_request raise exc asyncssh.sftp.SFTPFailure: scp: 
/tmp/var/lib/ceph/5c384430-da91-11ed-af9c-c780a5227aff/config/ceph.conf.new: 
Permission denied
3/26/24 9:38:09 PM[INF]Updating 
dkcphhpcmgt028:/var/lib/ceph/5c384430-da91-11ed-af9c-c780a5227aff/config/ceph.conf

It seem to be related to the permissions that the manager writes the files with 
and the process copying them around. 

$ sudo ceph -v
[sudo] password for adminjskr:
ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)


Best regards,

Jesper Agerbo Krogh

Director Digitalization
Digitalization

 

Topsoe A/S
Haldor Topsøes Allé 1 
2800 Kgs. Lyngby
Denmark
Phone (direct): 27773240

   

Read more attopsoe.com

 

Topsoe A/S and/or its affiliates. This e-mail message (including attachments, 
if any) is confidential and may be privileged. It is intended only for the 
addressee.
Any unauthorised distribution or disclosure is prohibited. Disclosure to anyone 
other than the intended recipient does not constitute waiver of privilege.
If you have received this email in error, please notify the sender by email and 
delete it and any attachments from your computer system and records.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io