[ceph-users] Ceph-Dokan Mount Caps at ~1GB transfer?

2021-11-01 Thread Mason-Williams, Gabryel (RFI,RAL,-)
Hello,

We have been trying to use Ceph-Dokan to mount cephfs on Windows. When 
transferring any data below ~1GB the transfer speed is as quick as desired and 
works perfectly. However, once more than ~1GB has been transferred the 
connection stops being able to send data and everything seems to just hang.

I've ruled out it being a quota problem as I can transfer just than just under 
1GB close the connection and then reopen it and then transfer just under 1GB 
again, with no issues.

Windows Version: 10
Dokan Version: 1.3.1.1000

Does anyone have any idea why this is occurring and have any suggestions on how 
to fix it?

Kind regards

Gabryel Mason-Williams

Junior Research Software Engineer

Please bear in mind that I work part-time, so there may be a delay in my 
response.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph-Dokan Mount Caps at ~1GB transfer?

2021-11-01 Thread Radoslav Milanov
Have you tries this with the native client under Linux ? It could be 
just slow cephfs ?


On 1.11.2021 г. 06:40 ч., Mason-Williams, Gabryel (RFI,RAL,-) wrote:

Hello,

We have been trying to use Ceph-Dokan to mount cephfs on Windows. When 
transferring any data below ~1GB the transfer speed is as quick as desired and 
works perfectly. However, once more than ~1GB has been transferred the 
connection stops being able to send data and everything seems to just hang.

I've ruled out it being a quota problem as I can transfer just than just under 
1GB close the connection and then reopen it and then transfer just under 1GB 
again, with no issues.

Windows Version: 10
Dokan Version: 1.3.1.1000

Does anyone have any idea why this is occurring and have any suggestions on how 
to fix it?

Kind regards

Gabryel Mason-Williams

Junior Research Software Engineer

Please bear in mind that I work part-time, so there may be a delay in my 
response.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NFS Ganesha Active Active Question

2021-11-01 Thread Daniel Gryniewicz
You can fail from one running Ganesha to another, using something like 
ctdb or pacemaker/corosync.  This is how some other clustered 
filesysytem (e.g. Gluster) use Ganesha.  This is not how the Ceph 
community has decided to implement HA with Ganesha, so it will be a more 
manual setup for you, but it can be done.


Daniel

On 10/31/21 1:47 PM, Xiaolong Jiang wrote:

Hi Maged

Yea,  it requires the cloud integration to quickly fail over IP.  For 
me,  I probably need to have standby server and once i detect instance 
is dead.  I probably need to ask cephadm to schedule ganesha there and 
attach the ip to New server.



On Oct 31, 2021, at 10:40 AM, Maged Mokhtar  wrote:



Hi Xiaolong

The grace period is 90 sec, the failover process should be automated 
and should run quicker than this, maybe like 15-30 sec ( not too quick 
to avoid false alarms ), this will make client io resume after a small 
pause.


/Maged

On 31/10/2021 17:37, Xiaolong Jiang wrote:

Hi Maged ,

Thank you for the response. That helps a lot!

Looks like I have to spin up a new server quickly and float the ip to 
the new server. If I spin up the server after about 20 mins, I guess 
IO will recover after that but the previous state will be gone since 
it passed the grace period?



On Oct 31, 2021, at 4:51 AM, Maged Mokhtar  wrote:




On 31/10/2021 05:29, Xiaolong Jiang wrote:

Hi Experts.

I am a bit confused about ganesha active-active setup.

We can set up multiple ganesha servers on top of cephfs and clients 
can point to different ganesh server to serve the traffic. that can 
scale out the traffic.


From client side, is it using DNS round robin directly connecting 
to ganesha server ?
Is it possible to front all ganesha server with a load balancer so 
client only connects load balancer IP and byte writes can load 
balancer across all ganesha server?


My current feeling is we probably have to use DNS way and specific 
client read/write request can only go to same ganesha server for 
the session.


--
Best regards,
Xiaolong Jiang

Senior Software Engineer at Netflix
Columbia University

___
Dev mailing list --d...@ceph.io
To unsubscribe send an email todev-le...@ceph.io



Load balancing ganesha means some clients are being served by a 
gateway and other clients by other gateways, so we distribute the 
clients and their load on the different gateways but each client 
remains on a specific gateway, you cannot have a single client load 
balance on several gateways.


A good way to distribute clients on the gateways is via round robin 
dns, but you do not have, you can distribute ips manually among your 
clients if you want, but dns automates the process in scalable way.


One note about high availability, currently you cannot failover 
clients to another ganesha gateway in case of failure, but if you 
bring the failed gateway back online quickly enough, the client 
connections will resume. So to support HA in case a host server 
failure, the ganesha gateways are implemented as containers so you 
can start the failed container on a new host server.


/Maged



___
Dev mailing list -- d...@ceph.io
To unsubscribe send an email to dev-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrade to 16.2.6 and osd+mds crash after bluestore_fsck_quick_fix_on_mount true

2021-11-01 Thread Igor Fedotov

Hi Thilo,

theoretically this is a recoverable case - due to the bug new prefix was 
inserted at the beginning of every OMAP record instead of replacing old 
one. So one has to just remove an old prefix to fix that (to-be-removed 
prefix starts after the first '.' char and ends with the second one 
(inclusive). E.g. the following key:


p 
%00%00%00%00%00%00%00%03%00%00%00%00%00%00%00%00%00%00%04t.%00%00%00%00%00%00%04t._infover


to be converted to

p %00%00%00%00%00%00%00%03%00%00%00%00%00%00%00%00%00%00%04t._infover

One can use ceph-kvstore-tool's list command against 'p' prefix to view 
all the omap keys in DB.



Unfortunately currently we don't have any means to perform such a 
conversion in a bulk manner.  There are single key retrieval/update 
operations in ceph-kvstore-tool but this would be terribly inefficient 
for tons of records due to tool's startup/teardown overhead.


Potentially such a bulk recovery can be added to ceph-kvstore-tool or 
something but given release cycle procedure and related timings I doubt 
that's what you like to get at the moment.. So I can probably make a 
source patch with the fix but one needs to build it for his 
envrironment. Not to mention all the risks of using urgent modification 
which bypasses the QA/review procedure...


Would it work for you?


Thanks,

Igor


On 10/30/2021 11:59 AM, Thilo Molitor wrote:

I have the exact same problem: I upgraded to 16.2.6 and set
bluestore_fsck_quick_fix_on_mount to true, after a rolling restart of my osds
only 2 of 5 came back (one of them was only recently added and has only very
few data, so in essence there is only 1 osd really running).
Al other osds crashed with:

./src/osd/PG.cc: In function 'static int PG::peek_map_epoch(ObjectStore*,

spg_t, epoch_t*)' thread 7f68d999bd00 time 2021-10-30T08:59:11.782259+0200

./src/osd/PG.cc: 1009: FAILED ceph_assert(values.size() == 2)
ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific

(stable)

My cluster does not come up anymore and I cannot access my data.
Any advice on how to recover here?

-tmolitor




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: bluestore zstd compression questions

2021-11-01 Thread Igor Fedotov


On 10/29/2021 1:06 PM, Elias Abacioglu wrote:

I don't have any data yet.
I set up a k8s cluster and set up CephFS, RGW and RBD for k8s.. So it's
hard to tell beforehand what we will store and know compression ratios.
Thus making it hard to know how to benchmark, but I guess a mix of
everything from very compressible to non-compressible stuff.

What happens if you turn on/off compression in a pool? Is it possible to
change after the pool is created and data exists?


It's possible but existing data would  stay uncompressed. New data would 
benefit from the compression only. Later switching compression off 
wouldn't impact(decompress) existing data either. Data accessibility is 
provided in any case though...





/Elias

On Thu, Oct 21, 2021 at 4:50 PM Konstantin Shalygin  wrote:


What data you have to compress? Do you bench compression efficiency?


k

Sent from my iPhone


On 21 Oct 2021, at 17:43, Elias Abacioglu <

elias.abacio...@deltaprojects.com> wrote:

Hi

I've been trying to Google about Bluestore compression. But most

articles I

find are quite old and are from Ceph versions where zstd compression

level

was hardcoded to 5.
I've been thinking about enabling zstd compression with a `
compressor_zstd_level` of "-1" in most of my pools. Any thoughts?
Is there any one that has any recent benchmarks of this?

zstd compression with a normal server CPU should be faster than HDD

writes.

And decompression should also be faster than any HDD reads.
I don't know about NVMe drives, there a lot of disks have faster write
speed than it takes the CPU to compress data, but also if the compression
ratio is good, you have less data to write. And when it comes to reads I
guess it would depend on the NVMe disk and the CPU you got.

Also I've been wondering about the pro's and con's when it comes to
compression.
I guess that some pro's would be
- Less data to scrub (since less data are stored on the drives)
- Less network traffic (replicas and such, I guess it depends on where

the

compression takes place)?
- Less wear and tear on the drives(since less data are written and read)
Also I wonder, where does the compression take place?

A con would be if it is slower, but I guess this might be depending on
which CPU, drives and storage controller you use, also what data you
write/read.

But it would be nice with a fresh benchmark. This would be especially
interesting since this PR was merged

https://github.com/ceph/ceph/pull/33790

which changed the default compression level to 1 and allows you to set

your

own compression level for zstd.

/Elias
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance degredation with upgrade from Octopus to Pacific

2021-11-01 Thread Igor Fedotov

Hey Dustin,

what Pacific version have you got?


Thanks,

Igor

On 11/1/2021 7:08 PM, Dustin Lagoy wrote:

Hi everyone,

This is my first time posting here, so it's nice to meet you all!

I have a Ceph cluster that was recently upgraded from Octopus to Pacific and 
now the write performance is noticeably worse. We have an application which 
continually writes 270MB/s through the RGW which was working fine before the 
upgrade. Now it struggles to write 170MB/s continuously.

I don't have detailed benchmarks from before the upgrade other that what I just 
mentioned but I have investigated some since. For background the cluster is two 
hosts with 10 HDD OSDs each which is under 5% usage overall. The RGW pool is 
set up with a k=4, m=2 erasure code (across OSDs).

Profiling with `rados bench` to a pool with the same erasure code setup gives 
similar 170MB/s performance (so about 42MB/s per disk). Profiling a single OSD 
using `ceph tell osd.0 bench` yields an average of 40MB/s across all disks 
(which should yield close to 160MB/s at the pool level). Given this it seems to 
me to be an issue at the OSD level.

I tried setting `bluefs_buffered_io` to false as mentioned elsewhere. This 
reduced the `%wrqm` reported by iostat (which was previously close to 100%) and 
gave a slight performance gain (around 50MB/s per OSD), but nothing close to 
what was seen previously.

Before and after the above change both iostat and iotop report over 100MB/s 
written to disk while a single `ceph tell osd bench` command is running and 
iotop shows a near threefold write amplification. With the benchmark writing 
1GB to disk iotop shows both the `bstore_kv_sync` and `bstore_kv_final` threads 
writing about 1GB each and the threads `rocksdb:high0` and `rocksdb:low0` 
writing a total of 1GB. So 3GB total for the 1GB benchmark.

Looking at the OSD logs during the benchmark it seems rocksdb is compacting 
several times. I tried adding sharding to one of the OSDs as mentioned in the 
documentation (with `ceph-bluestore-tool`) but it didn't seem to make a 
difference.

Does anyone have any idea what may have caused this performance loss? I am 
happy to post any more logs/detail if it would help.

Thanks!
Dustin

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance degredation with upgrade from Octopus to Pacific

2021-11-01 Thread Igor Fedotov

Then highly likely you're bitten by https://tracker.ceph.com/issues/52089

This has been fixed starting 16.2.6. So please update or wait for a bit 
till 16.2.7 is release which is gonna to happend shortly.



Thanks,

Igor


On 11/1/2021 7:25 PM, Dustin Lagoy wrote:

I am running a cephadm base cluster with all images on 16.2.5.

Thanks for the quick response!
Dustin


On Mon, Nov 01, 2021 at 07:18:38PM +0300, Igor Fedotov wrote:
Hey Dustin,

what Pacific version have you got?


Thanks,

Igor

On 11/1/2021 7:08 PM, Dustin Lagoy wrote:

Hi everyone,

This is my first time posting here, so it's nice to meet you all!

I have a Ceph cluster that was recently upgraded from Octopus to Pacific and 
now the write performance is noticeably worse. We have an application which 
continually writes 270MB/s through the RGW which was working fine before the 
upgrade. Now it struggles to write 170MB/s continuously.

I don't have detailed benchmarks from before the upgrade other that what I just 
mentioned but I have investigated some since. For background the cluster is two 
hosts with 10 HDD OSDs each which is under 5% usage overall. The RGW pool is 
set up with a k=4, m=2 erasure code (across OSDs).

Profiling with `rados bench` to a pool with the same erasure code setup gives 
similar 170MB/s performance (so about 42MB/s per disk). Profiling a single OSD 
using `ceph tell osd.0 bench` yields an average of 40MB/s across all disks 
(which should yield close to 160MB/s at the pool level). Given this it seems to 
me to be an issue at the OSD level.

I tried setting `bluefs_buffered_io` to false as mentioned elsewhere. This 
reduced the `%wrqm` reported by iostat (which was previously close to 100%) and 
gave a slight performance gain (around 50MB/s per OSD), but nothing close to 
what was seen previously.

Before and after the above change both iostat and iotop report over 100MB/s 
written to disk while a single `ceph tell osd bench` command is running and 
iotop shows a near threefold write amplification. With the benchmark writing 
1GB to disk iotop shows both the `bstore_kv_sync` and `bstore_kv_final` threads 
writing about 1GB each and the threads `rocksdb:high0` and `rocksdb:low0` 
writing a total of 1GB. So 3GB total for the 1GB benchmark.

Looking at the OSD logs during the benchmark it seems rocksdb is compacting 
several times. I tried adding sharding to one of the OSDs as mentioned in the 
documentation (with `ceph-bluestore-tool`) but it didn't seem to make a 
difference.

Does anyone have any idea what may have caused this performance loss? I am 
happy to post any more logs/detail if it would help.

Thanks!
Dustin

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Pg autoscaling and device_health_metrics pool pg sizing

2021-11-01 Thread Alex Petty
Hello,

I’m evaluating Ceph as a storage option, using ceph version 16.2.6, Pacific 
stable installed using cephadm. I was hoping to use PG autoscaling to reduce 
ops efforts. I’m standing this up on a cluster with 96 OSDs across 9 hosts.

The device_health_metrics pool was created by Ceph automatically once I started 
adding OSD  and created with 2048 PGs. This seems high, and put many PGs on 
each OSD. Documentation indicates that I should be targeting around 100 PGs per 
OSD, is that guideline out of date?

Also, when I created a pool to test erasure coded with a 6+2 config for CephFS 
with PG autoscaling enabled, it was created with 1PG to start, and didn’t scale 
up even as I loaded test data onto it, giving the entire CephFS the write 
performance of 1 single disk as it was only writing to 1 disk and backfilling 
to 7 others. Should I be manually setting default PGs at a sane level (512, 
1024) or will autoscaling size this pool up? I have never seen any output from 
ceph osd pool autoscale-status when I am trying to see autoscaling information.

I’d appreciate some guidance about configuring PGs on Pacific.

Thanks,

Alex
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Pg autoscaling and device_health_metrics pool pg sizing

2021-11-01 Thread Yury Kirsanov
Hi Alex,
Switch autoscaler to 'scale-up' profile, it will keep PGs at minimum and
increase them as required. Default one is 'scale-down'.

Regards,
Yury.

On Tue, Nov 2, 2021 at 3:31 AM Alex Petty  wrote:

> Hello,
>
> I’m evaluating Ceph as a storage option, using ceph version 16.2.6,
> Pacific stable installed using cephadm. I was hoping to use PG autoscaling
> to reduce ops efforts. I’m standing this up on a cluster with 96 OSDs
> across 9 hosts.
>
> The device_health_metrics pool was created by Ceph automatically once I
> started adding OSD  and created with 2048 PGs. This seems high, and put
> many PGs on each OSD. Documentation indicates that I should be targeting
> around 100 PGs per OSD, is that guideline out of date?
>
> Also, when I created a pool to test erasure coded with a 6+2 config for
> CephFS with PG autoscaling enabled, it was created with 1PG to start, and
> didn’t scale up even as I loaded test data onto it, giving the entire
> CephFS the write performance of 1 single disk as it was only writing to 1
> disk and backfilling to 7 others. Should I be manually setting default PGs
> at a sane level (512, 1024) or will autoscaling size this pool up? I have
> never seen any output from ceph osd pool autoscale-status when I am trying
> to see autoscaling information.
>
> I’d appreciate some guidance about configuring PGs on Pacific.
>
> Thanks,
>
> Alex
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Free space in ec-pool should I worry?

2021-11-01 Thread Etienne Menguy
Hi,

Why do you think it’s used at 91%?

Ceph reports 47.51% usage for this pool.

-
Etienne Menguy
etienne.men...@croit.io




> On 1 Nov 2021, at 18:03, Szabo, Istvan (Agoda)  wrote:
> 
> Hi,
> 
> Theoretically my data pool is on 91% used but the fullest osd is on 60%, 
> should In” worry?
> 
> 
> 
> This is the ceph detail:
> 
> --- RAW STORAGE ---
> CLASS  SIZE AVAILUSED RAW USED  %RAW USED
> nvme10 TiB  9.3 TiB  292 MiB   1.2 TiB  11.68
> ssd503 TiB  327 TiB  156 TiB   176 TiB  34.92
> TOTAL  513 TiB  337 TiB  156 TiB   177 TiB  34.44
> 
> --- POOLS ---
> POOLID  PGS  STORED   (DATA)   (OMAP)   OBJECTS  USED 
> (DATA)   (OMAP)   %USED  MAX AVAIL  QUOTA OBJECTS  QUOTA BYTES  DIRTY   USED 
> COMPR  UNDER COMPR
> device_health_metrics11   41 MiB  0 B   41 MiB   54  123 MiB  
> 0 B  123 MiB  0 57 TiB  N/AN/A  54
>  0 B  0 B
> .rgw.root2   32  981 KiB  978 KiB  3.3 KiB  163  4.0 MiB  
> 4.0 MiB  9.8 KiB  0 57 TiB  N/AN/A 163
>  0 B  0 B
> sin.rgw.log 19   32   49 GiB  2.6 MiB   49 GiB   40.85k  146 GiB  
>  12 MiB  146 GiB   0.08 57 TiB  N/AN/A  40.85k
>  0 B  0 B
> sin.rgw.buckets.index   20   32  416 GiB  0 B  416 GiB   58.88k  1.2 TiB  
> 0 B  1.2 TiB  12.722.8 TiB  N/AN/A  58.88k
>  0 B  0 B
> sin.rgw.buckets.non-ec  21   32   16 MiB405 B   16 MiB   15   48 MiB  
> 180 KiB   48 MiB  0 57 TiB  N/AN/A  15
>  0 B  0 B
> sin.rgw.meta22   32  5.5 MiB  300 KiB  5.2 MiB1.13k   29 MiB  
>  13 MiB   16 MiB  02.8 TiB  N/AN/A   1.13k
>  0 B  0 B
> sin.rgw.control 23   32  0 B  0 B  0 B8  0 B  
> 0 B  0 B  0 57 TiB  N/AN/A   8
>  0 B  0 B
> sin.rgw.buckets.data24  128  104 TiB  104 TiB  0 B1.30G  156 TiB  
> 156 TiB  0 B  47.51115 TiB  N/AN/A   1.30G
>  0 B  0 B
> 
> 
> 
> 
> 
> This is the osd df:
> 
> ID  CLASS  WEIGHTREWEIGHT  SIZE RAW USE  DATA OMAP META  
> AVAIL%USE   VAR   PGS  STATUS
> 36   nvme   1.74660   1.0  1.7 TiB  209 GiB   47 MiB  208 GiB   776 MiB  
> 1.5 TiB  11.68  0.34   31  up
> 0ssd  13.97069   1.0   14 TiB  5.2 TiB  4.6 TiB  8.7 MiB   583 GiB  
> 8.8 TiB  37.00  1.07   39  up
> 8ssd  13.97069   1.0   14 TiB  8.1 TiB  7.2 TiB  3.6 GiB   890 GiB  
> 5.9 TiB  57.66  1.67   47  up
> 15ssd  13.97069   1.0   14 TiB  2.8 TiB  2.5 TiB   10 MiB   299 GiB   
> 11 TiB  19.69  0.57   19  up
> 18ssd  13.97069   1.0   14 TiB  4.7 TiB  4.2 TiB  5.8 MiB   530 GiB  
> 9.2 TiB  33.80  0.98   34  up
> 24ssd  13.97069   1.0   14 TiB  3.9 TiB  3.5 TiB  6.6 MiB   477 GiB   
> 10 TiB  28.22  0.82   21  up
> 30ssd  13.97069   1.0   14 TiB  4.8 TiB  4.2 TiB  5.4 MiB   545 GiB  
> 9.2 TiB  34.18  0.99   31  up
> 37   nvme   1.74660   1.0  1.7 TiB  273 GiB   47 MiB  271 GiB   1.3 GiB  
> 1.5 TiB  15.25  0.44   39  up
> 1ssd  14.55289   1.0   14 TiB  5.0 TiB  4.4 TiB   15 GiB   576 GiB  
> 9.0 TiB  35.90  1.04   29  up
> 11ssd  14.55289   1.0   14 TiB  7.3 TiB  6.5 TiB   15 GiB   798 GiB  
> 6.7 TiB  52.11  1.51   42  up
> 17ssd  14.55289   1.0   14 TiB  5.7 TiB  5.1 TiB  7.4 GiB   623 GiB  
> 8.3 TiB  40.84  1.19   39  up
> 23ssd  14.55289   1.0   14 TiB  5.1 TiB  4.5 TiB  2.4 GiB   578 GiB  
> 8.9 TiB  36.41  1.06   31  up
> 28ssd  14.55289   1.0   14 TiB  4.8 TiB  4.3 TiB  9.8 GiB   524 GiB  
> 9.2 TiB  34.26  0.99   39  up
> 35ssd  14.55289   1.0   14 TiB  1.3 TiB  1.2 TiB  4.9 GiB   143 GiB   
> 13 TiB   9.41  0.27   21  up
> 41   nvme   1.74660   1.0  1.7 TiB  222 GiB   47 MiB  221 GiB   735 MiB  
> 1.5 TiB  12.39  0.36   33  up
> 2ssd  14.55289   1.0   14 TiB  4.2 TiB  3.6 TiB   22 GiB   511 GiB  
> 9.8 TiB  29.73  0.86   33  up
> 6ssd  14.55289   1.0   14 TiB  2.0 TiB  1.8 TiB   10 MiB   214 GiB   
> 12 TiB  14.02  0.41   20  up
> 13ssd  14.55289   1.0   14 TiB  5.2 TiB  4.6 TiB   15 MiB   600 GiB  
> 8.8 TiB  37.14  1.08   30  up
> 19ssd  14.55289   1.0   14 TiB  3.6 TiB  3.2 TiB   54 MiB   401 GiB   
> 10 TiB  25.77  0.75   26  up
> 26ssd  14.55289   1.0   14 TiB  5.8 TiB  5.2 TiB   14 MiB   635 GiB  
> 8.2 TiB  41.45  1.20   38  up
> 32ssd  13.97069   1.0   14 TiB  8.6 TiB  7.7 TiB   16 MiB  1014 GiB  
> 5.3 TiB  61.85  1.80   46  up
> 38   nvme   1.74660   1.0  1.7 TiB  184 GiB   47 MiB  184 GiB   731 MiB  
> 1.6 TiB  10.31  0.30   26  up
> 5ssd  14.55289   1.0   14 TiB  3.1 TiB  2.7 TiB   1

[ceph-users] Re: Free space in ec-pool should I worry?

2021-11-01 Thread Alexander Closs
Max available = free space actually usable now based on OSD usage, not 
including already-used space.

-Alex
MIT CSAIL

On 11/1/21, 2:18 PM, "Szabo, Istvan (Agoda)"  wrote:

It says max available: 115TB and current use is 104TB, what I don’t 
understand where the max available come from because on the pool no object and 
no size limit is set:

quotas for pool 'sin.rgw.buckets.data':
  max objects: N/A
  max bytes  : N/A

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

On 2021. Nov 1., at 18:48, Etienne Menguy  wrote:

sin.rgw.buckets.data24  128  104 TiB  104 TiB  0 B1.30G  156 
TiB  156 TiB  0 B  47.51115 TiB  N/AN/A   1.30G 
0 B  0 B
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Free space in ec-pool should I worry?

2021-11-01 Thread Alexander Closs
I can add another 2 positive datapoints for the balancer, my personal and work 
clusters are both happily balancing.

Good luck :)
-Alex

On 11/1/21, 3:05 PM, "Josh Baergen"  wrote:

Well, those who have negative reviews are often the most vocal. :)
We've had few, if any, problems with the balancer in our own use of
it.

Josh

On Mon, Nov 1, 2021 at 12:58 PM Szabo, Istvan (Agoda)
 wrote:
>
> Yeah, just follow the autoscaler at the moment, it suggested 128, might 
enable later the balancer, just scare a bit due to negative feedbacks about it.
>
> Istvan Szabo
> Senior Infrastructure Engineer
> ---
> Agoda Services Co., Ltd.
> e: istvan.sz...@agoda.com
> ---
>
> On 2021. Nov 1., at 19:29, Josh Baergen  wrote:
>
> Email received from the internet. If in doubt, don't click any link nor 
open any attachment !
> 
>
> To expand on the comments below, "max avail" takes into account usage
> imbalance between OSDs. There's a pretty significant imbalance in this
> cluster and Ceph assumes that the imbalance will continue, and thus
> indicates that there's not much room left in the pool. Rebalancing
> that pool will make a big difference in terms of top-OSD fullness and
> the "max avail" metric.
>
> Josh
>
> On Mon, Nov 1, 2021 at 12:25 PM Alexander Closs  
wrote:
>
>
> Max available = free space actually usable now based on OSD usage, not 
including already-used space.
>
>
> -Alex
>
> MIT CSAIL
>
>
> On 11/1/21, 2:18 PM, "Szabo, Istvan (Agoda)"  
wrote:
>
>
>It says max available: 115TB and current use is 104TB, what I don’t 
understand where the max available come from because on the pool no object and 
no size limit is set:
>
>
>quotas for pool 'sin.rgw.buckets.data':
>
>  max objects: N/A
>
>  max bytes  : N/A
>
>
>Istvan Szabo
>
>Senior Infrastructure Engineer
>
>---
>
>Agoda Services Co., Ltd.
>
>e: istvan.sz...@agoda.com
>
>---
>
>
>On 2021. Nov 1., at 18:48, Etienne Menguy  
wrote:
>
>
>sin.rgw.buckets.data24  128  104 TiB  104 TiB  0 B1.30G  
156 TiB  156 TiB  0 B  47.51115 TiB  N/AN/A   1.30G 
0 B  0 B
>
>___
>
>ceph-users mailing list -- ceph-users@ceph.io
>
>To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
> ___
>
> ceph-users mailing list -- ceph-users@ceph.io
>
> To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Free space in ec-pool should I worry?

2021-11-01 Thread David Orman
The balancer does a pretty good job. It's the PG autoscaler that has bitten
us frequently enough that we always ensure it is disabled for all pools.

David

On Mon, Nov 1, 2021 at 2:08 PM Alexander Closs  wrote:

> I can add another 2 positive datapoints for the balancer, my personal and
> work clusters are both happily balancing.
>
> Good luck :)
> -Alex
>
> On 11/1/21, 3:05 PM, "Josh Baergen"  wrote:
>
> Well, those who have negative reviews are often the most vocal. :)
> We've had few, if any, problems with the balancer in our own use of
> it.
>
> Josh
>
> On Mon, Nov 1, 2021 at 12:58 PM Szabo, Istvan (Agoda)
>  wrote:
> >
> > Yeah, just follow the autoscaler at the moment, it suggested 128,
> might enable later the balancer, just scare a bit due to negative feedbacks
> about it.
> >
> > Istvan Szabo
> > Senior Infrastructure Engineer
> > ---
> > Agoda Services Co., Ltd.
> > e: istvan.sz...@agoda.com
> > ---
> >
> > On 2021. Nov 1., at 19:29, Josh Baergen 
> wrote:
> >
> > Email received from the internet. If in doubt, don't click any link
> nor open any attachment !
> > 
> >
> > To expand on the comments below, "max avail" takes into account usage
> > imbalance between OSDs. There's a pretty significant imbalance in
> this
> > cluster and Ceph assumes that the imbalance will continue, and thus
> > indicates that there's not much room left in the pool. Rebalancing
> > that pool will make a big difference in terms of top-OSD fullness and
> > the "max avail" metric.
> >
> > Josh
> >
> > On Mon, Nov 1, 2021 at 12:25 PM Alexander Closs <
> acl...@csail.mit.edu> wrote:
> >
> >
> > Max available = free space actually usable now based on OSD usage,
> not including already-used space.
> >
> >
> > -Alex
> >
> > MIT CSAIL
> >
> >
> > On 11/1/21, 2:18 PM, "Szabo, Istvan (Agoda)" <
> istvan.sz...@agoda.com> wrote:
> >
> >
> >It says max available: 115TB and current use is 104TB, what I
> don’t understand where the max available come from because on the pool no
> object and no size limit is set:
> >
> >
> >quotas for pool 'sin.rgw.buckets.data':
> >
> >  max objects: N/A
> >
> >  max bytes  : N/A
> >
> >
> >Istvan Szabo
> >
> >Senior Infrastructure Engineer
> >
> >---
> >
> >Agoda Services Co., Ltd.
> >
> >e: istvan.sz...@agoda.com
> >
> >---
> >
> >
> >On 2021. Nov 1., at 18:48, Etienne Menguy <
> etienne.men...@croit.io> wrote:
> >
> >
> >sin.rgw.buckets.data24  128  104 TiB  104 TiB  0 B
> 1.30G  156 TiB  156 TiB  0 B  47.51115 TiB  N/AN/A
>  1.30G 0 B  0 B
> >
> >___
> >
> >ceph-users mailing list -- ceph-users@ceph.io
> >
> >To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> >
> >
> > ___
> >
> > ceph-users mailing list -- ceph-users@ceph.io
> >
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD spend too much time on "waiting for readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart

2021-11-01 Thread Sage Weil
Hi Manuel,

I'm looking at the ticket for this issue (
https://tracker.ceph.com/issues/51463) and tried to reproduce.  This was
initially trivial to do with vstart (rados bench paused for many seconds
afters stopping an osd) but it turns out that was because the vstart
ceph.conf includes `osd_fast_shutdown = false`.  Once I enabled that again
(as it is by default on a normal cluster) I did not see any
noticeable interruption when an OSD was stopped.

Can you confirm what osd_fast_shutdown and osd_fast_shutdown_notify_mon are
set to on your cluster?

The intent is that when an OSD goes down, it will no longer accept
messenger connection attempts, and peer OSDs will inform the monitor with a
flag indicating the OSD is definitely dead (vs slow or unresponsive).  This
will allow the peering process to skip waiting for the read lease to time
out.  If you're seeing the laggy or 'waiting for readable' state, then that
isn't happening.. probably because the OSD shutdown isn't working as
originally intended.

If it's not one of those two options, make you can include a 'ceph config
dump' (or jsut a list of the changed values at least) so we can see what
else might be affecting OSD shutdown...

Thanks!
sage


On Thu, Oct 28, 2021 at 7:23 AM Manuel Lausch 
wrote:

> Hello Istvan,
>
> the state "waiting for readable" seems to be related to this read_lease
> topic documented here:
> https://docs.ceph.com/en/latest/dev/osd_internals/stale_read/
>
> The only parameter to tune around this I know about, is the
> "osd_pool_default_read_lease_ratio" which is defaulted 0.8
>
> My clusters complains after 5 seconds for slow ops due to corresponding
> configuration. On starting, stopping or restarting OSDs I see slow
> ops up to about 15 seconds. In normal clusters this will not be shown
> as slow op because this is under the default 32 seconds limit.
>
> We plan to set this parameter to 0.15 to mitigate the issue before we
> will upgrade to ceph octopus and later. I also where happy if someone
> with deep knowledge could have a look to this issue.
>
>
> To your cluster. I don't see the reasen whats triggers your cluster to
> run in this stage. Did you stop/start some OSDs beforhand?
> You May tell us some more details.
>
>
> Manuel
>
>
>
>
> On Thu, 28 Oct 2021 11:19:15 +
> "Szabo, Istvan (Agoda)"  wrote:
>
> > Hi,
> >
> > I found on the mail thread couple of emails that might be related to
> > this issue, but there it came after osd restart or recovery. My
> > cluster is on octopus 15.2.14 at the moment.
> >
> > When slow ops started to come I can see laggy pgs and slowly the 3
> > rados gateway starts to die behind the haproxy and when the slow ops
> > gone it will restart, but it causes user interruption.
> >
> > Digging deeper on the affected osd slow_ops historical log I can see
> > couple of slow ops, where 32 seconds spent on waiting for readable
> > event. Is there anything to do with this? To avoid osd restart I'm
> > running with osd_op_thread_suicide_timeout=2000
> > osd_op_thread_timeout=90 enabled on all osd at the moment.
> >
> > Might be I'm using too small amount of pg?
> > Data pool is on 4:2 ec host based and have 7 nodes, currently the pg
> > number is 128 based on the balancer suggestion.
> >
> > An example slow ops:
> >
> > {
> > "description": "osd_op(client.141400841.0:290235021
> > 28.17s0
> >
> 28:e94c28ab:::9213182a-14ba-48ad-bde9-289a1c0c0de8.6034919.1_%2fWHITELABEL-1%2fPAGETPYE-5%2fDEVICE-4%2fLANGUAGE-38%2fSUBTYPE-0%2f148341:head
> > [create,setxattr user.rgw.idtag (56) in=14b,setxattr
> > user.rgw.tail_tag (56) in=17b,writefull 0~17706,setxattr
> > user.rgw.manifest (375) in=17b,setxattr user.rgw.acl (123)
> > in=12b,setxattr user.rgw.content_type (10) in=21b,setxattr
> > user.rgw.etag (32) in=13b,setxattr
> > user.rgw.x-amz-meta-storagetimestamp (40) in=36b,call
> > rgw.obj_store_pg_ver in=19b,setxattr user.rgw.source_zone (4) in=20b]
> > snapc 0=[] ondisk+write+known_if_redirected e37602)", "initiated_at":
> > "2021-10-28T16:52:44.652426+0700", "age": 3258.4773889160001,
> > "duration": 32.445993113, "type_data": { "flag_point": "started",
> > "client_info": { "client": "client.141400841", "client_addr":
> > "10.118.199.3:0/462844935", "tid": 290235021 }, "events":
> > [ { "event": "initiated", "time": "2021-10-28T16:52:44.652426+0700",
> > "duration": 0 },
> > {
> > "event": "throttled",
> > "time": "2021-10-28T16:52:44.652426+0700",
> > "duration": 6.11439996e-05
> > },
> > {
> > "event": "header_read",
> > "time": "2021-10-28T16:52:44.652487+0700",
> > "duration": 2.2341e-06
> > },
> > {
> > "event": "all_read",
> > "time": "2021-10-28T16:52:44.652489+0700",
> >  

[ceph-users] Re: Free space in ec-pool should I worry?

2021-11-01 Thread Anthony D'Atri
I think this thread has inadvertantly conflated the two.

Balancer:  ceph-mgr module that uses pg-upmap to balance OSD utilization / 
fullness

Autoscaler:  attempts to set pg_num / pgp_num for each pool adaptively 


> 
> The balancer does a pretty good job. It's the PG autoscaler that has bitten
> us frequently enough that we always ensure it is disabled for all pools.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Best way to add multiple nodes to a cluster?

2021-11-01 Thread Zakhar Kirpichenko
Hi!

I have a 3-node 16.2.6 cluster with 33 OSDs, and plan to add another 3
nodes of the same configuration to it. What is the best way to add the new
nodes and OSDs so that I can avoid a massive rebalance and performance hit
until all new nodes and OSDs are in place and operational?

I would very much appreciate any advice.

Best regards,
Zakhar
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io