[ceph-users] Don't know how to use bucket notification

2019-10-24 Thread 柯名澤
Hi, all.
Does anyone know where the endpoint of CREATE TOPIC is? (for bucket
notification) 
https://docs.ceph.com/docs/master/radosgw/notifications/#create-a-topic
Is that the same with the normal S3 API? I tried but failed.
Thanks.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: subtrees have overcommitted (target_size_bytes / target_size_ratio)

2019-10-24 Thread Lars Täuber
My question requires too complex an answer.
So let me ask a simple question:

What does the SIZE of "osd pool autoscale-status" tell/mean/comes from?

Thanks
Lars

Wed, 23 Oct 2019 14:28:10 +0200
Lars Täuber  ==> ceph-users@ceph.io :
> Hello everybody!
> 
> What does this mean?
> 
> health: HEALTH_WARN
> 1 subtrees have overcommitted pool target_size_bytes
> 1 subtrees have overcommitted pool target_size_ratio
> 
> and what does it have to do with the autoscaler?
> When I deactivate the autoscaler the warning goes away.
> 
> 
> $ ceph osd pool autoscale-status
>  POOL   SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET 
> RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE 
>  cephfs_metadata  15106M3.0 2454G  0.0180
> 0.3000   4.0 256  on
>  cephfs_data  113.6T1.5165.4T  1.0306
> 0.9000   1.0 512  on
> 
> 
> $ ceph health detail
> HEALTH_WARN 1 subtrees have overcommitted pool target_size_bytes; 1 subtrees 
> have overcommitted pool target_size_ratio
> POOL_TARGET_SIZE_BYTES_OVERCOMMITTED 1 subtrees have overcommitted pool 
> target_size_bytes
> Pools ['cephfs_data'] overcommit available storage by 1.031x due to 
> target_size_bytes0  on pools []
> POOL_TARGET_SIZE_RATIO_OVERCOMMITTED 1 subtrees have overcommitted pool 
> target_size_ratio
> Pools ['cephfs_data'] overcommit available storage by 1.031x due to 
> target_size_ratio 0.900 on pools ['cephfs_data']
> 
> 
> Thanks
> Lars
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


-- 
Informationstechnologie
Berlin-Brandenburgische Akademie der Wissenschaften
Jägerstraße 22-23  10117 Berlin
Tel.: +49 30 20370-352   http://www.bbaw.de
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: subtrees have overcommitted (target_size_bytes / target_size_ratio)

2019-10-24 Thread Lars Täuber
This question is answered here:
https://ceph.io/rados/new-in-nautilus-pg-merging-and-autotuning/

But it tells me that there is more data stored in the pool than the raw 
capacity provides (taking the replication factor RATE into account) hence the 
RATIO being above 1.0 .

How comes this is the case? - Data is stored outside of the pool?
How comes this is only the case when the autoscaler is active?

Thanks
Lars


Thu, 24 Oct 2019 10:36:52 +0200
Lars Täuber  ==> ceph-users@ceph.io :
> My question requires too complex an answer.
> So let me ask a simple question:
> 
> What does the SIZE of "osd pool autoscale-status" tell/mean/comes from?
> 
> Thanks
> Lars
> 
> Wed, 23 Oct 2019 14:28:10 +0200
> Lars Täuber  ==> ceph-users@ceph.io :
> > Hello everybody!
> > 
> > What does this mean?
> > 
> > health: HEALTH_WARN
> > 1 subtrees have overcommitted pool target_size_bytes
> > 1 subtrees have overcommitted pool target_size_ratio
> > 
> > and what does it have to do with the autoscaler?
> > When I deactivate the autoscaler the warning goes away.
> > 
> > 
> > $ ceph osd pool autoscale-status
> >  POOL   SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET 
> > RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE 
> >  cephfs_metadata  15106M3.0 2454G  0.0180
> > 0.3000   4.0 256  on
> >  cephfs_data  113.6T1.5165.4T  1.0306
> > 0.9000   1.0 512  on
> > 
> > 
> > $ ceph health detail
> > HEALTH_WARN 1 subtrees have overcommitted pool target_size_bytes; 1 
> > subtrees have overcommitted pool target_size_ratio
> > POOL_TARGET_SIZE_BYTES_OVERCOMMITTED 1 subtrees have overcommitted pool 
> > target_size_bytes
> > Pools ['cephfs_data'] overcommit available storage by 1.031x due to 
> > target_size_bytes0  on pools []
> > POOL_TARGET_SIZE_RATIO_OVERCOMMITTED 1 subtrees have overcommitted pool 
> > target_size_ratio
> > Pools ['cephfs_data'] overcommit available storage by 1.031x due to 
> > target_size_ratio 0.900 on pools ['cephfs_data']
> > 
> > 
> > Thanks
> > Lars
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io  
> 
> 


-- 
Informationstechnologie
Berlin-Brandenburgische Akademie der Wissenschaften
Jägerstraße 22-23  10117 Berlin
Tel.: +49 30 20370-352   http://www.bbaw.de
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Unbalanced data distribution

2019-10-24 Thread Thomas Schneider
Hello,
this is understood.

I needed to start reweighting specific OSD because rebalancing was not
working and I got a warning in Ceph that some OSDs are running out of space.

KR


Am 24.10.2019 um 05:58 schrieb Konstantin Shalygin:
> On 10/23/19 2:46 PM, Thomas Schneider wrote:
>> Sure, here's the pastebin.
>
> Some of your 1.6Tb OSD's is reweighted, like osd.89 is 0.8,
> osd.100 is 0.7, etc...
>
> By this reason this OSD's get less PG's then other.
>
>
>
> k
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Change device class in EC profile

2019-10-24 Thread Eugen Block

Hi Frank,

just a short note on changing EC profiles. If you try to change only a  
single value you'll end up with a mess. See this example (Nautilus):


---snip---
# Created new profile
mon1:~ # ceph osd erasure-code-profile get ec-k2m4
crush-device-class=
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=2
m=4
plugin=jerasure
technique=reed_sol_van
w=8

# overwrite device-class
mon1:~ # ceph osd erasure-code-profile set ec-k2m4  
crush-device-class=ssd --force


# k and m have changed to default
mon1:~ # ceph osd erasure-code-profile get ec-k2m4
crush-device-class=ssd
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=2
m=1
plugin=jerasure
technique=reed_sol_van
w=8
---snip---


If you really want to change the profile you should set all values of  
that profile again (similar to the auth caps changes). But even if you  
can change an existing EC profile it won't have an impact on existing  
pools. Also the docs [1] say it can't be modified for an existing pool:


Choosing the right profile is important because it cannot be  
modified after the pool is created: a new pool with a different  
profile needs to be created and all objects from the previous pool  
moved to the new.



But since you already moved the pool to different devices cleaning up  
the profile should be ok, I guess. Newly created pools with the  
modified profile should honor the new configuration, I just replayed  
that in a small test environment and it worked there. But I haven't  
done this in a production environment so there might be other issues  
I'm not aware of.


Regards,
Eugen

[1]  
https://docs.ceph.com/docs/master/rados/operations/erasure-code/#erasure-code-profiles


Zitat von Frank Schilder :

I recently moved an EC pool from HDD to SSD by changing the device  
class in the crush rule. I would like to complete this operation by  
cleaning up a dirty trail. The EC profile attached to this pool is  
called sr-ec-6-2-hdd and it is easy enough to rename that to  
sr-ec-6-2-ssd. However, the profile itself contains the device class  
as well:


crush-device-class=hdd
[...]

I can already see the confusion this will cause in the future. Is  
this situation one of the few instances where using the --force  
option is warranted to change the device class of a profile, as in:


osd erasure-code-profile set sr-ec-6-2-ssd crush-device-class=ssd --force

If not, how can I change the device class of the profile?

Many thanks and best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: subtrees have overcommitted (target_size_bytes / target_size_ratio)

2019-10-24 Thread Nathan Fish
The formatting is mangled on my phone, but if I am reading it correctly,
you have set Target Ratio to 4.0. This means you have told the balancer
that this pool will occupy 4x the space of your whole cluster, and to
optimize accordingly. This is naturally a problem. Setting it to 0 will
clear the setting and allow the autobalancer to work.

On Thu., Oct. 24, 2019, 5:18 a.m. Lars Täuber,  wrote:

> This question is answered here:
> https://ceph.io/rados/new-in-nautilus-pg-merging-and-autotuning/
>
> But it tells me that there is more data stored in the pool than the raw
> capacity provides (taking the replication factor RATE into account) hence
> the RATIO being above 1.0 .
>
> How comes this is the case? - Data is stored outside of the pool?
> How comes this is only the case when the autoscaler is active?
>
> Thanks
> Lars
>
>
> Thu, 24 Oct 2019 10:36:52 +0200
> Lars Täuber  ==> ceph-users@ceph.io :
> > My question requires too complex an answer.
> > So let me ask a simple question:
> >
> > What does the SIZE of "osd pool autoscale-status" tell/mean/comes from?
> >
> > Thanks
> > Lars
> >
> > Wed, 23 Oct 2019 14:28:10 +0200
> > Lars Täuber  ==> ceph-users@ceph.io :
> > > Hello everybody!
> > >
> > > What does this mean?
> > >
> > > health: HEALTH_WARN
> > > 1 subtrees have overcommitted pool target_size_bytes
> > > 1 subtrees have overcommitted pool target_size_ratio
> > >
> > > and what does it have to do with the autoscaler?
> > > When I deactivate the autoscaler the warning goes away.
> > >
> > >
> > > $ ceph osd pool autoscale-status
> > >  POOL   SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO
> TARGET RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE
> > >  cephfs_metadata  15106M3.0 2454G  0.0180
>   0.3000   4.0 256  on
> > >  cephfs_data  113.6T1.5165.4T  1.0306
>   0.9000   1.0 512  on
> > >
> > >
> > > $ ceph health detail
> > > HEALTH_WARN 1 subtrees have overcommitted pool target_size_bytes; 1
> subtrees have overcommitted pool target_size_ratio
> > > POOL_TARGET_SIZE_BYTES_OVERCOMMITTED 1 subtrees have overcommitted
> pool target_size_bytes
> > > Pools ['cephfs_data'] overcommit available storage by 1.031x due
> to target_size_bytes0  on pools []
> > > POOL_TARGET_SIZE_RATIO_OVERCOMMITTED 1 subtrees have overcommitted
> pool target_size_ratio
> > > Pools ['cephfs_data'] overcommit available storage by 1.031x due
> to target_size_ratio 0.900 on pools ['cephfs_data']
> > >
> > >
> > > Thanks
> > > Lars
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> >
>
>
> --
> Informationstechnologie
> Berlin-Brandenburgische Akademie der Wissenschaften
> Jägerstraße 22-23  10117 Berlin
> Tel.: +49 30 20370-352   http://www.bbaw.de
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: subtrees have overcommitted (target_size_bytes / target_size_ratio)

2019-10-24 Thread Lars Täuber
Thanks Nathan for your answer,

but I set the the Target Ratio to 0.9. It is the cephfs_data pool that makes 
the troubles.

The 4.0 is the BIAS from the cephfs_metadata pool. This "BIAS" is not explained 
on the page linked below. So I don't know its meaning.

How can be a pool overcommited when it is the only pool on a set of OSDs?

Best regards,
Lars

Thu, 24 Oct 2019 09:39:51 -0400
Nathan Fish  ==> Lars Täuber  :
> The formatting is mangled on my phone, but if I am reading it correctly,
> you have set Target Ratio to 4.0. This means you have told the balancer
> that this pool will occupy 4x the space of your whole cluster, and to
> optimize accordingly. This is naturally a problem. Setting it to 0 will
> clear the setting and allow the autobalancer to work.
> 
> On Thu., Oct. 24, 2019, 5:18 a.m. Lars Täuber,  wrote:
> 
> > This question is answered here:
> > https://ceph.io/rados/new-in-nautilus-pg-merging-and-autotuning/
> >
> > But it tells me that there is more data stored in the pool than the raw
> > capacity provides (taking the replication factor RATE into account) hence
> > the RATIO being above 1.0 .
> >
> > How comes this is the case? - Data is stored outside of the pool?
> > How comes this is only the case when the autoscaler is active?
> >
> > Thanks
> > Lars
> >
> >
> > Thu, 24 Oct 2019 10:36:52 +0200
> > Lars Täuber  ==> ceph-users@ceph.io :  
> > > My question requires too complex an answer.
> > > So let me ask a simple question:
> > >
> > > What does the SIZE of "osd pool autoscale-status" tell/mean/comes from?
> > >
> > > Thanks
> > > Lars
> > >
> > > Wed, 23 Oct 2019 14:28:10 +0200
> > > Lars Täuber  ==> ceph-users@ceph.io :  
> > > > Hello everybody!
> > > >
> > > > What does this mean?
> > > >
> > > > health: HEALTH_WARN
> > > > 1 subtrees have overcommitted pool target_size_bytes
> > > > 1 subtrees have overcommitted pool target_size_ratio
> > > >
> > > > and what does it have to do with the autoscaler?
> > > > When I deactivate the autoscaler the warning goes away.
> > > >
> > > >
> > > > $ ceph osd pool autoscale-status
> > > >  POOL   SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  
> > TARGET RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE  
> > > >  cephfs_metadata  15106M3.0 2454G  0.0180  
> >   0.3000   4.0 256  on  
> > > >  cephfs_data  113.6T1.5165.4T  1.0306  
> >   0.9000   1.0 512  on  
> > > >
> > > >
> > > > $ ceph health detail
> > > > HEALTH_WARN 1 subtrees have overcommitted pool target_size_bytes; 1  
> > subtrees have overcommitted pool target_size_ratio  
> > > > POOL_TARGET_SIZE_BYTES_OVERCOMMITTED 1 subtrees have overcommitted  
> > pool target_size_bytes  
> > > > Pools ['cephfs_data'] overcommit available storage by 1.031x due  
> > to target_size_bytes0  on pools []  
> > > > POOL_TARGET_SIZE_RATIO_OVERCOMMITTED 1 subtrees have overcommitted  
> > pool target_size_ratio  
> > > > Pools ['cephfs_data'] overcommit available storage by 1.031x due  
> > to target_size_ratio 0.900 on pools ['cephfs_data']  
> > > >
> > > >
> > > > Thanks
> > > > Lars
> > > > ___
> > > > ceph-users mailing list -- ceph-users@ceph.io
> > > > To unsubscribe send an email to ceph-users-le...@ceph.io  
> > >
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: subtrees have overcommitted (target_size_bytes / target_size_ratio)

2019-10-24 Thread Nathan Fish
Ah, I see! The BIAS reflects the number of placement groups it should
create. Since cephfs metadata pools are usually very small, but have
many objects and high IO, the autoscaler gives them 4x the number of
placement groups that it would normally give for that amount of data.

So, your cephfs_data is set to a ratio of 0.9, and cephfs_metadata to
0.3? Are the two pools using entirely different device classes, so
they are not sharing space?
Anyway, I see that your overcommit is only "1.031x". So if you set
cephfs_data to 0.85, it should go away.

On Thu, Oct 24, 2019 at 10:09 AM Lars Täuber  wrote:
>
> Thanks Nathan for your answer,
>
> but I set the the Target Ratio to 0.9. It is the cephfs_data pool that makes 
> the troubles.
>
> The 4.0 is the BIAS from the cephfs_metadata pool. This "BIAS" is not 
> explained on the page linked below. So I don't know its meaning.
>
> How can be a pool overcommited when it is the only pool on a set of OSDs?
>
> Best regards,
> Lars
>
> Thu, 24 Oct 2019 09:39:51 -0400
> Nathan Fish  ==> Lars Täuber  :
> > The formatting is mangled on my phone, but if I am reading it correctly,
> > you have set Target Ratio to 4.0. This means you have told the balancer
> > that this pool will occupy 4x the space of your whole cluster, and to
> > optimize accordingly. This is naturally a problem. Setting it to 0 will
> > clear the setting and allow the autobalancer to work.
> >
> > On Thu., Oct. 24, 2019, 5:18 a.m. Lars Täuber,  wrote:
> >
> > > This question is answered here:
> > > https://ceph.io/rados/new-in-nautilus-pg-merging-and-autotuning/
> > >
> > > But it tells me that there is more data stored in the pool than the raw
> > > capacity provides (taking the replication factor RATE into account) hence
> > > the RATIO being above 1.0 .
> > >
> > > How comes this is the case? - Data is stored outside of the pool?
> > > How comes this is only the case when the autoscaler is active?
> > >
> > > Thanks
> > > Lars
> > >
> > >
> > > Thu, 24 Oct 2019 10:36:52 +0200
> > > Lars Täuber  ==> ceph-users@ceph.io :
> > > > My question requires too complex an answer.
> > > > So let me ask a simple question:
> > > >
> > > > What does the SIZE of "osd pool autoscale-status" tell/mean/comes from?
> > > >
> > > > Thanks
> > > > Lars
> > > >
> > > > Wed, 23 Oct 2019 14:28:10 +0200
> > > > Lars Täuber  ==> ceph-users@ceph.io :
> > > > > Hello everybody!
> > > > >
> > > > > What does this mean?
> > > > >
> > > > > health: HEALTH_WARN
> > > > > 1 subtrees have overcommitted pool target_size_bytes
> > > > > 1 subtrees have overcommitted pool target_size_ratio
> > > > >
> > > > > and what does it have to do with the autoscaler?
> > > > > When I deactivate the autoscaler the warning goes away.
> > > > >
> > > > >
> > > > > $ ceph osd pool autoscale-status
> > > > >  POOL   SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO
> > > TARGET RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE
> > > > >  cephfs_metadata  15106M3.0 2454G  0.0180
> > >   0.3000   4.0 256  on
> > > > >  cephfs_data  113.6T1.5165.4T  1.0306
> > >   0.9000   1.0 512  on
> > > > >
> > > > >
> > > > > $ ceph health detail
> > > > > HEALTH_WARN 1 subtrees have overcommitted pool target_size_bytes; 1
> > > subtrees have overcommitted pool target_size_ratio
> > > > > POOL_TARGET_SIZE_BYTES_OVERCOMMITTED 1 subtrees have overcommitted
> > > pool target_size_bytes
> > > > > Pools ['cephfs_data'] overcommit available storage by 1.031x due
> > > to target_size_bytes0  on pools []
> > > > > POOL_TARGET_SIZE_RATIO_OVERCOMMITTED 1 subtrees have overcommitted
> > > pool target_size_ratio
> > > > > Pools ['cephfs_data'] overcommit available storage by 1.031x due
> > > to target_size_ratio 0.900 on pools ['cephfs_data']
> > > > >
> > > > >
> > > > > Thanks
> > > > > Lars
> > > > > ___
> > > > > ceph-users mailing list -- ceph-users@ceph.io
> > > > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > > >
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] [ceph-user] Ceph mimic support FIPS

2019-10-24 Thread Amit Ghadge
Hi all,

We have FIPS enable cluster where it is running on ceph-12.2.12, after
upgrading to mimic 13.2.6 can't serve any requests. and not able to get/put
objects, buckets.

Is there Mimic support FIPS?

Thanks,
Amit G
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] rgw recovering shards

2019-10-24 Thread Frank R
Hi all,

After an RGW upgrade from 12.2.7 to 12.2.12 for RGW multisite a few days
ago the "sync status" has constantly shown a few "recovering shards", ie:

-

#  radosgw-admin sync status
  realm 8f7fd3fd-f72d-411d-b06b-7b4b579f5f2f (prod)
  zonegroup 60a2cb75-6978-46a3-b830-061c8be9dc75 (prod)
   zone ffce148e-3b24-462d-98bf-8c212de31de5 (us-east-1)
  metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is caught up with master
  data sync source: 7fe96e52-d6f7-4ad6-b66e-ecbbbffbc18e (us-east-2)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is behind on 1 shards
behind shards: [48]
oldest incremental change not applied: 2019-10-21
22:34:11.0.293798s
5 shards are recovering
recovering shards: [11,37,48,110,117]

-

This is the secondary zone. I am worried about the "oldest incremental
change not applied" being from the 21st. Is there a way to have RGW just
stop trying to recover these shards and just sync them from this point in
time?

thx
Frank
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Don't know how to use bucket notification

2019-10-24 Thread Yuval Lifshitz
the endpoint is not the RGW endpoint, it is the server to which you
want to send the bucket notifications to.
E.g. if you have a rabbitmq server running at address: 1.2.3.4, you should use:
push-endpoint=amqp://1.2.3.4
note that in such a case the: amqp-exchange parameter must be set as well.

assuming you have an http server, on the same address, listening on
port 8080, and you want your notifications to get to it, you should
use:
push-endpoint=http://1.2.3.4:8080

Yuval

> Subject:[ceph-users] Don't know how to use bucket notification
> Date:   Thu, 24 Oct 2019 16:26:54 +0800
> From:   柯名澤 
> To: ceph-users@ceph.io
>
>
>
> Hi, all.
> Does anyone know where the endpoint of CREATE TOPIC is? (for bucket
> notification)
> https://docs.ceph.com/docs/master/radosgw/notifications/#create-a-topic
> Is that the same with the normal S3 API? I tried but failed.
> Thanks.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] iSCSI write performance

2019-10-24 Thread Ryan
I'm in the process of testing the iscsi target feature of ceph. The cluster
is running ceph 14.2.4 and ceph-iscsi 3.3. It consists of 5 hosts with 12
SSD OSDs per host. Some basic testing moving VMs to a ceph backed datastore
is only showing 60MB/s transfers. However moving these back off the
datastore is fast at 200-300MB/s.

What should I be looking at to track down the write performance issue? In
comparison with the Nimble Storage arrays I can see 200-300MB/s in both
directions.

Thanks,
Ryan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: iSCSI write performance

2019-10-24 Thread Nathan Fish
Are you using Erasure Coding or replication? What is your crush rule?
What SSDs and CPUs? Does each OSD use 100% of a core or more when
writing?

On Thu, Oct 24, 2019 at 1:22 PM Ryan  wrote:
>
> I'm in the process of testing the iscsi target feature of ceph. The cluster 
> is running ceph 14.2.4 and ceph-iscsi 3.3. It consists of 5 hosts with 12 SSD 
> OSDs per host. Some basic testing moving VMs to a ceph backed datastore is 
> only showing 60MB/s transfers. However moving these back off the datastore is 
> fast at 200-300MB/s.
>
> What should I be looking at to track down the write performance issue? In 
> comparison with the Nimble Storage arrays I can see 200-300MB/s in both 
> directions.
>
> Thanks,
> Ryan
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: iSCSI write performance

2019-10-24 Thread Drew Weaver
I was told by someone at Red Hat that ISCSI performance is still several 
magnitudes behind using the client / driver.

Thanks,
-Drew


-Original Message-
From: Nathan Fish  
Sent: Thursday, October 24, 2019 1:27 PM
To: Ryan 
Cc: ceph-users 
Subject: [ceph-users] Re: iSCSI write performance

Are you using Erasure Coding or replication? What is your crush rule?
What SSDs and CPUs? Does each OSD use 100% of a core or more when writing?

On Thu, Oct 24, 2019 at 1:22 PM Ryan  wrote:
>
> I'm in the process of testing the iscsi target feature of ceph. The cluster 
> is running ceph 14.2.4 and ceph-iscsi 3.3. It consists of 5 hosts with 12 SSD 
> OSDs per host. Some basic testing moving VMs to a ceph backed datastore is 
> only showing 60MB/s transfers. However moving these back off the datastore is 
> fast at 200-300MB/s.
>
> What should I be looking at to track down the write performance issue? In 
> comparison with the Nimble Storage arrays I can see 200-300MB/s in both 
> directions.
>
> Thanks,
> Ryan
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an 
> email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Change device class in EC profile

2019-10-24 Thread Frank Schilder
Hi Eugen,

thanks for that comment. I did save the command line I used to create the EC 
profile. To force an update, I would just re-execute the same line with the 
device class set to SSD this time.

I would also expect that the pool only continues using k, m and algorithmic 
settings, which must not be changed ever. Device class and crush root should be 
safe to change as they are only relevant at pool creation. Sounds like what you 
write confirms this hypothesis.

Best regards and thanks,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: 24 October 2019 14:32:36
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Change device class in EC profile

Hi Frank,

just a short note on changing EC profiles. If you try to change only a
single value you'll end up with a mess. See this example (Nautilus):

---snip---
# Created new profile
mon1:~ # ceph osd erasure-code-profile get ec-k2m4
crush-device-class=
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=2
m=4
plugin=jerasure
technique=reed_sol_van
w=8

# overwrite device-class
mon1:~ # ceph osd erasure-code-profile set ec-k2m4
crush-device-class=ssd --force

# k and m have changed to default
mon1:~ # ceph osd erasure-code-profile get ec-k2m4
crush-device-class=ssd
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=2
m=1
plugin=jerasure
technique=reed_sol_van
w=8
---snip---


If you really want to change the profile you should set all values of
that profile again (similar to the auth caps changes). But even if you
can change an existing EC profile it won't have an impact on existing
pools. Also the docs [1] say it can't be modified for an existing pool:

> Choosing the right profile is important because it cannot be
> modified after the pool is created: a new pool with a different
> profile needs to be created and all objects from the previous pool
> moved to the new.


But since you already moved the pool to different devices cleaning up
the profile should be ok, I guess. Newly created pools with the
modified profile should honor the new configuration, I just replayed
that in a small test environment and it worked there. But I haven't
done this in a production environment so there might be other issues
I'm not aware of.

Regards,
Eugen

[1]
https://docs.ceph.com/docs/master/rados/operations/erasure-code/#erasure-code-profiles

Zitat von Frank Schilder :

> I recently moved an EC pool from HDD to SSD by changing the device
> class in the crush rule. I would like to complete this operation by
> cleaning up a dirty trail. The EC profile attached to this pool is
> called sr-ec-6-2-hdd and it is easy enough to rename that to
> sr-ec-6-2-ssd. However, the profile itself contains the device class
> as well:
>
> crush-device-class=hdd
> [...]
>
> I can already see the confusion this will cause in the future. Is
> this situation one of the few instances where using the --force
> option is warranted to change the device class of a profile, as in:
>
> osd erasure-code-profile set sr-ec-6-2-ssd crush-device-class=ssd --force
>
> If not, how can I change the device class of the profile?
>
> Many thanks and best regards,
>
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: iSCSI write performance

2019-10-24 Thread Martin Verges
Hello,

we did some local testing a few days ago on a new installation of a small
cluster.
Performance of our iSCSI implementation showed a performance drop of 20-30%
against krbd.

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.ver...@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx


Am Do., 24. Okt. 2019 um 19:37 Uhr schrieb Drew Weaver <
drew.wea...@thenap.com>:

> I was told by someone at Red Hat that ISCSI performance is still several
> magnitudes behind using the client / driver.
>
> Thanks,
> -Drew
>
>
> -Original Message-
> From: Nathan Fish 
> Sent: Thursday, October 24, 2019 1:27 PM
> To: Ryan 
> Cc: ceph-users 
> Subject: [ceph-users] Re: iSCSI write performance
>
> Are you using Erasure Coding or replication? What is your crush rule?
> What SSDs and CPUs? Does each OSD use 100% of a core or more when writing?
>
> On Thu, Oct 24, 2019 at 1:22 PM Ryan  wrote:
> >
> > I'm in the process of testing the iscsi target feature of ceph. The
> cluster is running ceph 14.2.4 and ceph-iscsi 3.3. It consists of 5 hosts
> with 12 SSD OSDs per host. Some basic testing moving VMs to a ceph backed
> datastore is only showing 60MB/s transfers. However moving these back off
> the datastore is fast at 200-300MB/s.
> >
> > What should I be looking at to track down the write performance issue?
> In comparison with the Nimble Storage arrays I can see 200-300MB/s in both
> directions.
> >
> > Thanks,
> > Ryan
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> > email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Choosing suitable SSD for Ceph cluster

2019-10-24 Thread Hermann Himmelbauer
Hi,
I am running a nice ceph (proxmox 4 / debian-8 / ceph 0.94.3) cluster on
3 nodes (supermicro X8DTT-HIBQF), 2 OSD each (2TB SATA harddisks),
interconnected via Infiniband 40.

Problem is that the ceph performance is quite bad (approx. 30MiB/s
reading, 3-4 MiB/s writing ), so I thought about plugging into each node
a PCIe to NVMe/M.2 adapter and install SSD harddisks. The idea is to
have a faster ceph storage and also some storage extension.

The question is now which SSDs I should use. If I understand it right,
not every SSD is suitable for ceph, as is denoted at the links below:

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
or here:
https://www.proxmox.com/en/downloads/item/proxmox-ve-ceph-benchmark

In the first link, the Samsung SSD 950 PRO 512GB NVMe is listed as a
fast SSD for ceph. As the 950 is not available anymore, I ordered a
Samsung 970 1TB for testing, unfortunately, the "EVO" instead of PRO.

Before equipping all nodes with these SSDs, I did some tests with "fio"
as recommended, e.g. like this:

fio --filename=/dev/DEVICE --direct=1 --sync=1 --rw=write --bs=4k
--numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting
--name=journal-test

The results are as the following:

---
1) Samsung 970 EVO NVMe M.2 mit PCIe Adapter
Jobs: 1:
read : io=26706MB, bw=445MiB/s, iops=113945, runt= 60001msec
write: io=252576KB, bw=4.1MiB/s, iops=1052, runt= 60001msec

Jobs: 4:
read : io=21805MB, bw=432.7MiB/s, iops=93034, runt= 60001msec
write: io=422204KB, bw=6.8MiB/s, iops=1759, runt= 60002msec

Jobs: 10:
read : io=26921MB, bw=448MiB/s, iops=114859, runt= 60001msec
write: io=435644KB, bw=7MiB/s, iops=1815, runt= 60004msec
---

So the read speed is impressive, but the write speed is really bad.

Therefore I ordered the Samsung 970 PRO (1TB) as it has faster NAND
chips (MLC instead of TLC). The results are, however even worse for writing:

---
Samsung 970 PRO NVMe M.2 mit PCIe Adapter
Jobs: 1:
read : io=15570MB, bw=259.4MiB/s, iops=66430, runt= 60001msec
write: io=199436KB, bw=3.2MiB/s, iops=830, runt= 60001msec

Jobs: 4:
read : io=48982MB, bw=816.3MiB/s, iops=208986, runt= 60001msec
write: io=327800KB, bw=5.3MiB/s, iops=1365, runt= 60002msec

Jobs: 10:
read : io=91753MB, bw=1529.3MiB/s, iops=391474, runt= 60001msec
write: io=343368KB, bw=5.6MiB/s, iops=1430, runt= 60005msec
---

I did some research and found out, that the "--sync" flag sets the flag
"O_DSYNC" which seems to disable the SSD cache which leads to these
horrid write speeds.

It seems that this relates to the fact that the write cache is only not
disabled for SSDs which implement some kind of battery buffer that
guarantees a data flush to the flash in case of a powerloss.

However, It seems impossible to find out which SSDs do have this
powerloss protection, moreover, these enterprise SSDs are crazy
expensive compared to the SSDs above - moreover it's unclear if
powerloss protection is even available in the NVMe form factor. So
building a 1 or 2 TB cluster seems not really affordable/viable.

So, can please anyone give me hints what to do? Is it possible to ensure
that the write cache is not disabled in some way (my server is situated
in a data center, so there will probably never be loss of power).

Or is the link above already outdated as newer ceph releases somehow
deal with this problem? Or maybe a later Debian release (10) will handle
the O_DSYNC flag differently?

Perhaps I should simply invest in faster (and bigger) harddisks and
forget the SSD-cluster idea?

Thank you in advance for any help,

Best Regards,
Hermann


-- 
herm...@qwer.tk
PGP/GPG: 299893C7 (on keyservers)
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Choosing suitable SSD for Ceph cluster

2019-10-24 Thread Martin Verges
Hello,

think about migrating to a way faster and better Ceph version and towards
bluestore to increase the performance with the existing hardware.

If you want to go with PCIe card, the Samsung PM1725b can provide quite
good speeds but at much higher costs then the EVO. If you want to check
drives, take a look at the uncached write latency. The lower the value is,
the better will be the drive.

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.ver...@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx


Am Do., 24. Okt. 2019 um 21:09 Uhr schrieb Hermann Himmelbauer <
herm...@qwer.tk>:

> Hi,
> I am running a nice ceph (proxmox 4 / debian-8 / ceph 0.94.3) cluster on
> 3 nodes (supermicro X8DTT-HIBQF), 2 OSD each (2TB SATA harddisks),
> interconnected via Infiniband 40.
>
> Problem is that the ceph performance is quite bad (approx. 30MiB/s
> reading, 3-4 MiB/s writing ), so I thought about plugging into each node
> a PCIe to NVMe/M.2 adapter and install SSD harddisks. The idea is to
> have a faster ceph storage and also some storage extension.
>
> The question is now which SSDs I should use. If I understand it right,
> not every SSD is suitable for ceph, as is denoted at the links below:
>
>
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
> or here:
> https://www.proxmox.com/en/downloads/item/proxmox-ve-ceph-benchmark
>
> In the first link, the Samsung SSD 950 PRO 512GB NVMe is listed as a
> fast SSD for ceph. As the 950 is not available anymore, I ordered a
> Samsung 970 1TB for testing, unfortunately, the "EVO" instead of PRO.
>
> Before equipping all nodes with these SSDs, I did some tests with "fio"
> as recommended, e.g. like this:
>
> fio --filename=/dev/DEVICE --direct=1 --sync=1 --rw=write --bs=4k
> --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting
> --name=journal-test
>
> The results are as the following:
>
> ---
> 1) Samsung 970 EVO NVMe M.2 mit PCIe Adapter
> Jobs: 1:
> read : io=26706MB, bw=445MiB/s, iops=113945, runt= 60001msec
> write: io=252576KB, bw=4.1MiB/s, iops=1052, runt= 60001msec
>
> Jobs: 4:
> read : io=21805MB, bw=432.7MiB/s, iops=93034, runt= 60001msec
> write: io=422204KB, bw=6.8MiB/s, iops=1759, runt= 60002msec
>
> Jobs: 10:
> read : io=26921MB, bw=448MiB/s, iops=114859, runt= 60001msec
> write: io=435644KB, bw=7MiB/s, iops=1815, runt= 60004msec
> ---
>
> So the read speed is impressive, but the write speed is really bad.
>
> Therefore I ordered the Samsung 970 PRO (1TB) as it has faster NAND
> chips (MLC instead of TLC). The results are, however even worse for
> writing:
>
> ---
> Samsung 970 PRO NVMe M.2 mit PCIe Adapter
> Jobs: 1:
> read : io=15570MB, bw=259.4MiB/s, iops=66430, runt= 60001msec
> write: io=199436KB, bw=3.2MiB/s, iops=830, runt= 60001msec
>
> Jobs: 4:
> read : io=48982MB, bw=816.3MiB/s, iops=208986, runt= 60001msec
> write: io=327800KB, bw=5.3MiB/s, iops=1365, runt= 60002msec
>
> Jobs: 10:
> read : io=91753MB, bw=1529.3MiB/s, iops=391474, runt= 60001msec
> write: io=343368KB, bw=5.6MiB/s, iops=1430, runt= 60005msec
> ---
>
> I did some research and found out, that the "--sync" flag sets the flag
> "O_DSYNC" which seems to disable the SSD cache which leads to these
> horrid write speeds.
>
> It seems that this relates to the fact that the write cache is only not
> disabled for SSDs which implement some kind of battery buffer that
> guarantees a data flush to the flash in case of a powerloss.
>
> However, It seems impossible to find out which SSDs do have this
> powerloss protection, moreover, these enterprise SSDs are crazy
> expensive compared to the SSDs above - moreover it's unclear if
> powerloss protection is even available in the NVMe form factor. So
> building a 1 or 2 TB cluster seems not really affordable/viable.
>
> So, can please anyone give me hints what to do? Is it possible to ensure
> that the write cache is not disabled in some way (my server is situated
> in a data center, so there will probably never be loss of power).
>
> Or is the link above already outdated as newer ceph releases somehow
> deal with this problem? Or maybe a later Debian release (10) will handle
> the O_DSYNC flag differently?
>
> Perhaps I should simply invest in faster (and bigger) harddisks and
> forget the SSD-cluster idea?
>
> Thank you in advance for any help,
>
> Best Regards,
> Hermann
>
>
> --
> herm...@qwer.tk
> PGP/GPG: 299893C7 (on keyservers)
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an

[ceph-users] Re: Choosing suitable SSD for Ceph cluster

2019-10-24 Thread Frank Schilder
Dear Hermann,

try your tests again with volatile write cache disabled ([s/h]dparm -W 0 
DEVICE). If your disks have super capacitors, you should then see spec 
performance (possibly starting with iodopth=2 or 4) with your fio test. A good 
article is this one here: 
https://yourcmc.ru/wiki/index.php?title=Ceph_performance .

The feature you are looking for is called "power loss protection". I would 
expect Samsung PRO disks to have it.

The fio test with iodepth=1 will give you an indication of what you an expect 
from a single OSD deployed on the disk. When choosing disks, also look for 
DWPD>=1.

In addition, as Martin writes, consider upgrading and deploy all new disks with 
bluestore.

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Martin Verges 
Sent: 24 October 2019 21:21
To: Hermann Himmelbauer
Cc: ceph-users
Subject: [ceph-users] Re: Choosing suitable SSD for Ceph cluster

Hello,

think about migrating to a way faster and better Ceph version and towards 
bluestore to increase the performance with the existing hardware.

If you want to go with PCIe card, the Samsung PM1725b can provide quite good 
speeds but at much higher costs then the EVO. If you want to check drives, take 
a look at the uncached write latency. The lower the value is, the better will 
be the drive.

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.ver...@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx


Am Do., 24. Okt. 2019 um 21:09 Uhr schrieb Hermann Himmelbauer 
mailto:herm...@qwer.tk>>:
Hi,
I am running a nice ceph (proxmox 4 / debian-8 / ceph 0.94.3) cluster on
3 nodes (supermicro X8DTT-HIBQF), 2 OSD each (2TB SATA harddisks),
interconnected via Infiniband 40.

Problem is that the ceph performance is quite bad (approx. 30MiB/s
reading, 3-4 MiB/s writing ), so I thought about plugging into each node
a PCIe to NVMe/M.2 adapter and install SSD harddisks. The idea is to
have a faster ceph storage and also some storage extension.

The question is now which SSDs I should use. If I understand it right,
not every SSD is suitable for ceph, as is denoted at the links below:

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
or here:
https://www.proxmox.com/en/downloads/item/proxmox-ve-ceph-benchmark

In the first link, the Samsung SSD 950 PRO 512GB NVMe is listed as a
fast SSD for ceph. As the 950 is not available anymore, I ordered a
Samsung 970 1TB for testing, unfortunately, the "EVO" instead of PRO.

Before equipping all nodes with these SSDs, I did some tests with "fio"
as recommended, e.g. like this:

fio --filename=/dev/DEVICE --direct=1 --sync=1 --rw=write --bs=4k
--numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting
--name=journal-test

The results are as the following:

---
1) Samsung 970 EVO NVMe M.2 mit PCIe Adapter
Jobs: 1:
read : io=26706MB, bw=445MiB/s, iops=113945, runt= 60001msec
write: io=252576KB, bw=4.1MiB/s, iops=1052, runt= 60001msec

Jobs: 4:
read : io=21805MB, bw=432.7MiB/s, iops=93034, runt= 60001msec
write: io=422204KB, bw=6.8MiB/s, iops=1759, runt= 60002msec

Jobs: 10:
read : io=26921MB, bw=448MiB/s, iops=114859, runt= 60001msec
write: io=435644KB, bw=7MiB/s, iops=1815, runt= 60004msec
---

So the read speed is impressive, but the write speed is really bad.

Therefore I ordered the Samsung 970 PRO (1TB) as it has faster NAND
chips (MLC instead of TLC). The results are, however even worse for writing:

---
Samsung 970 PRO NVMe M.2 mit PCIe Adapter
Jobs: 1:
read : io=15570MB, bw=259.4MiB/s, iops=66430, runt= 60001msec
write: io=199436KB, bw=3.2MiB/s, iops=830, runt= 60001msec

Jobs: 4:
read : io=48982MB, bw=816.3MiB/s, iops=208986, runt= 60001msec
write: io=327800KB, bw=5.3MiB/s, iops=1365, runt= 60002msec

Jobs: 10:
read : io=91753MB, bw=1529.3MiB/s, iops=391474, runt= 60001msec
write: io=343368KB, bw=5.6MiB/s, iops=1430, runt= 60005msec
---

I did some research and found out, that the "--sync" flag sets the flag
"O_DSYNC" which seems to disable the SSD cache which leads to these
horrid write speeds.

It seems that this relates to the fact that the write cache is only not
disabled for SSDs which implement some kind of battery buffer that
guarantees a data flush to the flash in case of a powerloss.

However, It seems impossible to find out which SSDs do have this
powerloss protection, moreover, these enterprise SSDs are crazy
expensive compared to the SSDs above - moreover it's unclear if
powerloss protection is even available in the NVMe form factor. So
building a 1 or 2 TB cluster seems not really affordable/viable.

So, can please anyone give

[ceph-users] Re: Choosing suitable SSD for Ceph cluster

2019-10-24 Thread Vitaliy Filippov

It's easy:

https://yourcmc.ru/wiki/Ceph_performance


Hi,
I am running a nice ceph (proxmox 4 / debian-8 / ceph 0.94.3) cluster on
3 nodes (supermicro X8DTT-HIBQF), 2 OSD each (2TB SATA harddisks),
interconnected via Infiniband 40.

Problem is that the ceph performance is quite bad (approx. 30MiB/s
reading, 3-4 MiB/s writing ), so I thought about plugging into each node
a PCIe to NVMe/M.2 adapter and install SSD harddisks. The idea is to
have a faster ceph storage and also some storage extension.

The question is now which SSDs I should use. If I understand it right,
not every SSD is suitable for ceph, as is denoted at the links below:

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
or here:
https://www.proxmox.com/en/downloads/item/proxmox-ve-ceph-benchmark

In the first link, the Samsung SSD 950 PRO 512GB NVMe is listed as a
fast SSD for ceph. As the 950 is not available anymore, I ordered a
Samsung 970 1TB for testing, unfortunately, the "EVO" instead of PRO.

Before equipping all nodes with these SSDs, I did some tests with "fio"
as recommended, e.g. like this:

fio --filename=/dev/DEVICE --direct=1 --sync=1 --rw=write --bs=4k
--numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting
--name=journal-test

The results are as the following:

---
1) Samsung 970 EVO NVMe M.2 mit PCIe Adapter
Jobs: 1:
read : io=26706MB, bw=445MiB/s, iops=113945, runt= 60001msec
write: io=252576KB, bw=4.1MiB/s, iops=1052, runt= 60001msec

Jobs: 4:
read : io=21805MB, bw=432.7MiB/s, iops=93034, runt= 60001msec
write: io=422204KB, bw=6.8MiB/s, iops=1759, runt= 60002msec

Jobs: 10:
read : io=26921MB, bw=448MiB/s, iops=114859, runt= 60001msec
write: io=435644KB, bw=7MiB/s, iops=1815, runt= 60004msec
---

So the read speed is impressive, but the write speed is really bad.

Therefore I ordered the Samsung 970 PRO (1TB) as it has faster NAND
chips (MLC instead of TLC). The results are, however even worse for  
writing:


---
Samsung 970 PRO NVMe M.2 mit PCIe Adapter
Jobs: 1:
read : io=15570MB, bw=259.4MiB/s, iops=66430, runt= 60001msec
write: io=199436KB, bw=3.2MiB/s, iops=830, runt= 60001msec

Jobs: 4:
read : io=48982MB, bw=816.3MiB/s, iops=208986, runt= 60001msec
write: io=327800KB, bw=5.3MiB/s, iops=1365, runt= 60002msec

Jobs: 10:
read : io=91753MB, bw=1529.3MiB/s, iops=391474, runt= 60001msec
write: io=343368KB, bw=5.6MiB/s, iops=1430, runt= 60005msec
---

I did some research and found out, that the "--sync" flag sets the flag
"O_DSYNC" which seems to disable the SSD cache which leads to these
horrid write speeds.

It seems that this relates to the fact that the write cache is only not
disabled for SSDs which implement some kind of battery buffer that
guarantees a data flush to the flash in case of a powerloss.

However, It seems impossible to find out which SSDs do have this
powerloss protection, moreover, these enterprise SSDs are crazy
expensive compared to the SSDs above - moreover it's unclear if
powerloss protection is even available in the NVMe form factor. So
building a 1 or 2 TB cluster seems not really affordable/viable.

So, can please anyone give me hints what to do? Is it possible to ensure
that the write cache is not disabled in some way (my server is situated
in a data center, so there will probably never be loss of power).

Or is the link above already outdated as newer ceph releases somehow
deal with this problem? Or maybe a later Debian release (10) will handle
the O_DSYNC flag differently?

Perhaps I should simply invest in faster (and bigger) harddisks and
forget the SSD-cluster idea?

Thank you in advance for any help,

Best Regards,
Hermann



--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Choosing suitable SSD for Ceph cluster

2019-10-24 Thread Vitaliy Filippov
Especially https://yourcmc.ru/wiki/Ceph_performance#CAPACITORS.21 but I  
recommend you to read the whole article


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: iSCSI write performance

2019-10-24 Thread Mike Christie
On 10/24/2019 12:22 PM, Ryan wrote:
> I'm in the process of testing the iscsi target feature of ceph. The
> cluster is running ceph 14.2.4 and ceph-iscsi 3.3. It consists of 5

What kernel are you using?

> hosts with 12 SSD OSDs per host. Some basic testing moving VMs to a ceph
> backed datastore is only showing 60MB/s transfers. However moving these
> back off the datastore is fast at 200-300MB/s.

What is the workload and what are you using to measure the throughput?

If you are using fio, what arguments are you using? And, could you
change the ioengine to rbd and re-run the test from the target system so
we can check if rbd is slow or iscsi?

For small IOs, 60 is about right.

For 128-512K IOs you should be able to get around 300 MB/s for writes
and 600 for reads.

1. Increase max_data_area_mb. This is a kernel buffer lio/tcmu uses to
pass data between the kernel and tcmu-runner. The default is only 8MB.

In gwcli cd to your disk and do:

# reconfigure max_data_area_mb %N

where N is between 8 and 2048 MBs.

2. The Linux kernel target only allows 64 commands per iscsi session by
default. We increase that to 128, but you can increase this to 512.

In gwcli cd to the target dir and do

reconfigure cmdsn_depth 512

3. I think ceph-iscsi and lio work better with higher queue depths so if
you are using fio you want higher numjobs and/or iodepths.

> 
> What should I be looking at to track down the write performance issue?
> In comparison with the Nimble Storage arrays I can see 200-300MB/s in
> both directions.
> 
> Thanks,
> Ryan
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Static website hosting with RGW

2019-10-24 Thread Oliver Freyermuth
Dear Cephers,

I have a question concerning static websites with RGW. 
To my understanding, it is best to run >=1 RGW client for "classic" S3 and in 
addition operate >=1 RGW client for website serving
(potentially with HAProxy or its friends in front) to prevent messup of 
requests via the different protocols. 

I'd prefer to avoid "*.example.com" entries in DNS if possible. 
So my current setup has these settings for the "web" RGW client:
 rgw_enable_static_website = true
 rgw_enable_apis = s3website
 rgw_dns_s3website_name = 
some_value_unused_when_A_records_are_used_pointing_to_the_IP_but_it_needs_to_be_set
and I create simple A records for each website pointing to the IP of this "web" 
RGW node. 

I can easily upload content for those websites to the other RGW instances which 
are serving S3,
so S3 and s3website APIs are cleanly separated in separate instances. 

However, one issue remains: How do I run
 s3cmd ws-create
on each website-bucket once? 
I can't do that against the "classic" S3-serving RGW nodes. This will give me a 
405 (not allowed),
since they do not have rgw_enable_static_website enabled. 
I also can not run it against the "web S3" nodes, since they do not have the S3 
API enabled. 
Of course I could enable that, but then the RGW node can't cleanly disentangle 
S3 and website requests since I use A records. 

Does somebody have a good idea on how to solve this issue? 
Setting "rgw_enable_static_website = true" on the S3-serving RGW nodes would 
solve it, but does that have any bad side-effects on their S3 operation? 

Also, if there's an expert on this: Exposing a bucket under a tenant as static 
website is not possible since the colon (:) can't be encoded in DNS, right? 


In case somebody also wants to set something like this up, here are the best 
docs I could find:
https://gist.github.com/robbat2/ec0a66eed28e5f0e1ef7018e9c77910c
and of course:
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html-single/object_gateway_guide_for_red_hat_enterprise_linux/index#configuring_gateways_for_static_web_hosting


Cheers,
Oliver



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: iSCSI write performance

2019-10-24 Thread Ryan
They are Samsung 860 EVO 2TB SSDs. The Dell R740xd servers have dual Intel
Gold 6130 CPUs and dual SAS controllers with 6 SSDs each. Top shows around
20-25% of a core being used by each OSD daemon. I am using erasure coding
with crush-failure-domain=host k=3 m=2.

On Thu, Oct 24, 2019 at 1:37 PM Drew Weaver  wrote:

> I was told by someone at Red Hat that ISCSI performance is still several
> magnitudes behind using the client / driver.
>
> Thanks,
> -Drew
>
>
> -Original Message-
> From: Nathan Fish 
> Sent: Thursday, October 24, 2019 1:27 PM
> To: Ryan 
> Cc: ceph-users 
> Subject: [ceph-users] Re: iSCSI write performance
>
> Are you using Erasure Coding or replication? What is your crush rule?
> What SSDs and CPUs? Does each OSD use 100% of a core or more when writing?
>
> On Thu, Oct 24, 2019 at 1:22 PM Ryan  wrote:
> >
> > I'm in the process of testing the iscsi target feature of ceph. The
> cluster is running ceph 14.2.4 and ceph-iscsi 3.3. It consists of 5 hosts
> with 12 SSD OSDs per host. Some basic testing moving VMs to a ceph backed
> datastore is only showing 60MB/s transfers. However moving these back off
> the datastore is fast at 200-300MB/s.
> >
> > What should I be looking at to track down the write performance issue?
> In comparison with the Nimble Storage arrays I can see 200-300MB/s in both
> directions.
> >
> > Thanks,
> > Ryan
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> > email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: iSCSI write performance

2019-10-24 Thread Ryan
I'm using CentOS 7.7.1908 with kernel 3.10.0-1062.1.2.el7.x86_64. The
workload was a VMware Storage Motion from a local SSD backed datastore to
the ceph backed datastore. Performance was measured using dstat on the
iscsi gateway for network traffic and ceph status as this cluster is
basically idle.  I changed max_data_area_mb to 256 and cmdsn_depth to 128.
This appears to have given a slight improvement of maybe 10MB/s.

Moving VM to the ceph backed datastore
io:
client:   124 KiB/s rd, 76 MiB/s wr, 95 op/s rd, 1.26k op/s wr

Moving VM off the ceph backed datastore
  io:
client:   344 MiB/s rd, 625 KiB/s wr, 5.54k op/s rd, 62 op/s wr

I'm going to test bonnie++ with an rbd volume mounted directly on the iscsi
gateway. Also will test bonnie++ inside a VM on a ceph backed datastore.

On Thu, Oct 24, 2019 at 7:15 PM Mike Christie  wrote:

> On 10/24/2019 12:22 PM, Ryan wrote:
> > I'm in the process of testing the iscsi target feature of ceph. The
> > cluster is running ceph 14.2.4 and ceph-iscsi 3.3. It consists of 5
>
> What kernel are you using?
>
> > hosts with 12 SSD OSDs per host. Some basic testing moving VMs to a ceph
> > backed datastore is only showing 60MB/s transfers. However moving these
> > back off the datastore is fast at 200-300MB/s.
>
> What is the workload and what are you using to measure the throughput?
>
> If you are using fio, what arguments are you using? And, could you
> change the ioengine to rbd and re-run the test from the target system so
> we can check if rbd is slow or iscsi?
>
> For small IOs, 60 is about right.
>
> For 128-512K IOs you should be able to get around 300 MB/s for writes
> and 600 for reads.
>
> 1. Increase max_data_area_mb. This is a kernel buffer lio/tcmu uses to
> pass data between the kernel and tcmu-runner. The default is only 8MB.
>
> In gwcli cd to your disk and do:
>
> # reconfigure max_data_area_mb %N
>
> where N is between 8 and 2048 MBs.
>
> 2. The Linux kernel target only allows 64 commands per iscsi session by
> default. We increase that to 128, but you can increase this to 512.
>
> In gwcli cd to the target dir and do
>
> reconfigure cmdsn_depth 512
>
> 3. I think ceph-iscsi and lio work better with higher queue depths so if
> you are using fio you want higher numjobs and/or iodepths.
>
> >
> > What should I be looking at to track down the write performance issue?
> > In comparison with the Nimble Storage arrays I can see 200-300MB/s in
> > both directions.
> >
> > Thanks,
> > Ryan
> >
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rgw recovering shards

2019-10-24 Thread Konstantin Shalygin


On 10/24/19 11:00 PM, Frank R wrote:
After an RGW upgrade from 12.2.7 to 12.2.12 for RGW multisite a few 
days ago the "sync status" has constantly shown a few "recovering 
shards", ie:


-

#  radosgw-admin sync status
          realm 8f7fd3fd-f72d-411d-b06b-7b4b579f5f2f (prod)
      zonegroup 60a2cb75-6978-46a3-b830-061c8be9dc75 (prod)
           zone ffce148e-3b24-462d-98bf-8c212de31de5 (us-east-1)
  metadata sync syncing
                full sync: 0/64 shards
                incremental sync: 64/64 shards
                metadata is caught up with master
      data sync source: 7fe96e52-d6f7-4ad6-b66e-ecbbbffbc18e (us-east-2)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 1 shards
                        behind shards: [48]
                        oldest incremental change not applied: 
2019-10-21 22:34:11.0.293798s

                        5 shards are recovering
                        recovering shards: [11,37,48,110,117]

-

This is the secondary zone. I am worried about the "oldest incremental 
change not applied" being from the 21st. Is there a way to have RGW 
just stop trying to recover these shards and just sync them from this 
point in time?



Was you run after upgrade new shard-maintenance command (`reshard 
stale-instances list`, `reshard stale-instances rm`)?




k

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Unbalanced data distribution

2019-10-24 Thread Konstantin Shalygin

On 10/24/19 6:54 PM, Thomas Schneider wrote:

this is understood.

I needed to start reweighting specific OSD because rebalancing was not
working and I got a warning in Ceph that some OSDs are running out of space.


Still, the main your issue is that your buckets is uneven, 350TB vs 
79TB, more that 4 times.


I suggest to you disable multiroot (use only default), use your 1.6Tb 
drives from current default root (I count ~48 1.6Tb OSD's).


And mix your OSD's in hosts to be more evenly distributed in cluster - 
this is one of basic Ceph best practices.



Also you can try to use offline upmap method, some folks get better 
results with this (don't forget to disable balancer):


`ceph osd getmap -o om; osdmaptool om --upmap upmap.sh --upmap-deviation 
0; bash upmap.sh; rm -f upmap.sh om`




k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: subtrees have overcommitted (target_size_bytes / target_size_ratio)

2019-10-24 Thread Lars Täuber
Hi Nathan,

Thu, 24 Oct 2019 10:59:55 -0400
Nathan Fish  ==> Lars Täuber  :
> Ah, I see! The BIAS reflects the number of placement groups it should
> create. Since cephfs metadata pools are usually very small, but have
> many objects and high IO, the autoscaler gives them 4x the number of
> placement groups that it would normally give for that amount of data.
> 
ah ok, I understand.

> So, your cephfs_data is set to a ratio of 0.9, and cephfs_metadata to
> 0.3? Are the two pools using entirely different device classes, so
> they are not sharing space?

Yes, the metadata is on SSDs and the data on HDDs.

> Anyway, I see that your overcommit is only "1.031x". So if you set
> cephfs_data to 0.85, it should go away.

This is not the case. I set the target_ratio to 0.7 and get this:

 POOL   SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO 
 BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE 
 cephfs_metadata  15736M3.0 2454G  0.01880.3000 
  4.0 256  on
 cephfs_data  122.2T1.5165.4T  1.10850.7000 
  1.01024  on

The ratio seems to have nothing to do with the target_ratio but the SIZE and 
the RAW_CAPACITY.
Because the pool is still getting more data the SIZE increases and therefore 
the RATIO increases.
The RATIO seems to be calculated by this formula
RATIO = SIZE * RATE / RAW_CAPACITY.

This is what I don't understand. The data in the cephfs_data pool seems to need 
more space than the raw capacity of the cluster provides. Hence the situation 
is called "overcommitment".

But why is this only the case when the autoscaler is active?

Thanks
Lars

> 
> On Thu, Oct 24, 2019 at 10:09 AM Lars Täuber  wrote:
> >
> > Thanks Nathan for your answer,
> >
> > but I set the the Target Ratio to 0.9. It is the cephfs_data pool that 
> > makes the troubles.
> >
> > The 4.0 is the BIAS from the cephfs_metadata pool. This "BIAS" is not 
> > explained on the page linked below. So I don't know its meaning.
> >
> > How can be a pool overcommited when it is the only pool on a set of OSDs?
> >
> > Best regards,
> > Lars
> >
> > Thu, 24 Oct 2019 09:39:51 -0400
> > Nathan Fish  ==> Lars Täuber  :  
> > > The formatting is mangled on my phone, but if I am reading it correctly,
> > > you have set Target Ratio to 4.0. This means you have told the balancer
> > > that this pool will occupy 4x the space of your whole cluster, and to
> > > optimize accordingly. This is naturally a problem. Setting it to 0 will
> > > clear the setting and allow the autobalancer to work.
> > >
> > > On Thu., Oct. 24, 2019, 5:18 a.m. Lars Täuber,  wrote:
> > >  
> > > > This question is answered here:
> > > > https://ceph.io/rados/new-in-nautilus-pg-merging-and-autotuning/
> > > >
> > > > But it tells me that there is more data stored in the pool than the raw
> > > > capacity provides (taking the replication factor RATE into account) 
> > > > hence
> > > > the RATIO being above 1.0 .
> > > >
> > > > How comes this is the case? - Data is stored outside of the pool?
> > > > How comes this is only the case when the autoscaler is active?
> > > >
> > > > Thanks
> > > > Lars
> > > >
> > > >
> > > > Thu, 24 Oct 2019 10:36:52 +0200
> > > > Lars Täuber  ==> ceph-users@ceph.io :  
> > > > > My question requires too complex an answer.
> > > > > So let me ask a simple question:
> > > > >
> > > > > What does the SIZE of "osd pool autoscale-status" tell/mean/comes 
> > > > > from?
> > > > >
> > > > > Thanks
> > > > > Lars
> > > > >
> > > > > Wed, 23 Oct 2019 14:28:10 +0200
> > > > > Lars Täuber  ==> ceph-users@ceph.io :  
> > > > > > Hello everybody!
> > > > > >
> > > > > > What does this mean?
> > > > > >
> > > > > > health: HEALTH_WARN
> > > > > > 1 subtrees have overcommitted pool target_size_bytes
> > > > > > 1 subtrees have overcommitted pool target_size_ratio
> > > > > >
> > > > > > and what does it have to do with the autoscaler?
> > > > > > When I deactivate the autoscaler the warning goes away.
> > > > > >
> > > > > >
> > > > > > $ ceph osd pool autoscale-status
> > > > > >  POOL   SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  
> > > > TARGET RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE  
> > > > > >  cephfs_metadata  15106M3.0 2454G  0.0180  
> > > >   0.3000   4.0 256  on  
> > > > > >  cephfs_data  113.6T1.5165.4T  1.0306  
> > > >   0.9000   1.0 512  on  
> > > > > >
> > > > > >
> > > > > > $ ceph health detail
> > > > > > HEALTH_WARN 1 subtrees have overcommitted pool target_size_bytes; 1 
> > > > > >  
> > > > subtrees have overcommitted pool target_size_ratio  
> > > > > > POOL_TARGET_SIZE_BYTES_OVERCOMMITTED 1 subtrees have overcommitted  
> > > > pool target_size_bytes  
> > > > > > Pools ['cephfs_data'] overcommit available storage by 1.031x 
> > > > > > due  
> > > > to target_size_bytes0  on pools []