Re: [ceph-users] Cluster Re-balancing

2018-04-19 Thread Monis Monther
Hi Casper,

Thank you for the response, problem is solved now. After some searching, it
turned out to be that after Luminous, setting  mon_osd_backfillfull_ratio
and  mon_osd_nearfull_ratio do not take effect anymore. This is because
these settings are being read from the OSD map and the commands "ceph osd
set-nearfull-ratio" and "ceph osd set-backfillfull-ratio" are used to
change them.

This was verified by running "ceph osd dunp|head" all ratios were still
0.92 and 0.95...etc. When Setting them to 0.85 the flags started to work
normally and we were able to control our cluster in a better way.

Moreover, setting the backfillfull ratio lower than near full ration would
show a HEALH_ERR out of order flags. Therefore, we set them to the same
number for now and started reweighting to rebalance the cluster

The backfillfull ones actually prevent data movement to them and data was
moved to other OSDs with more free space. Nevertheless some PG got stuck
and backfill_too_full was flagged. at the end those we reweighted and all
restored to normal. Finally we set the backfullfull ratio to be higher than
the nearfull ratio. END OF STORY.


Thanks






On Wed, Apr 18, 2018 at 11:20 AM, Caspar Smit 
wrote:

> Hi Monis,
>
> The settings you mention do not prevent data movement to overloaded OSD's,
> they are a threshold when CEPH warns when an OSD is nearfull or
> backfillfull.
> No expert on this but setting backfillfull lower then nearfull is not
> recommended, the nearfull state should be reached first in stead of
> backfillfull.
>
> You can reweight the overloaded OSD's manually by issueing: ceph osd
> reweight osd.X 0.95  (the last value should be between 0 and 1,
> where 1 is the default and can be seen as 100%, setting this to 0.95 means
> to only use 95% of the OSD, to move more PGS of this OSD you can set the
> value lower to 0.9 or 0.85)
>
> Kind regards,
> Caspar
>
>
> 2018-04-18 9:07 GMT+02:00 Monis Monther :
>
>> Hi,
>>
>> We are running a cluster with ceph luminous 12.2.0. Some of the OSDs are
>> getting full and we are running ceph osd reweight-by-utilization to
>> re-balance the OSDs. We have also set
>>
>> mon_osd_backfillfull_ratio 0.8 (This is to prevent moving data to an
>> overloaded OSD when re-weighting)
>> mon_osd_nearfull_ratio 0.85
>>
>> However, reweight is worsening the problem by moving data from an 85%
>> full OSD to an 84.7 full OSD instead of moving it to half empty OSD. This
>> is causing the last to increase up to 85.6. Some OSDs have now reached 87%
>> and 86%
>>
>> Moreover, the cluster does not show any OSD as near full although some
>> OSDs have passed 86% and is totaly ignoring the backfillfull setting by
>> moving data to OSDs that are above 80%.
>>
>> Are the settings above wrong? what can we do to prevent moving data to
>> overloaded OSDs
>>
>> --
>> Best Regards
>> Monis
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>


-- 
Best Regards
Monis
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph luminous 12.2.4 - 2 servers better than 3 ?

2018-04-19 Thread Steven Vacaroaia
Hi,

Any idea why 2 servers with one OSD each will provide better performance
than 3 ?

Servers are identical
Performance  is impacted irrespective if I used SSD for WAL/DB or not
Basically, I am getting lots of cur MB/s zero

Network is separate 10 GB for public and private
I tested it with iperf and I am getting 9.3 Gbs

I have tried replication by 2 and 3 with same results ( much better for 2
servers than 3 )

reinstalled CEPH multiple times
ceph.conf very simple - no major customization ( see below)
I am out of ideas - any hint will be TRULY appreciated

Steven



auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx


public_network = 10.10.30.0/24
cluster_network = 192.168.0.0/24


osd_pool_default_size = 2
osd_pool_default_min_size = 1 # Allow writing 1 copy in a degraded state
osd_crush_chooseleaf_type = 1


[mon]
mon_allow_pool_delete = true
mon_osd_min_down_reporters = 1

[osd]
osd_mkfs_type = xfs
osd_mount_options_xfs =
"rw,noatime,nodiratime,attr2,logbufs=8,logbsize=256k,largeio,inode64,swalloc,allocsize=4M"
osd_mkfs_options_xfs = "-f -i size=2048"
bluestore_block_db_size = 32212254720
bluestore_block_wal_size = 1073741824

rados bench -p rbd 120 write --no-cleanup && rados bench -p rbd 120 seq
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size
4194304 for up to 120 seconds or 0 objects
Object prefix: benchmark_data_osd01_383626
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
0   0 0 0 0 0   -
 0
1  165741   163.991   1640.197929
0.065543
2  16574181.992 0   -
0.065543
3  166751   67.993620   0.0164632
0.249939
4  166751   50.9951 0   -
0.249939
5  167155   43.9958 8   0.0171439
0.319973
6  16   181   165   109.989   440   0.0159057
0.563746
7  16   182   166   94.8476 40.221421
0.561684
8  16   182   166   82.9917 0   -
0.561684
9  16   240   224   99.5458   116   0.0232989
0.638292
   10  16   264   248   99.190196   0.0222669
0.583336
   11  16   264   248   90.1729 0   -
0.583336
   12  16   285   269   89.657942   0.0165706
0.600606
   13  16   285   269   82.7611 0   -
0.600606
   14  16   310   294   83.991850   0.0254241
0.756351
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph luminous 12.2.4 - 2 servers better than 3 ?

2018-04-19 Thread Milanov, Radoslav Nikiforov
Try filestore instead of bluestore ?

- Rado

From: ceph-users  On Behalf Of Steven 
Vacaroaia
Sent: Thursday, April 19, 2018 8:11 AM
To: ceph-users 
Subject: [ceph-users] ceph luminous 12.2.4 - 2 servers better than 3 ?

Hi,

Any idea why 2 servers with one OSD each will provide better performance than 3 
?

Servers are identical
Performance  is impacted irrespective if I used SSD for WAL/DB or not
Basically, I am getting lots of cur MB/s zero

Network is separate 10 GB for public and private
I tested it with iperf and I am getting 9.3 Gbs

I have tried replication by 2 and 3 with same results ( much better for 2 
servers than 3 )

reinstalled CEPH multiple times
ceph.conf very simple - no major customization ( see below)
I am out of ideas - any hint will be TRULY appreciated

Steven



auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx


public_network = 10.10.30.0/24
cluster_network = 192.168.0.0/24


osd_pool_default_size = 2
osd_pool_default_min_size = 1 # Allow writing 1 copy in a degraded state
osd_crush_chooseleaf_type = 1


[mon]
mon_allow_pool_delete = true
mon_osd_min_down_reporters = 1

[osd]
osd_mkfs_type = xfs
osd_mount_options_xfs = 
"rw,noatime,nodiratime,attr2,logbufs=8,logbsize=256k,largeio,inode64,swalloc,allocsize=4M"
osd_mkfs_options_xfs = "-f -i size=2048"
bluestore_block_db_size = 32212254720
bluestore_block_wal_size = 1073741824

rados bench -p rbd 120 write --no-cleanup && rados bench -p rbd 120 seq
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 
for up to 120 seconds or 0 objects
Object prefix: benchmark_data_osd01_383626
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
0   0 0 0 0 0   -   0
1  165741   163.991   1640.1979290.065543
2  16574181.992 0   -0.065543
3  166751   67.993620   0.01646320.249939
4  166751   50.9951 0   -0.249939
5  167155   43.9958 8   0.01714390.319973
6  16   181   165   109.989   440   0.01590570.563746
7  16   182   166   94.8476 40.2214210.561684
8  16   182   166   82.9917 0   -0.561684
9  16   240   224   99.5458   116   0.02329890.638292
   10  16   264   248   99.190196   0.02226690.583336
   11  16   264   248   90.1729 0   -0.583336
   12  16   285   269   89.657942   0.01657060.600606
   13  16   285   269   82.7611 0   -0.600606
   14  16   310   294   83.991850   0.02542410.756351


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph luminous 12.2.4 - 2 servers better than 3 ?

2018-04-19 Thread Marc Roos

If I may guess, because with 3 it reads from 3 and with 2 it reads only 
from 2. You should be able to verify this with something like dstat -d 
-D sda,sdb,sdc,sdd,sde,sdf,sdg not?

With replication of 2, objects are still being stored among the 3 nodes.

I am getting with iperf3 on 10Gbit
[ ID] Interval   Transfer Bandwidth   Retr  Cwnd
[  4]   0.00-10.00  sec  11.5 GBytes  9.89 Gbits/sec0   1.31 MBytes
[  4]  10.00-20.00  sec  11.5 GBytes  9.89 Gbits/sec0   1.79 MBytes
 


-Original Message-
From: Steven Vacaroaia [mailto:ste...@gmail.com] 
Sent: donderdag 19 april 2018 14:11
To: ceph-users
Subject: [ceph-users] ceph luminous 12.2.4 - 2 servers better than 3 ?

Hi,

Any idea why 2 servers with one OSD each will provide better performance 
than 3 ?

Servers are identical
Performance  is impacted irrespective if I used SSD for WAL/DB or not 
Basically, I am getting lots of cur MB/s zero  

Network is separate 10 GB for public and private I tested it with iperf 
and I am getting 9.3 Gbs 

I have tried replication by 2 and 3 with same results ( much better for 
2 servers than 3 )

reinstalled CEPH multiple times
ceph.conf very simple - no major customization ( see below) I am out of 
ideas - any hint will be TRULY appreciated 

Steven 



auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx


public_network = 10.10.30.0/24
cluster_network = 192.168.0.0/24


osd_pool_default_size = 2
osd_pool_default_min_size = 1 # Allow writing 1 copy in a degraded state 
osd_crush_chooseleaf_type = 1


[mon]
mon_allow_pool_delete = true
mon_osd_min_down_reporters = 1

[osd]
osd_mkfs_type = xfs
osd_mount_options_xfs = 
"rw,noatime,nodiratime,attr2,logbufs=8,logbsize=256k,largeio,inode64,swa
lloc,allocsize=4M"
osd_mkfs_options_xfs = "-f -i size=2048"
bluestore_block_db_size = 32212254720
bluestore_block_wal_size = 1073741824

rados bench -p rbd 120 write --no-cleanup && rados bench -p rbd 120 seq 
hints = 1 Maintaining 16 concurrent writes of 4194304 bytes to objects 
of size 4194304 for up to 120 seconds or 0 objects Object prefix: 
benchmark_data_osd01_383626
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg 
lat(s)
0   0 0 0 0 0   -
   0
1  165741   163.991   1640.197929
0.065543
2  16574181.992 0   -
0.065543
3  166751   67.993620   0.0164632
0.249939
4  166751   50.9951 0   -
0.249939
5  167155   43.9958 8   0.0171439
0.319973
6  16   181   165   109.989   440   0.0159057
0.563746
7  16   182   166   94.8476 40.221421
0.561684
8  16   182   166   82.9917 0   -
0.561684
9  16   240   224   99.5458   116   0.0232989
0.638292
   10  16   264   248   99.190196   0.0222669
0.583336
   11  16   264   248   90.1729 0   -
0.583336
   12  16   285   269   89.657942   0.0165706
0.600606
   13  16   285   269   82.7611 0   -
0.600606
   14  16   310   294   83.991850   0.0254241
0.756351




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph luminous 12.2.4 - 2 servers better than 3 ?

2018-04-19 Thread Hans van den Bogert
Hi Steven,

There is only one bench. Could you show multiple benches of the different
scenarios you discussed? Also provide hardware details.

Hans

On Apr 19, 2018 13:11, "Steven Vacaroaia"  wrote:

Hi,

Any idea why 2 servers with one OSD each will provide better performance
than 3 ?

Servers are identical
Performance  is impacted irrespective if I used SSD for WAL/DB or not
Basically, I am getting lots of cur MB/s zero

Network is separate 10 GB for public and private
I tested it with iperf and I am getting 9.3 Gbs

I have tried replication by 2 and 3 with same results ( much better for 2
servers than 3 )

reinstalled CEPH multiple times
ceph.conf very simple - no major customization ( see below)
I am out of ideas - any hint will be TRULY appreciated

Steven



auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx


public_network = 10.10.30.0/24
cluster_network = 192.168.0.0/24


osd_pool_default_size = 2
osd_pool_default_min_size = 1 # Allow writing 1 copy in a degraded state
osd_crush_chooseleaf_type = 1


[mon]
mon_allow_pool_delete = true
mon_osd_min_down_reporters = 1

[osd]
osd_mkfs_type = xfs
osd_mount_options_xfs =
"rw,noatime,nodiratime,attr2,logbufs=8,logbsize=256k,largeio,inode64,swalloc,allocsize=4M"
osd_mkfs_options_xfs = "-f -i size=2048"
bluestore_block_db_size = 32212254720
bluestore_block_wal_size = 1073741824

rados bench -p rbd 120 write --no-cleanup && rados bench -p rbd 120 seq
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size
4194304 for up to 120 seconds or 0 objects
Object prefix: benchmark_data_osd01_383626
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
0   0 0 0 0 0   -
 0
1  165741   163.991   1640.197929
0.065543
2  16574181.992 0   -
0.065543
3  166751   67.993620   0.0164632
0.249939
4  166751   50.9951 0   -
0.249939
5  167155   43.9958 8   0.0171439
0.319973
6  16   181   165   109.989   440   0.0159057
0.563746
7  16   182   166   94.8476 40.221421
0.561684
8  16   182   166   82.9917 0   -
0.561684
9  16   240   224   99.5458   116   0.0232989
0.638292
   10  16   264   248   99.190196   0.0222669
0.583336
   11  16   264   248   90.1729 0   -
0.583336
   12  16   285   269   89.657942   0.0165706
0.600606
   13  16   285   269   82.7611 0   -
0.600606
   14  16   310   294   83.991850   0.0254241
0.756351


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph luminous 12.2.4 - 2 servers better than 3 ?

2018-04-19 Thread Steven Vacaroaia
Thanks for helping

I thought that with  CEPH,  the more servers you have  the  better the
performance
- that is why I am so confused

Also, I tried to add the 4th server ( still no luck - in fact the rado
bench output I included was from 4 servers, one OSD on each, bluestore,
replication 2 )

Here is the same rados bench but with only 2 servers/ 2 OSDs

rados bench -p rbd 120 write --no-cleanup && rados bench -p rbd 120 seq
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size
4194304 for up to 120 seconds or 0 objects
Object prefix: benchmark_data_osd01_384454
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
0   0 0 0 0 0   -
 0
1  16   10185   339.981   3400.244679
0.160362
2  16   149   133   265.972   1920.303997
0.226268
3  16   197   181   241.307   1920.179013
0.255609
4  16   241   225   224.975   1760.353464
0.272623
5  16   289   273   218.376   1920.303821
0.282425
6  16   338   322   214.643   1960.326009
 0.29105
7  16   387   371   211.977   196 0.27048
0.296497
8  16   436   420   209.977   1960.287188
0.299224
9  16   479   463   205.755   1720.380512
0.302272
   10  16   527   511   204.378   1920.289532
0.306163
   11  16   576   560   203.614   1960.406783
0.309271
   12  16   624   608   202.645   1920.282266
0.312167
   13  16   667   651   200.286   1720.377555
0.313733
   14  16   716   700   199.978   1960.350938
0.315445
   15  16   764   748   199.445   1920.183931
0.317474


Here is my iperf3 results

 0.00-1.00   sec  1.15 GBytes  9.92 Gbits/sec0850 KBytes
[  4]   1.00-2.00   sec  1.15 GBytes  9.90 Gbits/sec0850 KBytes


On Thu, 19 Apr 2018 at 08:28, Marc Roos  wrote:

>
> If I may guess, because with 3 it reads from 3 and with 2 it reads only
> from 2. You should be able to verify this with something like dstat -d
> -D sda,sdb,sdc,sdd,sde,sdf,sdg not?
>
> With replication of 2, objects are still being stored among the 3 nodes.
>
> I am getting with iperf3 on 10Gbit
> [ ID] Interval   Transfer Bandwidth   Retr  Cwnd
> [  4]   0.00-10.00  sec  11.5 GBytes  9.89 Gbits/sec0   1.31 MBytes
> [  4]  10.00-20.00  sec  11.5 GBytes  9.89 Gbits/sec0   1.79 MBytes
>
>
>
> -Original Message-
> From: Steven Vacaroaia [mailto:ste...@gmail.com]
> Sent: donderdag 19 april 2018 14:11
> To: ceph-users
> Subject: [ceph-users] ceph luminous 12.2.4 - 2 servers better than 3 ?
>
> Hi,
>
> Any idea why 2 servers with one OSD each will provide better performance
> than 3 ?
>
> Servers are identical
> Performance  is impacted irrespective if I used SSD for WAL/DB or not
> Basically, I am getting lots of cur MB/s zero
>
> Network is separate 10 GB for public and private I tested it with iperf
> and I am getting 9.3 Gbs
>
> I have tried replication by 2 and 3 with same results ( much better for
> 2 servers than 3 )
>
> reinstalled CEPH multiple times
> ceph.conf very simple - no major customization ( see below) I am out of
> ideas - any hint will be TRULY appreciated
>
> Steven
>
>
>
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
>
>
> public_network = 10.10.30.0/24
> cluster_network = 192.168.0.0/24
>
>
> osd_pool_default_size = 2
> osd_pool_default_min_size = 1 # Allow writing 1 copy in a degraded state
> osd_crush_chooseleaf_type = 1
>
>
> [mon]
> mon_allow_pool_delete = true
> mon_osd_min_down_reporters = 1
>
> [osd]
> osd_mkfs_type = xfs
> osd_mount_options_xfs =
> "rw,noatime,nodiratime,attr2,logbufs=8,logbsize=256k,largeio,inode64,swa
> lloc,allocsize=4M"
> osd_mkfs_options_xfs = "-f -i size=2048"
> bluestore_block_db_size = 32212254720
> bluestore_block_wal_size = 1073741824
>
> rados bench -p rbd 120 write --no-cleanup && rados bench -p rbd 120 seq
> hints = 1 Maintaining 16 concurrent writes of 4194304 bytes to objects
> of size 4194304 for up to 120 seconds or 0 objects Object prefix:
> benchmark_data_osd01_383626
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
> lat(s)
> 0   0 0 0 0 0   -
>0
> 1  165741   163.991   1640.197929
> 0.065543
> 2  16574181.992 0   -
> 0.065543
> 3  166751   67.993620   0.0164632
> 0.249939
> 4  166751   50.9951 0   -
> 0.249939
> 5  167155   43.9958 8   0.0171439
> 0.319973
> 6  16   181   165   109.989   440   0.0159057
> 0.563746
> 7  16   182   1

Re: [ceph-users] ceph luminous 12.2.4 - 2 servers better than 3 ?

2018-04-19 Thread Marc Roos

> I thought that with  CEPH, the more servers you have  the  better the 
performance
> - that is why I am so confused 

You will have overall better performance of your concurrent client 
connections. Because all 
client reads/writes are spread over all disks/servers. 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph luminous 12.2.4 - 2 servers better than 3 ?

2018-04-19 Thread Steven Vacaroaia
Sure ..thanks for your willingness to help

Identical servers

Hardware
DELL R620, 6 cores, 64GB RAM, 2 x 10 GB ports,
Enterprise HDD 600GB( Seagate ST600MM0006), Enterprise grade SSD 340GB
(Toshiba PX05SMB040Y)


All tests done with the following command
rados bench -p rbd 50 write --no-cleanup && rados bench -p rbd 50 seq


ceph osd pool ls detail
"pool_name": "rbd",
"flags": 1,
"flags_names": "hashpspool",
"type": 1,
"size": 2,
"min_size": 1,
"crush_rule": 1,
"object_hash": 2,
"pg_num": 64,
"pg_placement_num": 64,
"crash_replay_interval": 0,
"last_change": "354",
"last_force_op_resend": "0",
"last_force_op_resend_preluminous": "0",
"auid": 0,
"snap_mode": "selfmanaged",
"snap_seq": 0,
"snap_epoch": 0,
"pool_snaps": [],
"removed_snaps": "[]",
"quota_max_bytes": 0,
"quota_max_objects": 0,
"tiers": [],
"tier_of": -1,
"read_tier": -1,
"write_tier": -1,
"cache_mode": "none",
"target_max_bytes": 0,
"target_max_objects": 0,
"cache_target_dirty_ratio_micro": 40,
"cache_target_dirty_high_ratio_micro": 60,
"cache_target_full_ratio_micro": 80,
"cache_min_flush_age": 0,
"cache_min_evict_age": 0,
"erasure_code_profile": "",
"hit_set_params": {
"type": "none"
},
"hit_set_period": 0,
"hit_set_count": 0,
"use_gmt_hitset": true,
"min_read_recency_for_promote": 0,
"min_write_recency_for_promote": 0,
"hit_set_grade_decay_rate": 0,
"hit_set_search_last_n": 0,
"grade_table": [],
"stripe_width": 0,
"expected_num_objects": 0,
"fast_read": false,
"options": {},
"application_metadata": {}
}


ceph osd crush rule dump
[
{
"rule_id": 0,
"rule_name": "replicated_rule",
"ruleset": 0,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 1,
"rule_name": "rbd",
"ruleset": 1,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -9,
"item_name": "sas"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
]









2 servers, 2 OSD

ceph osd tree
ID  CLASS WEIGHT  TYPE NAME  STATUS REWEIGHT PRI-AFF
 -9   4.0 root sas
-10   1.0 host osd01-sas
  2   hdd 1.0 osd.2  up0 1.0
-11   1.0 host osd02-sas
  3   hdd 1.0 osd.3  up0 1.0
-12   1.0 host osd03-sas
  5   hdd 1.0 osd.5  up  1.0 1.0
-19   1.0 host osd04-sas
  6   hdd 1.0 osd.6  up  1.0 1.0


2018-04-19 09:19:01.266010 min lat: 0.0412473 max lat: 1.03227 avg lat:
0.331163
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
   40  16  1941  1925   192.478   1920.315461
0.331163
   41  16  1984  1968   191.978   1720.262268
0.331529
   42  16  2032  2016   191.978   1920.326608
0.332061
   43  16  2081  2065   192.071   1960.345757
0.332389
   44  16  2123  2107   191.524   1680.307759
0.332745
   45  16  2166  2150191.09   1720.318577
0.333613
   46  16  2214  2198   191.109   1920.329559
0.333703
   47  16  2257  2241   190.702   1720.423664
 0.33427
   48  16  2305  2289   190.729   1920.357342
0.334386
   49  16  2348  2332   190.346   172 0.30218
0.334735
   50  16  2396  2380   190.379   1920.318226
0.334981
Total time run: 50.281886
Total writes made:  2397
Write size: 4194304
Object size:4194304
Bandwidth (MB/sec): 190.685
Stddev Bandwidth:   24.5781
Max bandwidth (MB/sec): 340
Min bandwidth (MB/sec): 164
Average IOPS:   47
Stddev IOPS:6
Max IOPS:   85
Min IOPS:   41
Average Latency(s): 0.335515
Stddev Latency(s):  0.0867836
Max latency(s): 1.03227
Min latency(s): 0.0412473

2018-04-19 09:19:52.340092 min lat: 0.0209445 max lat: 14.9208 avg l

Re: [ceph-users] Memory leak in Ceph OSD?

2018-04-19 Thread Stefan Kooman
Hi,

Quoting Stefan Kooman (ste...@bit.nl):
> Hi,
> 
> TL;DR: we see "used" memory grows indefinitely on our OSD servers.
> Until the point that either 1) a OSD process gets killed by OOMkiller,
> or 2) OSD aborts (proably because malloc cannot provide more RAM). I
> suspect a memory leak of the OSDs.

I got quite some feedback on this thread, thanks for that! I'm pretty
sure we were not hit by a Ceph memory leak, but an Intel i40e driver
leak, specifically in linux kernel 4.13 (Ubuntu Xenial HWE), see [1].

Running 4.13 kernel with Intel X710? You will definitely want to update
to 4.13.0-38 where this issue is fixed.

We are running this kernel now for a week or so and memory is "under
control". Now it's time to crank bluestore cache again :-).

FYI.

Gr. Stefan

[1]: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1748408




-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Creating first Ceph cluster

2018-04-19 Thread Shantur Rathore
Hi,

I am building my first Ceph cluster from hardware leftover from a previous
project. I have been reading a lot of Ceph documentation but need some help
to make sure I going the right way.
To set the stage below is what I have

Rack-1

1 x HP DL360 G9 with
   - 256 GB Memory
   - 5 x 300GB HDD
   - 2 x HBA SAS
   - 4 x 10GBe Networking Card

1 x SuperMicro chassis with 17 x HP Enterprise 400GB SSD and 17 x HP
Enterprise 1.7TB HDD
Chassis and HP server are connected with 2 x SAS HBA for redundancy.


Rack-2 (Same as Rack-1)

1 x HP DL360 G9 with
   - 256 GB Memory
   - 5 x 300GB HDD
   - 2 x HBA SAS
   - 4 x 10GBe Networking Card

1 x SuperMicro chassis with 17 x HP Enterprise 400GB SSD and 17 x HP
Enterprise 1.7TB HDD
Chassis and HP server are connected with 2 x SAS HBA for redundancy.


Rack-3

5 x HP DL360 G8 with
   - 128 GB Memory
   - 2 x 400GB HP Enterprise SSD
   - 3 x 1.7TB Enterprise HDD

Requirements
- To serve storage to around 200 VMware VMs via iSCSI. VMs use disks
moderately.
- To serve storage to some docker containers using ceph volume driver
- To serve storage to some legacy apps using NFS

Plan

- Create a ceph cluster with all machines
- Use Bluestore as osd backing ( 3 x SSD for DB and WAL in SuperMicro
Chassis and 1 x SSD for DB and WAL in Rack 3 G8s)
- Use remaining SSDs ( 14 x in SuperMicro and 1 x Rack 3 G8s ) for Rados
Cache Tier
- Update CRUSH map to make Rack as minimum failure domain. So almost all
data is replicated across racks and in case one of the host dies the
storage still works.
- Single bonded network (4x10GBe) connected to ToR switches.
- Same public and cluster network

Questions

- First of all, is this kind of setup workable.
- I have seen that Ceph uses /dev/sdx names in guides, is it a good
approach considering the disks die and can come up with different /dev/sdx
identifier on reboot.
- What should be the approx size of WAL and DB partitions for my kind of
setup?
- Can i install ceph in a VM and use other VMs on these hosts. Is Ceph too
CPU demanding?

Thanks,
Shantur
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Creating first Ceph cluster

2018-04-19 Thread Alfredo Deza
On Thu, Apr 19, 2018 at 11:10 AM, Shantur Rathore
 wrote:
> Hi,
>
> I am building my first Ceph cluster from hardware leftover from a previous
> project. I have been reading a lot of Ceph documentation but need some help
> to make sure I going the right way.
> To set the stage below is what I have
>
> Rack-1
>
> 1 x HP DL360 G9 with
>- 256 GB Memory
>- 5 x 300GB HDD
>- 2 x HBA SAS
>- 4 x 10GBe Networking Card
>
> 1 x SuperMicro chassis with 17 x HP Enterprise 400GB SSD and 17 x HP
> Enterprise 1.7TB HDD
> Chassis and HP server are connected with 2 x SAS HBA for redundancy.
>
>
> Rack-2 (Same as Rack-1)
>
> 1 x HP DL360 G9 with
>- 256 GB Memory
>- 5 x 300GB HDD
>- 2 x HBA SAS
>- 4 x 10GBe Networking Card
>
> 1 x SuperMicro chassis with 17 x HP Enterprise 400GB SSD and 17 x HP
> Enterprise 1.7TB HDD
> Chassis and HP server are connected with 2 x SAS HBA for redundancy.
>
>
> Rack-3
>
> 5 x HP DL360 G8 with
>- 128 GB Memory
>- 2 x 400GB HP Enterprise SSD
>- 3 x 1.7TB Enterprise HDD
>
> Requirements
> - To serve storage to around 200 VMware VMs via iSCSI. VMs use disks
> moderately.
> - To serve storage to some docker containers using ceph volume driver
> - To serve storage to some legacy apps using NFS
>
> Plan
>
> - Create a ceph cluster with all machines
> - Use Bluestore as osd backing ( 3 x SSD for DB and WAL in SuperMicro
> Chassis and 1 x SSD for DB and WAL in Rack 3 G8s)
> - Use remaining SSDs ( 14 x in SuperMicro and 1 x Rack 3 G8s ) for Rados
> Cache Tier
> - Update CRUSH map to make Rack as minimum failure domain. So almost all
> data is replicated across racks and in case one of the host dies the storage
> still works.
> - Single bonded network (4x10GBe) connected to ToR switches.
> - Same public and cluster network
>
> Questions
>
> - First of all, is this kind of setup workable.
> - I have seen that Ceph uses /dev/sdx names in guides, is it a good approach
> considering the disks die and can come up with different /dev/sdx identifier
> on reboot.

In the case of ceph-volume, these will not matter since it uses LVM
behind the scenes and LVM takes care of figuring out if /dev/sda1 is
now really /dev/sdb1 after
a reboot.

If using ceph-disk however, the detection is done a bit differently,
by reading partition labels and depending on UDEV triggers that
sometimes can be troublesome, specially
on reboot. In the case of a successful detection via UDEV the
non-persistent names wouldn't matter much still.

> - What should be the approx size of WAL and DB partitions for my kind of
> setup?
> - Can i install ceph in a VM and use other VMs on these hosts. Is Ceph too
> CPU demanding?
>
> Thanks,
> Shantur
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Impact on changing ceph auth cap

2018-04-19 Thread Sven Barczyk
Hi,

does anyone have experience in changing auth cap in production environments?
I'm trying to add an additional pool with rwx to my client.libvirt (OpenNebula).

ceph auth cap client.libvirt mon 'allow r' mgr 'allow r' osd 'profile rbd, 
allow rwx pool=one , allow rwx pool=two'

Does this move has any impact on my running vm's ?

Kind Regards
Sven


bringe Informationstechnik GmbH
Zur Seeplatte 12
D-76228 Karlsruhe
Germany

Fon

+49 721 94246-0

Fax

+49 721 94246-66


Geschäftsführer: Dipl.-Ing. (FH) Martin Bringe
UStID: DE812936645

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] bug in rgw quota calculation?

2018-04-19 Thread Matthew Vernon

Hi,

TL;DR there seems to be a problem with quota calculation for rgw in our 
Jewel / Ubuntu 16.04 cluster. Our support people suggested we raise it 
with upstream directly; before I open a tracker item I'd like to check 
I've not missed something obvious :)


Our cluster is running Jewel on Ubuntu 16.40 (rgw version 
10.2.7-0ubuntu0.16.04.2~sanger1 [0]). A user complained that they'd 
deleted a bucket with lots of part-uploaded bits in but their quota was 
still being treated as if the contents were still there.


rgw-admin user stats --sync-stats reports:
"total_entries": 6590,
"total_bytes": 1767693041863,
"total_bytes_rounded": 1767700045824

if I do bucket stats --uid=as45 (or search the output of bucket stats by 
hand), I find 4 buckets who sum to: (details in footnote 1)

num_objects: 3370
size_kb: 774880722
size_kb_actual: 774887560

taking the larger of these x1024 is 793,484,861,440, considerably 
smaller than the quota number above.


We have done "bucket check" on all the users' buckets, all return 0. We 
have done "orphan find" and removed all the leaked objects returned.


I attach the output of
radosgw-admin -n client.rgw.sto-1-2 user stats --sync-stats --uid=as45 
--debug_rgw=20 >/tmp/rgwoutput2 2>&1


(compressed).

This looks like a bug to me; should I open a tracker item?

Thanks,

Matthew

[0] The Sanger1 suffix is a RH-provided patch to fix a MIME issue with 
uploads

[1] 4 buckets:
"size_kb": 33599604,
"size_kb_actual": 33602384,
"num_objects": 1390
"size_kb": 0,
"size_kb_actual": 0,
"num_objects": 0
"size_kb": 707556170,
"size_kb_actual": 707556172,
"num_objects": 2
"size_kb": 33724948,
"size_kb_actual": 33729004,
"num_objects": 1978



--
The Wellcome Sanger Institute is operated by Genome Research 
Limited, a charity registered in England with number 1021457 and a 
company registered in England with number 2742969, whose registered 
office is 215 Euston Road, London, NW1 2BE. 

rgwoutput2.gz
Description: application/gzip
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph luminous 12.2.4 - 2 servers better than 3 ?

2018-04-19 Thread Hans van den Bogert
I take it that the first bench is with replication size 2, the second bench
is with replication size 3? Same for the 4 node OSD scenario?

Also please let us know how you setup block.db and Wal, are they on the SSD?

On Thu, Apr 19, 2018, 14:40 Steven Vacaroaia  wrote:

> Sure ..thanks for your willingness to help
>
> Identical servers
>
> Hardware
> DELL R620, 6 cores, 64GB RAM, 2 x 10 GB ports,
> Enterprise HDD 600GB( Seagate ST600MM0006), Enterprise grade SSD 340GB
> (Toshiba PX05SMB040Y)
>
>
> All tests done with the following command
> rados bench -p rbd 50 write --no-cleanup && rados bench -p rbd 50 seq
>
>
> ceph osd pool ls detail
> "pool_name": "rbd",
> "flags": 1,
> "flags_names": "hashpspool",
> "type": 1,
> "size": 2,
> "min_size": 1,
> "crush_rule": 1,
> "object_hash": 2,
> "pg_num": 64,
> "pg_placement_num": 64,
> "crash_replay_interval": 0,
> "last_change": "354",
> "last_force_op_resend": "0",
> "last_force_op_resend_preluminous": "0",
> "auid": 0,
> "snap_mode": "selfmanaged",
> "snap_seq": 0,
> "snap_epoch": 0,
> "pool_snaps": [],
> "removed_snaps": "[]",
> "quota_max_bytes": 0,
> "quota_max_objects": 0,
> "tiers": [],
> "tier_of": -1,
> "read_tier": -1,
> "write_tier": -1,
> "cache_mode": "none",
> "target_max_bytes": 0,
> "target_max_objects": 0,
> "cache_target_dirty_ratio_micro": 40,
> "cache_target_dirty_high_ratio_micro": 60,
> "cache_target_full_ratio_micro": 80,
> "cache_min_flush_age": 0,
> "cache_min_evict_age": 0,
> "erasure_code_profile": "",
> "hit_set_params": {
> "type": "none"
> },
> "hit_set_period": 0,
> "hit_set_count": 0,
> "use_gmt_hitset": true,
> "min_read_recency_for_promote": 0,
> "min_write_recency_for_promote": 0,
> "hit_set_grade_decay_rate": 0,
> "hit_set_search_last_n": 0,
> "grade_table": [],
> "stripe_width": 0,
> "expected_num_objects": 0,
> "fast_read": false,
> "options": {},
> "application_metadata": {}
> }
>
>
> ceph osd crush rule dump
> [
> {
> "rule_id": 0,
> "rule_name": "replicated_rule",
> "ruleset": 0,
> "type": 1,
> "min_size": 1,
> "max_size": 10,
> "steps": [
> {
> "op": "take",
> "item": -1,
> "item_name": "default"
> },
> {
> "op": "chooseleaf_firstn",
> "num": 0,
> "type": "host"
> },
> {
> "op": "emit"
> }
> ]
> },
> {
> "rule_id": 1,
> "rule_name": "rbd",
> "ruleset": 1,
> "type": 1,
> "min_size": 1,
> "max_size": 10,
> "steps": [
> {
> "op": "take",
> "item": -9,
> "item_name": "sas"
> },
> {
> "op": "chooseleaf_firstn",
> "num": 0,
> "type": "host"
> },
> {
> "op": "emit"
> }
> ]
> }
> ]
>
>
>
>
>
>
>
>
>
> 2 servers, 2 OSD
>
> ceph osd tree
> ID  CLASS WEIGHT  TYPE NAME  STATUS REWEIGHT PRI-AFF
>  -9   4.0 root sas
> -10   1.0 host osd01-sas
>   2   hdd 1.0 osd.2  up0 1.0
> -11   1.0 host osd02-sas
>   3   hdd 1.0 osd.3  up0 1.0
> -12   1.0 host osd03-sas
>   5   hdd 1.0 osd.5  up  1.0 1.0
> -19   1.0 host osd04-sas
>   6   hdd 1.0 osd.6  up  1.0 1.0
>
>
> 2018-04-19 09:19:01.266010 min lat: 0.0412473 max lat: 1.03227 avg lat:
> 0.331163
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
> lat(s)
>40  16  1941  1925   192.478   1920.315461
> 0.331163
>41  16  1984  1968   191.978   1720.262268
> 0.331529
>42  16  2032  2016   191.978   1920.326608
> 0.332061
>43  16  2081  2065   192.071   1960.345757
> 0.332389
>44  16  2123  2107   191.524   1680.307759
> 0.332745
>45  16  2166  2150191.09   1720.318577
> 0.333613
>46  16  2214  2198   191.109   1920.329559
> 0.333703
>47  16  2257  2241   190.702   1720.423664
>  0.33427
>48  16  2305  2289   190.729   1920.357342
> 0.334386
>49  16  2348  2332   190.346   172 0.30218
> 0.334735
>50  16  2396  2380   190.379   

Re: [ceph-users] ceph luminous 12.2.4 - 2 servers better than 3 ?

2018-04-19 Thread Steven Vacaroaia
replication size is always 2

DB/WAL on HDD in this case

I tried with  OSDs with WAL/DB on SSD - they exhibit the same symptoms  (
cur MB/s 0 )

In summary, it does not matter
- which server ( any 2 will work better than any 3 or 4)
- replication size ( it tried with size 2 and 3 )
- location of WAL/DB ( on separate SSD or same HDD)


Thanks
Steven

On Thu, 19 Apr 2018 at 12:06, Hans van den Bogert 
wrote:

> I take it that the first bench is with replication size 2, the second
> bench is with replication size 3? Same for the 4 node OSD scenario?
>
> Also please let us know how you setup block.db and Wal, are they on the
> SSD?
>
> On Thu, Apr 19, 2018, 14:40 Steven Vacaroaia  wrote:
>
>> Sure ..thanks for your willingness to help
>>
>> Identical servers
>>
>> Hardware
>> DELL R620, 6 cores, 64GB RAM, 2 x 10 GB ports,
>> Enterprise HDD 600GB( Seagate ST600MM0006), Enterprise grade SSD 340GB
>> (Toshiba PX05SMB040Y)
>>
>>
>> All tests done with the following command
>> rados bench -p rbd 50 write --no-cleanup && rados bench -p rbd 50 seq
>>
>>
>> ceph osd pool ls detail
>> "pool_name": "rbd",
>> "flags": 1,
>> "flags_names": "hashpspool",
>> "type": 1,
>> "size": 2,
>> "min_size": 1,
>> "crush_rule": 1,
>> "object_hash": 2,
>> "pg_num": 64,
>> "pg_placement_num": 64,
>> "crash_replay_interval": 0,
>> "last_change": "354",
>> "last_force_op_resend": "0",
>> "last_force_op_resend_preluminous": "0",
>> "auid": 0,
>> "snap_mode": "selfmanaged",
>> "snap_seq": 0,
>> "snap_epoch": 0,
>> "pool_snaps": [],
>> "removed_snaps": "[]",
>> "quota_max_bytes": 0,
>> "quota_max_objects": 0,
>> "tiers": [],
>> "tier_of": -1,
>> "read_tier": -1,
>> "write_tier": -1,
>> "cache_mode": "none",
>> "target_max_bytes": 0,
>> "target_max_objects": 0,
>> "cache_target_dirty_ratio_micro": 40,
>> "cache_target_dirty_high_ratio_micro": 60,
>> "cache_target_full_ratio_micro": 80,
>> "cache_min_flush_age": 0,
>> "cache_min_evict_age": 0,
>> "erasure_code_profile": "",
>> "hit_set_params": {
>> "type": "none"
>> },
>> "hit_set_period": 0,
>> "hit_set_count": 0,
>> "use_gmt_hitset": true,
>> "min_read_recency_for_promote": 0,
>> "min_write_recency_for_promote": 0,
>> "hit_set_grade_decay_rate": 0,
>> "hit_set_search_last_n": 0,
>> "grade_table": [],
>> "stripe_width": 0,
>> "expected_num_objects": 0,
>> "fast_read": false,
>> "options": {},
>> "application_metadata": {}
>> }
>>
>>
>> ceph osd crush rule dump
>> [
>> {
>> "rule_id": 0,
>> "rule_name": "replicated_rule",
>> "ruleset": 0,
>> "type": 1,
>> "min_size": 1,
>> "max_size": 10,
>> "steps": [
>> {
>> "op": "take",
>> "item": -1,
>> "item_name": "default"
>> },
>> {
>> "op": "chooseleaf_firstn",
>> "num": 0,
>> "type": "host"
>> },
>> {
>> "op": "emit"
>> }
>> ]
>> },
>> {
>> "rule_id": 1,
>> "rule_name": "rbd",
>> "ruleset": 1,
>> "type": 1,
>> "min_size": 1,
>> "max_size": 10,
>> "steps": [
>> {
>> "op": "take",
>> "item": -9,
>> "item_name": "sas"
>> },
>> {
>> "op": "chooseleaf_firstn",
>> "num": 0,
>> "type": "host"
>> },
>> {
>> "op": "emit"
>> }
>> ]
>> }
>> ]
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 2 servers, 2 OSD
>>
>> ceph osd tree
>> ID  CLASS WEIGHT  TYPE NAME  STATUS REWEIGHT PRI-AFF
>>  -9   4.0 root sas
>> -10   1.0 host osd01-sas
>>   2   hdd 1.0 osd.2  up0 1.0
>> -11   1.0 host osd02-sas
>>   3   hdd 1.0 osd.3  up0 1.0
>> -12   1.0 host osd03-sas
>>   5   hdd 1.0 osd.5  up  1.0 1.0
>> -19   1.0 host osd04-sas
>>   6   hdd 1.0 osd.6  up  1.0 1.0
>>
>>
>> 2018-04-19 09:19:01.266010 min lat: 0.0412473 max lat: 1.03227 avg lat:
>> 0.331163
>>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
>> lat(s)
>>40  16  1941  1925   192.478   1920.315461
>> 0.331163
>>41  16  1984  1968   191.978   1720.262268
>> 0.331529
>>42  16  2032  2016   191.978   1920.326608
>> 0.332061
>>43  16  20

Re: [ceph-users] Impact on changing ceph auth cap

2018-04-19 Thread Alex Gorbachev
On Thu, Apr 19, 2018 at 11:32 AM, Sven Barczyk  wrote:

> Hi,
>
>
>
> does anyone have experience in changing auth cap in production
> environments?
>
> I’m trying to add an additional pool with rwx to my client.libvirt
> (OpenNebula).
>
>
>
> ceph auth cap client.libvirt mon ‘allow r’ mgr ‘allow r’ osd ‘profile rbd,
> allow rwx pool=one , allow rwx pool=two’
>
>
>
> Does this move has any impact on my running vm’s ?
>

I often do this in production with rbd, there is no impact on other rbd
clients.  I sometimes have to restart the rbd VM to recognize the new
caps.  Sorry, no experience with libvirt on this, but the caps process
seems to work well.

--
Alex Gorbachev
Storcium



>
>
> Kind Regards
>
> Sven
>
>
>
> 
> bringe Informationstechnik GmbH
>
> Zur Seeplatte 12
>
> D-76228 Karlsruhe
>
> Germany
>
>
>
> Fon
>
> +49 721 94246-0
>
> Fax
>
> +49 721 94246-66
>
>
>
> Geschäftsführer: Dipl.-Ing. (FH) Martin Bringe
>
> UStID: DE812936645
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Impact on changing ceph auth cap

2018-04-19 Thread Jason Dillaman
On Thu, Apr 19, 2018 at 11:32 AM, Sven Barczyk  wrote:

> Hi,
>
>
>
> does anyone have experience in changing auth cap in production
> environments?
>
> I’m trying to add an additional pool with rwx to my client.libvirt
> (OpenNebula).
>
>
>
> ceph auth cap client.libvirt mon ‘allow r’ mgr ‘allow r’ osd ‘profile rbd,
> allow rwx pool=one , allow rwx pool=two’
>

Note that "osd ‘profile rbd, allow rwx pool=one , allow rwx pool=two’"
grants rxw in all pools due to the lack of pool-restrictions on 'profile
rbd', so the trailing 'allow rwx pool=one , allow rwx pool=two’ are
unnecessary.


>
> Does this move has any impact on my running vm’s ?
>

>
> Kind Regards
>
> Sven
>
>
>
> 
> bringe Informationstechnik GmbH
>
> Zur Seeplatte 12
>
> D-76228 Karlsruhe
>
> Germany
>
>
>
> Fon
>
> +49 721 94246-0
>
> Fax
>
> +49 721 94246-66
>
>
>
> Geschäftsführer: Dipl.-Ing. (FH) Martin Bringe
>
> UStID: DE812936645
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph luminous 12.2.4 - 2 servers better than 3 ?

2018-04-19 Thread Hans van den Bogert
I see, the second one is the read bench. Even in the 2 node scenario the
read performance is pretty bad. Have you verified the hardware with micro
benchmarks such as 'fio'? Also try to review storage controller settings.

On Apr 19, 2018 5:13 PM, "Steven Vacaroaia"  wrote:

replication size is always 2

DB/WAL on HDD in this case

I tried with  OSDs with WAL/DB on SSD - they exhibit the same symptoms  (
cur MB/s 0 )

In summary, it does not matter
- which server ( any 2 will work better than any 3 or 4)
- replication size ( it tried with size 2 and 3 )
- location of WAL/DB ( on separate SSD or same HDD)


Thanks
Steven

On Thu, 19 Apr 2018 at 12:06, Hans van den Bogert 
wrote:

> I take it that the first bench is with replication size 2, the second
> bench is with replication size 3? Same for the 4 node OSD scenario?
>
> Also please let us know how you setup block.db and Wal, are they on the
> SSD?
>
> On Thu, Apr 19, 2018, 14:40 Steven Vacaroaia  wrote:
>
>> Sure ..thanks for your willingness to help
>>
>> Identical servers
>>
>> Hardware
>> DELL R620, 6 cores, 64GB RAM, 2 x 10 GB ports,
>> Enterprise HDD 600GB( Seagate ST600MM0006), Enterprise grade SSD 340GB
>> (Toshiba PX05SMB040Y)
>>
>>
>> All tests done with the following command
>> rados bench -p rbd 50 write --no-cleanup && rados bench -p rbd 50 seq
>>
>>
>> ceph osd pool ls detail
>> "pool_name": "rbd",
>> "flags": 1,
>> "flags_names": "hashpspool",
>> "type": 1,
>> "size": 2,
>> "min_size": 1,
>> "crush_rule": 1,
>> "object_hash": 2,
>> "pg_num": 64,
>> "pg_placement_num": 64,
>> "crash_replay_interval": 0,
>> "last_change": "354",
>> "last_force_op_resend": "0",
>> "last_force_op_resend_preluminous": "0",
>> "auid": 0,
>> "snap_mode": "selfmanaged",
>> "snap_seq": 0,
>> "snap_epoch": 0,
>> "pool_snaps": [],
>> "removed_snaps": "[]",
>> "quota_max_bytes": 0,
>> "quota_max_objects": 0,
>> "tiers": [],
>> "tier_of": -1,
>> "read_tier": -1,
>> "write_tier": -1,
>> "cache_mode": "none",
>> "target_max_bytes": 0,
>> "target_max_objects": 0,
>> "cache_target_dirty_ratio_micro": 40,
>> "cache_target_dirty_high_ratio_micro": 60,
>> "cache_target_full_ratio_micro": 80,
>> "cache_min_flush_age": 0,
>> "cache_min_evict_age": 0,
>> "erasure_code_profile": "",
>> "hit_set_params": {
>> "type": "none"
>> },
>> "hit_set_period": 0,
>> "hit_set_count": 0,
>> "use_gmt_hitset": true,
>> "min_read_recency_for_promote": 0,
>> "min_write_recency_for_promote": 0,
>> "hit_set_grade_decay_rate": 0,
>> "hit_set_search_last_n": 0,
>> "grade_table": [],
>> "stripe_width": 0,
>> "expected_num_objects": 0,
>> "fast_read": false,
>> "options": {},
>> "application_metadata": {}
>> }
>>
>>
>> ceph osd crush rule dump
>> [
>> {
>> "rule_id": 0,
>> "rule_name": "replicated_rule",
>> "ruleset": 0,
>> "type": 1,
>> "min_size": 1,
>> "max_size": 10,
>> "steps": [
>> {
>> "op": "take",
>> "item": -1,
>> "item_name": "default"
>> },
>> {
>> "op": "chooseleaf_firstn",
>> "num": 0,
>> "type": "host"
>> },
>> {
>> "op": "emit"
>> }
>> ]
>> },
>> {
>> "rule_id": 1,
>> "rule_name": "rbd",
>> "ruleset": 1,
>> "type": 1,
>> "min_size": 1,
>> "max_size": 10,
>> "steps": [
>> {
>> "op": "take",
>> "item": -9,
>> "item_name": "sas"
>> },
>> {
>> "op": "chooseleaf_firstn",
>> "num": 0,
>> "type": "host"
>> },
>> {
>> "op": "emit"
>> }
>> ]
>> }
>> ]
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 2 servers, 2 OSD
>>
>> ceph osd tree
>> ID  CLASS WEIGHT  TYPE NAME  STATUS REWEIGHT PRI-AFF
>>  -9   4.0 root sas
>> -10   1.0 host osd01-sas
>>   2   hdd 1.0 osd.2  up0 1.0
>> -11   1.0 host osd02-sas
>>   3   hdd 1.0 osd.3  up0 1.0
>> -12   1.0 host osd03-sas
>>   5   hdd 1.0 osd.5  up  1.0 1.0
>> -19   1.0 host osd04-sas
>>   6   hdd 1.0 osd.6  up  1.0 1.0
>>
>>
>> 2018-04-19 09:19:01.266010 min lat: 0.0412473 max lat: 1.03227 avg lat:
>> 0.331163
>>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
>>

Re: [ceph-users] ceph luminous 12.2.4 - 2 servers better than 3 ?

2018-04-19 Thread Steven Vacaroaia
fio is fine and megacli setings are as below ( device with WT is the SSD)


 Vendor Id  : TOSHIBA

Product Id : PX05SMB040Y

Capacity   : 372.0 GB



Results

Jobs: 20 (f=20): [W(20)] [100.0% done] [0KB/447.1MB/0KB /s] [0/115K/0 iops]
[eta 00m:00s]



Vendor Id  : SEAGATE

Product Id : ST600MM0006

Capacity   : 558.375 GB



Results

Jobs: 10 (f=10): [W(10)] [100.0% done] [0KB/100.5MB/0KB /s] [0/25.8K/0
iops] [eta 00m:00s]




 megacli -LDGetProp -cache -Lall -a0

Adapter 0-VD 0(target id: 0): Cache Policy:WriteThrough, ReadAheadNone,
Direct, Write Cache OK if bad BBU
Adapter 0-VD 1(target id: 1): Cache Policy:WriteBack, ReadAdaptive, Cached,
No Write Cache if bad BBU
Adapter 0-VD 2(target id: 2): Cache Policy:WriteBack, ReadAdaptive, Cached,
No Write Cache if bad BBU
Adapter 0-VD 3(target id: 3): Cache Policy:WriteBack, ReadAdaptive, Cached,
No Write Cache if bad BBU

Exit Code: 0x00
[root@osd01 ~]# megacli -LDGetProp -dskcache -Lall -a0

Adapter 0-VD 0(target id: 0): Disk Write Cache : Disk's Default
Adapter 0-VD 1(target id: 1): Disk Write Cache : Disk's Default
Adapter 0-VD 2(target id: 2): Disk Write Cache : Disk's Default
Adapter 0-VD 3(target id: 3): Disk Write Cache : Disk's Default


On Thu, 19 Apr 2018 at 14:22, Hans van den Bogert 
wrote:

> I see, the second one is the read bench. Even in the 2 node scenario the
> read performance is pretty bad. Have you verified the hardware with micro
> benchmarks such as 'fio'? Also try to review storage controller settings.
>
> On Apr 19, 2018 5:13 PM, "Steven Vacaroaia"  wrote:
>
> replication size is always 2
>
> DB/WAL on HDD in this case
>
> I tried with  OSDs with WAL/DB on SSD - they exhibit the same symptoms  (
> cur MB/s 0 )
>
> In summary, it does not matter
> - which server ( any 2 will work better than any 3 or 4)
> - replication size ( it tried with size 2 and 3 )
> - location of WAL/DB ( on separate SSD or same HDD)
>
>
> Thanks
> Steven
>
> On Thu, 19 Apr 2018 at 12:06, Hans van den Bogert 
> wrote:
>
>> I take it that the first bench is with replication size 2, the second
>> bench is with replication size 3? Same for the 4 node OSD scenario?
>>
>> Also please let us know how you setup block.db and Wal, are they on the
>> SSD?
>>
>> On Thu, Apr 19, 2018, 14:40 Steven Vacaroaia  wrote:
>>
>>> Sure ..thanks for your willingness to help
>>>
>>> Identical servers
>>>
>>> Hardware
>>> DELL R620, 6 cores, 64GB RAM, 2 x 10 GB ports,
>>> Enterprise HDD 600GB( Seagate ST600MM0006), Enterprise grade SSD 340GB
>>> (Toshiba PX05SMB040Y)
>>>
>>>
>>> All tests done with the following command
>>> rados bench -p rbd 50 write --no-cleanup && rados bench -p rbd 50 seq
>>>
>>>
>>> ceph osd pool ls detail
>>> "pool_name": "rbd",
>>> "flags": 1,
>>> "flags_names": "hashpspool",
>>> "type": 1,
>>> "size": 2,
>>> "min_size": 1,
>>> "crush_rule": 1,
>>> "object_hash": 2,
>>> "pg_num": 64,
>>> "pg_placement_num": 64,
>>> "crash_replay_interval": 0,
>>> "last_change": "354",
>>> "last_force_op_resend": "0",
>>> "last_force_op_resend_preluminous": "0",
>>> "auid": 0,
>>> "snap_mode": "selfmanaged",
>>> "snap_seq": 0,
>>> "snap_epoch": 0,
>>> "pool_snaps": [],
>>> "removed_snaps": "[]",
>>> "quota_max_bytes": 0,
>>> "quota_max_objects": 0,
>>> "tiers": [],
>>> "tier_of": -1,
>>> "read_tier": -1,
>>> "write_tier": -1,
>>> "cache_mode": "none",
>>> "target_max_bytes": 0,
>>> "target_max_objects": 0,
>>> "cache_target_dirty_ratio_micro": 40,
>>> "cache_target_dirty_high_ratio_micro": 60,
>>> "cache_target_full_ratio_micro": 80,
>>> "cache_min_flush_age": 0,
>>> "cache_min_evict_age": 0,
>>> "erasure_code_profile": "",
>>> "hit_set_params": {
>>> "type": "none"
>>> },
>>> "hit_set_period": 0,
>>> "hit_set_count": 0,
>>> "use_gmt_hitset": true,
>>> "min_read_recency_for_promote": 0,
>>> "min_write_recency_for_promote": 0,
>>> "hit_set_grade_decay_rate": 0,
>>> "hit_set_search_last_n": 0,
>>> "grade_table": [],
>>> "stripe_width": 0,
>>> "expected_num_objects": 0,
>>> "fast_read": false,
>>> "options": {},
>>> "application_metadata": {}
>>> }
>>>
>>>
>>> ceph osd crush rule dump
>>> [
>>> {
>>> "rule_id": 0,
>>> "rule_name": "replicated_rule",
>>> "ruleset": 0,
>>> "type": 1,
>>> "min_size": 1,
>>> "max_size": 10,
>>> "steps": [
>>> {
>>> "op": "take",
>>> "item": -1,
>>> "item_name": "default"
>>> },
>>> 

Re: [ceph-users] ceph luminous 12.2.4 - 2 servers better than 3 ?

2018-04-19 Thread Hans van den Bogert
Last thing I can come up with is doing a 2 node scenario with at least one
of the nodes  being an other. Maybe you've already done that..

But again, even the read performance in your shown bench of the 2 node
cluster is pretty bad.

The premise of this thread that a 2 node cluster does work well, is not
true (imo).

Hans

On Thu, Apr 19, 2018, 19:28 Steven Vacaroaia  wrote:

> fio is fine and megacli setings are as below ( device with WT is the SSD)
>
>
>  Vendor Id  : TOSHIBA
>
> Product Id : PX05SMB040Y
>
> Capacity   : 372.0 GB
>
>
>
> Results
>
> Jobs: 20 (f=20): [W(20)] [100.0% done] [0KB/447.1MB/0KB /s] [0/115K/0
> iops] [eta 00m:00s]
>
>
>
> Vendor Id  : SEAGATE
>
> Product Id : ST600MM0006
>
> Capacity   : 558.375 GB
>
>
>
> Results
>
> Jobs: 10 (f=10): [W(10)] [100.0% done] [0KB/100.5MB/0KB /s] [0/25.8K/0
> iops] [eta 00m:00s]
>
>
>
>
>  megacli -LDGetProp -cache -Lall -a0
>
> Adapter 0-VD 0(target id: 0): Cache Policy:WriteThrough, ReadAheadNone,
> Direct, Write Cache OK if bad BBU
> Adapter 0-VD 1(target id: 1): Cache Policy:WriteBack, ReadAdaptive,
> Cached, No Write Cache if bad BBU
> Adapter 0-VD 2(target id: 2): Cache Policy:WriteBack, ReadAdaptive,
> Cached, No Write Cache if bad BBU
> Adapter 0-VD 3(target id: 3): Cache Policy:WriteBack, ReadAdaptive,
> Cached, No Write Cache if bad BBU
>
> Exit Code: 0x00
> [root@osd01 ~]# megacli -LDGetProp -dskcache -Lall -a0
>
> Adapter 0-VD 0(target id: 0): Disk Write Cache : Disk's Default
> Adapter 0-VD 1(target id: 1): Disk Write Cache : Disk's Default
> Adapter 0-VD 2(target id: 2): Disk Write Cache : Disk's Default
> Adapter 0-VD 3(target id: 3): Disk Write Cache : Disk's Default
>
>
> On Thu, 19 Apr 2018 at 14:22, Hans van den Bogert 
> wrote:
>
>> I see, the second one is the read bench. Even in the 2 node scenario the
>> read performance is pretty bad. Have you verified the hardware with micro
>> benchmarks such as 'fio'? Also try to review storage controller settings.
>>
>> On Apr 19, 2018 5:13 PM, "Steven Vacaroaia"  wrote:
>>
>> replication size is always 2
>>
>> DB/WAL on HDD in this case
>>
>> I tried with  OSDs with WAL/DB on SSD - they exhibit the same symptoms  (
>> cur MB/s 0 )
>>
>> In summary, it does not matter
>> - which server ( any 2 will work better than any 3 or 4)
>> - replication size ( it tried with size 2 and 3 )
>> - location of WAL/DB ( on separate SSD or same HDD)
>>
>>
>> Thanks
>> Steven
>>
>> On Thu, 19 Apr 2018 at 12:06, Hans van den Bogert 
>> wrote:
>>
>>> I take it that the first bench is with replication size 2, the second
>>> bench is with replication size 3? Same for the 4 node OSD scenario?
>>>
>>> Also please let us know how you setup block.db and Wal, are they on the
>>> SSD?
>>>
>>> On Thu, Apr 19, 2018, 14:40 Steven Vacaroaia  wrote:
>>>
 Sure ..thanks for your willingness to help

 Identical servers

 Hardware
 DELL R620, 6 cores, 64GB RAM, 2 x 10 GB ports,
 Enterprise HDD 600GB( Seagate ST600MM0006), Enterprise grade SSD 340GB
 (Toshiba PX05SMB040Y)


 All tests done with the following command
 rados bench -p rbd 50 write --no-cleanup && rados bench -p rbd 50 seq


 ceph osd pool ls detail
 "pool_name": "rbd",
 "flags": 1,
 "flags_names": "hashpspool",
 "type": 1,
 "size": 2,
 "min_size": 1,
 "crush_rule": 1,
 "object_hash": 2,
 "pg_num": 64,
 "pg_placement_num": 64,
 "crash_replay_interval": 0,
 "last_change": "354",
 "last_force_op_resend": "0",
 "last_force_op_resend_preluminous": "0",
 "auid": 0,
 "snap_mode": "selfmanaged",
 "snap_seq": 0,
 "snap_epoch": 0,
 "pool_snaps": [],
 "removed_snaps": "[]",
 "quota_max_bytes": 0,
 "quota_max_objects": 0,
 "tiers": [],
 "tier_of": -1,
 "read_tier": -1,
 "write_tier": -1,
 "cache_mode": "none",
 "target_max_bytes": 0,
 "target_max_objects": 0,
 "cache_target_dirty_ratio_micro": 40,
 "cache_target_dirty_high_ratio_micro": 60,
 "cache_target_full_ratio_micro": 80,
 "cache_min_flush_age": 0,
 "cache_min_evict_age": 0,
 "erasure_code_profile": "",
 "hit_set_params": {
 "type": "none"
 },
 "hit_set_period": 0,
 "hit_set_count": 0,
 "use_gmt_hitset": true,
 "min_read_recency_for_promote": 0,
 "min_write_recency_for_promote": 0,
 "hit_set_grade_decay_rate": 0,
 "hit_set_search_last_n": 0,
 "grade_table": [],
 "stripe_w

[ceph-users] Tens of millions of objects in a sharded bucket

2018-04-19 Thread Robert Stanford
 The rule of thumb is not to have tens of millions of objects in a radosgw
bucket, because reads will be slow.  If using bucket index sharding (with
128 or 256 shards), does this eliminate this concern?  Has anyone tried
tens of millions (20-40M) of objects with sharded indexes?

 Thank you
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph 12.2.4 MGR spams syslog with "mon failed to return metadata for mds"

2018-04-19 Thread Charles Alva
Hi All,

Just noticed on 2 Ceph Luminous 12.2.4 clusters, Ceph mgr spams the syslog
with lots of "mon failed to return metadata for mds" every second.

```
2018-04-20 06:06:03.951412 7fca238ff700  1 mgr send_beacon active
2018-04-20 06:06:04.934477 7fca14809700  0 ms_deliver_dispatch: unhandled
message 0x55bf897f0a00 mgrreport(mds.mds1 +24-0 packed 214) v5 from mds.0
10.100.100.114:6800/4132681434
2018-04-20 06:06:04.934937 7fca25102700  1 mgr finish mon failed to return
metadata for mds.mds1: (2) No such file or directory
```

How to fix this issue? or disable it completely to reduce disk IO and
increase SSD life span?


Kind regards,

Charles Alva
Sent from Gmail Mobile
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com