Re: [ceph-users] wal and db device on SSD partitions?

2018-03-21 Thread Caspar Smit
2018-03-21 7:20 GMT+01:00 ST Wong (ITSC) :

> Hi all,
>
>
>
> We got some decommissioned servers from other projects for setting up
> OSDs.  They’ve 10 2TB SAS disks with 4 2TB SSD.
>
> We try to test with bluestores and hope to play wal and db devices on
> SSD.  Need advice on some newbie questions:
>
>
>
> 1. As there are more SAS than SSD, is it possible/recommended to put wal
> and db of multiple OSDs in partitions on same SSD (instead of using whole
> SSD device) ?
>
>
Yes, that's common practice. Although the general advice is no more then 5
OSD's per wal/db SSD.


> 2. if ok to do 1., how to size the SSD partition for wal/db ?
>

~ 10GB per TB. In your case 20GB partitions for 2TB drives.


> 3. do we need to tune parameters like bluestore_cache_size_ssd when SSD
> is used in OSD?
>
>
>

Only if your low on RAM.

Caspar

Thanks a lot.
>
> Best Regards,
>
> /ST Wong
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Separate BlueStore WAL/DB : best scenario ?

2018-03-21 Thread Hervé Ballans

Hi all,

I have a question regarding a possible scenario to put both wal and db 
in a separate SSD device for an OSD node composed by 22 OSDs (HDD SAS 
10k 1,8 To).


I'm thinking of 2 options (at about the same price) :

- add 2 SSD SAS Write Intensive (10DWPD)

- or add a unique SSD NVMe 800 Go (it's the minimum capacity currently 
on the market !..)


In both case, that's a lot of partitions on each SSD disk, especially on 
the second solution where we would have 44 partitions (22 WAL and 22 DB) !


Is this solution workable (I mean in term of i/o speeds), or is it 
unsafe despite the high PCIe bus transfer rate ?


I just want to talk here about throughput performances, not data 
integrity on the node in case of SSD crashes...


Thanks in advance for your advices,

Hervé

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] wal and db device on SSD partitions?

2018-03-21 Thread Ján Senko
2018-03-21 8:56 GMT+01:00 Caspar Smit :

> 2018-03-21 7:20 GMT+01:00 ST Wong (ITSC) :
>
>> Hi all,
>>
>>
>>
>> We got some decommissioned servers from other projects for setting up
>> OSDs.  They’ve 10 2TB SAS disks with 4 2TB SSD.
>>
>> We try to test with bluestores and hope to play wal and db devices on
>> SSD.  Need advice on some newbie questions:
>>
>>
>>
>> 1. As there are more SAS than SSD, is it possible/recommended to put wal
>> and db of multiple OSDs in partitions on same SSD (instead of using whole
>> SSD device) ?
>>
>>
> Yes, that's common practice. Although the general advice is no more then 5
> OSD's per wal/db SSD.
>

I understand this was general advice in the FileStore era, because *all* writes
had to go through journal first.
However IIRC in BlueStore only some writes need to go to db? Is this number
therefore different now?


>
>
>> 2. if ok to do 1., how to size the SSD partition for wal/db ?
>>
>
> ~ 10GB per TB. In your case 20GB partitions for 2TB drives.
>
>

Again, this was common for FileStore Journals. As we know BlueStore can
live without separate db, and will store excess db on the drive itself.
What is the current recommendation?


> 3. do we need to tune parameters like bluestore_cache_size_ssd when SSD
>> is used in OSD?
>>
>>
>>
>
> Only if your low on RAM.
>
> Caspar
>
> Thanks a lot.
>>
>> Best Regards,
>>
>> /ST Wong
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Jan Senko, Skype janos-
Phone in Switzerland: +41 774 144 602
Phone in Czech Republic: +420 777 843 818
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Updating standby mds from 12.2.2 to 12.2.4 caused up:active 12.2.2 mds's to suicide

2018-03-21 Thread Martin Palma
Just run into this problem on our production cluster

It would have been nice if the release notes of 12.2.4 had been
adapted to inform user about this.

Best,
Martin

On Wed, Mar 14, 2018 at 9:53 PM, Gregory Farnum  wrote:
> On Wed, Mar 14, 2018 at 12:41 PM, Lars Marowsky-Bree  wrote:
>> On 2018-03-14T06:57:08, Patrick Donnelly  wrote:
>>
>>> Yes. But the real outcome is not "no MDS [is] active" but "some or all
>>> metadata I/O will pause" -- and there is no avoiding that. During an
>>> MDS upgrade, a standby must take over the MDS being shutdown (and
>>> upgraded).  During takeover, metadata I/O will briefly pause as the
>>> rank is unavailable. (Specifically, no other rank can obtains locks or
>>> communicate with the "failed" rank; so metadata I/O will necessarily
>>> pause until a standby takes over.) Single active vs. multiple active
>>> upgrade makes little difference in this outcome.
>>
>> Fair, except that there's no standby MDS at this time in case the update
>> goes wrong.
>>
>>> > Is another approach theoretically feasible? Have the updated MDS only go
>>> > into the incompatible mode once there's a quorum of new ones available,
>>> > or something?
>>> I believe so, yes. That option wasn't explored for this patch because
>>> it was just disambiguating the compatibility flags and the full
>>> side-effects weren't realized.
>>
>> Would such a patch be accepted if we ended up pursuing this? Any
>> suggestions on how to best go about this?
>
> It'd be ugly, but you'd have to set it up so that
> * new MDSes advertise the old set of required values
> * but can identify when all the MDSes are new
> * then mark somewhere that they can use the correct values
> * then switch to the proper requirements
>
> I don't remember the details of this CompatSet code any more, and it's
> definitely made trickier by the MDS having no permanent local state.
> Since we do luckily have both the IDs and the strings, you might be
> able to do something in the MDSMonitor to identify whether booting
> MDSes have "too-old", "old-featureset-but-support-new-feature", or
> "new, correct feature advertising" and then either massage that
> incoming message down to the "old-featureset-but-support-new-feature"
> (if not all the MDSes are new) or do an auto-upgrade of the required
> features in the map. And you might also need compatibility code in the
> MDS to make sure it sends out the appropriate bits on connection, but
> I *think* the CompatSet checks are only done on the monitor and when
> an MDS receives an MDSMap.
> -Greg
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Separate BlueStore WAL/DB : best scenario ?

2018-03-21 Thread Ronny Aasen

On 21. mars 2018 11:27, Hervé Ballans wrote:

Hi all,

I have a question regarding a possible scenario to put both wal and db 
in a separate SSD device for an OSD node composed by 22 OSDs (HDD SAS 
10k 1,8 To).


I'm thinking of 2 options (at about the same price) :

- add 2 SSD SAS Write Intensive (10DWPD)

- or add a unique SSD NVMe 800 Go (it's the minimum capacity currently 
on the market !..)


In both case, that's a lot of partitions on each SSD disk, especially on 
the second solution where we would have 44 partitions (22 WAL and 22 DB) !


Is this solution workable (I mean in term of i/o speeds), or is it 
unsafe despite the high PCIe bus transfer rate ?


I just want to talk here about throughput performances, not data 
integrity on the node in case of SSD crashes...


Thanks in advance for your advices,




if you put the wal and db on the same device anyway, there is no real 
benefit to having a partition for each. the reason you can split them up 
is if you have them on different devices. Eg db on ssd, but wal on 
nvram. it is easier to just colocat wal and db into the same partition 
since they live on the same device in your case anyway.


if you have too many osd's db's on the same ssd, you may end up with the 
ssd beeing the bottleneck. 4 osd's db's on a ssd have been a "golden 
rule" on the mailinglist for a while. for nvram you can possibly have 
some more.


but the bottleneck is only one part of the problem. when the 22 
partitions db nvram dies, it brings down 22 osd's at once and will be a 
huge pain on your cluster. (depending on how large it is...)
i would spread the db's on more devices to reduce the bottleneck and 
failure domains in this situation.



kind regards
Ronny Aasen


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Difference in speed on Copper of Fiber ports on switches

2018-03-21 Thread Willem Jan Withagen
Hi,

I just ran into this table for a 10G Netgear switch we use:

Fiberdelays:
10 Gbps vezelvertraging (64 bytepakketten): 1.827 µs
10 Gbps vezelvertraging (512 bytepakketten): 1.919 µs
10 Gbps vezelvertraging (1024 bytepakketten): 1.971 µs
10 Gbps vezelvertraging (1518 bytepakketten): 1.905 µs

Copperdelays:
10 Gbps kopervertraging (64 bytepakketten): 2.728 µs
10 Gbps kopervertraging (512 bytepakketten): 2.85 µs
10 Gbps kopervertraging (1024 bytepakketten): 2.904 µs
10 Gbps kopervertraging (1518 bytepakketten): 2.841 µs

Fiberdelays:
1 Gbps vezelvertraging (64 bytepakketten) 2.289 µs
1 Gbps vezelvertraging (512 bytepakketten) 2.393 µs
1 Gbps vezelvertraging (1024 bytepakketten) 2.423 µs
1 Gbps vezelvertraging (1518 bytepakketten) 2.379 µs

Copperdelays:
1 Gbps kopervertraging (64 bytepakketten) 2.707 µs
1 Gbps kopervertraging (512 bytepakketten) 2.821 µs
1 Gbps kopervertraging (1024 bytepakketten) 2.866 µs
1 Gbps kopervertraging (1518 bytepakketten) 2.826 µs

So the difference is serious: 900ns on a total of 1900ns for a 10G pakket.
Other strange thing is that 1K packets are slower than 1518 bytes.

So that might warrant connecting boxes preferably with optics
instead of CAT cableing if you are trying to squeeze the max out of a setup.

Sad thing is that they do not report for jumbo frames, and doing these
measurements your self is not easy...

--WjW

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Object Gateway - Server Side Encryption

2018-03-21 Thread Vik Tara


On 15/03/18 10:45, Vik Tara wrote:
>
> On 14/03/18 12:31, Amardeep Singh wrote:
>
>> Though I have now another issue because I am using Multisite setup
>> with one zone for data and second zone for metadata with elastic
>> search tier.
>>
>> http://docs.ceph.com/docs/master/radosgw/elastic-sync-module/
>>
>> When document is encrypted the metadata is not pushed to
>> elasticsearch and if document is uploaded without encryption it works
>> fine.
>>
>>  /2018-03-14 15:48:02.397490 7f0b4cbce700 20
>> cr:s=0x560a726c4000:op=0x560a7276e800:20RGWPutRESTResourceCRI15es_obj_metadataiE:
>> operate()
>> 2018-03-14 15:48:02.397492 7f0b4cbce700 20
>> cr:s=0x560a726c4000:op=0x560a7276e800:20RGWPutRESTResourceCRI15es_obj_metadataiE:
>> operate()
>> *2018-03-14 15:48:02.397633 7f0b4cbce700 20 sending request to
>> http://192.168.95.60:9200/newbucket/object/ee560b67-c330-4fd0-af50-aefff93735d2.4163.1:testdocument:null*
>> 2018-03-14 15:48:02.397653 7f0b4cbce700 20 register_request
>> mgr=0x560a720d5d58 req_data->id=1759, easy_handle=0x560a7348a000
>> 2018-03-14 15:48:02.397666 7f0b4cbce700 20 run: stack=0x560a726c4000
>> is io blocked
>> 2018-03-14 15:48:02.397685 7f0b4b3cb700 20 link_request
>> req_data=0x560a727fae00 req_data->id=1758, easy_handle=0x560a733e6000
>> 2018-03-14 15:48:02.397725 7f0b4b3cb700 20 link_request
>> req_data=0x560a72f31e00 req_data->id=1759, easy_handle=0x560a7348a000
>> 2018-03-14 15:48:02.398609 7f0b4b3cb700 10 receive_http_header
>> 2018-03-14 15:48:02.398631 7f0b4b3cb700 10 received header:HTTP/1.1
>> 100 Continue
>> 2018-03-14 15:48:02.398636 7f0b4b3cb700 10 received header:HTTP/1.1
>> 2018-03-14 15:48:02.398638 7f0b4b3cb700 10 receive_http_header
>> 2018-03-14 15:48:02.398639 7f0b4b3cb700 10 received header:
>> 2018-03-14 15:48:02.398659 7f0b4b3cb700 10 receive_http_header
>> 2018-03-14 15:48:02.398661 7f0b4b3cb700 10 received header:HTTP/1.1
>> 100 Continue
>> 2018-03-14 15:48:02.398662 7f0b4b3cb700 10 received header:HTTP/1.1
>> 2018-03-14 15:48:02.398663 7f0b4b3cb700 10 receive_http_header
>> 2018-03-14 15:48:02.398664 7f0b4b3cb700 10 received header:
>> 2018-03-14 15:48:02.443530 7f0b4b3cb700 10 receive_http_header
>> *2018-03-14 15:48:02.443556 7f0b4b3cb700 10 received header:HTTP/1.1
>> 400 Bad Request*
>> 2018-03-14 15:48:02.443563 7f0b4b3cb700 10 receive_http_header
>> 2018-03-14 15:48:02.443565 7f0b4b3cb700 10 received header:Warning:
>> 299 Elasticsearch-5.6.2-57e20f3 "Content type detection for rest
>> requests is deprecated. Specify the content type using the
>> [Content-Type] header." "Wed, 14 Mar 2018 10:17:35 GMT"
>> 2018-03-14 15:48:02.443574 7f0b4b3cb700 10 receive_http_header
>> 2018-03-14 15:48:02.443575 7f0b4b3cb700 10 received
>> header:content-type: application/json; charset=UTF-8
>> 2018-03-14 15:48:02.443588 7f0b4b3cb700 10 receive_http_header
>> 2018-03-14 15:48:02.443591 7f0b4b3cb700 10 received
>> header:content-length: 374
>> 2018-03-14 15:48:02.443594 7f0b4b3cb700 10 receive_http_header
>> 2018-03-14 15:48:02.443595 7f0b4b3cb700 10 received header:
>> 2018-03-14 15:48:02.443663 7f0b4cbce700 20
>> cr:s=0x560a732f4d20:op=0x560a72fa8000:20RGWPutRESTResourceCRI15es_obj_metadataiE:
>> operate()
>> 2018-03-14 15:48:02.443675 7f0b4cbce700  5 failed to wait for op,
>> ret=-22: PUT
>> http://192.168.95.60:9200/newbucket/object/ee560b67-c330-4fd0-af50-aefff93735d2.4163.1:testdocument:null
>> 2018-03-14 15:48:02.443754 7f0b4cbce700 20
>> cr:s=0x560a732f4d20:op=0x560a72fa8000:20RGWPutRESTResourceCRI15es_obj_metadataiE:
>> operate() returned r=-22
>> 2018-03-14 15:48:02.443773 7f0b4cbce700 20
>> cr:s=0x560a732f4d20:op=0x560a7276c800:29RGWElasticHandleRemoteObjCBCR:
>> operate()
>> 2018-03-14 15:48:02.443787 7f0b4cbce700 20
>> cr:s=0x560a732f4d20:op=0x560a7276c800:29RGWElasticHandleRemoteObjCBCR:
>> operate() returned r=-22
>> 2018-03-14 15:48:02.443791 7f0b4cbce700 20
>> cr:s=0x560a732f4d20:op=0x560a72efb800:27RGWElasticHandleRemoteObjCR:
>> operate()/
>
> This change 7 days ago looks like it deals with the encoding that ES
> requires.
>
> https://github.com/ceph/ceph/pull/20707/files/13978bb28b7be809033bf24550b21ed2713ddc9b
>
> I'll make a build that incorporates this commit so you can test it :)

After testing, it looks like that commit does not solve the issue with
indexing encrypted objects - will raise a bug for this / move the
discussion to the dev list.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Difference in speed on Copper of Fiber ports on switches

2018-03-21 Thread Paul Emmerich
Hi,

2.3µs is a typical delay for a 10GBASE-T connection. But fiber or SFP+ DAC
connections should be faster: switches are typically in the range of ~500ns
to 1µs.


But you'll find that this small difference in latency induced by the switch
will be quite irrelevant in the grand scheme of things when using the Linux
network stack...

Paul

2018-03-21 12:16 GMT+01:00 Willem Jan Withagen :

> Hi,
>
> I just ran into this table for a 10G Netgear switch we use:
>
> Fiberdelays:
> 10 Gbps vezelvertraging (64 bytepakketten): 1.827 µs
> 10 Gbps vezelvertraging (512 bytepakketten): 1.919 µs
> 10 Gbps vezelvertraging (1024 bytepakketten): 1.971 µs
> 10 Gbps vezelvertraging (1518 bytepakketten): 1.905 µs
>
> Copperdelays:
> 10 Gbps kopervertraging (64 bytepakketten): 2.728 µs
> 10 Gbps kopervertraging (512 bytepakketten): 2.85 µs
> 10 Gbps kopervertraging (1024 bytepakketten): 2.904 µs
> 10 Gbps kopervertraging (1518 bytepakketten): 2.841 µs
>
> Fiberdelays:
> 1 Gbps vezelvertraging (64 bytepakketten) 2.289 µs
> 1 Gbps vezelvertraging (512 bytepakketten) 2.393 µs
> 1 Gbps vezelvertraging (1024 bytepakketten) 2.423 µs
> 1 Gbps vezelvertraging (1518 bytepakketten) 2.379 µs
>
> Copperdelays:
> 1 Gbps kopervertraging (64 bytepakketten) 2.707 µs
> 1 Gbps kopervertraging (512 bytepakketten) 2.821 µs
> 1 Gbps kopervertraging (1024 bytepakketten) 2.866 µs
> 1 Gbps kopervertraging (1518 bytepakketten) 2.826 µs
>
> So the difference is serious: 900ns on a total of 1900ns for a 10G pakket.
> Other strange thing is that 1K packets are slower than 1518 bytes.
>
> So that might warrant connecting boxes preferably with optics
> instead of CAT cableing if you are trying to squeeze the max out of a
> setup.
>
> Sad thing is that they do not report for jumbo frames, and doing these
> measurements your self is not easy...
>
> --WjW
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
-- 
Paul Emmerich

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Prometheus RADOSGW usage exporter

2018-03-21 Thread Berant Lemmenes
My apologies, I don't seem to be getting notifications on PRs. I'll review
this week.

Thanks,
Berant

On Mon, Mar 19, 2018 at 5:55 AM, Konstantin Shalygin  wrote:

> Hi Berant
>
>
> I've created prometheus exporter that scrapes the RADOSGW Admin Ops API and
>> exports the usage information for all users and buckets. This is my first
>> prometheus exporter so if anyone has feedback I'd greatly appreciate it.
>> I've tested it against Hammer, and will shortly test against Jewel; though
>> looking at the docs it should work fine for Jewel as well.
>>
>> https://github.com/blemmenes/radosgw_usage_exporter
>>
>
>
> It would be nice, if you take a look on PR's.
>
>
>
>
> k
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Difference in speed on Copper of Fiber ports on switches

2018-03-21 Thread Willem Jan Withagen
On 21-3-2018 13:47, Paul Emmerich wrote:
> Hi,
> 
> 2.3µs is a typical delay for a 10GBASE-T connection. But fiber or SFP+
> DAC connections should be faster: switches are typically in the range of
> ~500ns to 1µs.
> 
> 
> But you'll find that this small difference in latency induced by the
> switch will be quite irrelevant in the grand scheme of things when using
> the Linux network stack...

But I think it does when people start to worry about selecting High
clock speed CPUS versus packages with more cores...

900ns is quite a lot if you have that mindset.
And probably 1800ns at that, because the delay will be a both ends.
Or perhaps even 3600ns because the delay is added at every ethernet
connector???

But I'm inclined to believe you that the network stack could take quite
some time...


--WjW


> Paul
> 
> 2018-03-21 12:16 GMT+01:00 Willem Jan Withagen  >:
> 
> Hi,
> 
> I just ran into this table for a 10G Netgear switch we use:
> 
> Fiberdelays:
> 10 Gbps vezelvertraging (64 bytepakketten): 1.827 µs
> 10 Gbps vezelvertraging (512 bytepakketten): 1.919 µs
> 10 Gbps vezelvertraging (1024 bytepakketten): 1.971 µs
> 10 Gbps vezelvertraging (1518 bytepakketten): 1.905 µs
> 
> Copperdelays:
> 10 Gbps kopervertraging (64 bytepakketten): 2.728 µs
> 10 Gbps kopervertraging (512 bytepakketten): 2.85 µs
> 10 Gbps kopervertraging (1024 bytepakketten): 2.904 µs
> 10 Gbps kopervertraging (1518 bytepakketten): 2.841 µs
> 
> Fiberdelays:
> 1 Gbps vezelvertraging (64 bytepakketten) 2.289 µs
> 1 Gbps vezelvertraging (512 bytepakketten) 2.393 µs
> 1 Gbps vezelvertraging (1024 bytepakketten) 2.423 µs
> 1 Gbps vezelvertraging (1518 bytepakketten) 2.379 µs
> 
> Copperdelays:
> 1 Gbps kopervertraging (64 bytepakketten) 2.707 µs
> 1 Gbps kopervertraging (512 bytepakketten) 2.821 µs
> 1 Gbps kopervertraging (1024 bytepakketten) 2.866 µs
> 1 Gbps kopervertraging (1518 bytepakketten) 2.826 µs
> 
> So the difference is serious: 900ns on a total of 1900ns for a 10G
> pakket.
> Other strange thing is that 1K packets are slower than 1518 bytes.
> 
> So that might warrant connecting boxes preferably with optics
> instead of CAT cableing if you are trying to squeeze the max out of
> a setup.
> 
> Sad thing is that they do not report for jumbo frames, and doing these
> measurements your self is not easy...
> 
> --WjW
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 
> 
> -- 
> -- 
> Paul Emmerich
> 
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io 
> Tel: +49 89 1896585 90

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] IO rate-limiting with Ceph RBD (and libvirt)

2018-03-21 Thread Andre Goree
I'm trying to determine the best way to go about configuring IO 
rate-limiting for individual images within an RBD pool.


Here [1], I've found that OpenStack appears to use Libvirt's "iotune" 
parameter, however I seem to recall reading about being able to do so 
via Ceph's settings.


Is there a place in Ceph to set IO limits on individual images within an 
RBD pool?  Thanks in advance for the advice.


[1] https://ceph.com/geen-categorie/openstack-ceph-rbd-and-qos/

--
Andre Goree
-=-=-=-=-=-
Email - andre at drenet.net
Website   - http://blog.drenet.net
PGP key   - http://www.drenet.net/pubkey.html
-=-=-=-=-=-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bluestore cluster, bad IO perf on blocksize<64k... could it be throttling ?

2018-03-21 Thread Frederic BRET

Hi all,

The context :
- Test cluster aside production one
- Fresh install on Luminous
- choice of Bluestore (coming from Filestore)
- Default config (including wpq queuing)
- 6 nodes SAS12, 14 OSD, 2 SSD, 2 x 10Gb nodes, far more Gb at each 
switch uplink...

- R3 pool, 2 nodes per site
- separate db (25GB) and wal (600MB) partitions on SSD for each OSD to 
be able to observe each kind of IO with iostat

- RBD client fio --ioengine=libaio --iodepth=128 --direct=1
- client RDB :  rbd map rbd/test_rbd -o queue_depth=1024
- Just to point out, this is not a thread on SSD performance or 
adequation between SSD and number of OSD. These 12Gb SAS 10DWPD SSD are 
perfectly performing with lot of headroom on the production cluster even 
with XFS filestore and journals on SSDs.
- This thread is about a possible bottleneck on low size blocks with 
rocksdb/wal/Bluestore.


To begin with, Bluestore performance is really breathtaking compared to 
filestore/XFS : we saturate the 20Gb clients bandwidth on this small 
test cluster, as soon as IO blocksize=64k, a thing we couldn't achieve 
with Filestore and journals, even at 256k.


The downside, all small IO blockizes (4k, 8k, 16k, 32k) are considerably 
slower and appear somewhat capped.


Just to compare, here are observed latencies at 2 consecutive values for 
blocksize 64k and 32k :

64k :
 write: io=55563MB, bw=1849.2MB/s, iops=29586, runt= 30048msec
lat (msec): min=2, max=867, avg=17.29, stdev=32.31

32k :
 write: io=6332.2MB, bw=207632KB/s, iops=6488, runt= 31229msec
lat (msec): min=1, max=5111, avg=78.81, stdev=430.50

Whereas 64k one is almost filling the 20Gb client connection, the 32k 
one is only getting a mere 1/10th of the bandwidth, and IOs latencies 
are multiplied by 4.5 (or get a  ~60ms pause ? ... )


And we see the same constant latency at 16k, 8k and 4k :
16k :
 write: io=3129.4MB, bw=102511KB/s, iops=6406, runt= 31260msec
lat (msec): min=0.908, max=6.67, avg=79.87, stdev=500.08

8k :
 write: io=1592.8MB, bw=52604KB/s, iops=6575, runt= 31005msec
lat (msec): min=0.824, max=5.49, avg=77.82, stdev=461.61

4k :
 write: io=837892KB, bw=26787KB/s, iops=6696, runt= 31280msec
lat (msec): min=0.766, max=5.45, avg=76.39, stdev=428.29

To compare with filestore, on 4k IOs results I have on hand from 
previous install, we were getting almost 2x the Bluestore perfs on the 
exact same cluster :

WRITE: io=1221.4MB, aggrb=41477KB/,s maxt=30152msec

The thing is during these small blocksize fio benchmarks, nowhere nodes 
CPU, OSD, SSD, or of course network are saturated (ie. I think this has 
nothing to do with write amplification), nevertheless clients IOPS 
starve at low values.

Shouldn't Bluestore IOPs be far higher than Filestore on small IOs too ?

To summerize, here is what we can observe :


Seeking counters, I found in "perf dump" incrementing values with slow 
IO benchs, here for 1 run of 4k fio :

   "deferred_write_ops": 7631,
   "deferred_write_bytes": 31457280,

Does this means throttling or other QoS mechanism may be the cause and 
default config values may be artificially limiting small IO performance 
on our architecture ? And has anyone an idea on how to circumvent it ?


OSD Config Reference documentation may be talking about these aspects in 
the QoS/MClock/Caveats section, but I'm not sure to understand the whole 
picture.


Could someone help ?

Thanks
Frederic


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Memory leak in Ceph OSD?

2018-03-21 Thread Kjetil Joergensen
I retract my previous statement(s).

My current suspicion is that this isn't a leak as much as it being
load-driven, after enough waiting - it generally seems to settle around
some equilibrium. We do seem to sit on the mempools x 2.4 ~ ceph-osd RSS,
which is on the higher side (I see documentation alluding to expecting
~1.5x).

-KJ

On Mon, Mar 19, 2018 at 3:05 AM, Konstantin Shalygin  wrote:

>
> We don't run compression as far as I know, so that wouldn't be it. We do
>> actually run a mix of bluestore & filestore - due to the rest of the
>> cluster predating a stable bluestore by some amount.
>>
>
>
> 12.2.2 -> 12.2.4 at 2018/03/10: I don't see increase of memory usage. No
> any compressions of course.
>
>
> http://storage6.static.itmages.com/i/18/0319/h_1521453809_
> 9131482_859b1fb0a5.png
>
>
>
>
> k
>



-- 
Kjetil Joergensen 
SRE, Medallia Inc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Difference in speed on Copper of Fiber ports on switches

2018-03-21 Thread Subhachandra Chandra
Latency is a concern if your application is sending one packet at a time
and waiting for a reply. If you are streaming large blocks of data, the
first packet is delayed by the network latency but after that you will
receive a 10Gbps stream continuously. The latency for jumbo frames vs 1500
byte frames depends upon the switch type. On a cut-through switch there is
very little difference but on a store-and-forward switch it will be
proportional to packet size. Most modern switching ASICs are capable of
cut-through operation.

Subhachandra

On Wed, Mar 21, 2018 at 7:15 AM, Willem Jan Withagen 
wrote:

> On 21-3-2018 13:47, Paul Emmerich wrote:
> > Hi,
> >
> > 2.3µs is a typical delay for a 10GBASE-T connection. But fiber or SFP+
> > DAC connections should be faster: switches are typically in the range of
> > ~500ns to 1µs.
> >
> >
> > But you'll find that this small difference in latency induced by the
> > switch will be quite irrelevant in the grand scheme of things when using
> > the Linux network stack...
>
> But I think it does when people start to worry about selecting High
> clock speed CPUS versus packages with more cores...
>
> 900ns is quite a lot if you have that mindset.
> And probably 1800ns at that, because the delay will be a both ends.
> Or perhaps even 3600ns because the delay is added at every ethernet
> connector???
>
> But I'm inclined to believe you that the network stack could take quite
> some time...
>
>
> --WjW
>
>
> > Paul
> >
> > 2018-03-21 12:16 GMT+01:00 Willem Jan Withagen  > >:
> >
> > Hi,
> >
> > I just ran into this table for a 10G Netgear switch we use:
> >
> > Fiberdelays:
> > 10 Gbps vezelvertraging (64 bytepakketten): 1.827 µs
> > 10 Gbps vezelvertraging (512 bytepakketten): 1.919 µs
> > 10 Gbps vezelvertraging (1024 bytepakketten): 1.971 µs
> > 10 Gbps vezelvertraging (1518 bytepakketten): 1.905 µs
> >
> > Copperdelays:
> > 10 Gbps kopervertraging (64 bytepakketten): 2.728 µs
> > 10 Gbps kopervertraging (512 bytepakketten): 2.85 µs
> > 10 Gbps kopervertraging (1024 bytepakketten): 2.904 µs
> > 10 Gbps kopervertraging (1518 bytepakketten): 2.841 µs
> >
> > Fiberdelays:
> > 1 Gbps vezelvertraging (64 bytepakketten) 2.289 µs
> > 1 Gbps vezelvertraging (512 bytepakketten) 2.393 µs
> > 1 Gbps vezelvertraging (1024 bytepakketten) 2.423 µs
> > 1 Gbps vezelvertraging (1518 bytepakketten) 2.379 µs
> >
> > Copperdelays:
> > 1 Gbps kopervertraging (64 bytepakketten) 2.707 µs
> > 1 Gbps kopervertraging (512 bytepakketten) 2.821 µs
> > 1 Gbps kopervertraging (1024 bytepakketten) 2.866 µs
> > 1 Gbps kopervertraging (1518 bytepakketten) 2.826 µs
> >
> > So the difference is serious: 900ns on a total of 1900ns for a 10G
> > pakket.
> > Other strange thing is that 1K packets are slower than 1518 bytes.
> >
> > So that might warrant connecting boxes preferably with optics
> > instead of CAT cableing if you are trying to squeeze the max out of
> > a setup.
> >
> > Sad thing is that they do not report for jumbo frames, and doing
> these
> > measurements your self is not easy...
> >
> > --WjW
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> >
> >
> >
> >
> > --
> > --
> > Paul Emmerich
> >
> > croit GmbH
> > Freseniusstr. 31h
> > 81247 München
> > www.croit.io 
> > Tel: +49 89 1896585 90
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] DELL R620 - SSD recommendation

2018-03-21 Thread Steven Vacaroaia
Hi,

It will be appreciated if you could recommend some SSD models ( 200GB or
less)

I am planning to deploy 2 SSD and 6 HDD ( for a 1 to 3 ratio) in few DELL
R620 with 64GB RAM

Also, what is the highest HDD capacity that you were able to use in the
R620 ?


Note
I apologize for asking "research easy" kind of questions but I am hoping
for confirmed / hands on info / details

Many thanks
Steven
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Difference in speed on Copper of Fiber ports on switches

2018-03-21 Thread Subhachandra Chandra
Looking at the latency numbers in this thread, it seems to be a cut-through
switch.

Subhachandra

On Wed, Mar 21, 2018 at 12:58 PM, Subhachandra Chandra <
schan...@grailbio.com> wrote:

> Latency is a concern if your application is sending one packet at a time
> and waiting for a reply. If you are streaming large blocks of data, the
> first packet is delayed by the network latency but after that you will
> receive a 10Gbps stream continuously. The latency for jumbo frames vs 1500
> byte frames depends upon the switch type. On a cut-through switch there is
> very little difference but on a store-and-forward switch it will be
> proportional to packet size. Most modern switching ASICs are capable of
> cut-through operation.
>
> Subhachandra
>
> On Wed, Mar 21, 2018 at 7:15 AM, Willem Jan Withagen 
> wrote:
>
>> On 21-3-2018 13:47, Paul Emmerich wrote:
>> > Hi,
>> >
>> > 2.3µs is a typical delay for a 10GBASE-T connection. But fiber or SFP+
>> > DAC connections should be faster: switches are typically in the range of
>> > ~500ns to 1µs.
>> >
>> >
>> > But you'll find that this small difference in latency induced by the
>> > switch will be quite irrelevant in the grand scheme of things when using
>> > the Linux network stack...
>>
>> But I think it does when people start to worry about selecting High
>> clock speed CPUS versus packages with more cores...
>>
>> 900ns is quite a lot if you have that mindset.
>> And probably 1800ns at that, because the delay will be a both ends.
>> Or perhaps even 3600ns because the delay is added at every ethernet
>> connector???
>>
>> But I'm inclined to believe you that the network stack could take quite
>> some time...
>>
>>
>> --WjW
>>
>>
>> > Paul
>> >
>> > 2018-03-21 12:16 GMT+01:00 Willem Jan Withagen > > >:
>> >
>> > Hi,
>> >
>> > I just ran into this table for a 10G Netgear switch we use:
>> >
>> > Fiberdelays:
>> > 10 Gbps vezelvertraging (64 bytepakketten): 1.827 µs
>> > 10 Gbps vezelvertraging (512 bytepakketten): 1.919 µs
>> > 10 Gbps vezelvertraging (1024 bytepakketten): 1.971 µs
>> > 10 Gbps vezelvertraging (1518 bytepakketten): 1.905 µs
>> >
>> > Copperdelays:
>> > 10 Gbps kopervertraging (64 bytepakketten): 2.728 µs
>> > 10 Gbps kopervertraging (512 bytepakketten): 2.85 µs
>> > 10 Gbps kopervertraging (1024 bytepakketten): 2.904 µs
>> > 10 Gbps kopervertraging (1518 bytepakketten): 2.841 µs
>> >
>> > Fiberdelays:
>> > 1 Gbps vezelvertraging (64 bytepakketten) 2.289 µs
>> > 1 Gbps vezelvertraging (512 bytepakketten) 2.393 µs
>> > 1 Gbps vezelvertraging (1024 bytepakketten) 2.423 µs
>> > 1 Gbps vezelvertraging (1518 bytepakketten) 2.379 µs
>> >
>> > Copperdelays:
>> > 1 Gbps kopervertraging (64 bytepakketten) 2.707 µs
>> > 1 Gbps kopervertraging (512 bytepakketten) 2.821 µs
>> > 1 Gbps kopervertraging (1024 bytepakketten) 2.866 µs
>> > 1 Gbps kopervertraging (1518 bytepakketten) 2.826 µs
>> >
>> > So the difference is serious: 900ns on a total of 1900ns for a 10G
>> > pakket.
>> > Other strange thing is that 1K packets are slower than 1518 bytes.
>> >
>> > So that might warrant connecting boxes preferably with optics
>> > instead of CAT cableing if you are trying to squeeze the max out of
>> > a setup.
>> >
>> > Sad thing is that they do not report for jumbo frames, and doing
>> these
>> > measurements your self is not easy...
>> >
>> > --WjW
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com 
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > 
>> >
>> >
>> >
>> >
>> > --
>> > --
>> > Paul Emmerich
>> >
>> > croit GmbH
>> > Freseniusstr. 31h
>> > 81247 München
>> > www.croit.io 
>> > Tel: +49 89 1896585 90
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] DELL R620 - SSD recommendation

2018-03-21 Thread Nghia Than
If you want speed and IOPS, try: PM863a or SM863a (PM863a is slightly
cheaper).

If you want high endurances, try Intel DC S3700 series.

Do not use consumer SSD for caching either HDD desktop for OSD.

what is the highest HDD capacity that you were able to use in the R620 ?


​This depend on your raid controller, not your server.

On Thu, Mar 22, 2018 at 2:40 AM, Steven Vacaroaia  wrote:

> Hi,
>
> It will be appreciated if you could recommend some SSD models ( 200GB or
> less)
>
> I am planning to deploy 2 SSD and 6 HDD ( for a 1 to 3 ratio) in few DELL
> R620 with 64GB RAM
>
> Also, what is the highest HDD capacity that you were able to use in the
> R620 ?
>
>
> Note
> I apologize for asking "research easy" kind of questions but I am hoping
> for confirmed / hands on info / details
>
> Many thanks
> Steven
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
==
Nghia Than
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] IO rate-limiting with Ceph RBD (and libvirt)

2018-03-21 Thread Wido den Hollander


On 03/21/2018 06:48 PM, Andre Goree wrote:
> I'm trying to determine the best way to go about configuring IO
> rate-limiting for individual images within an RBD pool.
> 
> Here [1], I've found that OpenStack appears to use Libvirt's "iotune"
> parameter, however I seem to recall reading about being able to do so
> via Ceph's settings.
> 
> Is there a place in Ceph to set IO limits on individual images within an
> RBD pool?  Thanks in advance for the advice.

No, there is not. Right now you will need to limit this through
libvirt/Qemu indeed.

People are thinking about a QoS mechanism inside the OSDs, but that's
not there yet, so don't count on it.

Wido

> 
> [1] https://ceph.com/geen-categorie/openstack-ceph-rbd-and-qos/
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com