Hello Robert, 

Sorry for late answer. 

Thanks for your reply. I updated to infernalis and I applied all your
recommendations but it doesn't change anything, with or without cache
tiering :-/ 

I also compared XFS to EXT4 and BTRFS but it doesn't make the
difference. 

The fio command from Sebastien Han tells me my disks can do 100 Kiops
actually, so it's really frustrating :-S 

Rémi 

Le 2015-11-07 15:59, Robert LeBlanc a écrit : 

> You most likely did the wrong test to get baseline Ceph IOPS or of your ssds. 
> Ceph is really hard on SSDS and it does direct sync writes which drives 
> handle very different even between models of the same brand. Start with 
> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>  [4] as your base numbers and just realize that hammer still can't use all 
> those IOps. I was able to gain 50% in SSD IOPS by: disabling translated huge 
> pages, ld_preloading jemalloc (uses a little more RAM but your config should 
> be ok), enabling numad, dialing irqbalance, setting vfs_cache_pressure to 
> 500, and greatly increasing the network buffers and disabling the slow tcp 
> startup. We are also using EXT4 which I've found is a bit faster but it had 
> recently been reported that someone is having deadlocks/crashes with it. We 
> are having an XFS log issue on one of our clusters causing an OSD or two to 
> fail every week. 
> 
> When I tested the same workload in an SSD cache tier the performance was only 
> 50% of what I was able to achieve on the pure SSD tier (I'm guessing overhead 
> of the cache tier). And this was with having the entire test set in the SSD 
> tier so there was no spindle activity. 
> 
> Short answer is that your will need a lot more SSDS to hit your target with 
> hammer. Or if you can wait for Jewel you may be able to get by with only 
> needing a little bit more. 
> 
> Robert LeBlanc 
> 
> Sent from a mobile device please excuse any typos. 
> On Nov 7, 2015 1:24 AM, "Rémi BUISSON" <remi-buis...@orange.fr> wrote:
> 
>> Hi guys,
>> 
>> I would need your help to figure out performance issues on my ceph cluster.
>> I've read pretty much every thread on the net concerning this topic but I 
>> didn't manage to have acceptable performances.
>> In my company, we are planning to replace our existing virtualization 
>> infrastucture NAS by a ceph cluster in order to improve the global platform 
>> performances, scalability and security. The current NAS we have handle about 
>> 50k iops.
>> 
>> For this we bought:
>> 2 x NFS servers: 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 32 GB RAM, 2 
>> x 10Gbps network interfaces (bonding)
>> 3 x MON servers: 1 x Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz, 16 GB RAM, 2 
>> x 10Gbps network interfaces (bonding)
>> 2 x MDS servers: 2 x Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz, 32 GB RAM, 
>> 2 x 10Gbps network interfaces (bonding)
>> 2 x OSD servers (cache): 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 256 
>> GB RAM, 2 x SSD INTEL SSDSC2BX200G4 (200 GB) for journal, 6 x SSD INTEL 
>> SSDSC2BX016T4R (1,4 TB) for data, 2 x 10Gbps network interfaces (bonding)
>> 4 x OSD servers (storage): 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 
>> 256 GB RAM, 4 x SSD TOSHIBA PX02SMF020 (200GB) for journal, 18 x HGST 
>> Ultrastar HUC101818CS4204 (1.8TB) for data, 2 x 10Gbps network interfaces 
>> (bonding)
>> 
>> The total of this is 84 OSDs.
>> 
>> I created two 4096 pgs pools, one called rbd-cold-storage and the other 
>> rbd-hot-storage. As you may guess, the rbd-cold-storage is composed of the 4 
>> OSD servers with platter disks and the rbd-hot-storage is composed of the 2 
>> OSD servers with SSD disks.
>> On the rdb-cold-storage, I created an rbd device which is mapped on the NFS 
>> server.
>> 
>> I benched each of the SSD we have and it can handle 40k iops each. As my 
>> replication factor is 2, the theoritical performance of the cluster is (2 x 
>> 6 (OSD cache) x 40k) / 2 = 240k iops.
>> 
>> I'm currently benching the cluster with fio tool from one NFS server. Here 
>> my fio job file:
>> [global]
>> ioengine=libaio
>> iodepth=32
>> runtime=300
>> direct=1
>> filename=/dev/rbd0
>> group_reporting=1
>> gtod_reduce=1
>> randrepeat=1
>> size=4G
>> numjobs=1
>> 
>> [4k-rand-write]
>> new_group
>> bs=4k
>> rw=randwrite
>> stonewall
>> 
>> The problem is I can't get more than 15k iops for writes. In my monitoring 
>> engine, I can see that each of the OSD (cache) SSD are not doing more than 
>> 2,5k iops which seems to correspond with 6 x 2,5k = 15k iops. I don't expect 
>> to reach the theoritical value but reaching 100k iops would be perfect.
>> 
>> My cluster is running on debian jessie with ceph Hammer v0.94.5 debian 
>> package (compiled with --with-jemalloc option, I also tried without). Here 
>> is my ceph.conf:
>> 
>> [global]
>> fsid = 5046f766-670f-4705-adcc-290f434c8a83
>> 
>> # basic settings
>> mon initial members = a01cepmon001,a01cepmon002,a01cepmon003
>> mon host = 10.10.69.254,10.10.69.253,10.10.69.252
>> mon osd allow primary affinity = true
>> # network settings
>> public network = 10.10.69.128/25 [1]
>> cluster network = 10.10.69.0/25 [2]
>> 
>> # auth settings
>> auth cluster required = cephx
>> auth service required = cephx
>> auth client required = cephx
>> 
>> # default pools settings
>> osd pool default size = 2
>> osd pool default min size = 1
>> osd pool default pg num = 8192
>> osd pool default pgp num = 8192
>> osd crush chooseleaf type = 1
>> 
>> # debug settings
>> debug lockdep = 0/0
>> debug context = 0/0
>> debug crush = 0/0
>> debug buffer = 0/0
>> debug timer = 0/0
>> debug journaler = 0/0
>> debug osd = 0/0
>> debug optracker = 0/0
>> debug objclass = 0/0
>> debug filestore = 0/0
>> debug journal = 0/0
>> debug ms = 0/0
>> debug monc = 0/0
>> debug tp = 0/0
>> debug auth = 0/0
>> debug finisher = 0/0
>> debug heartbeatmap = 0/0
>> debug perfcounter = 0/0
>> debug asok = 0/0
>> debug throttle = 0/0
>> 
>> throttler perf counter = false
>> osd enable op tracker = false
>> 
>> ## OSD settings
>> [osd]
>> # OSD FS settings
>> osd mkfs type = xfs
>> osd mkfs options xfs = -f -i size=2048
>> osd mount options xfs = rw,noatime,logbsize=256k,delaylog
>> 
>> # OSD journal settings
>> osd journal block align = true
>> osd journal aio = true
>> osd journal dio = true
>> 
>> # Performance tuning
>> filestore xattr use omap = true
>> filestore merge threshold = 40
>> filestore split multiple = 8
>> filestore max sync interval = 10
>> filestore queue max ops = 100000
>> filestore queue max bytes = 1GiB
>> filestore op threads = 20
>> filestore journal writeahead = true
>> filestore fd cache size = 10240
>> osd op threads = 8
>> 
>> Disabling throttling doesn't change anything.
>> So after all I read, I would like to know if, since the few months old 
>> threads, someone to fix those kind of problems ? any idea or thoughts to 
>> improve this ?
>> 
>> Thanks.
>> 
>> Rémi
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [3]
 

Links:
------
[1] http://10.10.69.128/25
[2] http://10.10.69.0/25
[3] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[4]
http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to