[ceph-users] About IOPS num
guys: There's a ceph cluster working and nodes were connected with 10Gb cable. We defined fio's bs=4k and the object size of rbd is 4MB. Client node was connect with the cluster via 1000Mb cable. Finally the IOPS is 27 ,when we control the latency of fio under 20ms. 4MB*27=108MB nearly 1000Mb,so is it reasonable? lixue...@chinacloud.com.cn ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS
On Fri, Aug 29, 2014 at 10:37 AM, Somnath Roy wrote: > Thanks Haomai ! > > Here is some of the data from my setup. > > > > -- > > Set up: > > > > > > 32 core cpu with HT enabled, 128 GB RAM, one SSD (both journal and data) -> > one OSD. 5 client m/c with 12 core cpu and each running two instances of > ceph_smalliobench (10 clients total). Network is 10GbE. > > > > Workload: > > - > > > > Small workload – 20K objects with 4K size and io_size is also 4K RR. The > intent is to serve the ios from memory so that it can uncover the > performance problems within single OSD. > > > > Results from Firefly: > > -- > > > > Single client throughput is ~14K iops, but as the number of client increases > the aggregated throughput is not increasing. 10 clients ~15K iops. ~9-10 cpu > cores are used. > > > > Result with latest master: > > -- > > > > Single client is ~14K iops, but scaling as number of clients increases. 10 > clients ~107K iops. ~25 cpu cores are used. > > > > -- > > > > > > More realistic workload: > > - > > Let’s see how it is performing while > 90% of the ios are served from disks > > Setup: > > --- > > 40 cpu core server as a cluster node (single node cluster) with 64 GB RAM. 8 > SSDs -> 8 OSDs. One similar node for monitor and rgw. Another node for > client running fio/vdbench. 4 rbds are configured with ‘noshare’ option. 40 > GbE network > > > > Workload: > > > > > > 8 SSDs are populated , so, 8 * 800GB = ~6.4 TB of data. Io_size = 4K RR. > > > > Results from Firefly: > > > > > > Aggregated output while 4 rbd clients stressing the cluster in parallel is > ~20-25K IOPS , cpu cores used ~8-10 cores (may be less can’t remember > precisely) > > > > Results from latest master: > > > > > > Aggregated output while 4 rbd clients stressing the cluster in parallel is > ~120K IOPS , cpu is 7% idle i.e ~37-38 cpu cores. > > > > Hope this helps. > > > Thanks Roy, the results are very promising! Just two moments - are numbers from above related to the HT cores or you renormalized the result for real ones? And what about percentage of I/O time/utilization in this test was (if you measured this ones)? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS
On Fri, Aug 29, 2014 at 4:03 PM, Andrey Korolyov wrote: > On Fri, Aug 29, 2014 at 10:37 AM, Somnath Roy > wrote: > > Thanks Haomai ! > > > > Here is some of the data from my setup. > > > > > > > > > -- > > > > Set up: > > > > > > > > > > > > 32 core cpu with HT enabled, 128 GB RAM, one SSD (both journal and data) > -> > > one OSD. 5 client m/c with 12 core cpu and each running two instances of > > ceph_smalliobench (10 clients total). Network is 10GbE. > > > > > > > > Workload: > > > > - > > > > > > > > Small workload – 20K objects with 4K size and io_size is also 4K RR. The > > intent is to serve the ios from memory so that it can uncover the > > performance problems within single OSD. > > > > > > > > Results from Firefly: > > > > -- > > > > > > > > Single client throughput is ~14K iops, but as the number of client > increases > > the aggregated throughput is not increasing. 10 clients ~15K iops. ~9-10 > cpu > > cores are used. > > > > > > > > Result with latest master: > > > > -- > > > > > > > > Single client is ~14K iops, but scaling as number of clients increases. > 10 > > clients ~107K iops. ~25 cpu cores are used. > > > > > > > > > -- > > > > > > > > > > > > More realistic workload: > > > > - > > > > Let’s see how it is performing while > 90% of the ios are served from > disks > > > > Setup: > > > > --- > > > > 40 cpu core server as a cluster node (single node cluster) with 64 GB > RAM. 8 > > SSDs -> 8 OSDs. One similar node for monitor and rgw. Another node for > > client running fio/vdbench. 4 rbds are configured with ‘noshare’ option. > 40 > > GbE network > > > > > > > > Workload: > > > > > > > > > > > > 8 SSDs are populated , so, 8 * 800GB = ~6.4 TB of data. Io_size = 4K RR. > > > > > > > > Results from Firefly: > > > > > > > > > > > > Aggregated output while 4 rbd clients stressing the cluster in parallel > is > > ~20-25K IOPS , cpu cores used ~8-10 cores (may be less can’t remember > > precisely) > Good job! I would like to perform it later. > > > > > > > > Results from latest master: > > > > > > > > > > > > Aggregated output while 4 rbd clients stressing the cluster in parallel > is > > ~120K IOPS , cpu is 7% idle i.e ~37-38 cpu cores. > > > > > > > > Hope this helps. > > > > > > > > Thanks Roy, the results are very promising! > > Just two moments - are numbers from above related to the HT cores or > you renormalized the result for real ones? And what about percentage > of I/O time/utilization in this test was (if you measured this ones)? > -- Best Regards, Wheat ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RFC: A preliminary Chinese version of Calamari
Hi, We have set up a preliminary Chinese version of Calamari at https://github.com/xiaoxianxia/calamari-clients-cn, major jobs done are translating the English words on the web interface into Chinese, we did not change the localization infrastructure, any help in this direction are appreciated, also any suggestions, tests and technical involvement are welcome, to make it ready to be merged to the upstream. Cheers, Li Wang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS
Thanks a lot for the answers, even if we drifted from the main subject a little bit. Thanks Somnath for sharing this, when can we expect any codes that might improve _write_ performance? @Mark thanks trying this :) Unfortunately using nobarrier and another dedicated SSD for the journal (plus your ceph setting) didn’t bring much, now I can reach 3,5K IOPS. By any chance, would it be possible for you to test with a single OSD SSD? On 28 Aug 2014, at 18:11, Sebastien Han wrote: > Hey all, > > It has been a while since the last thread performance related on the ML :p > I’ve been running some experiment to see how much I can get from an SSD on a > Ceph cluster. > To achieve that I did something pretty simple: > > * Debian wheezy 7.6 > * kernel from debian 3.14-0.bpo.2-amd64 > * 1 cluster, 3 mons (i’d like to keep this realistic since in a real > deployment i’ll use 3) > * 1 OSD backed by an SSD (journal and osd data on the same device) > * 1 replica count of 1 > * partitions are perfectly aligned > * io scheduler is set to noon but deadline was showing the same results > * no updatedb running > > About the box: > > * 32GB of RAM > * 12 cores with HT @ 2,4 GHz > * WB cache is enabled on the controller > * 10Gbps network (doesn’t help here) > > The SSD is a 200G Intel DC S3700 and is capable of delivering around 29K iops > with random 4k writes (my fio results) > As a benchmark tool I used fio with the rbd engine (thanks deutsche telekom > guys!). > > O_DIECT and D_SYNC don’t seem to be a problem for the SSD: > > # dd if=/dev/urandom of=rand.file bs=4k count=65536 > 65536+0 records in > 65536+0 records out > 268435456 bytes (268 MB) copied, 29.5477 s, 9.1 MB/s > > # du -sh rand.file > 256Mrand.file > > # dd if=rand.file of=/dev/sdo bs=4k count=65536 oflag=dsync,direct > 65536+0 records in > 65536+0 records out > 268435456 bytes (268 MB) copied, 2.73628 s, 98.1 MB/s > > See my ceph.conf: > > [global] > auth cluster required = cephx > auth service required = cephx > auth client required = cephx > fsid = 857b8609-8c9b-499e-9161-2ea67ba51c97 > osd pool default pg num = 4096 > osd pool default pgp num = 4096 > osd pool default size = 2 > osd crush chooseleaf type = 0 > > debug lockdep = 0/0 >debug context = 0/0 >debug crush = 0/0 >debug buffer = 0/0 >debug timer = 0/0 >debug journaler = 0/0 >debug osd = 0/0 >debug optracker = 0/0 >debug objclass = 0/0 >debug filestore = 0/0 >debug journal = 0/0 >debug ms = 0/0 >debug monc = 0/0 >debug tp = 0/0 >debug auth = 0/0 >debug finisher = 0/0 >debug heartbeatmap = 0/0 >debug perfcounter = 0/0 >debug asok = 0/0 >debug throttle = 0/0 > > [mon] > mon osd down out interval = 600 > mon osd min down reporters = 13 >[mon.ceph-01] >host = ceph-01 >mon addr = 172.20.20.171 > [mon.ceph-02] >host = ceph-02 >mon addr = 172.20.20.172 > [mon.ceph-03] >host = ceph-03 >mon addr = 172.20.20.173 > >debug lockdep = 0/0 >debug context = 0/0 >debug crush = 0/0 >debug buffer = 0/0 >debug timer = 0/0 >debug journaler = 0/0 >debug osd = 0/0 >debug optracker = 0/0 >debug objclass = 0/0 >debug filestore = 0/0 >debug journal = 0/0 >debug ms = 0/0 >debug monc = 0/0 >debug tp = 0/0 >debug auth = 0/0 >debug finisher = 0/0 >debug heartbeatmap = 0/0 >debug perfcounter = 0/0 >debug asok = 0/0 >debug throttle = 0/0 > > [osd] > osd mkfs type = xfs > osd mkfs options xfs = -f -i size=2048 > osd mount options xfs = rw,noatime,logbsize=256k,delaylog > osd journal size = 20480 > cluster_network = 172.20.20.0/24 > public_network = 172.20.20.0/24 > osd mon heartbeat interval = 30 > # Performance tuning > filestore merge threshold = 40 > filestore split multiple = 8 > osd op threads = 8 > # Recovery tuning > osd recovery max active = 1 > osd max backfills = 1 > osd recovery op priority = 1 > > >debug lockdep = 0/0 >debug context = 0/0 >debug crush = 0/0 >debug buffer = 0/0 >debug timer = 0/0 >debug journaler = 0/0 >debug osd = 0/0 >debug optracker = 0/0 >debug objclass = 0/0 >debug filestore = 0/0 >debug journal = 0/0 >debug ms = 0/0 >debug monc = 0/0 >debug tp = 0/0 >debug auth = 0/0 >debug finisher = 0/0 >debug heartbeatmap = 0/0 >debug perfcounter = 0/0 >debug asok = 0/0 >debug throttle = 0/0 > > Disabling all debugging made me win 200/300 more IOPS. > > See my fio template: > > [global] > #logging > #write_iops_log=write_iops_log > #write_bw_log=write_bw_log > #write_lat_log=write_lat_lo > > time_based > runtime=60 > > ioengine=rbd > clientname=admin >
[ceph-users] Difference between "object rm" and "object unlink" ?
Hi all, From radosgw-admin commond : # radosgw-admin object rm --object=my_test_file.txt --bucket=test_buck # radosgw-admin object unlink --object=my_test_file.txt --bucket=test_buck The both above did "delete" the object from bucket, both got the right result. What is the difference between these two commands ? Could some one tell me the detailes ? Thanks very much. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS
Hi Sebastien, Here’s my recipe for max IOPS on a _testing_ instance with SSDs: osd op threads = 2 osd disk threads = 2 journal max write bytes = 1048576 journal queue max bytes = 1048576 journal max write entries = 1 journal queue max ops = 5 filestore op threads = 2 filestore max sync interval = 60 filestore queue max ops = 5 filestore queue max bytes = 1048576 filestore queue committing max bytes = 1048576 filestore queue committing max ops = 5 filestore wbthrottle xfs bytes start flusher = 4194304000 filestore wbthrottle xfs bytes hard limit = 4194304 filestore wbthrottle xfs ios start flusher = 5 filestore wbthrottle xfs ios hard limit = 50 filestore wbthrottle xfs inodes start flusher = 5 filestore wbthrottle xfs inodes hard limit = 50 Basically the goal there is to ensure no IOs are blocked from entering any queue. (And don’t run that in production!) IIRC I can get up to around 5000 IOPS to a single fio/rbd client. Related to the sync interval, I was also playing with vm.dirty_expire_centisecs and vm.dirty_writeback_centisecs to disable the background page flushing (disables FileStore flushing). That way, the only disk activity becomes the journal writes. You can confirm that with iostat. Another thing that comes to mind is at some point a single fio/librbd client will be the bottleneck. Did you try running two simultaneous fio’s (then adding the results)? Cheers, Dan -- Dan van der Ster || Data & Storage Services || CERN IT Department -- On 28 Aug 2014, at 18:11, Sebastien Han wrote: > Hey all, > > It has been a while since the last thread performance related on the ML :p > I’ve been running some experiment to see how much I can get from an SSD on a > Ceph cluster. > To achieve that I did something pretty simple: > > * Debian wheezy 7.6 > * kernel from debian 3.14-0.bpo.2-amd64 > * 1 cluster, 3 mons (i’d like to keep this realistic since in a real > deployment i’ll use 3) > * 1 OSD backed by an SSD (journal and osd data on the same device) > * 1 replica count of 1 > * partitions are perfectly aligned > * io scheduler is set to noon but deadline was showing the same results > * no updatedb running > > About the box: > > * 32GB of RAM > * 12 cores with HT @ 2,4 GHz > * WB cache is enabled on the controller > * 10Gbps network (doesn’t help here) > > The SSD is a 200G Intel DC S3700 and is capable of delivering around 29K iops > with random 4k writes (my fio results) > As a benchmark tool I used fio with the rbd engine (thanks deutsche telekom > guys!). > > O_DIECT and D_SYNC don’t seem to be a problem for the SSD: > > # dd if=/dev/urandom of=rand.file bs=4k count=65536 > 65536+0 records in > 65536+0 records out > 268435456 bytes (268 MB) copied, 29.5477 s, 9.1 MB/s > > # du -sh rand.file > 256Mrand.file > > # dd if=rand.file of=/dev/sdo bs=4k count=65536 oflag=dsync,direct > 65536+0 records in > 65536+0 records out > 268435456 bytes (268 MB) copied, 2.73628 s, 98.1 MB/s > > See my ceph.conf: > > [global] > auth cluster required = cephx > auth service required = cephx > auth client required = cephx > fsid = 857b8609-8c9b-499e-9161-2ea67ba51c97 > osd pool default pg num = 4096 > osd pool default pgp num = 4096 > osd pool default size = 2 > osd crush chooseleaf type = 0 > > debug lockdep = 0/0 >debug context = 0/0 >debug crush = 0/0 >debug buffer = 0/0 >debug timer = 0/0 >debug journaler = 0/0 >debug osd = 0/0 >debug optracker = 0/0 >debug objclass = 0/0 >debug filestore = 0/0 >debug journal = 0/0 >debug ms = 0/0 >debug monc = 0/0 >debug tp = 0/0 >debug auth = 0/0 >debug finisher = 0/0 >debug heartbeatmap = 0/0 >debug perfcounter = 0/0 >debug asok = 0/0 >debug throttle = 0/0 > > [mon] > mon osd down out interval = 600 > mon osd min down reporters = 13 >[mon.ceph-01] >host = ceph-01 >mon addr = 172.20.20.171 > [mon.ceph-02] >host = ceph-02 >mon addr = 172.20.20.172 > [mon.ceph-03] >host = ceph-03 >mon addr = 172.20.20.173 > >debug lockdep = 0/0 >debug context = 0/0 >debug crush = 0/0 >debug buffer = 0/0 >debug timer = 0/0 >debug journaler = 0/0 >debug osd = 0/0 >debug optracker = 0/0 >debug objclass = 0/0 >debug filestore = 0/0 >debug journal = 0/0 >debug ms = 0/0 >debug monc = 0/0 >debug tp = 0/0 >debug auth = 0/0 >debug finisher = 0/0 >debug heartbeatmap = 0/0 >debug perfcounter = 0/0 >debug asok = 0/0 >debug throttle = 0/0 > > [osd] > osd mkfs type = xfs > osd mkfs options xfs = -f -i size=2048 > osd mount options xfs = rw,noatime,logbsize=256k,delaylog > osd journal size = 20480 > cluster_network = 172.20.20.0/24 > pub
Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS
Hi Sébastien, On Thu, Aug 28, 2014 at 06:11:37PM +0200, Sebastien Han wrote: > Hey all, (...) > We have been able to reproduce this on 3 distinct platforms with some > deviations (because of the hardware) but the behaviour is the same. > Any thoughts will be highly appreciated, only getting 3,2k out of an 29K IOPS > SSD is a bit frustrating :). Yes, it's the OSD code running thru ~20k lines of code on every IO using ~1000 Systems-Calls for 1 real IO. Please have a look at the attached page, which I presented last Sept at SDC. => Ceph (v0.61) used 1600 mys for a 4k IO, SSD ~60 mys, Network ~20-200 mys (on v0.84 and all debug set to 0/0 it is still ~ 1100 mys) => so, tune or re-write the OSD code ... (I'm serious about this statement) Mit freundlichen Grüßen / Best regards Dieter > > Cheers. > > Sébastien Han > Cloud Architect > > "Always give 100%. Unless you're giving blood." > > Phone: +33 (0)1 49 70 99 72 > Mail: sebastien@enovance.com > Address : 11 bis, rue Roquépine - 75008 Paris > Web : www.enovance.com - Twitter : @enovance > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com Ceph-OSD-TAT.pdf Description: Adobe PDF document ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS
On 08/29/2014 06:10 AM, Dan Van Der Ster wrote: Hi Sebastien, Here’s my recipe for max IOPS on a _testing_ instance with SSDs: osd op threads = 2 With SSDs, In the past I've seen increasing the osd op thread count can help random reads. osd disk threads = 2 journal max write bytes = 1048576 journal queue max bytes = 1048576 journal max write entries = 1 journal queue max ops = 5 filestore op threads = 2 filestore max sync interval = 60 filestore queue max ops = 5 filestore queue max bytes = 1048576 filestore queue committing max bytes = 1048576 filestore queue committing max ops = 5 filestore wbthrottle xfs bytes start flusher = 4194304000 filestore wbthrottle xfs bytes hard limit = 4194304 filestore wbthrottle xfs ios start flusher = 5 filestore wbthrottle xfs ios hard limit = 50 filestore wbthrottle xfs inodes start flusher = 5 filestore wbthrottle xfs inodes hard limit = 50 It's also probably worth trying disabling all in-memory logging. Unfortunately I don't think we have a global flag to do this, so you have to do it on a per-log basis which is annoying. Basically the goal there is to ensure no IOs are blocked from entering any queue. (And don’t run that in production!) IIRC I can get up to around 5000 IOPS to a single fio/rbd client. Related to the sync interval, I was also playing with vm.dirty_expire_centisecs and vm.dirty_writeback_centisecs to disable the background page flushing (disables FileStore flushing). That way, the only disk activity becomes the journal writes. You can confirm that with iostat. Another thing that comes to mind is at some point a single fio/librbd client will be the bottleneck. Did you try running two simultaneous fio’s (then adding the results)? Cheers, Dan -- Dan van der Ster || Data & Storage Services || CERN IT Department -- On 28 Aug 2014, at 18:11, Sebastien Han wrote: Hey all, It has been a while since the last thread performance related on the ML :p I’ve been running some experiment to see how much I can get from an SSD on a Ceph cluster. To achieve that I did something pretty simple: * Debian wheezy 7.6 * kernel from debian 3.14-0.bpo.2-amd64 * 1 cluster, 3 mons (i’d like to keep this realistic since in a real deployment i’ll use 3) * 1 OSD backed by an SSD (journal and osd data on the same device) * 1 replica count of 1 * partitions are perfectly aligned * io scheduler is set to noon but deadline was showing the same results * no updatedb running About the box: * 32GB of RAM * 12 cores with HT @ 2,4 GHz * WB cache is enabled on the controller * 10Gbps network (doesn’t help here) The SSD is a 200G Intel DC S3700 and is capable of delivering around 29K iops with random 4k writes (my fio results) As a benchmark tool I used fio with the rbd engine (thanks deutsche telekom guys!). O_DIECT and D_SYNC don’t seem to be a problem for the SSD: # dd if=/dev/urandom of=rand.file bs=4k count=65536 65536+0 records in 65536+0 records out 268435456 bytes (268 MB) copied, 29.5477 s, 9.1 MB/s # du -sh rand.file 256Mrand.file # dd if=rand.file of=/dev/sdo bs=4k count=65536 oflag=dsync,direct 65536+0 records in 65536+0 records out 268435456 bytes (268 MB) copied, 2.73628 s, 98.1 MB/s See my ceph.conf: [global] auth cluster required = cephx auth service required = cephx auth client required = cephx fsid = 857b8609-8c9b-499e-9161-2ea67ba51c97 osd pool default pg num = 4096 osd pool default pgp num = 4096 osd pool default size = 2 osd crush chooseleaf type = 0 debug lockdep = 0/0 debug context = 0/0 debug crush = 0/0 debug buffer = 0/0 debug timer = 0/0 debug journaler = 0/0 debug osd = 0/0 debug optracker = 0/0 debug objclass = 0/0 debug filestore = 0/0 debug journal = 0/0 debug ms = 0/0 debug monc = 0/0 debug tp = 0/0 debug auth = 0/0 debug finisher = 0/0 debug heartbeatmap = 0/0 debug perfcounter = 0/0 debug asok = 0/0 debug throttle = 0/0 [mon] mon osd down out interval = 600 mon osd min down reporters = 13 [mon.ceph-01] host = ceph-01 mon addr = 172.20.20.171 [mon.ceph-02] host = ceph-02 mon addr = 172.20.20.172 [mon.ceph-03] host = ceph-03 mon addr = 172.20.20.173 debug lockdep = 0/0 debug context = 0/0 debug crush = 0/0 debug buffer = 0/0 debug timer = 0/0 debug journaler = 0/0 debug osd = 0/0 debug optracker = 0/0 debug objclass = 0/0 debug filestore = 0/0 debug journal = 0/0 debug ms = 0/0 debug monc = 0/0 debug tp = 0/0 debug auth = 0/0 debug finisher = 0/0 debug heartbeatmap = 0/0 debug perfcounter = 0/0 debug asok = 0/0 debug throttle = 0/0 [osd]
Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS
@Dan: thanks for sharing your config, with all your flags I don’t seem to get more that 3,4K IOPS and they even seem to slow me down :( This is really weird. Yes I already tried to run to simultaneous processes and only half of 3,4K for each of them. @Kasper: thanks for these results, I believe some improvement could be made in the code as well :). FYI I just tried on Ubuntu 12.04 and it looks a bit better because I’m getting iops=3783. On 29 Aug 2014, at 13:10, Dan Van Der Ster wrote: > vm.dirty_expire_centisecs Cheers. Sébastien Han Cloud Architect "Always give 100%. Unless you're giving blood." Phone: +33 (0)1 49 70 99 72 Mail: sebastien@enovance.com Address : 11 bis, rue Roquépine - 75008 Paris Web : www.enovance.com - Twitter : @enovance signature.asc Description: Message signed with OpenPGP using GPGMail ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS
Excellent, I've been meaning to check into how the TCP transport is going. Are you using a hybrid threadpool/epoll approach? That I suspect would be very effective at reducing context switching, especially compared to what we do now. Mark On 08/28/2014 10:40 PM, Matt W. Benjamin wrote: Hi, There's also an early-stage TCP transport implementation for Accelio, also EPOLL-based. (We haven't attempted to run Ceph protocols over it yet, to my knowledge, but it should be straightforward.) Regards, Matt - "Haomai Wang" wrote: Hi Roy, As for messenger level, I have some very early works on it(https://github.com/yuyuyu101/ceph/tree/msg-event), it contains a new messenger implementation which support different event mechanism. It looks like at least one more week to make it work. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS
Hi Mark, Yeah. The application defines portals which are active threaded, then the transport layer is servicing the portals with EPOLL. Matt - "Mark Nelson" wrote: > Excellent, I've been meaning to check into how the TCP transport is > going. Are you using a hybrid threadpool/epoll approach? That I > suspect would be very effective at reducing context switching, > especially compared to what we do now. > > Mark > > On 08/28/2014 10:40 PM, Matt W. Benjamin wrote: > > Hi, > > > > There's also an early-stage TCP transport implementation for > Accelio, also EPOLL-based. (We haven't attempted to run Ceph > protocols over it yet, to my knowledge, but it should be > straightforward.) > > > > Regards, > > > > Matt > > > > - "Haomai Wang" wrote: > > > >> Hi Roy, > >> > >> > >> As for messenger level, I have some very early works on > >> it(https://github.com/yuyuyu101/ceph/tree/msg-event), it contains > a > >> new messenger implementation which support different event > mechanism. > >> It looks like at least one more week to make it work. > >> > > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Matt Benjamin The Linux Box 206 South Fifth Ave. Suite 150 Ann Arbor, MI 48104 http://linuxbox.com tel. 734-761-4689 fax. 734-769-8938 cel. 734-216-5309 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS
Results are with HT enabled. We are yet to measure the performance with HT disabled. No, I didn't measure I/O time/utilization. -Original Message- From: Andrey Korolyov [mailto:and...@xdel.ru] Sent: Friday, August 29, 2014 1:03 AM To: Somnath Roy Cc: Haomai Wang; ceph-users@lists.ceph.com Subject: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS On Fri, Aug 29, 2014 at 10:37 AM, Somnath Roy wrote: > Thanks Haomai ! > > Here is some of the data from my setup. > > > > -- > -- > -- > > Set up: > > > > > > 32 core cpu with HT enabled, 128 GB RAM, one SSD (both journal and > data) -> one OSD. 5 client m/c with 12 core cpu and each running two > instances of ceph_smalliobench (10 clients total). Network is 10GbE. > > > > Workload: > > - > > > > Small workload – 20K objects with 4K size and io_size is also 4K RR. > The intent is to serve the ios from memory so that it can uncover the > performance problems within single OSD. > > > > Results from Firefly: > > -- > > > > Single client throughput is ~14K iops, but as the number of client > increases the aggregated throughput is not increasing. 10 clients ~15K > iops. ~9-10 cpu cores are used. > > > > Result with latest master: > > -- > > > > Single client is ~14K iops, but scaling as number of clients > increases. 10 clients ~107K iops. ~25 cpu cores are used. > > > > -- > -- > -- > > > > > > More realistic workload: > > - > > Let’s see how it is performing while > 90% of the ios are served from > disks > > Setup: > > --- > > 40 cpu core server as a cluster node (single node cluster) with 64 GB > RAM. 8 SSDs -> 8 OSDs. One similar node for monitor and rgw. Another > node for client running fio/vdbench. 4 rbds are configured with > ‘noshare’ option. 40 GbE network > > > > Workload: > > > > > > 8 SSDs are populated , so, 8 * 800GB = ~6.4 TB of data. Io_size = 4K RR. > > > > Results from Firefly: > > > > > > Aggregated output while 4 rbd clients stressing the cluster in > parallel is ~20-25K IOPS , cpu cores used ~8-10 cores (may be less > can’t remember > precisely) > > > > Results from latest master: > > > > > > Aggregated output while 4 rbd clients stressing the cluster in > parallel is ~120K IOPS , cpu is 7% idle i.e ~37-38 cpu cores. > > > > Hope this helps. > > > Thanks Roy, the results are very promising! Just two moments - are numbers from above related to the HT cores or you renormalized the result for real ones? And what about percentage of I/O time/utilization in this test was (if you measured this ones)? PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fwd: Ceph Filesystem - Production?
I am running active/standby and it didn't swap over to the standby. If I shutdown the active server it swaps to the standby fine though. When there were issues, disk access would back up on the webstats servers and a cat of /sys/kernel/debug/ceph/*/mdsc would have a list of entries whereas normally it would only list one or two if any. I have 4 cores and 2GB of ram on the mds machines. Watching it right now it is using most of the ram and some of swap although most of the active ram is disk cache. I lowered the memory.swappiness value to see if that helps. I'm also logging top output if it happens again. On Thu, Aug 28, 2014 at 8:22 PM, Yan, Zheng wrote: > On Fri, Aug 29, 2014 at 8:36 AM, James Devine wrote: > > > > On Thu, Aug 28, 2014 at 1:30 PM, Gregory Farnum > wrote: > >> > >> On Thu, Aug 28, 2014 at 10:36 AM, Brian C. Huffman > >> wrote: > >> > Is Ceph Filesystem ready for production servers? > >> > > >> > The documentation says it's not, but I don't see that mentioned > anywhere > >> > else. > >> > http://ceph.com/docs/master/cephfs/ > >> > >> Everybody has their own standards, but Red Hat isn't supporting it for > >> general production use at this time. If you're brave you could test it > >> under your workload for a while and see how it comes out; the known > >> issues are very much workload-dependent (or just general concerns over > >> polish). > >> -Greg > >> Software Engineer #42 @ http://inktank.com | http://ceph.com > >> ___ > >> ceph-users mailing list > >> ceph-users@lists.ceph.com > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > I've been testing it with our webstats since it gets live hits but isn't > > customer affecting. Seems the MDS server has problems every few days > > requiring me to umount and remount the ceph disk to resolve. Not sure if > > the issue is resolved in development versions but as of 0.80.5 we seem > to be > > hitting it. I set the log verbosity to 20 so there's tons of logs but > ends > > with > > The cephfs client is supposed to be able to handle MDS takeover. > what's symptom makes you umount and remount the cephfs ? > > > > > 2014-08-24 07:10:19.682015 7f2b575e7700 10 mds.0.14 laggy, deferring > > client_request(client.92141:6795587 getattr pAsLsXsFs #1026dc1) > > 2014-08-24 07:10:19.682021 7f2b575e7700 5 mds.0.14 is_laggy 19.324963 > > 15 > > since last acked beacon > > 2014-08-24 07:10:20.358011 7f2b554e2700 10 mds.0.14 beacon_send up:active > > seq 127220 (currently up:active) > > 2014-08-24 07:10:21.515899 7f2b575e7700 5 mds.0.14 is_laggy 21.158841 > > 15 > > since last acked beacon > > 2014-08-24 07:10:21.515912 7f2b575e7700 10 mds.0.14 laggy, deferring > > client_session(request_renewcaps seq 26766) > > 2014-08-24 07:10:21.515915 7f2b575e7700 5 mds.0.14 is_laggy 21.158857 > > 15 > > since last acked beacon > > 2014-08-24 07:10:21.981148 7f2b575e7700 10 mds.0.snap check_osd_map > > need_to_purge={} > > 2014-08-24 07:10:21.981176 7f2b575e7700 5 mds.0.14 is_laggy 21.624117 > > 15 > > since last acked beacon > > 2014-08-24 07:10:23.170528 7f2b575e7700 5 mds.0.14 handle_mds_map epoch > 93 > > from mon.0 > > 2014-08-24 07:10:23.175367 7f2b532d5700 0 -- 10.251.188.124:6800/985 >> > > 10.251.188.118:0/2461578479 pipe(0x5588a80 sd=23 :6800 s=2 pgs=91 cs=1 > l=0 > > c=0x2cbfb20).fault with nothing to send, going to standby > > 2014-08-24 07:10:23.175376 7f2b533d6700 0 -- 10.251.188.124:6800/985 >> > > 10.251.188.55:0/306923677 pipe(0x5588d00 sd=22 :6800 s=2 pgs=7 cs=1 l=0 > > c=0x2cbf700).fault with nothing to send, going to standby > > 2014-08-24 07:10:23.175380 7f2b531d4700 0 -- 10.251.188.124:6800/985 >> > > 10.251.188.31:0/2854230502 pipe(0x5589480 sd=24 :6800 s=2 pgs=881 cs=1 > l=0 > > c=0x2cbfde0).fault with nothing to send, going to standby > > 2014-08-24 07:10:23.175438 7f2b534d7700 0 -- 10.251.188.124:6800/985 >> > > 10.251.188.68:0/2928927296 pipe(0x5588800 sd=21 :6800 s=2 pgs=7 cs=1 l=0 > > c=0x2cbf5a0).fault with nothing to send, going to standby > > 2014-08-24 07:10:23.184201 7f2b575e7700 10 mds.0.14 my compat > > compat={},rocompat={},incompat={1=base v0.20,2=client writeable > > ranges,3=default file layouts on dirs,4=dir inode in separate > object,5=mds > > uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline > data} > > 2014-08-24 07:10:23.184255 7f2b575e7700 10 mds.0.14 mdsmap compat > > compat={},rocompat={},incompat={1=base v0.20,2=client writeable > > ranges,3=default file layouts on dirs,4=dir inode in separate > object,5=mds > > uses versioned encoding,6=dirfrag is stored in omap} > > 2014-08-24 07:10:23.184264 7f2b575e7700 10 mds.-1.-1 map says i am > > 10.251.188.124:6800/985 mds.-1.-1 state down:dne > > 2014-08-24 07:10:23.184275 7f2b575e7700 10 mds.-1.-1 peer mds gid 94665 > > removed from map > > 2014-08-24 07:10:23.184282 7f2b575e7700 1 mds.-1.-1 handle_mds_map i > > (10.251.188.124:6800/985) dne in the mdsmap, re
Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS
Sebastien, No timeline yet.. Couple of issues here. 1. I don't think FileStore is a solution for SSDs because of its double write. WA will be >=2 always. So far can't find a way to replace Ceph journal for filesystem backend. 2. The existing key-value backend like leveldb/Rocksdb also are not flash optimized. Yes, Ceph journal will not be needed anymore but I guess compaction overhead for Flash will be too much. Thanks & Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Sebastien Han Sent: Friday, August 29, 2014 3:17 AM To: ceph-users Subject: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS Thanks a lot for the answers, even if we drifted from the main subject a little bit. Thanks Somnath for sharing this, when can we expect any codes that might improve _write_ performance? @Mark thanks trying this :) Unfortunately using nobarrier and another dedicated SSD for the journal (plus your ceph setting) didn't bring much, now I can reach 3,5K IOPS. By any chance, would it be possible for you to test with a single OSD SSD? On 28 Aug 2014, at 18:11, Sebastien Han wrote: > Hey all, > > It has been a while since the last thread performance related on the > ML :p I've been running some experiment to see how much I can get from an SSD > on a Ceph cluster. > To achieve that I did something pretty simple: > > * Debian wheezy 7.6 > * kernel from debian 3.14-0.bpo.2-amd64 > * 1 cluster, 3 mons (i'd like to keep this realistic since in a real > deployment i'll use 3) > * 1 OSD backed by an SSD (journal and osd data on the same device) > * 1 replica count of 1 > * partitions are perfectly aligned > * io scheduler is set to noon but deadline was showing the same > results > * no updatedb running > > About the box: > > * 32GB of RAM > * 12 cores with HT @ 2,4 GHz > * WB cache is enabled on the controller > * 10Gbps network (doesn't help here) > > The SSD is a 200G Intel DC S3700 and is capable of delivering around > 29K iops with random 4k writes (my fio results) As a benchmark tool I used > fio with the rbd engine (thanks deutsche telekom guys!). > > O_DIECT and D_SYNC don't seem to be a problem for the SSD: > > # dd if=/dev/urandom of=rand.file bs=4k count=65536 > 65536+0 records in > 65536+0 records out > 268435456 bytes (268 MB) copied, 29.5477 s, 9.1 MB/s > > # du -sh rand.file > 256Mrand.file > > # dd if=rand.file of=/dev/sdo bs=4k count=65536 oflag=dsync,direct > 65536+0 records in > 65536+0 records out > 268435456 bytes (268 MB) copied, 2.73628 s, 98.1 MB/s > > See my ceph.conf: > > [global] > auth cluster required = cephx > auth service required = cephx > auth client required = cephx > fsid = 857b8609-8c9b-499e-9161-2ea67ba51c97 > osd pool default pg num = 4096 > osd pool default pgp num = 4096 > osd pool default size = 2 > osd crush chooseleaf type = 0 > > debug lockdep = 0/0 >debug context = 0/0 >debug crush = 0/0 >debug buffer = 0/0 >debug timer = 0/0 >debug journaler = 0/0 >debug osd = 0/0 >debug optracker = 0/0 >debug objclass = 0/0 >debug filestore = 0/0 >debug journal = 0/0 >debug ms = 0/0 >debug monc = 0/0 >debug tp = 0/0 >debug auth = 0/0 >debug finisher = 0/0 >debug heartbeatmap = 0/0 >debug perfcounter = 0/0 >debug asok = 0/0 >debug throttle = 0/0 > > [mon] > mon osd down out interval = 600 > mon osd min down reporters = 13 >[mon.ceph-01] >host = ceph-01 >mon addr = 172.20.20.171 > [mon.ceph-02] >host = ceph-02 >mon addr = 172.20.20.172 > [mon.ceph-03] >host = ceph-03 >mon addr = 172.20.20.173 > >debug lockdep = 0/0 >debug context = 0/0 >debug crush = 0/0 >debug buffer = 0/0 >debug timer = 0/0 >debug journaler = 0/0 >debug osd = 0/0 >debug optracker = 0/0 >debug objclass = 0/0 >debug filestore = 0/0 >debug journal = 0/0 >debug ms = 0/0 >debug monc = 0/0 >debug tp = 0/0 >debug auth = 0/0 >debug finisher = 0/0 >debug heartbeatmap = 0/0 >debug perfcounter = 0/0 >debug asok = 0/0 >debug throttle = 0/0 > > [osd] > osd mkfs type = xfs > osd mkfs options xfs = -f -i size=2048 osd mount options xfs = > rw,noatime,logbsize=256k,delaylog osd journal size = 20480 > cluster_network = 172.20.20.0/24 public_network = 172.20.20.0/24 osd > mon heartbeat interval = 30 # Performance tuning filestore merge > threshold = 40 filestore split multiple = 8 osd op threads = 8 # > Recovery tuning osd recovery max active = 1 osd max backfills = 1 > osd recovery op priority = 1 > > >debug lockdep = 0/0 >debug context = 0/0 >debug crush = 0/0 >debug buffer = 0/0 >debug timer = 0/0 >debug journaler = 0/0 >
Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS
Hi Somnath, we're in the process evaluating sandisk ssds for ceph (fs and journal on each). 8 osds / ssds per host xeon e3 1650 Which one can you recommend? Greets, Stefan Excuse my typo sent from my mobile phone. > Am 29.08.2014 um 18:33 schrieb Somnath Roy : > > Somnath ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] script for commissioning a node with multiple osds, added to cluster as a whole
Hi All, Does anyone have a script or sequence of commands to prepare all drives on a single computer for use by ceph, and then start up all OSDs on the computer at one time? I feel this would be faster and less network traffic than adding one drive at a time, which is what the current script does. Thanks! Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] script for commissioning a node with multiple osds, added to cluster as a whole
Hi, You could use ceph-deploy. There wont be any difference in the total amount of data being moved. W dniu 29.08.2014 o 18:53 Chad Seys pisze: Hi All, Does anyone have a script or sequence of commands to prepare all drives on a single computer for use by ceph, and then start up all OSDs on the computer at one time? I feel this would be faster and less network traffic than adding one drive at a time, which is what the current script does. Thanks! Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Konrad Gutkowski ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 'incomplete' PGs: what does it mean?
Hmm, so you've got PGs which are out-of-date on disk (by virtue of being an older snapshot?) but still have records of them being newer in the OSD journal? That's a new failure node for me and I don't think we have any tools designed for solving. If you can *back up the disk* before doing this, I think I'd try flushing the OSD journal, then zapping it and creating a new one. You'll probably still have an inconsistent store, so you'll then need to, for each incomplete pg (and for every OSD at the same point on it) use the OSD store tool extract the "pginfo" entry and rewrite it to the versions which the pg actually is on disk. These are uncharted waters and you're going to need to explore some dev tools; good luck! (Unless Sam or somebody has a better solution.) -Greg On Thursday, August 28, 2014, John Morris wrote: > Greg, thanks for the tips in both this and the BTRFS_IOC_SNAP_CREATE > thread. They were enough to get PGs 'incomplete' due to 'not enough > OSDs hosting' resolved by rolling back to a btrfs snapshot. I promise > to write a full post-mortem (embarrassing as it will be) after the > cluster is fully healthy. > > As is my kind of luck, the cluster *also* suffers from eight of the > *other* 'missing log' sort of 'incomplete' PGs: > > 2014-08-28 23:42:03.350612 7f1cc9d82700 20 osd.1 pg_epoch: 13085 > pg[2.5c( v 10143'300715 (9173'297714,10143'300715] local-les=11890 > n=3404 ec=1 les/c 11890/10167 12809/12809/12809) [7,3,0,4] > r=-1 lpr=12809 pi=8453-12808/138 lcod 0'0 > inactive NOTIFY] > handle_activate_map: Not dirtying info: > last_persisted is 13010 while current is 13085 > > The data clearly exists in the renegade OSD's data directory. There > are no reported 'unfound' objects, so 'mark_unfound_lost revert' doesn't > apply. No apparently useful data from 'ceph pg ... query', but an > example is pasted below. > > Since the beginning of the cluster rebuild, all ceph clients have been > turned off, so I believe there's no fear of lost data by reverting to > these PGs, and besides there're always backup tapes. > > How can Ceph be told to accept the versions it sees on osd.1 as the > most current version, and forget the missing log history? > >John > > > $ ceph pg 2.5c query > { "state": "incomplete", > "epoch": 13241, > "up": [ > 7, > 1, > 0, > 4], > "acting": [ > 7, > 1, > 0, > 4], > "info": { "pgid": "2.5c", > "last_update": "10143'300715", > "last_complete": "10143'300715", > "log_tail": "9173'297715", > "last_backfill": "0\/\/0\/\/-1", > "purged_snaps": "[]", > "history": { "epoch_created": 1, > "last_epoch_started": 11890, > "last_epoch_clean": 10167, > "last_epoch_split": 0, > "same_up_since": 13229, > "same_interval_since": 13229, > "same_primary_since": 13118, > "last_scrub": "10029'298459", > "last_scrub_stamp": "2014-08-18 17:36:01.079649", > "last_deep_scrub": "8323'284793", > "last_deep_scrub_stamp": "2014-08-15 17:38:06.229106", > "last_clean_scrub_stamp": "2014-08-18 17:36:01.079649"}, > "stats": { "version": "10143'300715", > "reported_seq": "1764", > "reported_epoch": "13241", > "state": "incomplete", > "last_fresh": "2014-08-29 01:35:44.196909", > "last_change": "2014-08-29 01:22:50.298880", > "last_active": "0.00", > "last_clean": "0.00", > "last_became_active": "0.00", > "last_unstale": "2014-08-29 01:35:44.196909", > "mapping_epoch": 13223, > "log_start": "9173'297715", > "ondisk_log_start": "9173'297715", > "created": 1, > "last_epoch_clean": 10167, > "parent": "0.0", > "parent_split_bits": 0, > "last_scrub": "10029'298459", > "last_scrub_stamp": "2014-08-18 17:36:01.079649", > "last_deep_scrub": "8323'284793", > "last_deep_scrub_stamp": "2014-08-15 17:38:06.229106", > "last_clean_scrub_stamp": "2014-08-18 17:36:01.079649", > "log_size": 3000, > "ondisk_log_size": 3000, > "stats_invalid": "0", > "stat_sum": { "num_bytes": 0, > "num_objects": 0, > "num_object_clones": 0, > "num_object_copies": 0, > "num_objects_missing_on_primary": 0, > "num_objects_degraded": 0, > "num_objects_unfound": 0, > "num_read": 0, > "num_read_kb": 0, > "num_write": 0, > "num_write_kb": 0, > "num_scrub_errors": 0, > "num_shallow_scrub_errors": 0, > "num_deep_scrub_errors": 0, > "num_objects_recovered": 0, > "num_bytes_recovered": 0, > "num_keys_recovered": 0}, > "stat
Re: [ceph-users] script for commissioning a node with multiple osds, added to cluster as a whole
Hello, - Mail original - > De: "Chad Seys" > À: ceph-users@lists.ceph.com > Envoyé: Vendredi 29 Août 2014 18:53:19 > Objet: [ceph-users] script for commissioning a node with multiple osds, > added to cluster as a whole > > Hi All, > Does anyone have a script or sequence of commands to prepare all drives on > a > single computer for use by ceph, and then start up all OSDs on the computer > at > one time? > I feel this would be faster and less network traffic than adding one drive > at a time, which is what the current script does. You can use puppet : https://github.com/cernceph/puppet-ceph https://github.com/enovance/puppet-ceph You may also have a look here : https://github.com/cernceph Cheers, Olivier. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] question about monitor and paxos relationship
On Thu, Aug 28, 2014 at 9:52 PM, pragya jain wrote: > I have some basic question about monitor and paxos relationship: > > As the documents says, Ceph monitor contains cluster map, if there is any > change in the state of the cluster, the change is updated in the cluster > map. monitor use paxos algorithm to create the consensus among monitors to > establish a quorum. > And when we talk about the Paxos algorithm, documents says that monitor > writes all changes to the Paxos instance and Paxos writes the changes to a > key/value store for strong consistency. > > #1: I am unable to understand what actually the Paxos algorithm do? all > changes in the cluster map are made by Paxos algorithm? how it create a > consensus among monitors Paxos is an algorithm for making decisions and/or safely replicating data in a distributed system. The Ceph monitor cluster uses it for all changes to any of its data. > My assumption is: cluster map is updated when OSD report monitor about any > changes, there is no role of Paxos in it. Paxos write changes made only for > the monitors. Please somebody elaborate at this point. Every change the monitors incorporate to any data structure, most definitely including the OSD map's changes based on reports from OSDs, is passed through paxos. > #2: why odd no. of monitors are recommended for production cluster, not even > no.? This is because of a trait of the Paxos' systems durability and uptime guarantees. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] question about monitor and paxos relationship
On Fri, Aug 29, 2014 at 12:52 AM, pragya jain wrote: > #2: why odd no. of monitors are recommended for production cluster, not even > no.? Because to achieve a quorum, you must always have participation of more than 50% of the monitors. Not 50%. More than 50%. With an even number of monitors, half is not a quorum so you need half + 1. With an odd number of monitors, there's no such thing as half. So with an even number of monitors, one is always "wasted." 1 monitors -> 1 to make quorum -> 0 can be lost 2 monitors -> 2 to make quorum -> 0 can be lost 3 monitors -> 2 to make quorum -> 1 can be lost 4 monitors -> 3 to make quorum -> 1 can be lost 5 monitors -> 3 to make quorum -> 2 can be lost 6 monitors -> 4 to make quorum -> 2 can be lost 7 monitors -> 4 to make quorum -> 3 can be lost So an even number N of monitors doesn't give you any better fault resilience than N-1 monitors. And the more monitors you have, the more traffic there is between them. So when N is even, N monitors consume more resources and provide no extra benefit compared to N-1 monitors. Hope that helps! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] question about monitor and paxos relationship
On 08/29/2014 11:22 PM, J David wrote: So an even number N of monitors doesn't give you any better fault resilience than N-1 monitors. And the more monitors you have, the more traffic there is between them. So when N is even, N monitors consume more resources and provide no extra benefit compared to N-1 monitors. Except for more copies ;) But yeah, if you're going with 2 or 4, you'll be better off with 3 or 5. As long as you don't go with 1 you should be okay. Only go with 1 if you're truly okay with losing whatever you're storing if that one monitor's disk is fried. -Joao -- Joao Eduardo Luis Software Engineer | http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com