Nik, If you increase num_jobs beyond 4 , is it helping further ? Try 8 or so. Yeah, libsoft* is definitely consuming some cpu cycles , but I don't know how to resolve that. Also, acpi_processor_ffh_cstate_enter popped up and consuming lot of cpu. Try disabling cstate and run cpu in maximum performance mode , this may give you some boost.
Thanks & Regards Somnath -----Original Message----- From: Nikola Ciprich [mailto:nikola.cipr...@linuxbox.cz] Sent: Sunday, May 10, 2015 11:32 PM To: Somnath Roy Cc: ceph-users; n...@linuxbox.cz Subject: Re: [ceph-users] very different performance on two volumes in the same pool #2 On Mon, May 11, 2015 at 06:07:21AM +0000, Somnath Roy wrote: > Yes, you need to run fio clients on a separate box, it will take quite a bit > of cpu. > Stopping OSDs on other nodes, rebalancing will start. Have you waited cluster > to go for active + clean state ? If you are running while rebalancing is > going on , the performance will be impacted. I set noout, so there was no rebalancing, I forgot to mention that.. > > ~110% cpu util seems pretty low. Try to run fio_rbd with more num_jobs (say > 3 or 4 or more), io_depth =64 is fine and see if it improves performance or > not. ok, increasing jobs to 4 seems to squeeze a bit more from the cluster, about 43.3K iops.. OSD cpu util jumps to ~300% on both alive nodes, so there seems to be still a bit of reserves.. > Also, since you have 3 OSDs (3 nodes?), I would suggest to tweak the > following settings > > osd_op_num_threads_per_shard > osd_op_num_shards > > May be (1,10 / 1,15 / 2, 10 ?). tried all those combinations, but it doesn't make almost any difference.. do you think I could get more then those 43k? one more think that makes me wonder a bit is this line I can see in perf: 2.21% libsoftokn3.so [.] 0x000000000001ebb2 I suppose this has something to do with resolving, 2.2% seems quite a lot to me.. Should I be worried about it? Does it make sense to enable kernel DNS resolving support in ceph? thanks for your time Somnath! nik > > Thanks & Regards > Somnath > > -----Original Message----- > From: Nikola Ciprich [mailto:nikola.cipr...@linuxbox.cz] > Sent: Sunday, May 10, 2015 10:33 PM > To: Somnath Roy > Cc: ceph-users; n...@linuxbox.cz > Subject: Re: [ceph-users] very different performance on two volumes in > the same pool #2 > > > On Mon, May 11, 2015 at 05:20:25AM +0000, Somnath Roy wrote: > > Two things.. > > > > 1. You should always use SSD drives for benchmarking after preconditioning > > it. > well, I don't really understand... ? > > > > > 2. After creating and mapping rbd lun, you need to write data first > > to read it afterword otherwise fio output will be misleading. In > > fact, I think you will see IO is not even hitting cluster (check > > with ceph -s) > yes, so this approves my conjecture. ok. > > > > > > Now, if you are saying it's a 3 OSD setup, yes, ~23K is pretty low. Check > > the following. > > > > 1. Check client or OSd node cpu is saturating or not. > On OSD nodes, I can see cpeh-osd CPU utilisation of ~110%. On client node > (which is one of OSD nodes as well), I can see fio eating quite lot of CPU > cycles.. I tried stopping ceph-osd on this node (thus only two nodes are > serving data) and performance got a bit higher, to ~33k IOPS. But still I > think it's not very good.. > > > > > > 2. With 4K, hope network BW is fine > I think it's ok.. > > > > > > 3. Number of PGs/pool should be ~128 or so. > I'm using pg_num 128 > > > > > > 4. If you are using krbd, you might want to try latest krbd module where > > TCP_NODELAY problem is fixed. If you don't want that complexity, try with > > fio-rbd. > I'm not using RBD (only for writing data to volume), for benchmarking, I'm > using fio-rbd. > > anything else I could check? > > > > > > Hope this helps, > > > > Thanks & Regards > > Somnath > > > > -----Original Message----- > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On > > Behalf Of Nikola Ciprich > > Sent: Sunday, May 10, 2015 9:43 PM > > To: ceph-users > > Cc: n...@linuxbox.cz > > Subject: [ceph-users] very different performance on two volumes in > > the same pool #2 > > > > Hello ceph developers and users, > > > > some time ago, I posted here a question regarding very different > > performance for two volumes in one pool (backed by SSD drives). > > > > After some examination, I probably got to the root of the problem.. > > > > When I create fresh volume (ie rbd create --image-format 2 --size > > 51200 ssd/test) and run random io fio benchmark > > > > fio --randrepeat=1 --ioengine=rbd --direct=1 --gtod_reduce=1 > > --name=test --pool=ssd3r --rbdname=${rbdname} --invalidate=1 --bs=4k > > --iodepth=64 --readwrite=randread > > > > I get very nice performance of up to 200k IOPS. However once the volume is > > written to (ie when I map it using rbd map and dd whole volume with some > > random data), and repeat the benchmark, random performance drops to ~23k > > IOPS. > > > > This leads me to conjecture that for unwritten (sparse) volumes, read is > > just a noop, simply returning zeroes without really having to read data > > from physical storage, and thus showing nice performance, but once the > > volume is written, performance drops due to need to physically read the > > data, right? > > > > However I'm a bit unhappy about the performance drop, the pool is backed by > > 3 SSD drives (each having random io performance of 100k iops) on three > > nodes, and object size is set to 3. Cluster is completely idle, nodes are > > quad core Xeons E3-1220 v3 @ 3.10GHz, 32GB RAM each, centos 6, kernel > > 3.18.12, ceph 0.94.1. I'm using libtcmalloc (I even tried upgrading > > gperftools-libs to 2.4) Nodes are connected using 10gb ethernet, with jumbo > > frames enabled. > > > > > > I tried tuning following values: > > > > osd_op_threads = 5 > > filestore_op_threads = 4 > > osd_op_num_threads_per_shard = 1 > > osd_op_num_shards = 25 > > filestore_fd_cache_size = 64 > > filestore_fd_cache_shards = 32 > > > > I don't see anything special in perf: > > > > 5.43% [kernel] [k] acpi_processor_ffh_cstate_enter > > 2.93% libtcmalloc.so.4.2.6 [.] 0x0000000000017d2c > > 2.45% libpthread-2.12.so [.] pthread_mutex_lock > > 2.37% libpthread-2.12.so [.] pthread_mutex_unlock > > 2.33% [kernel] [k] do_raw_spin_lock > > 2.00% libsoftokn3.so [.] 0x000000000001f455 > > 1.96% [kernel] [k] __switch_to > > 1.32% [kernel] [k] __schedule > > 1.24% libstdc++.so.6.0.13 [.] std::basic_ostream<char, > > std::char_traits<char> >& std::__ostream_insert<char, > > std::char_traits<char> >(std::basic_ostream<char, std::char > > 1.24% libc-2.12.so [.] memcpy > > 1.19% libtcmalloc.so.4.2.6 [.] operator delete(void*) > > 1.16% [kernel] [k] __d_lookup_rcu > > 1.09% libstdc++.so.6.0.13 [.] 0x000000000007d6be > > 0.93% libstdc++.so.6.0.13 [.] std::basic_streambuf<char, > > std::char_traits<char> >::xsputn(char const*, long) > > 0.93% ceph-osd [.] crush_hash32_3 > > 0.85% libc-2.12.so [.] vfprintf > > 0.84% libc-2.12.so [.] __strlen_sse42 > > 0.80% [kernel] [k] get_futex_key_refs > > 0.80% libpthread-2.12.so [.] pthread_mutex_trylock > > 0.78% libtcmalloc.so.4.2.6 [.] > > tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, > > unsigned long, int) > > 0.71% libstdc++.so.6.0.13 [.] std::basic_string<char, > > std::char_traits<char>, std::allocator<char> >::basic_string(std::string > > const&) > > 0.68% ceph-osd [.] ceph::log::Log::flush() > > 0.66% libtcmalloc.so.4.2.6 [.] tc_free > > 0.63% [kernel] [k] resched_curr > > 0.63% [kernel] [k] page_fault > > 0.62% libstdc++.so.6.0.13 [.] std::string::reserve(unsigned long) > > > > I'm running benchmark directly on one of nodes, which I know is not > > optimal, but it's still able to give those 200k iops for empty volume, so I > > guess it shouldn't be problem.. > > > > Another story is random write performance, which is totally poor, but I't > > like to deal with read performance first.. > > > > > > so my question is, are those numbers normal? If not, what should I check? > > > > I'll be very grateful for all the hints I could get.. > > > > thanks a lot in advance > > > > nik > > > > > > -- > > ------------------------------------- > > Ing. Nikola CIPRICH > > LinuxBox.cz, s.r.o. > > 28.rijna 168, 709 00 Ostrava > > > > tel.: +420 591 166 214 > > fax: +420 596 621 273 > > mobil: +420 777 093 799 > > www.linuxbox.cz > > > > mobil servis: +420 737 238 656 > > email servis: ser...@linuxbox.cz > > ------------------------------------- > > > > ________________________________ > > > > PLEASE NOTE: The information contained in this electronic mail message is > > intended only for the use of the designated recipient(s) named above. If > > the reader of this message is not the intended recipient, you are hereby > > notified that you have received this message in error and that any review, > > dissemination, distribution, or copying of this message is strictly > > prohibited. If you have received this communication in error, please notify > > the sender by telephone or e-mail (as shown above) immediately and destroy > > any and all copies of this message in your possession (whether hard copies > > or electronically stored copies). > > > > > > -- > ------------------------------------- > Ing. Nikola CIPRICH > LinuxBox.cz, s.r.o. > 28.rijna 168, 709 00 Ostrava > > tel.: +420 591 166 214 > fax: +420 596 621 273 > mobil: +420 777 093 799 > www.linuxbox.cz > > mobil servis: +420 737 238 656 > email servis: ser...@linuxbox.cz > ------------------------------------- > -- ------------------------------------- Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax: +420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz ------------------------------------- _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com