On Mon, May 11, 2015 at 05:20:25AM +0000, Somnath Roy wrote: > Two things.. > > 1. You should always use SSD drives for benchmarking after preconditioning it. well, I don't really understand... ?
> > 2. After creating and mapping rbd lun, you need to write data first to read > it afterword otherwise fio output will be misleading. In fact, I think you > will see IO is not even hitting cluster (check with ceph -s) yes, so this approves my conjecture. ok. > > Now, if you are saying it's a 3 OSD setup, yes, ~23K is pretty low. Check the > following. > > 1. Check client or OSd node cpu is saturating or not. On OSD nodes, I can see cpeh-osd CPU utilisation of ~110%. On client node (which is one of OSD nodes as well), I can see fio eating quite lot of CPU cycles.. I tried stopping ceph-osd on this node (thus only two nodes are serving data) and performance got a bit higher, to ~33k IOPS. But still I think it's not very good.. > > 2. With 4K, hope network BW is fine I think it's ok.. > > 3. Number of PGs/pool should be ~128 or so. I'm using pg_num 128 > > 4. If you are using krbd, you might want to try latest krbd module where > TCP_NODELAY problem is fixed. If you don't want that complexity, try with > fio-rbd. I'm not using RBD (only for writing data to volume), for benchmarking, I'm using fio-rbd. anything else I could check? > > Hope this helps, > > Thanks & Regards > Somnath > > -----Original Message----- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Nikola Ciprich > Sent: Sunday, May 10, 2015 9:43 PM > To: ceph-users > Cc: n...@linuxbox.cz > Subject: [ceph-users] very different performance on two volumes in the same > pool #2 > > Hello ceph developers and users, > > some time ago, I posted here a question regarding very different performance > for two volumes in one pool (backed by SSD drives). > > After some examination, I probably got to the root of the problem.. > > When I create fresh volume (ie rbd create --image-format 2 --size 51200 > ssd/test) and run random io fio benchmark > > fio --randrepeat=1 --ioengine=rbd --direct=1 --gtod_reduce=1 --name=test > --pool=ssd3r --rbdname=${rbdname} --invalidate=1 --bs=4k --iodepth=64 > --readwrite=randread > > I get very nice performance of up to 200k IOPS. However once the volume is > written to (ie when I map it using rbd map and dd whole volume with some > random data), and repeat the benchmark, random performance drops to ~23k IOPS. > > This leads me to conjecture that for unwritten (sparse) volumes, read is just > a noop, simply returning zeroes without really having to read data from > physical storage, and thus showing nice performance, but once the volume is > written, performance drops due to need to physically read the data, right? > > However I'm a bit unhappy about the performance drop, the pool is backed by 3 > SSD drives (each having random io performance of 100k iops) on three nodes, > and object size is set to 3. Cluster is completely idle, nodes are quad core > Xeons E3-1220 v3 @ 3.10GHz, 32GB RAM each, centos 6, kernel 3.18.12, ceph > 0.94.1. I'm using libtcmalloc (I even tried upgrading gperftools-libs to 2.4) > Nodes are connected using 10gb ethernet, with jumbo frames enabled. > > > I tried tuning following values: > > osd_op_threads = 5 > filestore_op_threads = 4 > osd_op_num_threads_per_shard = 1 > osd_op_num_shards = 25 > filestore_fd_cache_size = 64 > filestore_fd_cache_shards = 32 > > I don't see anything special in perf: > > 5.43% [kernel] [k] acpi_processor_ffh_cstate_enter > 2.93% libtcmalloc.so.4.2.6 [.] 0x0000000000017d2c > 2.45% libpthread-2.12.so [.] pthread_mutex_lock > 2.37% libpthread-2.12.so [.] pthread_mutex_unlock > 2.33% [kernel] [k] do_raw_spin_lock > 2.00% libsoftokn3.so [.] 0x000000000001f455 > 1.96% [kernel] [k] __switch_to > 1.32% [kernel] [k] __schedule > 1.24% libstdc++.so.6.0.13 [.] std::basic_ostream<char, > std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> > >(std::basic_ostream<char, std::char > 1.24% libc-2.12.so [.] memcpy > 1.19% libtcmalloc.so.4.2.6 [.] operator delete(void*) > 1.16% [kernel] [k] __d_lookup_rcu > 1.09% libstdc++.so.6.0.13 [.] 0x000000000007d6be > 0.93% libstdc++.so.6.0.13 [.] std::basic_streambuf<char, > std::char_traits<char> >::xsputn(char const*, long) > 0.93% ceph-osd [.] crush_hash32_3 > 0.85% libc-2.12.so [.] vfprintf > 0.84% libc-2.12.so [.] __strlen_sse42 > 0.80% [kernel] [k] get_futex_key_refs > 0.80% libpthread-2.12.so [.] pthread_mutex_trylock > 0.78% libtcmalloc.so.4.2.6 [.] > tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, > unsigned long, int) > 0.71% libstdc++.so.6.0.13 [.] std::basic_string<char, > std::char_traits<char>, std::allocator<char> >::basic_string(std::string > const&) > 0.68% ceph-osd [.] ceph::log::Log::flush() > 0.66% libtcmalloc.so.4.2.6 [.] tc_free > 0.63% [kernel] [k] resched_curr > 0.63% [kernel] [k] page_fault > 0.62% libstdc++.so.6.0.13 [.] std::string::reserve(unsigned long) > > I'm running benchmark directly on one of nodes, which I know is not optimal, > but it's still able to give those 200k iops for empty volume, so I guess it > shouldn't be problem.. > > Another story is random write performance, which is totally poor, but I't > like to deal with read performance first.. > > > so my question is, are those numbers normal? If not, what should I check? > > I'll be very grateful for all the hints I could get.. > > thanks a lot in advance > > nik > > > -- > ------------------------------------- > Ing. Nikola CIPRICH > LinuxBox.cz, s.r.o. > 28.rijna 168, 709 00 Ostrava > > tel.: +420 591 166 214 > fax: +420 596 621 273 > mobil: +420 777 093 799 > www.linuxbox.cz > > mobil servis: +420 737 238 656 > email servis: ser...@linuxbox.cz > ------------------------------------- > > ________________________________ > > PLEASE NOTE: The information contained in this electronic mail message is > intended only for the use of the designated recipient(s) named above. If the > reader of this message is not the intended recipient, you are hereby notified > that you have received this message in error and that any review, > dissemination, distribution, or copying of this message is strictly > prohibited. If you have received this communication in error, please notify > the sender by telephone or e-mail (as shown above) immediately and destroy > any and all copies of this message in your possession (whether hard copies or > electronically stored copies). > > -- ------------------------------------- Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax: +420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz -------------------------------------
pgpBfiFMPWTTt.pgp
Description: PGP signature
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com