Thanks, Mark. Yes, we're using XFS and 3-replication, although we might switch to 2-replication since we're not too worried about resiliency.
I did some test on single disks with dd, and am able to get about 152 MB/s writes and 191 MB/s reads from a single disk. I also run the same test on all 13 disks in parallel and didn't notice much drop in throughput: 140 MB/s and 183 MB/s, respectively. Another point to make is that our network doesn't seem to be performing at its best. Tests with iperf revealed that we're only getting between 4 to 6 Gbits/s between hosts. I guess we'll need to experiment with moving the journals off the data disks, but I'm not quite sure what the best practices are. From what I understand, journals are rather small (5GB by default?) and at the moment we don't have much flexibility to add more disks to these servers, so given what we have, what would it be the ideal setup? Would it make sense to put the journals of all 12 OSDs on the same 900GB disk? Sergio On Thu, Apr 7, 2016 at 6:03 PM, Mark Nelson <mnel...@redhat.com> wrote: > Hi Sergio > > > On 04/07/2016 07:00 AM, Sergio A. de Carvalho Jr. wrote: > >> Hi all, >> >> I've setup a testing/development Ceph cluster consisting of 5 Dell >> PowerEdge R720xd servers (256GB RAM, 2x 8-core Xeon E5-2650 @ 2.60 GHz, >> dual-port 10Gb Ethernet, 2x 900GB + 12x 4TB disks) running CentOS 6.5 >> and Ceph Hammer 0.94.6. All servers use one 900GB disk for the root >> partition and the other 13 disks are assigned to OSDs, so we have 5 x 13 >> = 65 OSDs in total. We also run 1 monitor on every host. Journals are >> 5GB partitions on each disk (this is something we obviously will need to >> revisit later). The purpose of this cluster will be to serve as a >> backend storage for Cinder volumes and Glance images in an OpenStack >> cloud. >> >> With this setup, I'm getting what I'm considering an "okay" performance: >> >> # rados -p images bench 5 write >> Maintaining 16 concurrent writes of 4194304 bytes for up to 5 seconds >> or 0 objects >> >> Total writes made: 394 >> Write size: 4194304 >> Bandwidth (MB/sec): 299.968 >> >> Stddev Bandwidth: 127.334 >> Max bandwidth (MB/sec): 348 >> Min bandwidth (MB/sec): 0 >> Average Latency: 0.212524 >> Stddev Latency: 0.13317 >> Max latency: 0.828946 >> Min latency: 0.07073 >> >> Does that look acceptable? How much more can I expect to achieve by >> fine-tunning and perhaps using a more efficient setup? >> > > I'll assume 3x replication for these tests. In reasonable conditions you > should be able to get about ~70MB/s raw per standard 7200rpm spinning disk > with Ceph using filestore with XFS. For 65 OSDs, let's say about 4.5GB/s. > Divide that by 3 for replication and you get 1.5GB/s. Now add the journal > double write penalty and you are down to about 750MB/s. So I'd say your > aggregate throughput here is lower than what you might ideally see. > > The first step would probably be to increase the concurrency and see how > much that helps. > > >> I do understand the bandwidth above is a product of running 16 >> concurrent writes, and rather small object sizes (4MB). Bandwidth lowers >> significantly with 64MB and 1 thread: >> >> # rados -p images bench 5 write -b 67108864 -t 1 >> Maintaining 1 concurrent writes of 67108864 bytes for up to 5 seconds >> or 0 objects >> >> Total writes made: 7 >> Write size: 67108864 >> Bandwidth (MB/sec): 71.520 >> >> Stddev Bandwidth: 24.1897 >> Max bandwidth (MB/sec): 64 >> Min bandwidth (MB/sec): 0 >> Average Latency: 0.894792 >> Stddev Latency: 0.0547502 >> Max latency: 0.99311 >> Min latency: 0.832765 >> >> Is such a drop expected? >> > > Yep! Concurrency is really important for distributed systems and Ceph is > no exception. If you only keep 1 write in flight, you can't really expect > better than the performance of a single OSD. Ceph writes a fully copy of > the data to the journal before sending a write acknowledgment to the > client. In fact every replica write also has to be fully written to the > journal on the secondary OSDs as well. These writes happen in parallel, > but it adds latency and you'll only be as fast overall as the slowest of > all of these journal writes. In your case, you also have to contend with > the filesystem writes contending with the journal writes down since the > journals are co-located. > > In this case you probably are only getting 71MB/s because the test is so > short. In practice with co-located journals I'd expect for a longer > running test you'd actually get less than this in practice. > > >> Now, what I'm really concerned is about upload times. Uploading a >> randomly-generated 1GB file takes a bit too long: >> >> # time rados -p images put random_1GB /tmp/random_1GB >> >> real0m35.328s >> user0m0.560s >> sys0m3.665s >> >> Is this normal? If so, if I setup this cluster as a backend for Glance, >> does that mean uploading a 1GB image will require 35 seconds (plus >> whatever time Glance requires to do its own thing)? >> > > And here's where you are getting less. I'd hope for a little faster than > 29MB/s, but given how your cluster is setup 30-40MB/s is probably about > right. If you need this use-case to be faster, you have a couple of > options. > > 1) Wait for bluestore to become production ready. This is the new OSD > backend that specifically avoids full-data journal writes for large > sequential write IO. Expect per-osd speed to be around 1.5-2X as fast in > this case for spinning disk only clusters. > > 2) Move the journals off the disks. A common way to do this is to buy a > couple of very fast, high write endurance NVMe or SSDs. Some of the newer > NVMe drives are fast enough to support journals for 15-20 spinning disks > each. Just make sure they have enough write endurance to meet your needs. > Assuming no other bottlenecks, this is usually close to a 2X large write IO > performance improvement. > > 3) If that's not good enough, you might consider buying a small set of > SSDs/NVMes for a dedicated SSD pool for specific cases like this. Even in > this setup, you'll likely see higher performance with more concurrency. > Here's an example I just ran on a 4 node cluster using a single fast NVMe > drive per node: > > rados -p cbt-librbdfio bench 30 write -t 1 > Bandwidth (MB/sec): 180.205 > > rados -p cbt-librbdfio bench 30 write -t 16 > Bandwidth (MB/sec): 1197.11 > > > >> Thanks, >> >> Sergio >> >> >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com