Hi, folks.
I'm conducting a series of experiments and tests with CephFS and have been facing a behavior over which I can't seem to have much control. I configured a 5-node Ceph cluster running on enterprise servers. Each server has 10 x 6TB HDDs and 2 x 800GB SSDs. I configured the SSDs as a RAID-1 device for journaling and also two of the HDDs for the same purpose for the sake of comparison. All other 8 HDDs are configured as OSDs. The servers have 196GB of RAM and our private network is backed by a 40GB/s Brocade switch (frontend is 10Gb/s). When benchmarking the HDDs directly, here's the performance I get: dd if=/dev/zero of=/var/lib/ceph/osd/ceph-0/deleteme bs=10G count=1 oflag=direct & 0+1 records in0+1 records out2147479552 bytes (2.1 GB) copied, 11.684 s, 184 MB/s For read performance: dd if=/var/lib/ceph/osd/ceph-0/deleteme of=/dev/null bs=10G count=1 iflag=direct & 0+1 records in0+1 records out2147479552 bytes (2.1 GB) copied, 8.30168 s, 259 MB/s Now, when I benchmark the OSDs configured with HDD-based journaling, here's what I get: [root@cephnode1 ceph-cluster]# ceph tell osd.1 bench { "bytes_written": 1073741824, "blocksize": 4194304, "bytes_per_sec": 426840870.000000} which looks coherent. If I switch to the SDD-based journal, here's the new figure: [root@cephnode1 ~]# ceph tell osd.1 bench{ "bytes_written": 1073741824, "blocksize": 4194304, "bytes_per_sec": 805229549.000000} which, again, looks as expected to me. Finally, when I run the rados bench, here's what I get: rados bench -p cephfs_data 300 write --no-cleanup && rados bench -p cephfs_data 300 seq Total time run: 300.345098Total writes made: 48327Write size: 4194304Bandwidth (MB/sec): 643.620 Stddev Bandwidth: 114.222Max bandwidth (MB/sec): 1196Min bandwidth (MB/sec): 0Average Latency: 0.0994289Stddev Latency: 0.112926Max latency: 1.85983Min latency: 0.0139412 ---------------------------------------- Total time run: 300.121930Total reads made: 31990Read size: 4194304Bandwidth (MB/sec): 426.360 Average Latency: 0.149346Max latency: 1.77489Min latency: 0.00382452 I configured the cluster to replicate data twice (3 copies), so these numbers fall within my expectations. So far so good, but here's comes the issue: I configured CephFS and mounted a share locally on one of my servers. When I write data to it, it shows abnormally high performance at the beginning for about 5 seconds, stalls for about 20 seconds and then picks up again. For long running tests, the observed write throughput is very close to what the rados bench provided (about 640 MB/s), but for short-lived tests, I get peak performances of over 5GB/s. I know that journaling is expected to cause spiky performance patters like that, but not to this level, which makes me think that CephFS is buffering my writes and returning the control back to client before persisting them to the jounal, which looks undesirable. I searched the web for a couple of days looking for ways to disable this apparent write buffering, but couldn't find anything. So here comes my question: how can I disable it? Thanks and regards, F. Lucchese
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com