Hi, folks.


I'm conducting a series of experiments and tests with CephFS and have been 
facing a behavior over which I can't seem to have much control.
I configured a 5-node Ceph cluster running on enterprise servers. Each server 
has 10 x 6TB HDDs and 2 x 800GB SSDs. I configured the SSDs as a RAID-1 device 
for journaling and also two of the HDDs for the same purpose for the sake of 
comparison. All other 8 HDDs are configured as OSDs. The servers have 196GB of 
RAM and our private network is backed by a 40GB/s Brocade switch (frontend is 
10Gb/s).
When benchmarking the HDDs directly, here's the performance I get:
dd if=/dev/zero of=/var/lib/ceph/osd/ceph-0/deleteme bs=10G count=1 
oflag=direct &
0+1 records in0+1 records out2147479552 bytes (2.1 GB) copied, 11.684 s, 184 
MB/s
For read performance:
dd if=/var/lib/ceph/osd/ceph-0/deleteme of=/dev/null bs=10G count=1 
iflag=direct &
0+1 records in0+1 records out2147479552 bytes (2.1 GB) copied, 8.30168 s, 259 
MB/s
Now, when I benchmark the OSDs configured with HDD-based journaling, here's 
what I get:
[root@cephnode1 ceph-cluster]# ceph tell osd.1 bench
{    "bytes_written": 1073741824,    "blocksize": 4194304,    "bytes_per_sec": 
426840870.000000}
which looks coherent. If I switch to the SDD-based journal, here's the new 
figure:
[root@cephnode1 ~]# ceph tell osd.1 bench{    "bytes_written": 1073741824,    
"blocksize": 4194304,    "bytes_per_sec": 805229549.000000}
which, again, looks as expected to me.
Finally, when I run the rados bench, here's what I get:
rados bench -p cephfs_data 300 write --no-cleanup && rados bench -p cephfs_data 
300 seq
Total time run:         300.345098Total writes made:      48327Write size:      
       4194304Bandwidth (MB/sec):     643.620
Stddev Bandwidth:       114.222Max bandwidth (MB/sec): 1196Min bandwidth 
(MB/sec): 0Average Latency:        0.0994289Stddev Latency:         0.112926Max 
latency:            1.85983Min latency:            0.0139412
----------------------------------------
Total time run:        300.121930Total reads made:      31990Read size:         
    4194304Bandwidth (MB/sec):    426.360
Average Latency:       0.149346Max latency:           1.77489Min latency:       
    0.00382452
I configured the cluster to replicate data twice (3 copies), so these numbers 
fall within my expectations. So far so good, but here's comes the issue: I 
configured CephFS and mounted a share locally on one of my servers. When I 
write data to it, it shows abnormally high performance at the beginning for 
about 5 seconds, stalls for about 20 seconds and then picks up again. For long 
running tests, the observed write throughput is very close to what the rados 
bench provided (about 640 MB/s), but for short-lived tests, I get peak 
performances of over 5GB/s. I know that journaling is expected to cause spiky 
performance patters like that, but not to this level, which makes me think that 
CephFS is buffering my writes and returning the control back to client before 
persisting them to the jounal, which looks undesirable.
I searched the web for a couple of days looking for ways to disable this 
apparent write buffering, but couldn't find anything. So here comes my 
question: how can I disable it?
Thanks and regards,
F. Lucchese

  
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to