Thanks very much! Mark.Yes, I put the data and journal on the same disk, no SSD 
in my environment.My controllers are general SATA II.
Some more questions below in blue.

Date: Mon, 19 Aug 2013 07:48:23 -0500
From: mark.nel...@inktank.com
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Poor write/random read/random write performance


  
    
  
  
    On 08/19/2013 06:28 AM, Da Chun Ng
      wrote:

    
    
      
      I have a 3 nodes, 15 osds ceph cluster setup:
        * 15 7200 RPM SATA disks, 5 for each node.
        * 10G network
        * Intel(R) Xeon(R) CPU E5-2620(6 cores) 2.00GHz, for each
          node.
        * 64G Ram for each node.

          

          
          I deployed the cluster with ceph-deploy, and created a
            new data pool for cephfs.
          Both the data and metadata pools are set with replica
            size 3.
          Then mounted the cephfs on
              one of the three nodes, and tested the performance with
              fio.
          

          
          The sequential read  performance looks good:
          fio -direct=1 -iodepth 1 -thread -rw=read
            -ioengine=libaio -bs=16K -size=1G -numjobs=16
            -group_reporting -name=mytest -runtime 60
          read : io=10630MB, bw=181389KB/s, iops=11336 , runt=
            60012msec
        
      
    
    

    Sounds like readahead and or caching is helping out a lot here. 
    Btw, you might want to make sure this is actually coming from the
    disks with iostat or collectl or something.
I ran "sync && echo 3 | tee /proc/sys/vm/drop_caches" on all the nodes before 
every test. I used collectl to watch every disk IO, the numbers should match. I 
think readahead is helping here.

    

    
      
        
          

          
          But the sequential write/random read/random write
            performance is very poor:
          fio -direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=16K 
-size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60
          write: io=397280KB, bw=6618.2KB/s, iops=413 , runt=
            60029msec
        
      
    
    

    One thing to keep in mind is that unless you have SSDs in this
    system, you will be doing 2 writes for every client write to the
    spinning disks (since data and journals will both be on the same
    disk).

    

    So let's do the math:

    

    6618.2KB/s * 3 replication * 2 (journal + data writes) * 1024
    (KB->bytes) / 16384 (write size in bytes) / 15 drives = ~165 IOPS
    / drive

    

    If there is no write coalescing going on, this isn't terrible.  If
    there is, this is terrible. 
How can I know if there is write coalescing going on?
Have you tried buffered writes with the
    sync engine at the same IO size?
Do you mean as below?fio -direct=0 -iodepth 1 -thread -rw=write -ioengine=sync 
-bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60

    

    
      
        
          fio -direct=1 -iodepth 1 -thread -rw=randread
            -ioengine=libaio -bs=16K -size=256M -numjobs=16
            -group_reporting -name=mytest -runtime 60
          read : io=665664KB, bw=11087KB/s, iops=692 , runt=
            60041msec
        
      
    
    

    In this case:

    

    11087 * 1024 (KB->bytes) / 16384 / 15 = ~46 IOPS / drive.  

    

    Definitely not great!  You might want to try fiddling with read
    ahead both on the CephFS client and on the block devices under the
    OSDs themselves.  
Could you please tell me how to enable read ahead on the CephFS client? 
For the block devices under the OSDs, the read ahead value is:[root@ceph0 ~]# 
blockdev --getra /dev/sdi256How big is appropriate for it?
    

    One thing I did notice back during bobtail is that increasing the
    number of osd op threads seemed to help small object read
    performance.  It might be worth looking at too.

    

http://ceph.com/community/ceph-bobtail-jbod-performance-tuning/#4kbradosread

    

    Other than that, if you really want to dig into this, you can use
    tools like iostat, collectl, blktrace, and seekwatcher to try and
    get a feel for what the IO going to the OSDs looks like.  That can
    help when diagnosing this sort of thing.

    

    
      
        
          fio -direct=1 -iodepth 1 -thread -rw=randwrite
            -ioengine=libaio -bs=16K -size=256M -numjobs=16
            -group_reporting -name=mytest -runtime 60
          write: io=361056KB, bw=6001.1KB/s, iops=375 , runt=
            60157msec
        
      
    
    

    6001.1KB/s * 3 replication * 2 (journal + data writes) * 1024
    (KB->bytes) / 16384 (write size in bytes) / 15 drives = ~150 IOPS
    / drive

    

    
      
        
          

          
          I am mostly surprised by the seq write performance
            comparing to the raw sata disk performance(It can get 4127
            IOPS when mounted with ext4). My cephfs only gets 1/10
            performance of the raw disk.
        
      
    
    

    7200 RPM spinning disks typically top out at something like 150 IOPS
    (and some are lower).  With 15 disks, to hit 4127 IOPS you were
    probably seeing some write coalescing effects (or if these were
    random reads, some benefit to read ahead).

    

    
      
        
          

          
          How can I tune my cluster to improve the sequential write/random
              read/random write performance?
        
      
    
    I don't know what kind of controller you have, but in cases where
    journals are on the same disks as the data, using writeback cache
    helps a lot because the controller can coalesce the direct IO
    journal writes in cache and just do big periodic dumps to the
    drives.  That really reduces seek overhead for the writes.  Using
    SSDs for the journals accomplishes much of the same effect, and lets
    you get faster large IO writes too, but in many chassis there is a
    density (and cost) trade-off.

    

    Hope this helps!

    

    Mark

    

    
      
        
          

          
          

          
          

          
        
      
      

      
      

      _______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    
    

  


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com                          
          
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to