One thing I do right now for ceph performance testing is run a copy of collectl during every test. This gives you a TON of information about CPU usage, network stats, disk stats, etc. It's pretty easy to import the output data into gnuplot. Mark Seger (the creator of collectl) also has some tools to gather aggregate statistics across multiple nodes. Beyond collectl, you can get a ton of useful data out of the ceph admin socket. I especially like dump_historic_ops as it some times is enough to avoid having to parse through debug 20 logs.

While the following tools have too much overhead to be really useful for general system monitoring, they are really useful for specific performance investiations:

1) perf with the dwarf/unwind support
2) blktrace (optionally with seekwatcher)
3) valgrind (cachegrind, callgrind, massif)

Beyond that, there are some collectd plugins for Ceph and last time I checked DreamHost was using Graphite for a lot of visualizations. There's always ganglia too. :)

Mark

On 04/12/2014 09:41 AM, Jason Villalta wrote:
I know ceph throws some warnings if there is high write latency.  But i
would be most intrested in the delay for io requests, linking directly
to iops.  If iops start to drop because the disk are overwhelmed then
latency for requests would be increasing.  This would tell me that I
need to add more OSDs/Nodes.  I am not sure there is a specific metric
in ceph for this but it would be awesome if there was.


On Sat, Apr 12, 2014 at 10:37 AM, Greg Poirier <greg.poir...@opower.com
<mailto:greg.poir...@opower.com>> wrote:

    Curious as to how you define cluster latency.


    On Sat, Apr 12, 2014 at 7:21 AM, Jason Villalta <ja...@rubixnet.com
    <mailto:ja...@rubixnet.com>> wrote:

        Hi, i have not don't anything with metrics yet but the only ones
        I personally would be interested in is total capacity
        utilization and cluster latency.

        Just my 2 cents.


        On Sat, Apr 12, 2014 at 10:02 AM, Greg Poirier
        <greg.poir...@opower.com <mailto:greg.poir...@opower.com>> wrote:

            I'm in the process of building a dashboard for our Ceph
            nodes. I was wondering if anyone out there had instrumented
            their OSD / MON clusters and found particularly useful
            visualizations.

            At first, I was trying to do ridiculous things (like
            graphing % used for every disk in every OSD host), but I
            realized quickly that that is simply too many metrics and
            far too visually dense to be useful. I am attempting to put
            together a few simpler, more dense visualizations like...
            overcall cluster utilization, aggregate cpu and memory
            utilization per osd host, etc.

            Just looking for some suggestions.  Thanks!

            _______________________________________________
            ceph-users mailing list
            ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
            http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




        --
        --
        */Jason Villalta/*
        Co-founder
        Inline image 1
        800.799.4407x1230 | www.RubixTechnology.com
        <http://www.rubixtechnology.com/>





--
--
*/Jason Villalta/*
Co-founder
Inline image 1
800.799.4407x1230 | www.RubixTechnology.com
<http://www.rubixtechnology.com/>


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to