On Apr 6, 2015, at 7:04 PM, Robert LeBlanc <rob...@leblancnet.us> wrote:
> I see that ceph has 'ceph osd perf' that gets the latency of the OSDs.
> Is there a similar command that would provide some performance data
> about RBDs in use? I'm concerned about out ability to determine which
> RBD(s) may be "abusing" our storage at any given time.
>
> What are others doing to locate performance issues in their Ceph clusters?
I graph aggregate stats for `ceph --admin-daemon
/var/run/ceph/ceph-osd.$osdid.asok perf dump`. If the max latency strays too
far outside of my mean latency I know to go look for the troublemaker. My
graphs look something like this:
So on Thursday just before noon a drive dies. The blue min latency for all
disks spikes up because all disks are recovering the data on the lost OSD. The
min drops back down to normal pretty quickly but then the red max line spikes
way up for that single new disk which replaced the dead drive. It stays pretty
high until it is done moving data back to itself at which time it becomes
normal again just before midnight.
I do this style of graphing because I have 30 OSDs per chassis and a chart with
30 individual lines on it would be kind of tough to read. Though on less dense
nodes that would probably be the way to go.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com