Re: [pve-devel] [PATCH manager 1/4] ceph: add perf data cache helpers

Dominik Csapak Wed, 06 Nov 2019 07:11:20 -0800

On 11/6/19 3:49 PM, Thomas Lamprecht wrote:

On 11/6/19 8:36 AM, Dominik Csapak wrote:

On 11/5/19 6:33 PM, Thomas Lamprecht wrote:

On 11/5/19 1:51 PM, Dominik Csapak wrote:

+    my $entry = {
+    time => $time,
+    ops_r => $pgmap->{read_op_per_sec},
+    ops_w => $pgmap->{write_op_per_sec},
+    bytes_r => $pgmap->{read_bytes_sec},
+    bytes_w => $pgmap->{write_bytes_sec},


can all be undefined if no activity is on the ceph cluster


thats not that tragic, the gui can handle this, but yes
we do not have to add undef values to the entry in the first place


I'd rather set it to 0 if not defined, as this is the implicit value
it has then. Else you have undefined datapoints in graphs..

ok

+    push @$perf_cache, $entry;
+
+    # remove entries older than $keep_seconds
+    $perf_cache = [grep { $time - $_->{time} < $keep_seconds } @$perf_cache ];


why not just keep N entries, and remove all those which are  outside
of that length restriction, something like (untested):

my $cache_depth = 512;

my $cache_depth = scalar @$cache;
if ($cache_depth > $max_cache_depth) {
      splice(@$cache, 0, $cache_depth - $max_cache_depth);
}

probably faster and limits this on what we actually want, maximal
entry count. As each entry has the time saved anyway a client can
still cut-off after a certain time, if desired.


yeah makes sense, although i would not use such a high number; 32-64
would probably be enough.


You already have the data, and the overhead for keeping this is roughly
5 * 8 bytes, so with a bit of overhead ~ 50bytes, times 512 is 25 KB, which
is the last of our memory problems with perl... Having longer data windows
can be useful when investigating.. Maybe it could still be worth to do this
with RRD...


the way my patch works, would mean a higher overhead, since we

jsonencode the hash and store it as a string, which means a singledatapoint is '{"time":123123123,"ops_r":1234123, [..]'

but yeah if we have a special ipcc call for this, we can optimize here
and only send the necessary data

+
+    my $data = encode_json($perf_cache);
+    PVE::Cluster::broadcast_node_kv("ceph-perf", $data);


not sure if we want this, I mean those are pve ceph-server stats and so
their the same on the whole cluster.. So we're uselessly broadcasting
this (nodecount - 1) times..

A lockless design to have only one sender per node would be to just let
the node with the highest node-id from a quorate partition broadcast this
in a cluster wide status hashtable?
If we can broadcast we're quorate, and thus we can trust the membership
info, and even for the short race possibility where a node with higher
priority becomes quorate we or it will effectively just overwrite other,
also valid values, to this hash.


mhmm i am not so sure about this for various reasons:

* i did not want to touch pve-cluster/pmxcfs for this and reuse the existing 
interface. could we reuse the ipcc calls for a clusterwide kvstore? afaiu if we 
omit the nodename in 'ipcc_get_status' we
get the clusterwide content?


No, it's all always per node. A cluster wide, not per-node, status hashtable
needs to be added, but that would be useful in general.


sure, makes sense


* letting only one node broadcast is a bit problematic, since we cannot
be sure of any node to be able to reach the ceph portion of the cluster,
since not all nodes have to be part of it (i have seen such setups in
the wild, altough strange). also since the ceph cluster is mostly
independent of the pve-cluster, we cannot be sure that a quorate
pve cluster member can also reach the quorate portion of the ceph
monitors. letting all nodes broadcast its data would eliminate
such gaps in data then.


librados is still installed, ceph.conf still available so that's not
true, or?


yes librados and ceph config is available, but that does not mean the
cluster is designed so that all nodes can reach the monitor nodes...
e.g.:

5 nodes with node0-node2 ceph nodes, node3 a 'compute' node, and
node4 is a node in the same cluster, but shares only the 'pve cluster
network' with the others, not the ceph or vm network.. this node will
never be able to reach the ceph monitors...

The issue of "not reaching a quorate ceph monitor" is also unrelated,
and, if valid at all, should be solved transparently in librados
(reconnect to others)


not really, since we cannot guarantee that the quorate partition
of the pve cluster has anything to do with the ceph network

e.g if the ceph network is on a completely different switch

and the node with the highest id (or some different node chosen totransmit that data) has a broken cable there...

(i know not redundant but still). all nodes can be happily quorate
from the perspective of pve, but the one node is not be able
to connect to the ceph portion of the cluster at all...


* having multiple nodes query it, distributes the load between
the nodes, especially when considering my comment on patch 2/4
where i suggested that we reduce the amount of updates here and
since the pvestatd loops are not synchronized, we get more datapoints
with less rados calls per node


makes no sense, you multiply the (cluster traffic) load between nodes not
reduce it.. All nodes producing cluster traffic for this is NAK'd by me.


i am not really up to speed about the network traffic the current
corosync/pmxcfs versions produce, but i would imagine if we have
1 node syncing m datapoints, it should be roughly the same as

n nodes syncing m/n datapoints ? we could scale that with the number ofnodes for example...

+}
+
+sub get_cached_perf_data {

+
+    # only get new data if the already cached one is older than 10 seconds
+    if (scalar(@$perf_cache) > 0 && (time() - $perf_cache->[-1]->{time}) < 10) 
{
+    return $perf_cache;
+    }
+
+    my $raw = PVE::Cluster::get_node_kv("ceph-perf");
+
+    my $res = [];
+    my $times = {};
+
+    for my $host (keys %$raw) {


why multi host? Those stats are the same ceph-clusterwide, AFAICT, distributed
through MgrStatMonitor PAXOS child class. E.g., I never saw different values if
I executed the following command cluster wide at the same time:

perl -we 'use PVE::RADOS; PVE::RPCEnvironment->setup_default_cli_env();
my $rados = PVE::RADOS->new();
my $stat = $rados->mon_command({ prefix => "status" })->{pgmap};
print "w/r: $stat->{write_bytes_sec}/$stat->{read_bytes_sec}\n";'


see my above comment, the update calls are (by chance) not done at the same time


becomes obsolete once this is once per cluster, also I normally don't
want to have guaranteed-unpredictable time intervals in this sampling.


i still see a problem with selecting one node as the source of truth

(for above reasons) and in every scenario, we will have (at least sometimes) not even intervals (think pvestatd updates that take longer,network congestion, nodes leaving/entering the quorate partition, etc.)


also the intervals are not unpredictable (besides my point above)
they are just not evenly spaced...

+    my $tmp = eval { decode_json($raw->{$host}) };
+    for my $entry (@$tmp) {
+        my $etime = $entry->{time};
+        push @$res, $entry if !$times->{$etime}++;
+    }
+    }
+
+    $perf_cache = [sort { $a->{time} <=> $b->{time} } @$res];
+    return $perf_cache;
+}
+
   1;



_______________________________________________
pve-devel mailing list
pve-devel@pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH manager 1/4] ceph: add perf data cache helpers

Reply via email to