i only partially quoted, since most is already clear


yes librados and ceph config is available, but that does not mean the
cluster is designed so that all nodes can reach the monitor nodes...
e.g.:

5 nodes with node0-node2 ceph nodes, node3 a 'compute' node, and
node4 is a node in the same cluster, but shares only the 'pve cluster
network' with the others, not the ceph or vm network.. this node will
never be able to reach the ceph monitors...


You know which nodes hosts ceph. You even can limit this to monitor
nodes, and do the same there (lowest or highest node-id sends, if none
of those are quorate the monitor probably isn't either, and even if,
it simply does not hurt)


this makes sense and is very doable, something like:

if (quorate && monitor_node) {
        // get list of quorate nodes

        // get list of monitor nodes
        
        // check if i am the lowest/highest quorate monitor node
        // if yes, collect/broadcast data
}

the trade-off remains (this could happen to be a pve quorate node
where the monitor is not quorate), but in a 'normal' cluster
with 3 monitors the chance would be 1/3 (which is ok) and
converges against 1/2 for very many monitors (unlikely)

but yes as you say below, in such a scenario, the admin has
different problems ;)


* having multiple nodes query it, distributes the load between
the nodes, especially when considering my comment on patch 2/4
where i suggested that we reduce the amount of updates here and
since the pvestatd loops are not synchronized, we get more datapoints
with less rados calls per node

makes no sense, you multiply the (cluster traffic) load between nodes not
reduce it.. All nodes producing cluster traffic for this is NAK'd by me.

i am not really up to speed about the network traffic the current
corosync/pmxcfs versions produce, but i would imagine if we have
1 node syncing m datapoints, it should be roughly the same as
n nodes syncing m/n datapoints ? we could scale that with the number of nodes 
for example...

what? if each nodes sync whatever bytes, all nodes send and receive that many,
so you get (n-1) * (n-1) * (payload bytes + overhead) where overhead with crypto
and all is not exactly zero. While a single sender means one (n-1) term less, 
i.e.
O(n^2) vs. O(n) ..
Plus, with the monitor nodes are senders one saves "n - (monitor_count)" status
sends too.


i obviously did not convey my idea very well...
yes, as my patch currently is, you are right that we have
(n-1)*(n-1)*size network traffic

but what i further tried to propose was a mechanic by which
each node only sends the data on the nth iteration of the loop
(where n could be fixed or even dependent on the (quorate) nodecount)

so that each node only sends 1/n datapoints per pvestatd loop (on average)



see my above comment, the update calls are (by chance) not done at the same time

becomes obsolete once this is once per cluster, also I normally don't
want to have guaranteed-unpredictable time intervals in this sampling.

i still see a problem with selecting one node as the source of truth
(for above reasons) and in every scenario, we will have (at least some times) 
not even intervals (think pvestatd updates that take longer, network 
congestion, nodes leaving/entering the quorate partition, etc.)

you can still have that if you do this per node, so I don't see your
point.


also the intervals are not unpredictable (besides my point above)
they are just not evenly spaced...

Didn't you just said "not even intervals", so how are they not
guaranteed unpredictable if, e.g., 16 nodes all sends that stuff..
That's for sure not equidistant - a single node having control over
this is as best it can get.


we mean the same, but express us differently ;)

the timestamps 1,2,9,11,12,19,21,22,29,... are not equidistant (what i meant with 'not evenly spaced') but they are also not 'unpredictable' (what i understand as 'randomly spaced')

in summary, we have two proposals with different trade-offs:

1. have one node selected (in a sane/stable way) which is the only one who updates
   pros: mostly regular updates of the data
         we can select the most sensible node ourselves
con: if we select a 'bad' node, we have no data at all, as long as we select that node

2. have each node try to broadcast in an interval dependent on the number of nodes (to not have overly much traffic) pro: as long as most nodes have a connection to ceph, we get data (at least some times)
   cons: irregular updates of the data
         if ceph has a problem, impacts more nodes' pvestatd

and as my proposal has more cons and less pros, i give up ;)
i am sending a reworked v2 soon(tm), but will first try to reuse
our 'rrd' mechanism to save the data, since at a first look seems easier
to extend for this purpose that to have a completely new interface
(also the data is time series data, so putting this inside some kv store seems wrong..., even if i sent it that way in the first place^^)

thanks for reviewing and pushing me in the right direction :)

_______________________________________________
pve-devel mailing list
pve-devel@pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Reply via email to