> One of the outcomes from Juno will be horizontal scalability in the > central agent and alarm evaluator via partitioning[1]. The compute > agent will get the same capability if you choose to use it, but it > doesn't make quite as much sense. > > I haven't investigated the alarm evaluator side closely yet, but one > concern I have with the central agent partitioning is that, as far > as I can tell, it will result in stored samples that give no > indication of which (of potentially very many) central-agent it came > from. > > This strikes me as a debugging nightmare when something goes wrong > with the content of a sample that makes it all the way to storage. > We need some way, via the artifact itself, to narrow the scope of > our investigation. > > a) Am I right that no indicator is there? > > b) Assuming there should be one: > > * Where should it go? Presumably it needs to be an attribute of > each sample because as agents leave and join the group, where > samples are published from can change. > > * How should it be named? The never-ending problem. > > Thoughts?
Probably best to keep the bulk of this dicussion on-gerrit, but FWIW here's my riff just commented there ... Cheers, Eoghan WRT to marking each sample with an indication of originating agent. First, IIUC, true provenance would require that the full chain-of- ownership could be reconstructed for the sample, so we'd need to also record the individual collector that persisted each sample. So let's assume that we're only talking here about associating the originating agent with the sample. For most classes of bugs/issues that could impact on an agent, we'd expect an equivalent impact on all agents. However, I guess there would be a subset of issues, e.g. an agent being "left behind" after an upgrade, that could be localized. So in the classic ceilometer approach to metadata, one could imagine the agent identity being recorded in the sample itself. However this would become a lot more problematic, I think, after a shift to pure timeseries data. In which case, I don't think we'd necessarily want to pollute the limited number of dimensions that can be efficiently associated with a datapoint with additional information purely related to the implementation/architecture of ceilometer. So how about turning the issue on its head, and putting the onus on the agent to record its allocated resources for each cycle? The obvious way to do that would be via logging. Then in order to determine which agent was responsible for polling a particular resource at a particular time, the problem would collapse down to a distributed search over the agent log files for that period (perhaps aided by whatever log retention scheme is in use, e.g. logstash). > [1] https://review.openstack.org/#/c/113549/ > [2] https://review.openstack.org/#/c/115237/ _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev