On 13 May 2015 at 10:30, Vinod Pandarinathan (vpandari) <vpand...@cisco.com> wrote:
> - Traditional monitoring tools (Nagios, Zabbix, ....) are necessary anyway > for infrastructure monitoring (CPU, RAM, disks, operating system, RabbitMQ, > databases and more) and diagnostic purposes. Adding OpenStack service > checks is fairly easy if you already have the toolchain. > > The solution is for health-checking, which includes periodically running > light/mid/heavy > Control and data plane tests and provide test data. The tool shall not > have any dependency on one particular monitoring tool > If monitoring tool is installed, then monitoring data shall be exposed to > the applications in a consumable fashion. > As I mentioned earlier, we are not replacing any monitoring solution > available out there we are leveraging those solutions > and provide a clean interface so that the application/tenants and > Operators know if the cloud is healthy. > To rephrase this: - Zabbix and friends will monitor an operator's cloud and tell the operator bad things are happening. Or they can monitor an application's VMs and see if the app is happy, and tell the app or its owner. - Ceilometer will front cloud monitoring solutions and offer those statistics to tenants of the cloud in ways that (ideally) make sense to the client. It lets tenants see stats they couldn't get for themselves. This isn't quite what we're trying to address. We had one specific use case: a cloud application that needs to provide reasonably high availability uses the Openstack APIs occasionally to try and correct problems (VM died, app overloaded, etc.) - a pretty normal cloud application. If you're interested in maintaining service, you need to know about single points of failure to work around them, and the cloud control plane failing is a single point of failure - the APIs stop working, and the app runs just fine until a second failure that causes them to be used, and if you haven't done something by that point you get a meltdown. The idea of CloudPulse was to be able to say 'the cloud APIs are operating normally' to applications that are interested. If they're *not* normal then the application can take corrective action; for instance, spinning up extra capacity in another cloud and moving traffic over there. As you can see, that's a cross-domain sort of monitoring similar to Ceilometer - the tenant finding out information about the infrastructure that they can't see directly. That said, it's a very concise summary ('working'), and we also had in mind that you ran the tests to freshen the results if the tests hadn't been run recently, rather than looping them continually. Also, the history of the results are not really relevant - my app cares about about whether the control plane works *now*, not if it worked for 8 hours out of the last 24. We're scratching an itch. Absolutely the point of mailing everyone about it was to see if anyone had better scratching tools, and if people would like to chat about it at the summit. What seems to have come out of it is that yes, there are tools out there that might be usable for the purpose, and we'd love to hear your opinions and what ideas you have about how we should do this. Apparently there are also a lot of people with slightly different itches to scratch, and I hope you all take the opportunity to get together at the summit too. -- Ian.
__________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev