Monitoring needs to be part of the definition of done. In an ideal world you would not be able to deploy an application without deploying services to monitor the ongoing activity and health of the application. If you care enough to deploy an application, you should care enough to make sure it keeps working within expected parameters.
When this is not adhered you end up in a bad place. We have a large number of bespoke services developed over a period of ~12 years in many different technology stacks. However we have very poor visibility into their operation. In many cases it is difficult to determine whether the services are still operating within expected parameters or even whether they are still being used. In almost all of them, it is our users that alert us to performance or correctness problems in the services. This is a terrible situation and is hated by everyone; developers, operators and users alike.
Our newer applications are built with a strict regime where a service is not done until it has monitoring. We have adopted Chef as our configuration management tool of choice. We have also started using an attribute driven approach for developing our cookbooks. The rest of this post outlines some of the techniques we use to keep our systems in check.
We tend to break down monitoring into four major orientations; infrastructure, system, service and business characteristics. The infrastructure characteristics tend to be elements such as utilization of underlying network infrastructure etc but we do not have a good story for this yet. The system characteristics are the node characteristics such as CPU usage, memory usage, disk operations, bytes transmitted etc. The service characteristics include metrics such as queue sizes, throughput rates. The business characteristics include things like the number of Wildfires still going, the number of resources deployed to each emergency event, the number of incident control teams activated etc. (We write software for emergency services).
For most characteristics we monitor, we want to graph the characteristic over time. Alerts need to be generated if the metric values are outside an expected range of values or are trending towards this scenario. We also want our releases to be relatively self-contained. i.e. If we release a new application this should update all of the graphs, monitors and alert configurations for that application. The pattern we use to implement this is to have the cookbook for the application publish the probe and graph definitions as attributes on the node. The configuration is then discovered by recipes that run on the local node or remote nodes and is used to drive configuration of the monitoring tools.
Right now our tool of choice for persistence and graphing of metric data is graphite. The graphite product suite allows flexibility in how data is collected, aggregated, presented and analyzed but it has a very poor user experience. So we have adopted gdash to build our dashboard.
System Level Monitoring
We use a number of different tools to monitor the system level characteristics but the default answer on Ubuntu/Linux hosts is to use collectd. It has many plugins to monitor all sorts of characteristics but we tend to use it to measure system level characteristics such as CPU usage. The collectd agents publish directly to a graphite server using the write_graphite plugin.
We have rewritten the existing collectd cookbook to use an attribute driven approach so that each node need only define the appropriate attributes and include the appropriate collectd recipe to activate desired plugins. A typical block used to configure a collectd node would look something like the following;
Of course we also consider statistics about our chef execution to be important enough to monitor and we use the graphite_handler cookbook to collect these statistics.
This is enough to configure the publishing of data about the nodes to graphite but we also want to configure graphite to customize the retention policy for the data collected. This is done by attribute configuration. The graphite node discovers the configuration on next converge. Below is an example configuration for the chef statistics.
So far we have collected data and published it into graphite with a custom retention rules. To build up a dashboard we use a similar technique. Each application publishes attribute data defining a series of graphs and dashboard components. We use a modified gdash cookbook that uses search to discover all the published graph components and constructs a dashboard from the data. An example for such a configuration is below;
Application Level Monitoring
Our applications come in all shapes and sizes but we have many java (and jruby) based applications so we collect a lot of the monitoring data via JMX. We developed a small tool (spydle) and a corresponding chef cookbook that periodically polls the applications using JMX and pushes the data to the graphite server. The cookbook uses search to discover the configurations that other nodes have published and adds that to the pollers configuration.
We use the OpenMQ message broker that is part of the GlassFish server. To collect data about it's operation we use a snippet that is not unlike the following to grab data out of JMX. This configuration is defined on the OpenMQ node.
Business Level Monitoring
The business level metrics are collected from all sorts of places but the two main sources of information are; JMX characteristics exposed by our applications and values in the database. The metric data collected from the database is often an aggregate SQL query against either our operational data store or our warehouse database. Spydle also supports queries against a database. As a result spydle is the tool of choice at this level.
The way we release our applications is we promote a new application specific cookbook and then run converge on the application node. This ensures that the application is deployed and the attribute data for the monitoring system is published on the node. We then converge the monitoring nodes and they discover the new configuration for the application via search and update the graphite/gdash/spydle etc configuration as necessary. Rollback is simple as it is just another release and a re-converge on the monitor nodes and the application nodes.
So far, what we have works well. It is easy to monitor and graph data about a node or an application. In reality the configuration is a little more complex than is indicated above; we tend to have short retention times for data in environments other than production and we tend to limit the generation of graphs to environments we care about. We are trialing a few tools to generate alerts, mostly by querying graphite. The alerts will follow the same approach as our other monitoring infrastructure and we will publish the alerts in the application node's attributes. Once that is in place we will have much better insight into how our systems behave.