In recent years I’ve come to view Operations as a traditional bottleneck that developers have become comfortable with. I think this fact is changing rapidly to give Developers more visibility into how their application behaves in production & to allow faster delivery of value into production environments through things like continuous deployment.
One of the areas Operation is often a bottleneck is monitoring. The traditional model is to have Ops ask Dev what metrics they need monitored & to set those up. This often means that monitoring can’t start until the metrics are available in the code, and then it isn’t for days or weeks after that when some Ops person has time to setup the monitoring system to pull those. This is broken and unnecessary.
If you are operating in a service delivery model where you have control over all the systems you monitor, you should be working to get out of the way. You should be working to make the monitoring happen automatically without Ops involvement. This doesn’t mean that Dev does all the work, what it means is that Ops selects monitoring systems that allow for discovery of new metrics & automatic collection of those metrics without additional incremental work each time.
Some of this is technology selection, some of this is architecture, and some of this is just doing the work. This does take work - but I would be hard pressed to find an example where the work required to set this up is not offset by the work saved in the long run not having to respond to every new metric that gets added. Below are some concrete examples of what I’m talking about - if you aren’t familiar with this read these.
The yammer metrics library has made it really easy to expose your application metrics automatically. They have additionally provided hooks into tools like Ganglia and Graphite for pushing metrics to the monitoring system. As you look at how to scale a monitoring system, these are great tools to allow for that. Another popular data collection tool is statsd. The main idea is that you want to use collection tools that don’t have to have metrics pre-defined for them. If you give them a value for a metric they track it, that is all. The more often you give it to them, the more numbers they store.
Ganglia is great for allowing you to programmatically define graphs and manage those via your CM system of choice, like puppet or chef. Another approach is something like Graphite which provides a rich and generic UI for taking whatever metrics you collect & combining them into a graph. Building custom dashboards and such in Graphite is where it’s strength is at.
Nagios. We all dislike it, but it works pretty well. The main advantage Nagios has over more “intelligent” systems is that it can be configured through your CM system of choice. Additionally, Nagios has a massive community behind it. When building out Nagios or whatever you use, do your best to drive your configuration through CM and try to get things to the point where you don’t have to do any incremental monitoring work for each new system you add. New systems that are the same type as a system that’s already defined should just get monitored for “free”.
Think of monitoring like a service, like any other application in your architecture. You want it to discover what’s out there and configure itself as much as possible. Doing this isn’t completely simple yet - but it’s possible and if you set your mind to it you might even find a way to do it better that you can contribute back to the community. In doing this, try to get out of the way of your Developers and strive to have metrics they expose in their application automatically show up in your monitoring system of choice. Try to make it very low cost for them to add new metrics to see new information & you will probably be surprised at the amount of monitoring you get when all a Developer has to do is write the code to track the metric & it shows up in prod.