How many times have you had this happen?
Ren: Hey, I had to fix this broken server yesterday – it was a major pain. Took me 3 hours to figure out that I had to pull the clinker to fix it.
Mike: Yeah, I had that happen to me 3 times last week. Sorry, I should have let you know.
Ren: Really? Did you ask anybody about it?
Mike: I asked Bob about it, he wasn’t sure.
Sound familiar? There’s a better way.
I’m a classic over communicator – I communicate about all kinds of things in different ways. I’m sure plenty of folks at the companies I’ve worked for had an “Aaron filter” to file my stuff into folders and then never read it. I think about this and try hard to make my communication relevant & get to the point at the start of the message, not the end, but that’s another post. The point of this post is that, as a member of any organization, at any level, if you want someone to know stuff is broken you have to tell them about it.
So, at the risk of adding yet more Aaron spam to the mailbox of my beloved co-workers, when I have a problem I try to communicate it. If I don’t have time to fully dig into it – I at least send something off like this:
Subject: bambam ran out of memory last night, looks like it was the thumper process
Around 3am bambam alerted that it was out of memory. I got in and managed to spot the thumper process consuming about 30gig of memory. I had to kill it off, but wanted to see if folks were aware and if someone could review logs to see if there’s something more we can do.
I send this message to two groups:
The Operations team responsible for resolution of future occurrences of this problem.
The Engineering team (as limited a distribution as is appropriate) that is responsible for this component.
At a bare minimum I include:
What time the problem happened at. For every person that says “Duh” I can show you 3 examples where this isn’t included.
What system this happened on.
What did you do to fix it?
If I have the time I like to include the following:
Were there any obvious error messages in the logs
Was the service that was having problems functioning? This may sound silly, but often you get alerts about load / memory consumption before the service actually stops working – it’s relevant.
When did the event start / stop – were there any other alerts around the same time.
Were there any core files produced, any other services impacted, any other activity which happened around the same time that may be relevant.
Sometimes that first communication doesn’t fix the problem – persist
Often times what I get back is “Sorry, there really isn’t any info in the logs, not much we can do”. And that’s ok, because I’ve already accomplished two goals:
The guys who have to fix this next time know that it happened and how I fixed it.
The Engineering team is aware of the issue & you can move forward with the next step.
The next step: Ask more questions.
Again, be part of the solution. This is everyone’s problem – even if they don’t realize it – so if you have to be the guy engaging other teams to get to the bottom of it, do it. Saying “it’s obviously an application problem” doesn’t get you any more sleep, it doesn’t get developers any closer to fixing the problem, and it doesn’t help the company as a whole.
Start asking questions about how you can get more detail to help:
What additional information would help you diagnose this problem? Can we be monitoring something we aren’t?
What has changed with this component that may have triggered this to start happening?
Is there anything unusual in the logs in terms of trending (this is where getting familiar with your friends grep, awk & wc come in real handy)
And then be willing to set those things up to get more detail. Some of the most long-running problems I’ve had to deal with are those strange things that crop up in the middle of the night and only get noticed the next morning in graphs. Those types of problems sometimes require hacky data gathering to spot the issue – but you just gotta do it:
Run top in batch mode (-b) every minute dumping to a log to see what processes are doing what at various points throughout the night.
Setup sar to capture every minute instead of every 10 minutes
Send an SMS to your phone when the system gets into the state you see on the graphs so you can look at it. CPU goes to 100% for 30 minutes at 3am – have it page you when it does this and get up and have a look at it. Sucks to be you, get over it.
Ask the developers for help.
We’re all busy and some of us are more engaged on problems than others. Usually the Ops guys being woken up at 3am by a problem are more engaged than the developers – but that doesn’t mean the developers aren’t able to help. If you need more info out of the application to get to the bottom of your problem – ask for it.
I had a recent problem where I needed to validate that some metrics our app was producing were accurate – they seemed to contradict what I was observing on the network interface (which I was more inclined to trust than our app metrics). I asked for some additional logging to see what the app was doing – it took a few hours for the developer to whip up a build and I had it in production the same day. Those logs helped immensely in identifying the source of the problem.
But if you don’t ask – you’ll never get it.
If you like to fix problems at 3am over and over – ignore this post. If however, you are more of the “fix it once, then automate it or kill it” type – then work on communicating your problems to others who can help you. It’s not magic, it doesn’t always fix the problem right off the bat, but more often than not it helps others understand what is going on and leads to more understanding and eventually – more happiness.