Why do we insist on consensus on the role of Ops?

March 19th, 2012 No comments

I’ve seen so many threads over the last few weeks about who should do what, why, and what you should do about it if you don’t conform. I don’t get it. Ops is a team in a company – there are lots of types of companies. Companies typically have a few goals:

1) Make money

2) Change the world, as long as we can do #1.

Lots of companies accomplish these goals doing things wrong. If you want proof, read Good to Great, there are oodles of examples of companies who didn’t qualify as “great” but who you would recognize as successful.

When wagon trains migrated families west across the US, the idea of driving 40mph, of crossing a state in a day, would have been crazy talk. Then came the locomotive.

When locomotives moved people across the country, the idea of a car making an interstate trip would have been crazy. It would be madness if everyone operated their own car. Then came cars, and road, and traffic signals, and road signs. This took time, lots of mistakes, lots of retrospectives, and year over year progress.

Progress isn’t made by conforming to the conventions of today, it’s made by pushing for something better. That’s what some folks are doing in Ops today – they are trying to push the limits and do what works for them. Others are observing these patterns and following suit. Still others are sitting back and saying “That ain’t right, my process works just fine”. Perfect.

It wasn’t necessary for automobile manufacturers to convince railroad operators that the car was the future. The car became the future because people adopted it, because it worked, and because over time the infrastructure that supported it became more mature.

As our tools get better, as our patterns become more and more repeatable, as we start to understand what roads & traffic signals & road signs we need for Ops to get out of the way of Developers making changes in production, things will move. In the mean time – talk about what works for you, why it works for you, and don’t bother convincing other people why it should work for them.

Categories: Operations Tags:

Sometimes it takes 2 days to do 2 hours of work

February 21st, 2012 2 comments

I hear all the time:

“I could get that done in a few hours, easy”

“I could whip that up in 2 seconds”

So what? Instead of bragging about how awesome people should think you are because of how fast you say you can get things done, how about you ask some questions?

“When do you need it done by?”

“Do we have time to improve this other widget to make fixing your thing easier?”

We are so focused on trying to crank out as much as we possibly can, we sometimes think it’s better to talk about how quickly we can get things done. Instead, under commit & over deliver. If you have 2 days to get something done and you only need 1/2 day – spend some time improving something that will make *your* life easier. Understand what the expectations are before you commit & see if you can get in some extra benefit. If you hit a snag & end up taking 2 days to finish then nobody is disappointed, but if you don’t then you get some extra work done & still meet expectations.

I know folks prefer accurate estimates & like to fill your day with the stuff they want done – but don’t complain that you can’t get your stuff done if you aren’t under committing once in a while.

Categories: Operations Tags:

Getting out of the way – Monitoring

February 20th, 2012 No comments

In recent years I’ve come to view Operations as a traditional bottleneck that developers have become comfortable with. I think this fact is changing rapidly to give Developers more visibility into how their application behaves in production & to allow faster delivery of value into production environments through things like continuous deployment.

One of the areas Operation is often a bottleneck is monitoring. The traditional model is to have Ops ask Dev what metrics they need monitored & to set those up. This often means that monitoring can’t start until the metrics are available in the code, and then it isn’t for days or weeks after that when some Ops person has time to setup the monitoring system to pull those. This is broken and unnecessary.

If you are operating in a service delivery model where you have control over all the systems you monitor, you should be working to get out of the way. You should be working to make the monitoring happen automatically without Ops involvement. This doesn’t mean that Dev does all the work, what it means is that Ops selects monitoring systems that allow for discovery of new metrics & automatic collection of those metrics without additional incremental work each time.

Some of this is technology selection, some of this is architecture, and some of this is just doing the work. This does take work – but I would be hard pressed to find an example where the work required to set this up is not offset by the work saved in the long run not having to respond to every new metric that gets added. Below are some concrete examples of what I’m talking about – if you aren’t familiar with this read these.

Metric Collection

The yammer metrics library has made it really easy to expose your application metrics automatically. They have additionally provided hooks into tools like Ganglia and Graphite for pushing metrics to the monitoring system. As you look at how to scale a monitoring system, these are great tools to allow for that. Another popular data collection tool is statsd. The main idea is that you want to use collection tools that don’t have to have metrics pre-defined for them. If you give them a value for a metric they track it, that is all. The more often you give it to them, the more numbers they store.

Graph presentation

Ganglia is great for allowing you to programmatically define graphs and manage those via your CM system of choice, like puppet or chef. Another approach is something like Graphite which provides a rich and generic UI for taking whatever metrics you collect & combining them into a graph. Building custom dashboards and such in Graphite is where it’s strength is at.

Alerting

Nagios. We all dislike it, but it works pretty well. The main advantage Nagios has over more “intelligent” systems is that it can be configured through your CM system of choice. Additionally, Nagios has a massive community behind it. When building out Nagios or whatever you use, do your best to drive your configuration through CM and try to get things to the point where you don’t have to do any incremental monitoring work for each new system you add. New systems that are the same type as a system that’s already defined should just get monitored for “free”.

Summary

Think of monitoring like a service, like any other application in your architecture. You want it to discover what’s out there and configure itself as much as possible. Doing this isn’t completely simple yet – but it’s possible and if you set your mind to it you might even find a way to do it better that you can contribute back to the community. In doing this, try to get out of the way of your Developers and strive to have metrics they expose in their application automatically show up in your monitoring system of choice. Try to make it very low cost for them to add new metrics to see new information & you will probably be surprised at the amount of monitoring you get when all a Developer has to do is write the code to track the metric & it shows up in prod.

Categories: Operations Tags:

Overcoming Obstacles

February 17th, 2012 No comments

Three days a week, unless there’s something seriously preventing it, I scarf down my lunch and head down to the bouldering gym for an hour of climbing. This isn’t a post about rock climbing, but in the short time I’ve been climbing (about 8 months) I’ve found the whole practice to change the way I think about some problems. Rock climbing is a physical sport, but it’s also a very mental sport. Climbing difficult routes is a combination of strength, technique and determination. The process of learning is interesting as well, because you don’t hear a lot of people giving advice in the climbing gym. Folks lead by example.

So, I wanted to draw some parallels to my observations about climbing and working through obstacles in tech. We deal with new challenges every day and some can be pretty intimidating. We also learn and teach a lot, and I think there are some lessons about that as well.

Don’t allow obstacles to defeat you before you start.

When you approach a new route it can be intimidating, you aren’t really sure what to expect, what part of a route is going to be most difficult. This is true of approaching many problems, but just because a problem looks intimidating does not mean it cannot be overcome. There is a lot we don’t understand until we are working our way through a problem and telling yourself you can’t do this isn’t going to help. Make one move at a time & do your best – rarely is the situation so critical that you cannot afford to adjust as you learn.

When you miss, inspect and adapt.

The bouldering gym is full of big pads. Those aren’t there because nobody ever falls. Everybody falls. This is part of the process of challenging yourself, part of the process of trying new things. You go to the gym to fall because it’s safe to challenge yourself there & learn how to improve.

Too often I hear folks who are afraid to fall, afraid they might choose the wrong path when working through a problem in their life, their career, or some technical issue. It just isn’t possible to know the right path 100% of the time so don’t bother, do your best. When the inevitable fall happens, take another look at your moves, try to understand what went wrong, and try again a different way.

Inspecting and adapting to what you learn is one of the greatest skills you can learn. Freeing yourself to make mistakes removes a lot of barriers that you thought were there when they actually weren’t.

Watch others

In the Gym this means literally watching other climbers. Some climb with such grace that it makes things look so easy. This is true of a lot of things – so look at what others are doing. We all experience problems in different ways and we all solve them in different ways, learning from each other is key to progress. But keep in mind too that what works for one may not work for another. A tall person will climb a route much differently than a short person, they have longer reach, they also have a different center of gravity. Use ideas you see, but don’t get too upset if those same ideas don’t work for you.

Be patient

When you first start to climb, as when you first get into most things, there is a period of fast improvement. You feel great, you are learning fast, you must be awesome. As you learn more and as you start to approach more difficult challenges, your ascent will seem to slow. You are getting better, you are learning stuff, but it’s not as easy as it used to be. Once you’ve been doing this for 15 years, the problems that are hard to overcome aren’t about learning how to use some new programming language or learning to deal with some new technology, they are the finer points that actually make you better day to day. Those things take time to overcome, they are hard problems that require discipline and persistence like you have never needed before.

Climbers who have been climbing for many years will tell you that it becomes very hard to progress to the next level. Each progressive level requires significant improvement & a lot of work. You have to be patient & keep at it, you have to love climbing to climb, and you will improve.

Be Helpful

The only reason I am where I am at is because there were people who were willing to help me along the way. When I first stepped foot in the climbing gym there were people who showed me the basics. When I was clearly struggling with a route, there were people who climbed it & offered advice. When I have had problems finding that missing semicolon in a sea of code, there have been others willing to lend a 2nd set of eyes to find it.

We need each other to overcome obstacles and we each bring a different set of skills to the table. Your being helpful contributes to that, just as you have leveraged others helpfulness to get where you are at. Give back and help out.

Be Nice

It’s easy to be arrogant. It’s easy to tell someone that your challenges are more difficult than theirs. It also serves no one but yourself. Be kind as you work your way through your challenges because relationships matter more than any ability you could ever learn.

Categories: Operations Tags:

Keep decisions low cost so they are easier to make

February 16th, 2012 No comments

Which question is easier to answer?

1: Select the storage platform which will serve all our needs for the next 3 years up to 10PB.

2: Select a standard disk drive to use in the next 5 servers we purchase.

Most people could answer #2 pretty easily with a little bit of information. It might take a day or two to figure this out. And if you get it wrong, what’s the worst thing that happens? You’re replacing 10 disks or so. #1 though, that’s harder – that requires knowing what you are going to be doing for the next 3 years, how much money you can spend, how much flexibility you are going to need, etc.

The cost of being wrong for question #1 could be millions of dollars.

The cost of being wrong for question #2 probably means a few thousand dollars.

You don’t always have the luxury of changing the question, but when faced with a decision that seems to be hard to make, look for ways to make it easier. One of the ways to make a question easier to answer is to lower the cost of being wrong. For question #1, how can you answer that question for the next year without excluding the possibility of expanding to 3 years? How can you get more data about your needs by starting small, gathering data, building understanding? How can you constrain the scope, not solve all your problems but solve the most pressing ones?

Try not to solve all your problems at once because there’s a good chance you aren’t going to succeed. Pick a problem, and do the simplest thing that works to fix that problem.

Categories: Operations Tags:

Work-life balance is personal

February 15th, 2012 No comments

I was once (or three times) told that I was working when I should not. Usually this happens in response to a 10pm email, or an email on the weekend, or three of them. The thing is, that’s an important part of my balance and if I don’t do it I get thrown off.

I like the idea that companies try to embrace work-life balance, the ideal that says you should have a life outside of work. What I do not like, however, is that anyone thinks they know what time of day I should be working. Just because someone 50  years ago said that 40 hours a week is ‘normal’ doesn’t mean it’s appropriate for everyone. I get ideas at 10pm & I want to share them while I’m inspired.

Inspiration & opportunity come my way when they do, and when they come I have to decide if I want to make my move and get something done or wait until I’m “on the clock”. Doing the latter almost always means not doing it at all. Inspiration, after all, expires rather quickly.

What if work-life balance meant being honest with yourself and your employer? What if it meant that when there was an opportunity, you jumped on it, and for doing that you were given the flexibility to take that time back when there were personal opportunities outside of work? A number of companies do this already, it’s just informal and largely based on trust. The problem with this is that when it becomes a problem because of an individual, it has to get corrected for the whole team. Usually that correction comes in the form of “core hours” or some other lame blanket of consistency placed on the many because of problems with the few.

Part of the reason individuals see a problem in my work habits is because they feel pressure to conform to their interpretation of my actions. If I send a note at 10pm, that doesn’t mean that I am working 8 hours during the day + 2 hours at night. Some days it might mean I worked 10 hours, other days I might put in 6. The problem is that people feel pressure to do the same. They see someone else working “overtime” and feel like they need to. But they don’t know what I’ve been doing – they just see me working off-hours and assume it’s “overtime”. This turns into the ultra-competitive work-all-day-and-night environment that some companies have. That’s not what I’m talking about. That’s not what I do.

I have the good fortune of having work that overlaps with my passion, which is often my hobby. My hobbies contribute to my work, and vice versa. This blurs the distinction between work & life. My wife struggles with this – “I can’t tell if you are working or not when you are on the computer”. To which I ask “Why does it matter?”. Whether I’m working for my own goals, or working to get paid, the parameters are really the same. She doesn’t like that answer because she doesn’t want to interrupt me when I’m working, but if I’m not working it’s ok. The truth is, it’s always ok – and I’ll tell her (nicely, I hope) if it isn’t.

Maybe if we were more honest about how we worked & more open about what we expect from each other it would be easier to work this way. If we managed more based on achievement than hours or individual tasks. Maybe then we could all work better.

Categories: Career, Operations Tags:

Are you making blameless, data driven decisions?

February 14th, 2012 No comments

On this day of emotion & emotional decisions, this post is about not making decisions using emotions.

When I hear people use words like “could be” or “should be” I start to wonder if there’s enough data on the table to make a good decision. It’s true that sometimes you just don’t have the answer, so you have to ask more questions. How can we learn more to make this decision easier to make? How can we make this decision easier to change if we get it wrong?

Decisions made in the absence of data have another problem, the only collateral is personal responsibility. You are trusting someone’s gut, or someone’s experience, or someone’s opinion. When things don’t go right, you blame that someone.

If you are making a decision that is hard to change, it should be made when the and answer to questions are “it is…” or “it is not…”. You use those words because you have confidence you have tested & have data to back it up. You should be saying “We know” and not saying “We think”. Saying you know without data is just being dishonest. I don’t care if you’ve done this before, the people in the room have only your credibility to trust. Solutions don’t work in different environments for a variety of reasons, but if your credibility was the basis for a decision it will be the only thing that gets blamed.

Try to make decisions easy to change.

Try to make decisions based on fact, not opinion.

Try to make decisions that do not rely on your credibility.

Categories: Operations Tags:

If you aren’t using feature toggles, start… now

February 13th, 2012 No comments

Search for ‘feature toggle’ in Google, check out the results. The simple fact is that branching using a revision control system still has its place, but its place is not controlling when you release a feature to your customers. Feature toggles create a distinction between deploying your feature & making that feature available for use. They also remove the requirement that to disable a feature, or to go back to ‘old behavior’ you have to rollback your deployment to an older version of code. There are lots of other benefits too, as well as some challenges.

Bottom line though, if you aren’t using these you need to really seriously consider whether they would be a benefit. If you control your software release & you operate a multi-tenant system, and you want to increase the amount of control you have around the features you release, you need to be using these.

Here are some related blog posts:

http://code.flickr.com/blog/2009/12/02/flipping-out/

http://codeascraft.etsy.com/2010/05/20/quantum-of-deployment/

http://www.rallydev.com/engblog/2011/12/09/the-awesomeness-of-feature-toggles/

http://www.rallydev.com/engblog/2011/12/12/the-best-part-of-feature-toggles-removing-them/

http://www.rallydev.com/engblog/2011/12/14/testing-feature-toggled-code/

http://www.rallydev.com/engblog/2011/12/20/feature-toggles-branching-in-code/

 

Categories: Operations Tags:

Establishing ownership in Ops Teams

February 5th, 2012 No comments

I’ve been having some discussions about this lately so figured I would write something about the topic. Being a member of an Ops team can be pretty challenging at times. The job can be high pressure and often it feels like you spend all your time fighting fires, shaving yaks, etc. One of the difficult parts of being in Ops is that it’s often hard to put your mark on things, to use your skills to leave a lasting impression.

The reason it’s hard to leave a mark isn’t because there’s a lack of work, but because the work changes so frequently that influencing the long term outcome of a project can be hard. This can often be even more difficult in Operations teams following Agile methodologies because the work is broken into smaller stories and those stories may get worked by multiple folks. Even within these teams though, there are individuals with skills in certain areas and often there is more than one person with passion for a particular topic. Someone who’s passionate about a topic is more likely to do a great job, in my experience, and so we should see how we can leverage it.

Roles and Responsibilities Matrix

One successful tool I was shown was a Roles and Responsibilities matrix. The goal of this is to establish some basic ownership of components within an infrastructure so that individuals can focus their work. This often happens naturally in teams, but doing this formally accomplishes a few important goals:

  • Allows individuals with no experience, but with an interest, to raise their hand and work with new things.
  • Allows the team to agree on who is responsible for what infrastructure pieces. This is not sole ownership, but more about establishing expertise & creating less contention over decisions.
  • Helps you, as the manager, formalize who to work with on specific issues.

The matrix is pretty simple, for each component (you can partition this however you want) you define two roles, a “P1″ and a “P2″. These are the primary and secondary points of contact for that component. But there’s more to this than just having a primary and secondary:

  • P1: This person is the current “non-expert”, the trainee. All escalations for this component should go to them first. If they don’t know the answer it’s their responsibility to work with the P2 & resolve the issue. In this process, they learn.
  • P2: This person is the current expert, the trainer. They understand that they are P2 and are to work with the P1 on issues where they need help.

I have also observed this setup where there’s only a P1 and they are the expert because there just aren’t enough folks to have a P1/P2 for that component (or it’s not a priority). Another reason for the P1 to be the expert is if the system is going through a lot of changes and you want someone to keep tight reigns on what changes are made.

Here is what an example matrix might look like

Looking over this, each person is a P1 for one component & a P2 for some other. In a perfect world it works out like this, but the world aint perfect. Do your best with what you have – but try to setup something like this.

This is usually established during a meeting every quarter or every 6 months.  You walk through the list of functional areas and ask for volunteers. This more often than not ends with very little contest, but in the event where there are concerns about who is P1 or P2 you should try to understand why it’s important to each person to have a role in this, what they want to accomplish, and consider what other areas they also want to accomplish things in. Often, after discussing their vision on this component along with other stuff they are working on it’ll become more self evident who is the best P1 & you can get agreement.

Defining cross-functional areas

The matrix above works well, but the first question from folks is usually something like “what about monitoring, if I own that does it mean I have to do all that work for everyone else?”. The answer is “no” in most cases. There are some functional areas which are pretty clear & mostly self contained but there are others which cut across all the other areas. Examples where something intersects with everything else are Monitoring, Networking, Configuration Management and sometimes things like Storage, depending on your architecture.

For areas where your area of expertise is a dependency for others there needs to be shared ownership of those tasks. I generally look at it this way, using Monitoring as an example:

  • The P1 is responsible for overall architecture & infrastructure, training, documentation & escalations for that system. They are responsible for enabling the other team members to use the system effectively & for bringing any major changes to the team for review & consensus.
  • The P1 owners of other components are responsible for integrating their systems with monitoring, for writing any monitors, and for establishing meaningful metrics & thresholds for that system.
  • Both P1 owners work together to make sure any monitoring / metrics are done in a consistent way that is inline with what the team has agreed is the architecture.

In this way you are avoiding making the monitoring owners job suck by having to spend all day writing monitors for a million different components, but they have ownership of the overall success of the monitoring infrastructure. Individuals who own other components are making decisions about how best to monitor their own systems within the constraints of the best practices for the monitoring system & they can work with the monitoring owner if they want to break new ground on doing things a different way.

Working outside of Operations

One of the most important roles Operations plays (in my opinion) is in working with Development as closely as possible. This is becoming more and more obvious and more teams are starting to give it names, like DevOps. Some Ops folks are better at this than others and some will go out and find Developers to work with and others need to be prodded a bit.

Defining clear roles for individuals in Ops is a good way to force this collaboration. By assigning one Ops person to an upcoming Dev project & setting clear expectations around that role, you help foster their involvement and empower them to start working with other teams. That Dev team becomes a functional area, and they get a P1 & P2 like any other component.

What I would typically advocate for smaller Dev organizations is integrating one Ops person per Dev team if you can. This means that Ops person attends stand-ups, they go to planning meetings, and they are familiar with all the stuff that Dev team is working on. Should there become a need for Ops related work (or communication, which is always needed), the assigned Ops team member is responsible for that role. They aren’t necessarily responsible for all of the work but they are responsible for making sure the work is communicated & making sure it gets done.

Another approach is to assign Ops team members to individual projects. As projects arise, team members start to attend those meetings & start to get involved with any stand-ups and work around that project. I don’t like this approach as much because it relies on the Dev teams reaching out and saying “Ok, we’re ready for an Ops person now” most of the time – and that often happens late. Having Ops members already in position inside teams gives you much earlier warning and helps shape the end result much earlier.

Tracking & Communicating work

Now that everyone is working on their own projects, there will be a tendency to communicate that work less often & less completely. It takes some work to avoid this but it’s actually not all that hard. The important aspect of this is that each team member is talking about what they worked on each day at stand up & are being clear about their priorities during planning sessions. How you achieve this is up to you – but I’ll throw out some ideas.

Kanban works well as a visualization tool for work in progress. From an Ops perspective, I think that’s where the role of Kanban starts and ends. Operations is an inherently interrupt driven team and while many organizations get out of that mode through lots of practice – if you are at that point you probably don’t need my help in tracking & communicating work. Where I have seen Kanban work really well is in prioritizing work during planning (abc must come before xyz, move the card) and in visually showing what you did, what you are doing, and what you will be doing next.

Daily stand-ups are really, really helpful. Things change day to day in Ops teams and taking 10 minutes each morning to get everyone in sync with what’s going on is a huge help. Identifying blocks and talking about how to clear those is a big part of this. When everyone is there talking about things, saying “I’m blocked waiting for xyz” is an opportunity to get that problem solved today.

Also documenting proposals using a shared document system like Google Docs is a massive improvement. I can write up a proposal for something and instead of asking for feedback, people can add it right to the document – they can make comments, etc. We get together for a 30-60 minute meeting to review the document & the feedback and we take a shot at a final proposal. If there are still open questions we go back and answer those. The key is that much of the work is done asynchronously rather than asking that everyone bring their best, most un-distracted thoughts, into a meeting.

Rotating roles

Lastly, with all of this, there is change. Nobody wants to be stuck in the same role for years – people in Operations want to learn new things, they want an opportunity to take something that needs improvement and leave their mark on it. In every infrastructure there are some cool projects and there are some lame projects. There are also those parts of the system that are just a pain in the ass to maintain & nobody wants to do it. It’s important to rotate these around.

What has worked in my experience is a periodic review of the priorities. You start with a review of work in progress so that folks know what they are signing up for if they want to tackle an area they aren’t working in today. Then you wipe the slate clean & go functional area by functional area asking who wants to be involved.

The trick with this process is to try to allow folks who have projects in flight to maintain that responsibility while giving someone else a  shot at learning about the system. This is where the P1/P2 roles can really be leveraged. If you are re-building your network and you really need the same guy to maintain his momentum in that project – he becomes the P2, continuing that work. You assign a new P1 (if someone new wants to be involved) and you have them tackle the day to day interrupts. The two members work together on it and the new gets to learn while the old gets to finish their project.

If a functional area has no work in progress and you really want to move something new forward there, find the person who’s passionate about making that change and make them a P1. Find a P2 that can help enable them and let them go for it.

Wrapping up

Ownership is an important part of any job and in Operations it has been the light that keeps me coming back. Giving that ability to every member of your team is important, and hopefully this gives you some ideas about how to do that.

Categories: collaboration, Operations Tags:

Thanks Seth

February 3rd, 2012 No comments

There are few people I can point to on the Internet and say “you made an impact”. A few have, and countless others have contributed in various ways to the ideas I have today. No other however, has had the impact of Seth Godin. I’m sure you know who he is already, if not then I hope you gain something from this.

I try to read every post – I don’t always manage, and sometimes I read them out of order. Today I read this one. Like so many others, each having its own impact on me, this one made me pause and think “is this what I’m doing?”. Seth puts something out there every day. He always shows up & I always look forward to it.

So I just wanted to say thanks. Thanks for a variety of great books. Thanks for introductions to great people. And thanks for being a daily source of new thoughts.

That is all.

Categories: Awesome Tags: