I’ve been having some discussions about this lately so figured I would write something about the topic. Being a member of an Ops team can be pretty challenging at times. The job can be high pressure and often it feels like you spend all your time fighting fires, shaving yaks, etc. One of the difficult parts of being in Ops is that it’s often hard to put your mark on things, to use your skills to leave a lasting impression.
The reason it’s hard to leave a mark isn’t because there’s a lack of work, but because the work changes so frequently that influencing the long term outcome of a project can be hard. This can often be even more difficult in Operations teams following Agile methodologies because the work is broken into smaller stories and those stories may get worked by multiple folks. Even within these teams though, there are individuals with skills in certain areas and often there is more than one person with passion for a particular topic. Someone who’s passionate about a topic is more likely to do a great job, in my experience, and so we should see how we can leverage it.
Roles and Responsibilities Matrix
One successful tool I was shown was a Roles and Responsibilities matrix. The goal of this is to establish some basic ownership of components within an infrastructure so that individuals can focus their work. This often happens naturally in teams, but doing this formally accomplishes a few important goals:
- Allows individuals with no experience, but with an interest, to raise their hand and work with new things.
- Allows the team to agree on who is responsible for what infrastructure pieces. This is not sole ownership, but more about establishing expertise & creating less contention over decisions.
- Helps you, as the manager, formalize who to work with on specific issues.
The matrix is pretty simple, for each component (you can partition this however you want) you define two roles, a “P1” and a “P2”. These are the primary and secondary points of contact for that component. But there’s more to this than just having a primary and secondary:
- P1: This person is the current “non-expert”, the trainee. All escalations for this component should go to them first. If they don’t know the answer it’s their responsibility to work with the P2 & resolve the issue. In this process, they learn.
- P2: This person is the current expert, the trainer. They understand that they are P2 and are to work with the P1 on issues where they need help.
I have also observed this setup where there’s only a P1 and they are the expert because there just aren’t enough folks to have a P1/P2 for that component (or it’s not a priority). Another reason for the P1 to be the expert is if the system is going through a lot of changes and you want someone to keep tight reigns on what changes are made.
Here is what an example matrix might look like
Looking over this, each person is a P1 for one component & a P2 for some other. In a perfect world it works out like this, but the world aint perfect. Do your best with what you have – but try to setup something like this.
This is usually established during a meeting every quarter or every 6 months. You walk through the list of functional areas and ask for volunteers. This more often than not ends with very little contest, but in the event where there are concerns about who is P1 or P2 you should try to understand why it’s important to each person to have a role in this, what they want to accomplish, and consider what other areas they also want to accomplish things in. Often, after discussing their vision on this component along with other stuff they are working on it’ll become more self evident who is the best P1 & you can get agreement.
Defining cross-functional areas
The matrix above works well, but the first question from folks is usually something like “what about monitoring, if I own that does it mean I have to do all that work for everyone else?”. The answer is “no” in most cases. There are some functional areas which are pretty clear & mostly self contained but there are others which cut across all the other areas. Examples where something intersects with everything else are Monitoring, Networking, Configuration Management and sometimes things like Storage, depending on your architecture.
For areas where your area of expertise is a dependency for others there needs to be shared ownership of those tasks. I generally look at it this way, using Monitoring as an example:
- The P1 is responsible for overall architecture & infrastructure, training, documentation & escalations for that system. They are responsible for enabling the other team members to use the system effectively & for bringing any major changes to the team for review & consensus.
- The P1 owners of other components are responsible for integrating their systems with monitoring, for writing any monitors, and for establishing meaningful metrics & thresholds for that system.
- Both P1 owners work together to make sure any monitoring / metrics are done in a consistent way that is inline with what the team has agreed is the architecture.
In this way you are avoiding making the monitoring owners job suck by having to spend all day writing monitors for a million different components, but they have ownership of the overall success of the monitoring infrastructure. Individuals who own other components are making decisions about how best to monitor their own systems within the constraints of the best practices for the monitoring system & they can work with the monitoring owner if they want to break new ground on doing things a different way.
Working outside of Operations
One of the most important roles Operations plays (in my opinion) is in working with Development as closely as possible. This is becoming more and more obvious and more teams are starting to give it names, like DevOps. Some Ops folks are better at this than others and some will go out and find Developers to work with and others need to be prodded a bit.
Defining clear roles for individuals in Ops is a good way to force this collaboration. By assigning one Ops person to an upcoming Dev project & setting clear expectations around that role, you help foster their involvement and empower them to start working with other teams. That Dev team becomes a functional area, and they get a P1 & P2 like any other component.
What I would typically advocate for smaller Dev organizations is integrating one Ops person per Dev team if you can. This means that Ops person attends stand-ups, they go to planning meetings, and they are familiar with all the stuff that Dev team is working on. Should there become a need for Ops related work (or communication, which is always needed), the assigned Ops team member is responsible for that role. They aren’t necessarily responsible for all of the work but they are responsible for making sure the work is communicated & making sure it gets done.
Another approach is to assign Ops team members to individual projects. As projects arise, team members start to attend those meetings & start to get involved with any stand-ups and work around that project. I don’t like this approach as much because it relies on the Dev teams reaching out and saying “Ok, we’re ready for an Ops person now” most of the time – and that often happens late. Having Ops members already in position inside teams gives you much earlier warning and helps shape the end result much earlier.
Tracking & Communicating work
Now that everyone is working on their own projects, there will be a tendency to communicate that work less often & less completely. It takes some work to avoid this but it’s actually not all that hard. The important aspect of this is that each team member is talking about what they worked on each day at stand up & are being clear about their priorities during planning sessions. How you achieve this is up to you – but I’ll throw out some ideas.
Kanban works well as a visualization tool for work in progress. From an Ops perspective, I think that’s where the role of Kanban starts and ends. Operations is an inherently interrupt driven team and while many organizations get out of that mode through lots of practice – if you are at that point you probably don’t need my help in tracking & communicating work. Where I have seen Kanban work really well is in prioritizing work during planning (abc must come before xyz, move the card) and in visually showing what you did, what you are doing, and what you will be doing next.
Daily stand-ups are really, really helpful. Things change day to day in Ops teams and taking 10 minutes each morning to get everyone in sync with what’s going on is a huge help. Identifying blocks and talking about how to clear those is a big part of this. When everyone is there talking about things, saying “I’m blocked waiting for xyz” is an opportunity to get that problem solved today.
Also documenting proposals using a shared document system like Google Docs is a massive improvement. I can write up a proposal for something and instead of asking for feedback, people can add it right to the document – they can make comments, etc. We get together for a 30-60 minute meeting to review the document & the feedback and we take a shot at a final proposal. If there are still open questions we go back and answer those. The key is that much of the work is done asynchronously rather than asking that everyone bring their best, most un-distracted thoughts, into a meeting.
Lastly, with all of this, there is change. Nobody wants to be stuck in the same role for years – people in Operations want to learn new things, they want an opportunity to take something that needs improvement and leave their mark on it. In every infrastructure there are some cool projects and there are some lame projects. There are also those parts of the system that are just a pain in the ass to maintain & nobody wants to do it. It’s important to rotate these around.
What has worked in my experience is a periodic review of the priorities. You start with a review of work in progress so that folks know what they are signing up for if they want to tackle an area they aren’t working in today. Then you wipe the slate clean & go functional area by functional area asking who wants to be involved.
The trick with this process is to try to allow folks who have projects in flight to maintain that responsibility while giving someone else a shot at learning about the system. This is where the P1/P2 roles can really be leveraged. If you are re-building your network and you really need the same guy to maintain his momentum in that project – he becomes the P2, continuing that work. You assign a new P1 (if someone new wants to be involved) and you have them tackle the day to day interrupts. The two members work together on it and the new gets to learn while the old gets to finish their project.
If a functional area has no work in progress and you really want to move something new forward there, find the person who’s passionate about making that change and make them a P1. Find a P2 that can help enable them and let them go for it.
Ownership is an important part of any job and in Operations it has been the light that keeps me coming back. Giving that ability to every member of your team is important, and hopefully this gives you some ideas about how to do that.