Good & Bad patterns in Development and Operations

As part of my role at a new company I’ve been asked to provide feedback about structuring Dev & Ops as well as what sorts of things work and don’t. I certainly don’t claim to have all the answers, but I’ve seen some very functional and some very dysfunctional organizations. I’ve spent a fair amount of time thinking about what works & why.

Below is a cleaned up version of a message I sent to our CEO who asked for my thoughts on what does and doesn’t work. This was intended as scaffolding for further discussion so I didn’t go into deep details. If you want more details on any particular area just throw some comments out there.

I realize not all these issues are black & white to many folks – there are gray areas. My goal with this message was to drive conversation.

I figure this is probably review to many folks, but maybe it’ll help someone.

First, there are some very simple goals that all these bullets drive toward & they’re somewhat exclusive to SaaS companies:

Customers should continuously receive value from Developers as code is incrementally pushed out
Developers should get early feedback from customers on changes by enabling features for customers to test
We can address problems for customers very quickly – often in a matter of hours
We can inspect and understand customer behavior very deeply, gathering exceptional detail about how they use the service.
We can swap out components & substantially change the underlying software without the customer knowing (if we do it right)
We can measure how happy customers are with changes as we make them based on behavior & feedback

The lists below are what I feel make that possible (Good) and what inhibit it (Bad)

Culture & Communication

Good:

Stand ups.
Retrospectives
Small, self-formed teams (Let folks work on their area of passion)
Use Information Radiators whenever possible (Kanban boards, stats on big monitors, etc)
Decisions by teams, Leaders facilitate consensus
Discovering what doesn’t work is part of finding the right solution, not something to fear.
Hackathons allow Developers to do things they are passionate about
Hire for personality & team fit first, technical ability second
Data driven decisions, strive to have facts to back up decisions.
Make the right behavior the easiest thing to do – build a low resistance path to doing the right thing.

Bad:

Top down decision making
Strict role assignments & Silos
Fear of not getting it right the first time
Hiring for technical ability thinking team fit will come later
Creating process out of fear that makes it difficult to do the right thing.

Eliminate Manual Processes

Good:

Continuous Deployment / Delivery
Fully automated testing
Test Driven Development
Fully automated system monitoring, configuration & provisioning
Separate Deploy & Release (Feature toggles)
Deploy from master, do not branch (Forces particular behaviors)

Bad:

Manual testing by a QA Team – sometimes it’s necessary, but should be avoided
Deploying off a branch, slows things down & allows for other bad behaviors
Writing tests after writing code, code isn’t written with testing in mind
Developers relying on other teams to perform tasks that could be automated.
Processes that are the result of fear rather than necessary business process.

If it moves, measure it

Good:

Collect high resolution metrics about everything you possibly can
Developers can add new metrics by pushing new code, do not rely on additional configuration by other teams.
Graphs & metrics can be seen by anyone – Developers should rely on these.
There should be individuals or teams who are passionate about data visualization & analysis.
Dev teams rely on these metrics to make decisions, help identify what metrics are important
Developers watch metrics after pushing new code, watch for trend changes (Devs take responsibility for availability)

Bad:

Operations has to configure new metrics after developers have added support for them (Manual)
Operations monitors metrics & asks Dev teams when they think there’s a problem
Developers don’t look at metrics unless something is brought to their attention
Code doesn’t expose metrics until someone else asks for it

And here is the long version of all of that…

#1 Culture & Communication

Above all else I consider these most important. I think most problems in other areas of the business can be overcome if you do well in these areas. Rally has been, by far, the best example of a very successful model that I’ve seen in this area. They aren’t unique – there are other companies with similar models & similar successes.

Main points

Stand ups. By far the most effective tool for keeping everyone in touch. As teams grow you have to break them apart, so you have a 2nd standup where teams can bring cross-team items to share.
Projects are tackled by relatively small, typically self-formed teams. Get individuals who are interested in working in an area together & they feed on each others passion.
Perform retrospectives. This gives individuals & small groups the ability to voice concerns in a way that fosters resolution. There’s an art to facilitating this but it works well when done right. It also allows recognition of things that are done well.
Use open information radiators – it should be easy to see what’s going on by looking at status somewhere vs. having to ask for status, go to meetings, etc. Kanban boards are great for this.
Leaders exist to facilitate and help drive consensus but decisions are largely made by teams, not leaders. This makes being a leader harder, but it makes the teams more empowered.
Accept that things may not work & the team and company will adjust when things do not work. This makes it easy to try new things & easy for people to vocalize when they think it isn’t working. If it’s hard to change process then people are more resistant to try new things. This goes back to retrospectives for keeping things in check. Also important in this are “spikes” or time boxed efforts explicitly designed to explore possibilities.
Give developers time to pursue their own projects for the company. Many awesome features have come out of Hackathons where developers spent their own time to build something they were passionate about.
Hire for personality fit first. I have seen many awesome people find a special niche in a company because they grew into a role that you couldn’t hire for – but what made that possible was that they worked well with the team as an individual. Hiring for technical skill also means you lose that skill when that person leaves, I would prefer to have cross-functional teams.
Data driven decisions. This helps keep emotion and “I think xyz” out of the discussion & focuses on the data we do and do not have. If we don’t have data we either get more or acknowledge we may not be making the right decision but we’re going to move forward.
Make the right thing the easiest thing. I’ve seen too many companies put process out there that makes the “right thing” really difficult, so it gets bypassed. The right thing should be an express train to done – very little resistance and very easy to do. It’s when you start wanting to do things differently that it should become harder, more painful.

Also, everyone owns the quality of the service. This includes availability, performance, user experience, cost to deliver, etc. At my last company, there was exceptional collaboration between Operations, Engineering and Product (and across engineering teams) on all aspects of the service and there was a strong culture of shared ownership & very little finger pointing.

If you want more details on this specific to Rally I wrote a blog post with some more info: Blog Post

#2 Obsessively eliminate manual process – let computers do what they are good at.

This is so much easier to do up front. There should be as little manual process as possible standing between a developer adding value for customers (writing code) and that code getting into production. There may be business process that controls when that feature is enabled for customers – but the act of deploying & testing that code should not be blocked by manual process. I refer to this as separating “Deploy” from “Release” – those are two very different things.

Testing should only be manual to invalidate assumptions, validating assumptions should be automatic When we assume that if x is true then y will occur, there should be a test to validate that this is true. Testers should not manually validate these sorts of things unless there is just no way to automate them (rare). Testers are valuable to invalidate assumptions. Testers should be looking at the assumptions made by Developers and helping identify those assumptions that may not always be correct.

Too many organizations rely on manual testing because it’s “easier”, but it has some serious drawbacks:

You can only change your system as fast as your team can manually test it – which is very slow.
Your testing is done by humans who make mistakes and don’t behave predictably so you get inaccurate results.
The # of tests will only grow over time, requiring either more humans or more time, or both. It doesn’t scale.

Over time the software quality gets lower, takes longer to test, and the test results become less reliable. This is a death spiral for many companies who eventually find it very hard to make changes due to fear & low confidence in testing.

Avoiding this requires developers spend more time up front writing automated tests. This means developers might spend 60-70% of their time developing tests vs. writing code – this is the cost of doing business if you want to produce high quality software.

That may seem excessive, but the tradeoffs are significant:

Much higher code quality which stays high (those tests are always run, so re-introduced bugs (regressions) get caught)
Faster developer on boarding, the tests describe how the code should behave and act as documentation.
Refactoring code becomes easier because you know the tests describe what it should do.
Each commit to the codebase is fully tested, allowing nearly immediate deployment to production if done right.
Problems that make it into product feed back into more tests & continually improve code quality.

Much of the time developing tests is spent thinking about how to solve the problem, but you are also writing code with the intent of making it testable. Code is often written differently when the developer knows tests need to pass vs. someone manually testing it. It’s much harder to come along later and write tests for existing code.

You will hear me talk about Continuous Deployment & Continuous Integration – I feel these practices are extremely important to driving the above “good” behaviors. If you strive for Continuous Deployment then everything else falls into place without much disagreement because it has to be that way. This has a lot of benefits beyond what’s listed above:

Value can be delivered to customers in days or hours instead of weeks or months
Developers can get immediate feedback on their change in production
New features can be tuned & tweaked while they are fresh in a developers mind
You can focus on making it fast to resolve defects, no matter how predictable they are, rather than trying to predict all the ways things might go wrong.
Most of the tools and behaviors that enable Continuous Deployment scale to very large teams & very frequent deployments. Amazon is a prime example of this, deploying something, somewhere, about every 11 seconds. Many companies that are in the 30-100 engineer size talk about deploying tens of times per day.
This also impacts how you hire QA/Testers. This is a longer discussion, but you want to hire folks who can help during the test planning phase & can help Developers write better tests. Ideally your testers are also developers & work in a way that’s similar to Operations, helping your Developers to be better at their jobs.

#3 If it moves, measure it

I mentioned above, two big advantages a SaaS organization has are the amount it can learn about how customers use the product & the ability to change things rapidly. Both of these require obsessive measurement of everything that is going on so that you know if things are better or worse. Some of these metrics are about user behavior & experience to understand how the service is being used. Other metrics are about system performance & behavior.

The ability to expose some % of your customer base to a new feature & measure their feedback to that is huge. Plenty of companies have perfected the art of A/B testing but at the heart of it is the ability to measure behavior. Similar to testing, the software has to be built in a way which allows this behavior to be measured.

System performance similarly requires a lot of instrumentation to identify changes in trends, to identify problems areas in the application & to verify when changes actually improve the situation.

I’ve been at too many companies where they simply had no idea how the system was performing today compared to last week to understand if things were better or worse. At my last company I saw a much more mature approach to this measurement which worked pretty well, but it required investment. They had two people fully dedicated to performance analysis & customer behavior analysis.

Operation Bootstrap

Web Operations, Culture, Security & Startups.

Good & Bad Patterns in Development and Operations

Culture & Communication

Good:

Bad:

Eliminate Manual Processes

Good:

Bad:

If it moves, measure it

Good:

Bad:

#1 Culture & Communication

Main points

#2 Obsessively eliminate manual process – let computers do what they are good at.

#3 If it moves, measure it

Comments