You can watch the discussion here:
The general idea in this discussion was that while we can optimize development workflows to improve velocity for those teams, work often gets held up downstream of development during testing or deployment into production. We wanted to focus on what can be done to improve downstream throughput.
In general the topics covered were as follows:
* Culture and Team Structure
* Value Stream mapping
* Optimize inter-team handoffs
* Automate3
* Version & share binaries, environments and processes
* Blameless post-mortems
As an industry we’ve become really good at making things faster & more iterative but we haven’t yet figured out how to make light move faster. Most work toward improving the speed of a thing involve eliminating waste – improving efficiency & effectiveness. You’ll probably hear this theme over and over in my responses below but the simplest way to improve velocity through a series of steps is to eliminate steps & let computers do the ones you cannot eliminate. It’s not always possible – but should alway be considered a possibility.
This, in my view, is the most important aspect of the overall velocity of development. Assuming that we view “development velocity” as the time it takes from an idea being prioritized so that someone can spend time on it, to the time a customer can touch it, then we have to consider the entire delivery pipeline when we evaluate our velocity.
If we look at an entire delivery pipeline and all the handoffs that might occur, there are a lot of opportunities for waste. We could focus on making handoffs more efficient or we could focus on how we automate them – which often mostly eliminates the traditional idea of a handoff. This is usually a core objective of Continuous Deployment.
The most effective and high velocity team structures I’ve seen are those where developers have the ability to make changes which have a fully automated pipeline to production. Typically putting something like this in place requires that the development team be a cross-functional team with development, testing, product, operations and leadership expertise. The more of those functions you remove from the immediate team, the less efficient the team becomes. Whether that means they have lower velocity or are less effective depends on the organization – but there’s a good chance those traits are impacted as well.
When a development team has representatives from the areas of the business they need to interact with on the team & have an ability to rapidly experiment with changes to their software in a production environment then their ability to make accurate and timely decisions about product changes is vastly improved.
Watching the video you’ll see that I don’t spend much time on this topic because I’ve never really done much of it. I have, however, listened to others talk about how effective it can be at bringing groups together to talk about and understand what happens at each stage of a delivery pipeline. Again, I refer to a “delivery pipeline” as the entire process of getting some idea into customer hands. If you are a software company, your delivery pipeline includes nearly your entire organization. It is very rare for a single person to have a complete and accurate view of all the steps that occur in this process – so bringing folks together to document the current state of things can bring about some surprising results.
Have you ever watched the making of Jelly Belly beans and thought “Holy Cow, I had no idea there were so many steps in making a Jelly Bean!”? You might be surprised how little you know about your own processes. You might also be surprised to find out who in your organization is crucial to those processes completing correctly & on time. Are those folks doing ok?
Watch Jelly Bellys get made!
As I mentioned above, my first approach to this problem is to eliminate the handoff, but this isn’t always possible. When it isn’t possible, the above value stream mapping exercise can help you understand what opportunities there are for automation. Often the most time consuming part of any handoff between groups is discussion & establishing trust. If I am receiving a thing from a group I trust a lot & I have received things from them in the past with great success, I’m probably not going to require a lot of discussion. On the other hand, if every thing that group has handed off to me has erupted into a tire fire, then I will probably have some questions. Worse, my distrust may be based on nothing more than my own prejudice – making it a difficult situation to correct.
Trust is a fickle thing for a company because it’s usually between two individuals. There can be trust between groups, but that trust is often contingent on each groups membership – as members change then so too does the trust level. Humans are also imperfect – they forget, they have bad days, and sometimes they just don’t show up. For this reason it’s more effective to automate any of these decision points you can.
When you automate a handoff you could be simply documenting a set of possible outcomes and the conditions for each – like a protocol for handling things. This allows more autonomy and flexibility in handling the handoff – if the documented expectations are met – a set of steps can be followed to complete an action. You could also be adding some technical automation by having computers make some of the decisions for you & maybe do some of the work (if they aren’t already).
When a development team implements continuous delivery – they are essentially automating the handoff to Operations by saying “We’ve tested & met these set of expectations – this software will work if you press this button and deploy it”. When a team implements Continuous Deployment they’re taking it one step further and saying “We are taking more responsibility for the software and largely eliminating the handoff – we’ll both watch for problems and work together to resolve”.
Automation – “the use of largely automatic equipment in a system of manufacturing or other production process”
Automation takes that process you repeat over and over and allows some computer/mechanical processor to handle it. You can automate decision making (automated test suites/continuous integration) and you can automate actions (automated provisioning/deployment). I’m a super duper big fan of automation – it’s much of what people think I get paid to do. But the act of automating something you understand is not the hard part – or even really the interesting part to me anymore. The interesting part is understanding where you can apply automation, and then understanding how to tweak things here and there to make it possible to automate them.
An example of this is Feature Toggles. I can implement the automation to deploy the software a team builds on every commit. You push a commit and like magic – computers will whisk your code on out to production… where it will promptly provide a fine example of why “Continuous Deployment is BAD”. Outages, dying kittens, it all happens and then “Stop! We need a change control process!”. The issue isn’t that automating software deployment is hard, the issue is that making the software deployment automatable in a way that allows a high degree of success is hard.
Feature Toggles (Flags, Flippers, Switches, whatever) allow you to make changes to software while maintaining existing customer-facing behavior. This means that you can try some new thing in a Continuous Deployment environment in a way where customers should never be impacted. Yes, there are caveats – but it’s beside my point. The capability to automate the software deployment is contingent on a capability of the software, not the deployment automation.
There’s a similar dependency on automated testing, not because test quality is required – it’s required either way – but because testing must be automated in order for Continuous Deployment to even be possible. If I have to wait for Jimmy to finish his testing before my commit can go out to production – then by definition that’s not “Continuous Deployment”.
So automation is important – but not always in the way people think. Folks think they’ll just hire a DevOp to DevOp the heck out of this problem and automate it all away – but unless you also have a dev team that has built software capable of being deployed in this way, it’s not gonna happen.
This is a point I bring up during the discussion and I wanted to touch on it. If you haven’t, you should go and watch Jeff Hackert’s talk (below) on building humane systems. I see this failure so often and it’s truly avoidable if we treat our automation systems like we treat our products – building them with an understanding that our customers are humans, and they have feelings & experiences that differ from ours, and getting their feedback is good. I can’t add much to the talk – watch it.
This is perhaps more obvious to some than others – this is table stakes for building a modern development workflow. Very often the “environments” and “processes” pieces are harder than others – especially when different groups handle the production environment vs. the development environment. If you can make them the same – do it – but I have yet to be somewhere that can do this 100%.
My preference is to focus on the production environment. This is more meaningful in a SaaS environment than perhaps someone building desktop software, but I think it’s legitimate to consider either way. The effort involved in completely replicating your production environment – the traffic, the processes, the systems, the network, everything – is… so. much. work. I would argue that in many (most?) cases you can put that same degree of effort (or less) into making it possible to evaluate software changes in production. Further, by doing so you take advantage of all the organic variations which are experienced in a production environment.
Much has been written about doing this – for your googling pleasure take a look at the idea of a “dark launch”. The basic premise is exposing a system to production traffic in a way which doesn’t put customers at risk. The process is too involved to discuss here – but it’s a way to leverage the existing environment and traffic you have to evaluate changes in a way that provides better realism than any test environment I’ve seen while still allowing experimentation.
Evaluating changes in a production environment means you are using the same process, the same systems, the same network & ideally the same traffic patterns to evaluate the results of change. And as those things evolve, so too does your testing environment.
This is a big one for me. I prefer retrospectives, not everyone does, but I like to look back on the good and bad to learn from the past on a regular basis. There are times when an event was clearly bad, but there are usually positive things that happened that you can acknowledge and enforce during a retrospective.
That said, the “blameless” part is super important. If you want to bother to look back on what went wrong and what didn’t then it’s worthwhile to invest in making sure everyone is honest and open. This is hard because different people have different tolerance for criticism – so we minimize it. I’ve written in the past about this as have others – I’d suggest further reading if you aren’t familiar with the idea.
This impacts velocity by identifying areas where improvements can be made & identifying where there is friction today. If you aren’t looking at the past then it’s much more difficult to improve the future.
I care about these topics a lot and I always look forward to chatting with others about their different experiences. If you want to talk about any of this – just give me a shout.
]]>insert over-used and possibly inaccurate Einstein quote here
On the other side of the coin – I’ve worked for companies where stuff worked pretty OK. I’ve participated in what I thought was a reasonably good version of a high functioning team. I left that company because I got bored, or just didn’t feel like the company itself was giving me the opportunity to do what I wanted. What I wanted was to grow a team into that high functioning place.
So I found myself recently being frustrated that my attempts to change a team aren’t meeting my expectations. Unlike me from a number of years ago I’ve become a bit more introspective about things and so I started asking myself some questions:
Item #1 is that I have expectations that have been established on assumptions not in evidence. This is much the same as being frustrated that when you let go of a bowling ball it falls to the ground – but you expected it to go up. Have you ever observed an object fall up? Why would you expect it? Wanting it, thinking it’s so obvious that it should, does not change physics. Few folks get all bent out of shape when it rains, you can’t control it, you accept it. So too are other people – cannot be controlled, have to be understood and accepted, no real value in getting frustrated.
Teams do change, and there are many stories of teams changing. What is often missing from those stories is some perspective on the effort, time, and patience required to get to that end. This is usually measured in years unless there’s dramatic (and usually very disruptive) action. So, patience.
Item #2 is why I’m here in the first place, this is what I want. I could go work for some company that already has their shit together but what fun is that? Nope, I need to learn how to find joy in this journey and how to become successful at it because this is what drives me. I need only look at my selection of books in amazon for evidence of where my interests lean. Improving my ability to do this means being in an environment where things aren’t right & practicing. When you want to improve your code you find a problem to solve – same goes here.
Again, patience. Whether I like to believe it or not, I’m changing & learning as much as the rest of the team. By adjusting my behavior, others adjust theirs – we learn and adapt to each other, we build rapport. As this happens it unlocks new opportunities and makes possible things that weren’t before.
Item #3 is my fundamental belief that all people want to do good. With few exceptions, people are driven by the same desires I am – to do well & succeed at their goals. When we observe people who aren’t doing what we expect them to do, we have to ask why. Sometimes they don’t even know why, so we have to allow them to show us why. Sometimes what we expect isn’t correct – so we have to be willing to learn.
This is also a core tenant of how I believe this process can work. There’s another blog post brewing about building a “pit of success” – about making the right thing the easiest thing. Doing this requires setting aside your belief about what works & observing and learning from behavior of others. It requires lettings things go wrong & then asking how it can be improved.
Item #4 is my reality check. I can absolutely get passionate, maybe even dogmatic, about how I think things should happen. Given enough resistance I can turn into a righteous asshole. Do you listen to people like that? I don’t. I’m not doing anyone any favors if I’m not listening, learning, and asking how I can help people do well.
I have to sit down with people and ask how I’m doing. I have to check in and make sure I’m not destroying relationships. It shouldn’t be necessary to create adversaries in this process – if we are listening & learning we should be finding common ground. This takes effort, but it’s worth it.
All of this gave me some perspective to look back over the last 6-8 months and what progress has been made. Actually, there’s been a lot. There are definitely some problems, but there’s now evidence that the play dough actually does move – it’s not rigid. The process might not be as fast or as direct as I expect, and it might not even look like what I expect when we get there – but there is movement. I need to realize that it is this movement & the positive results of it that are my journey and when we reach our destination, so too does my journey end and I go looking for another tribe to walk with.
No sense in rushing to that point.
]]>So what’s new?
--kill
and --restart
to ease stopping spoon
containers without destroying them or having to use the docker cli.The release has been tested against Docker 1.4 and is known not to work with docker 1.2.
]]>There are a bunch more reasons you might like docker-spoon, those are just a few.
This tool grew out of a desire to streamline developer workstation setup but it went even further to the point where developer environments are disposable, can be provisioned in seconds & allow local and remote pairing at any time. Are there other ways to do this? Yep. I’ve seen this tool used successfully by entire dev teams at two companies now – it’s new to you but it’s been around a bit.
The idea behind docker-spoon is pretty basic, you create a docker image
with all your developer needs built into it, then you run spoon <name>
where name
is the name you assign to a new spoon instance running in
Docker. Spoon does all the heavy lifting – currently including:
So after installing spoon and creating a config, the time from running spoon to being at a command prompt in your new dev environment is about a few seconds. How easy is it? Demo time!
Video can be viewed directly here
There are some features not shown in this demo which already exist:
That demo used the example docker container included in the docker-spoon repository. If you want to try it out quickly just follow the directions in the docker-spoon repo.
So here’s the thing, there are some caveats – all this awesomeness doesn’t come without some conditions. Spoon takes advantage of the idea that working in a terminal with tmux is low latency & easily shared. There is some work being done to use VNC inside spoon to allow for the use of GUI apps but that’s not the optimal use case.
If you find docker-spoon useful let me know. If you want to see something different, submit a pull request or shoot me a note. The usage should be thoroughly documented in the README.
]]>We’ve come a long way as a community with tools like Github, making it easier and easier to get an OSS project out in the wild and accept contributions. As we’ve done this though, projects seem to support users with varying degrees of success. Some projects flourish, graciously accepting community contributions. Other projects do quite well accepting feedback but largely having maintainers drive the direction of the project. Still others fall to extremes of accepting every pull request that comes along with reckless abandon or apathetically allowing the project to rot and become another abandoned github relic.
Along with this more and more companies are building their businesses on OSS. Not just a piece here and there, but entire complex ecosystems of software maintained by all kinds of different folks. Each of these business have their own unique quirks, personalities, and constraints.
I’ve worked for a few of these businesses and I know that those constraints and those personalities, they aren’t easy for me to change. I have to work with what I have and try to find tools that do what I need. When I find a tool that’s 90% of what I need and I’m willing to put in the effort to push it across the line and make it work for me I get to come face to face with the maintainers of that project.
I’d like to say this always goes smoothly, that I’m always able to express my use case in a way that folks understand, and that my use case always falls inline with the intent of the project. I’d love if my skillset in a particular language was always up to par, and that my ability to write quality tests was always appropriate. I’d love to always understand the roadmap for the project & the maintainers expectations around how I should contribute. This just aint so.
As much as I may feel like I don’t understand a particular project well enough, sometimes it feels like the maintainers don’t understand me. What’s worse is when I interpret (accurately or not) an attitude from maintainers that I “just don’t get it” rather than making an attempt to help me understand.
This sounds really familiar to me. This sounds like that whole observation the world made that Dev & Ops need to work more closely. The observation that the people who build the software need to interact more closely with the people who use the software. That building empathy and a collaborative environment where everyone can communicate and be involved will help get us past our differences. For me, this problem isn’t just with the developers in my company. This problem extends to the developers who build the systems I use every day and most of those developers don’t work for my company.
Just as we have to invest in building relationships within a company, when you use OSS you have to invest in building a relationship with those you work with outside your company as well. This goes both ways, and the more effort maintainers put into understanding users, the better this is for everyone. This isn’t easy – so if you aren’t a maintainer and you’re reading this, understand that very often the interactions you have with maintainers is done outside their full time job, outside their ordinary deadlines and pressures, and it’s done because they enjoy working on a project. If folks are like me, when something stops being enjoyable for too long, I stop doing it. Keep them in mind when you have a complaint.
Further, for a popular project the ratio of users to maintainers is pretty imbalanced and not in the favor of the maintainers. They have a hard job and for those that do that job with grace and empathy for their users, my hat is off to you.
So when we talk about DevOps and Empathy and all these great ideas to make your company work better – don’t stop there. Think about all those projects that make your job exist, and all those users that make your project a success. Try to take a little time and understand each other and work together to make more awesome.
kthxbye
]]>If you have thoughts on this – I have some questions:
Keep in mind, I’m not talking about just deploying code, I’m talking about creating an entire dev workflow pipeline from desktop to prod – automatically. The assumption is that if you aren’t doing Continuous Deployment, you are at least doing Continuous Delivery. This probably looks something like a private PaaS at the end of the day – but automation that extends beyond just spinning up machines, it extends to CI, monitoring, everything.
If this story has been told in a blog post you have, or if you just copied what someone else wrote in their blog post, point me that way, but I want to hear about YOUR experience implementing it.
See, I am fairly certain that the answer to this question changes depending on the makeup of the group that built it. Further, I suspect that without the necessary increments – any group will build it wrong (like any software). So for a group that’s been down that road, I’m more interested in the journey than the destination.
Reply via email, in the comments, add links, use the twitters, whatever. I just want to learn – this is not the start of a debate.
]]>In tandem with this I’ve watched (and written) countless discussions about what DevOps means. I’ve heard countless definitions of what people think DevOps means which differ from my own definition. I’ve watched organizations create entirely new teams centered around what they understand DevOps to be. The typical charter of these teams centers around working with developers, yes, but also around automation & tooling.
In this process the word/title/team name of “DevOps” has become synonymous with “Operationally focused Development”… or Ops folks who code (sometimes). To me, this is just Operations and Development, but I’m an open minded guy and this post isn’t really about that – if doing this is so different from what you believe Operations is that you need to call it something different, so be it.
For organizations which I observe to be actually embracing the sprit of DevOps as it was originally intended, I find a few things that seem true:
Why does that last bullet matter? Because giving it a name doesn’t make it so. Actually giving it a name, I think, removes power from the teams to define how things should work. We already have names for this stuff: Collaboration, Communication, Teamwork. When you call it “DevOps” then I start to wonder what you mean, because it must be different from something I already have a name for.
So today I heard a reference to Cargo Cult, decided to lookup this term I’ve used in the past to make sure I was using it correctly, and was struck by how it applies so perfectly to what I see as wrong with the way many folks interpret DevOps.
We’ve seen examples of Cargo Cult in the past. Agile implementations are surely ripe with examples of companies implementing a process but not embracing the principles. The world of marketing uses the idea every day to sell you stuff you don’t need:
The result you are trying to reproduce is embodied in something that is a subset of what actually produced it. Taking that subset and dropping it into your life doesn’t give you all the things that produced it, you get an empty shell of the thing.
I love to rock climb and spend a fair amount of time at it. Rock Climbers have some observable characteristics – strong hands & upper body, relatively good balance, maybe less sanity than most folks. Many folks first approach climbing thinking that strength is the primary barrier to improvement. They think that to be a good climber they have to get strong. Then you watch some massive muscle-bound gym rat try to climb and you realize that can’t be right.
The reality is that climbers get strong by climbing. Climbers climb because they love the challenge, and they get better by being persistent and having the mental discipline to overcome doubt and fear. You can’t watch a climber and see passion, fear, doubt and their response to it. You can’t read their thoughts and know that, despite that move looking incredibly easy for them, it required very precise movement and exceptional focus and attention. You may not realize that the reason that particular sequence of movements worked for them was because they are 5’10” and have unusually long arms.
Climbers may enjoy the strength benefits of climbing, but if their objective was to become strong, there are more direct means. They become good climbers because they love something more fundamental about it.
In organizations where DevOps works, it isn’t Developers working with Operations that make it work, it’s people wanting to work with other people and the organization encouraging them to find the right solution together that makes it work. Operations seems to work well with Development in these organizations, an observable outcome of the culture, but reproducing that practice in another company isn’t likely to produce the same results. I’d go so far as to say it’s guaranteed to not produce the expected results.
I wrote a long winded post about what I see as things leading to a functional software development organization. DevOps is not a practice within these things that is singularly important, nor is it a team which is relevant to success. It’s now become a distracting misnomer for a subset of observable traits in successful organizations, few of which contribute to overall success when practiced in isolation. The factors that do contribute to success have been defined for quite a while now, they were defined in Good to Great, they are described in The Phoenix Project, and to a large extent they are at the core of what Agile was intended to be.
If you Cargo Cult DevOps into your organization then you’re just implementing a subset of what successful companies do & you are bound not to see the results you expect, unless your expectations are low.
On the other hand, if your goal is to use it as a hiring tool to clarify to Ops folks that the job you are offering is working on automation, I get it, but wish there was another term for that – because it isn’t DevOps. It’s Operations, or Development, or both. Something we all should be doing anyways.
I’m not really sure this post helps anyone, but it helped me – so thanks for reading.
]]>I mentioned that I don’t typically contribute to product features, preferring to focus in these areas. When asked why that was, my first response was – “I suck at algorithms and I never went to school” – both true but inaccurate explanations for why I got where I am. Neither of those things really hold me back from building or contributing to a product. It also doesn’t describe why I’m passionate about what I do.
I’ve never been one to choose to do something because I can’t do what I really want to do. Well, except that if I could jump out of airplanes all day I’d probably do that – totally awesome fun. Instead I rock climb.
I love writing code. I love putting together bits and pieces and building something useful – I’m terrible at finishing coding books because without a practical problem to solve, the incentive just isn’t there. I also get bored pretty quickly. For me, having a stack of things to do is optimal – I move one to the point where I’m blocked and move to something new, I like to focus for periods of time, but then have to switch gears. Occasionally I’ll find something that really gets my attention and it’s like the worst video game addiction – my kids notice, my wife notices, nothing else gets done. That’s rare for me and I am horrible at handling those things other than to just knock it out and get done.
Managing this shifting of priorities & still getting things done is a skill I’ve worked hard to get right. At this point I’m pretty good at it, and struggle working any other way.
Operations has been this for me. In most organizations Ops are the ultimate generalists, both using their own experience to solve problems as well as being adept at engaging other domain experts. We know where the right folks hang out on IRC, we know when to call support and when to just dig in, we learn quickly and are super resourceful. We are exceptionally familiar with the tough problem cycle – initial interest, fear that this is too big, despair that you aren’t going to solve it, the glimmer of hope leading to resolution and awesomeness. I have another post brewing on that specific topic. This isn’t exclusive to Ops, but it’s something you get very used to.
Wait, wasn’t this post about Infracoding?
Yes, the only reason I’m still doing Ops work is because I get to write code. If you were to offer me an Ops job where all I did all day was figure out tough problems for other people to code solutions, I’d tell you to suck it. If you suggest that I can pair with a developer and we can fix it together, I’m happier. If I can get proficient enough to code my own solutions and have other people tell me how to improve what I did, I’d sign up. It’s for this reason that I love the pull request model. I’ve learned from both Sr and Jr team members through PR feedback and it’s a great way for folks to watch and observe what feedback others get.
But along with writing code I get to do other stuff too. Helping to debug strange inconsistencies between monitoring systems – without the ability to code I wouldn’t be able to write the small tools that make testing theories easier. Packaging up and deploying infrastructure, more opportunities to build tooling. Make it easier for Developers to provision their workstations, more coding, more variety. Building networks these days is often pretty limited if you aren’t willing to write code. I know many Ops folks have been doing this for a while, but more and more the % of time an Ops guy spends writing code is increasing, not decreasing. This isn’t universally true, but it’s true of work I am interested in calling Ops.
All along the way, as I get more proficient with coding, as I have become more familiar with patterns you see in production, I can work with developers more closely to help them build better systems. Where developers can definitely get exposed to these same patterns, Ops is pretty directly involved with how things operate in the real world. We get to pull together the resources to make things work and we get to learn in the process. This is the more traditional part of Ops for me, being the guy who helps developers understand the realities of production. But more and more that knowledge gap is closing and Ops are as much an Engineering team as the UI team is. Availability and Agility are features of your product, you have to engineer them in and that requires Developer awareness that is on par with Ops.
Alright Aaron, so what’s your point?
I don’t do feature work because frankly, it seems too focused for me, too specific. Within the Ops role I can go as deep into code as I want but then surface to work on many other things. If I want to go deep on something, there are a bazillion tools out there I can (and sometimes do) contribute to, I can build my own, or not. I get to work with the sharp edges of many tools and find ways to wrap Nerf bits around them. I get to choose when good enough is the enemy of perfection or when perfection is, in fact, the enemy.
I love Ops and every day that role moves closer and closer to awesomeness for me. Although I generally say I don’t know what I want to be when I grow up, I’m pretty sure it’ll look a lot like Ops.
]]>One thing irked me a bit though, I heard a number of comments about how there was too much talk about culture.
I am confused – but I think I understand.
I attended one Open Space session called “Culture Hacks” which ended up not being so much about hacks as a plea for help. A number of folks expressed concerns about being in a difficult situation where they associate their problems with organizational cultural problems and they were looking for ideas on how to make things better. The suggestions that folks raised, myself included, sounded a lot like typical lean/agile tools – retrospectives, stand-ups, putting developers on-call, communicating metrics, providing incentives, hackathons. I think these are all good suggestions, the problem is that they only go so far. The unfortunate reality is that old saying – you can lead a horse to water, but you can’t make him drink. These are all tools but they don’t solve a problem in an organization that doesn’t want to change.
I think the topic of culture is pretty big and scary for many folks. It’s also very subjective, as John Willis raised in his talk – the culture in which a group of crooks thrive is very different than the culture in which a group of hippies would thrive (John didn’t use Hippies in his example). For each group though, there is a specific culture that allows those individuals to reach their goals in an optimal way. Is one culture right or wrong? Functional or dysfunctional? I think the answer is that if the culture is aligned with the team and makes them perform in an optimal way then it’s good – but it’s good for that group, not for everyone.
As such, if you haven’t defined an objective for your culture, if you’ve hired a mish-mash of people with different objectives and principles then you really aren’t going to find a culture that makes everyone achieve their goals. You can make it better for some, but it’ll likely not be better for others. Getting the folks off the bus who do not align with your desired culture is an important part of a change like this and is not something that an Ops person in one group can do. Here, I think, lies one of the main barriers to DevOps being about culture change – it’s driven by Ops people.
Ops working with Dev is great, we can all do that, but we are individuals who can only set an example and hope the organization follows suit. If they don’t – you can vote with your feet or suck it up, I’m not sure there are many other options. Setting an example often manifests as Ops folks trying get closer to Dev – the most natural way to do this is to write code, help with release engineering, enable developers to gain access to monitoring, logs, config management, etc. All the tools that we say aren’t actually what Devops is about… These seem to me to be a manifestation of ops folks doing what they can do get closer to Dev, that’s all.
Of course it isn’t culture, company culture is bigger than this, but it can change a small part of an organization & help set an example for the broader organization. Sometimes it can have a bigger impact – I love this one by John Allspaw as an example. Still, the change was isolated.
I recently read & really liked this TechCrunch article because I feel like it hits on an important point. If I work at a company where the CEO, CTO, COO, VP of Engineering or a variety of other high level positions push a culture that I don’t like – my chances of changing that are small. Further still, there is already a culture which is somewhat defined by the team you’ve hired. Unless you’ve worked very hard to hire for a specific person & team fit then no amount of effort is going to change the organizational culture.
So what is my point with all of this? While Dev & Ops collaboration is important to having a healthy development process – it is so easily undermined by problems with the larger organization. I don’t think that means you do nothing – I just think it helps clarify why talking about culture is hard. I also think having more actionable steps, more examples to replicate would be helpful. Right now the examples we have are of companies who have a good culture throughout – but what about companies who can’t do that? How do I fix my piece? How do I get my 10 person startup pointed in the right direction? How can I build a functional bubble inside a dysfunctional behemoth? What can I do to make my small corner of the world suck less?
There aren’t enough good examples for these questions – and this is why we need to talk about it. Talking about the little bits and pieces that work for you helps – so don’t avoid talking about it because you don’t have all the answers – share what you can. In that Open Space I attended, plenty of folks had little ideas about things that worked for them – they didn’t have a complete solution, but they had some ideas.
I shared my thoughts about how this may look when you start from the ground up and we’re implementing some of these ideas at my current company. Are they all the right things for this company? Probably not – the team will decide. Some of the things that worked at a past company wont work here – does that mean we are doomed? I don’t think it does.
Culture is very personal – but so are a lot of things that we have some established patterns for. We are homing in on some ideas that work – we need more ideas – and we need to talk about them more. I hope this comes up more at other Devops Days events in the future. It’s an important topic.
]]>Below is a cleaned up version of a message I sent to our CEO who asked for my thoughts on what does and doesn’t work. This was intended as scaffolding for further discussion so I didn’t go into deep details. If you want more details on any particular area just throw some comments out there.
I realize not all these issues are black & white to many folks – there are gray areas. My goal with this message was to drive conversation.
I figure this is probably review to many folks, but maybe it’ll help someone.
First, there are some very simple goals that all these bullets drive toward & they’re somewhat exclusive to SaaS companies:
The lists below are what I feel make that possible (Good) and what inhibit it (Bad)
And here is the long version of all of that…
Above all else I consider these most important. I think most problems in other areas of the business can be overcome if you do well in these areas. Rally has been, by far, the best example of a very successful model that I’ve seen in this area. They aren’t unique – there are other companies with similar models & similar successes.
Also, everyone owns the quality of the service. This includes availability, performance, user experience, cost to deliver, etc. At my last company, there was exceptional collaboration between Operations, Engineering and Product (and across engineering teams) on all aspects of the service and there was a strong culture of shared ownership & very little finger pointing.
If you want more details on this specific to Rally I wrote a blog post with some more info: Blog Post
This is so much easier to do up front. There should be as little manual process as possible standing between a developer adding value for customers (writing code) and that code getting into production. There may be business process that controls when that feature is enabled for customers – but the act of deploying & testing that code should not be blocked by manual process. I refer to this as separating “Deploy” from “Release” – those are two very different things.
Testing should only be manual to invalidate assumptions, validating assumptions should be automatic When we assume that if x is true then y will occur, there should be a test to validate that this is true. Testers should not manually validate these sorts of things unless there is just no way to automate them (rare). Testers are valuable to invalidate assumptions. Testers should be looking at the assumptions made by Developers and helping identify those assumptions that may not always be correct.
Too many organizations rely on manual testing because it’s “easier”, but it has some serious drawbacks:
Over time the software quality gets lower, takes longer to test, and the test results become less reliable. This is a death spiral for many companies who eventually find it very hard to make changes due to fear & low confidence in testing.
Avoiding this requires developers spend more time up front writing automated tests. This means developers might spend 60-70% of their time developing tests vs. writing code – this is the cost of doing business if you want to produce high quality software.
That may seem excessive, but the tradeoffs are significant:
Much of the time developing tests is spent thinking about how to solve the problem, but you are also writing code with the intent of making it testable. Code is often written differently when the developer knows tests need to pass vs. someone manually testing it. It’s much harder to come along later and write tests for existing code.
You will hear me talk about Continuous Deployment & Continuous Integration – I feel these practices are extremely important to driving the above “good” behaviors. If you strive for Continuous Deployment then everything else falls into place without much disagreement because it has to be that way. This has a lot of benefits beyond what’s listed above:
I mentioned above, two big advantages a SaaS organization has are the amount it can learn about how customers use the product & the ability to change things rapidly. Both of these require obsessive measurement of everything that is going on so that you know if things are better or worse. Some of these metrics are about user behavior & experience to understand how the service is being used. Other metrics are about system performance & behavior.
The ability to expose some % of your customer base to a new feature & measure their feedback to that is huge. Plenty of companies have perfected the art of A/B testing but at the heart of it is the ability to measure behavior. Similar to testing, the software has to be built in a way which allows this behavior to be measured.
System performance similarly requires a lot of instrumentation to identify changes in trends, to identify problems areas in the application & to verify when changes actually improve the situation.
I’ve been at too many companies where they simply had no idea how the system was performing today compared to last week to understand if things were better or worse. At my last company I saw a much more mature approach to this measurement which worked pretty well, but it required investment. They had two people fully dedicated to performance analysis & customer behavior analysis.
]]>Would you listen to that argument? I wouldn’t. Everybody has an opinion about how to do things, what makes one better than another?
I’m on my 4th SaaS company at this point. I’m starting early this time and hoping to steer things in the right direction. I feel like I’ve observed some good and some bad and some really ugly at this point and I have a pretty good idea of what patterns and anti-patterns are important. The problem hasn’t changed though, just because I feel they are important doesn’t make them a priority for the company.
When I go to work for a new company a very important question to answer is whether the company is ready and willing to implement all the cultural and technical requirements for Continuous Deployment. I’ve at least figured out that, from my position, it’s exceptionally hard (so far impossible for me) to convince a company they want to do this – they have to already want to. I know how to implement, I know how to enact change, but I need support that has to already exist.
I focus on Continuous Deployment not because it’s a technical solution that you just plug in and go. I focus on it because it drives conversation around all the other areas where organizations should improve. For each improvement you make working toward Continuous Deployment, you make your development process better and your software better. These aren’t things that only provide benefit once you are doing Continuous Deployment – but when you’ve done them all it becomes a fairly easy decision to deploy continuously.
I’m early into my latest venture but already the attention is there, the interest in doing this right. I’m being asked for my thoughts on what we need to prioritize to move toward Continuous Deployment, what do we need to focus on early so that it’s easier later on. I’m also being asked to educate folks on what is important and why, what have I watched work and what have I watched fail. What mistakes can we avoid and what mistakes are we just going to have to make on our own?
Oh, you were hoping not to make mistakes? Good luck with that. The best I can hope for is to make my own mistakes.
My first flight on an airplane, in my life, was a skydiving trip. When the instructors discovered this as we ascended toward altitude they said “Well this is perfect, next time you fly you can tell the person next to you that you’ve flown before, but you’ve never landed”. My risk tolerance may differ slightly from others. I like to rock climb as well, plenty of people won’t do that. The thing is, both of these sports involve a very risky activity offset by copious amounts of safety. Still, when you watch the girl up on the cliff hanging by a limb all you can think of is “what if she falls?”. The answer is, she’ll get a little bruised up maybe, but if she’s doing things right she’ll keep right on climbing.
When you’ve been climbing a bit, when you understand the safety mechanisms, you pay much more attention to the climb, to the technique, to each movement. You know that the climber is probably safe, because you know what keeps them safe. You can take more risk with each movement knowing that any single mistake will only set you back so far.
These activities, like Continuous Deployment, look more risky from the outside than they are. If you don’t take the time to understand all the safety mechanisms then you can’t accurately evaluate the risk. For a company who pushes software every few weeks to consider pushing every commit without substantial other changes would be insane. Just like I would never go rock climbing without the right equipment (I’m allergic to free soloing – sorry). The act of Continuous Deployment is a realization of a ton of other effort – and all that effort has to be prioritized before you can ever get on the rock face.
Lets say you deploy every week – I’m being generous here, but lets just pretend. So you deploy on Thursday during the day because you have an awesome deploy process and you know it’s better to spot problems when everyone is in the office. You spot a problem, what do you do? I’m guessing your deploy was from a branch, so you just fix that branch & deploy. Then you merge the fix into master.
Friday comes along, hey there’s another critical issue. Fix, branch, deploy. Lather, rinse repeat. Meanwhile, depending on how involved the fix is and what other stuff you have going on, you’ve got a bunch of merging to get right and the closer to your next branch you get, the more of a problem this becomes. How about a fix on Wednesday before the next deploy? I’m guessing you’ve already cut the next branch, so now you apply the fix to 3 branches (last week, this week, master).
All this deploying and merging and branching, it’s all work. The problem is – it’s not automated work, it’s work asking for mistakes to be made. It’s risk. Where are your safety mechanisms? Are they your manual testers? Your automated test suite? If those automated tests aren’t good enough to test each commit before it goes to production, why are they good enough to test each weeks deploy? Because you do manual testing?
This all sounds risky to me, but for some reason it sounds less risky than Continuous Deployment to some. I think this can only be because of a lack of understanding around the safety mechanisms, the pre-requisites. The proof is in the pudding though, and if you still produce shitty software when doing Continuous Deployment because you write bad tests and don’t do retrospectives and don’t prioritize the important work of making the system work right – then you’re sunk either way.
There are some companies that are probably better off deploying every 8 weeks.
The practices that surround Continuous Deployment/Delivery substantially reduce risk – things like Feature Toggles, automated testing, automated deployments, deploying off master, retrospectives, monitoring, accountability, access, ownership, reduced MTTR, and the list goes on. These all add up to make a software development and deployment environment so safe, anyone can commit code – if it doesn’t work it will not make it to production.
But, things will still break. In my experience you have to break things in very subtle ways for code to get into production & as time goes on and you build better monitoring, even those issues should be detected fast & resolved fast.
It can take a while to reach the end goal, but you’ve got to start somewhere. However, even if you never actually practice Continuous Deployment, all of these practices will produce better software and probably happier developers.
Here are a few other good resources to learn about Continuous Deployment/Delivery
I looked up the definition of “culture” – here are a few definitions:
Note that culture is the manifestation of intellectual achievement. It’s the evidence and result of achievement. I think the 3rd definition is most appropriate for DevOps – what are the behaviors that are characteristic of a well integrated Development & Operations organization?
The challenge, the discussion, was how we can re-balance the scales and get the word out that this is actually about culture and that tools happen as a result of culture, not the other way around. This post begins my contribution to that effort.
The question was asked – do we all agree that culture is the most important thing when it comes to creating a successful business? The short answer is “yes”. If you wanted to hear all the if/and/but/what-if/etc discussions, you should have come to Devopsdays. For the sake of this blog post – culture is the most important factor. If you want case studies and analysis that proves that culture matters – read Jim Collins Good to Great.
My present company has a really excellent culture of Developer / Ops cooperation and collaboration. I wasn’t there when it wasn’t that way (if ever) and so I can’t tell a story about how to change your organization. What I can tell you is what a healthy and thriving Dev/Ops practicing organization looks like and what I think some of the key factors in that success are. I see this as two components – there are fundamental core values that enable and support the culture, and then there are tactical things that are done to make the culture work for us. I’d like to talk about both. The culture is the result of these actions and ideas put into practice.
Background
I work for a company with a well defined set of core values. Those values set forth parameters under which the culture exists. Here’s what they are:
These values are public and they matter – they matter a lot.
These might sound hokey to you – but every single one of them is held high at the company & strongly defended. Defending a list of values like this is hard sometimes. When someone doesn’t show respect to others, how do you uphold that core value? When someone’s idea of “work life balance” is different than another person, how do you support both of them? When creating your own reality means you don’t want to work for Rally anymore – what do you do?
I’m proud to say that in Rally’s case – they are generally true to the core values. Putting “Create your own reality” on a list of core values doesn’t create culture – what creates culture is having repeated examples where individuals have followed their passion & the company has supported them. This support doesn’t just mean they have permission, it means the company uses whatever resources it can to help. Sometimes this means using your resources to help someone find another job. Sometimes this means helping them get an education they can use at another company. Usually though, it means getting them into a role where they can do their best work. Whatever the case – Rally’s culture is to always be true to that core value and do whatever they can to support an employee in creating their own reality.
This is repeated for all of the core values. By being explicit & public about these values they set the stage for what an employee can expect from Rally as a workplace. But there’s more to it – you have to make sure these core values are upheld and you have to make sure they thrive – and this is where some of the tactical parts come in.
What are the tactical things?
There are plenty of other things here and there but you get the general idea. We talk a lot & tell each other what we’re doing, we enlist passionate individuals in areas they have interest, we embrace & seek out change and we empower individuals to drive change by working with others.
So what? What does that have to do with Devops?
Everything
2.5 years ago the company had some performance & stability problems. Technical debt had caught up with them and the only real way to fix the problem was to completely change the way the company did development & prioritized their work. The good news is that they did it, but it was made possible by the fact that individuals were empowered to drive that change. Almost overnight, two teams were formed to focus on architectural issues. A council was formed to prioritize architectural work. The things we all complain about never being able to prioritize became a priority and remain a priority to a degree I’ve never experienced before at other companies. Prioritizing this work is defended and advocated by the development teams – something only possible because of the collaborative environment in which we operate.
I have been personally involved in two services that literally started out as a skeleton of an app when they went into production. The goal was to lay the groundwork to allow fast production deployments, get monitoring in place & enable visibility while the system was being developed. This was all done because the developers understand the value of these things, but they don’t know exactly how to build it – they need Ops help. Having tight Ops and Dev collaboration on these projects has made them examples of what works in our organization. These projects become examples for other teams in the company and they push the envelope on new tech. These two projects have:
I’m sure the list will continue to go on… it’s fantastic stuff.
The Rub – culture isn’t much of anything without people who embrace it.
Along with a responsibility for pushing change from the bottom up in Rally comes responsibility for defending culture – or changing it. This means that when you hire people, they have to align with your core values – they have to be willing to defend that culture or the company as a whole needs to shift culture. All those core values and tactical things will not maintain a culture that the team members do not support. Rally’s culture is what it is because everyone takes it seriously and that includes taking it seriously when there’s a problem that needs fixing.
This has happened. There are core values that used to be on that list above but they aren’t anymore. At one point or another things changed and those core values were eroding at other core values. This takes time to surface, it takes time to collect data to show it’s true, but when the teams start to observe this trend they have to take action. This isn’t the job of management alone – this is the job of every member of the company. When the voice begins to develop asking for change – you need a culture that allows that change to take place and for everyone to agree on the new shape things take.
That said, it also isn’t possible if management doesn’t support those same core values. Management has the same responsibility to take those core values seriously.
DevOps is our little corner of a much bigger idea
There’s a problem that we’re trying to fix – we’re trying to improve the happiness of people, the quality of software, and the general health of our industry. Our industry is totally healthy when you look at the bottom line, but we’re looking for something more. We want a happy and healthy development organization (including Ops, because Ops is part of the Development organization), but we also want our other teams to be part of that. As Ops folks and Developers, we can clean up our side of the street – we can do better. We seek to set an example for the rest of the organization.
For culture to really improve in companies it has to go beyond Dev and Ops into Executives, Product, Support, Marketing, Sales and everyone else. You ALL own quality by building a healthy substrate (culture) on top of which all else evolves.
But in the end it’s about culture. It’s really only about culture for now – because when you get culture right the other problems are easy to solve.
Congratulations to those of you who read this far – shoot me a note and let me know you read this far because you probably share the same passion about this that I do. Also – putting up blog posts from 32,000 feet is awesome – thanks Southwest.
]]>“Can you rate your skills in the following areas on a scale of 1 to 10:
Linux
Perl
SAN Storage
Networking
etc”
My response is generally, “no and goodbye”. This isn’t out of arrogance, but there’s just no point in moving forward. When we’re talking about a job I’m interviewing your company every step of the way. One key question I’m always asking myself is “does this seem like a hiring process that would hire people I want to work with?”. I don’t get a huge chance to interview my co-workers when I’m coming into a new place so my only real guide as to what type of folks they are is… the hiring process.
If your hiring process starts by selecting people based on how they rate skills, think of who you are excluding. Also, think of who ends up at the top of that list – arrogant, self-important folks who think they are experts in everything. Perhaps the HR person is clever and is looking for folks who rate themselves poorly to suggest humility but there are better ways.
I want to work on a team where you hire people because they fit well into the team, because they have passion about their job, and because they can learn. I don’t want you to hire someone who already thinks they know everything because every conversation with that person is going to be an argument about why something new isn’t as good as what they’ve already done before. People do what has worked for them in the past unless they are tinkerers who like to try new things – in which case they probably don’t become experts, they become generalists. The best folks I’ve worked with didn’t know how to do most of their day to day tasks when they started the job. They learned, they asked questions, they were humble and eager and engaging. Those are the people I want to work with.
Yes, I’m restricting my potential job opportunities by doing this – which is exactly the point.
]]>So, I’m letting it go. I’ve thought a lot about this and it would be different if I had ever had an opportunity that appeared to be gained by having the certification. I haven’t. If the knowledge I gained from the certification helped me do my job. It doesn’t. If people even cared that I have the certification. They don’t (except for one guy in Korea who I had BBQ with at 4am, he seemed excited about it). And lastly, if I ever thought I was going to go after a security job – but I’m pretty sure I’m not.
I think certifications have their place and I applaud anyone who goes out and earns them. I just personally have better things to do with my time than maintain something that really hasn’t added any value to my career. I encourage others to seek out certifications that do add value to their career and I encourage those who help maintain certifications to keep them evolving & continue to challenge folks to get them and keep them.
]]>This started when I began attending this teams daily stand up. My goal was to get more involved with a single team in dev to get a better idea of how the process worked in general. This team was one of our two Infrastructure teams which focus on scalability, stability & performance enhancing changes. Initially this team was creating a new Web Services API service which I wrote a little bit about here.
Eventually that service was set aside and the team moved on to a new Authorization and Authentication service. For this new service the decision was made to use Continuous Deployment. We were already doing fully automated deploys at least once per week but there was a bit of a jump to giving the developers the tools they needed to deploy every commit including monitoring & deployment automation changes.
I had also noticed, leading up to this, that the few times I had sat over with the team I was immediately more involved in discussions – they asked me questions (because I was there) and I had the option of attending planning sessions. There was literally a 20 foot difference between my own desk & the desk at sat at “with them” but it made a world of difference. As such, I talked to my management about sitting with that team all the time and they agreed to try it.
Now, this team is a bit unique. The team is constructed of a handful of developers working on the code but it also is the home of the Build & Release guy as well as our Sysadmin who manages the testing infrastructure. Sitting with this team gave me an opportunity to not only be involved in the development of this new service but to also become more involved in the Build & Release process, getting familiar with the day to day problems that are dealt with as well as pairing with folks to work on our puppet configurations which are shared between dev & prod. This team structure, along with me, also made them uniquely suited to tackle the Continuous Deployment problem (at least for this service) completely within a single team.
As part of the Continuous Deployment implementation we wanted to make it as easy as possible for developers to get access to the metrics they needed. We already had splunk for log access but our monitoring system required Ops involvement to manage new metrics. So as part of this new service we also had to perform a spike on a new metric collection/trending systems – we looked at Ganglia & Graphite. We weren’t trying to tackle alerting – we just made it a requirement that any system we select be able to expose metrics to Nagios. I worked with the developers to test out a variety of ways for our application to push metrics into each of these systems while also evaluating each system for good Operational fit (ease of management, performance, scalability, etc).
Throughout this process there were also a lot of questions about how to perform deployments. How many previous builds do we keep? When and how do we rollback? What is our criteria for calling a deployment successful? How do we make sure it fails in test before it fails in production? What do we have to build into the service to allow rolling deploys to not interrupt service? The list goes on – these are all things that you should think about with any service but when the Developers are building the deployment tools they become very aware of all of this – it was awesome.
After about 45 days we had the monitoring system selected & running in production and test, we had deployments going to our testing systems and we were just starting to deploy into production. We now had to start our dark launch, sending traffic from our production system to the new service without impacting production traffic so we can see how this backend service performs, whether it is responding correctly to production traffic & generally get a better understanding of behavior with prod traffic. Today this service is still operating dark as we tweak and tune a variety of things to make sure it’s ready for production – again, it’s awesome.
60 days in things started winding down. We had been dark launched for a few weeks and largely the developers had access to everything they needed – they could look at graphs, logs, if they needed new metrics they just added them to the code and they showed up in monitoring as soon as they deployed. We got deploy lines added onto the graphs so we could correlate deployments with trends on the graph – more awesome. However my work was winding down, there were fewer and fewer Operational questions coming up and I was starting to move back toward working on other Ops projects.
As I looked back on the last 60 days working with this team I realized the same 20 feet that kept me from being involved with the development team had now kept me from being involved with the Ops team. I was really conflicted but it felt like the healthy thing to do would be to move back over into Ops now that the work was winding down. I immediately realized the impact it had as people made comments “wow, you’re back!”… seriously folks, I was 20 feet away! You shot me with nerf darts!
So now I’ve been back over in Ops for a few weeks and there has actually been a change – I’m still much more involved with that Dev team than I was from the start. They still include me in planning & they come to me when there are Operational questions or issues that come up around the service. However, that 20 feet is there again and I can’t hear all the conversations and I know there are questions that would get asked if someone didn’t have to stand up and walk over. Our Dev teams tend to do a lot of pairing and as a result aren’t often on IM and email responses are usually delayed – pairing certainly cuts down on the email checking.
Was I happy I did it? Absolutely. Would I do it again? I think I would – but I would constrain it and set expectations. I think the physical proximity to the team helped a lot to move quickly and toss ideas around while the service was being developed and decisions were being made but it did have an impact on my relationship with the Ops team that I wish I could have avoided. I think continuing to move back and forth – spending time with the Ops team would be helpful. I actually did spend my on-call weeks (every 4th week) in Ops instead of sitting with the Dev team, but I would try to find some time during the 3 weeks in-between to be over there too, it was just too much absence.
All that said, I think overall the company and the service is better for the way this turned out and for me personally it was a super insightful experience that I wish every Ops person could try sometimes.
]]>I’m not sure how ready it is for prime time, but take a look – it sounds pretty interesting.
I think I’ll always have an interest in monitoring problems – not because I have to as part of my job, but because the problem has such a wide variety of potential solutions with different benefits. It’s also one of those areas that’s very rarely just plug and go, you have to architect it like any other service which keeps it interesting.
]]>1) Make money
2) Change the world, as long as we can do #1.
Lots of companies accomplish these goals doing things wrong. If you want proof, read Good to Great, there are oodles of examples of companies who didn’t qualify as “great” but who you would recognize as successful.
When wagon trains migrated families west across the US, the idea of driving 40mph, of crossing a state in a day, would have been crazy talk. Then came the locomotive.
When locomotives moved people across the country, the idea of a car making an interstate trip would have been crazy. It would be madness if everyone operated their own car. Then came cars, and road, and traffic signals, and road signs. This took time, lots of mistakes, lots of retrospectives, and year over year progress.
Progress isn’t made by conforming to the conventions of today, it’s made by pushing for something better. That’s what some folks are doing in Ops today – they are trying to push the limits and do what works for them. Others are observing these patterns and following suit. Still others are sitting back and saying “That ain’t right, my process works just fine”. Perfect.
It wasn’t necessary for automobile manufacturers to convince railroad operators that the car was the future. The car became the future because people adopted it, because it worked, and because over time the infrastructure that supported it became more mature.
As our tools get better, as our patterns become more and more repeatable, as we start to understand what roads & traffic signals & road signs we need for Ops to get out of the way of Developers making changes in production, things will move. In the mean time – talk about what works for you, why it works for you, and don’t bother convincing other people why it should work for them.
]]>“I could get that done in a few hours, easy”
“I could whip that up in 2 seconds”
So what? Instead of bragging about how awesome people should think you are because of how fast you say you can get things done, how about you ask some questions?
“When do you need it done by?”
“Do we have time to improve this other widget to make fixing your thing easier?”
We are so focused on trying to crank out as much as we possibly can, we sometimes think it’s better to talk about how quickly we can get things done. Instead, under commit & over deliver. If you have 2 days to get something done and you only need ½ day – spend some time improving something that will make your life easier. Understand what the expectations are before you commit & see if you can get in some extra benefit. If you hit a snag & end up taking 2 days to finish then nobody is disappointed, but if you don’t then you get some extra work done & still meet expectations.
I know folks prefer accurate estimates & like to fill your day with the stuff they want done – but don’t complain that you can’t get your stuff done if you aren’t under committing once in a while.
]]>One of the areas Operation is often a bottleneck is monitoring. The traditional model is to have Ops ask Dev what metrics they need monitored & to set those up. This often means that monitoring can’t start until the metrics are available in the code, and then it isn’t for days or weeks after that when some Ops person has time to setup the monitoring system to pull those. This is broken and unnecessary.
If you are operating in a service delivery model where you have control over all the systems you monitor, you should be working to get out of the way. You should be working to make the monitoring happen automatically without Ops involvement. This doesn’t mean that Dev does all the work, what it means is that Ops selects monitoring systems that allow for discovery of new metrics & automatic collection of those metrics without additional incremental work each time.
Some of this is technology selection, some of this is architecture, and some of this is just doing the work. This does take work – but I would be hard pressed to find an example where the work required to set this up is not offset by the work saved in the long run not having to respond to every new metric that gets added. Below are some concrete examples of what I’m talking about – if you aren’t familiar with this read these.
Metric Collection
The yammer metrics library has made it really easy to expose your application metrics automatically. They have additionally provided hooks into tools like Ganglia and Graphite for pushing metrics to the monitoring system. As you look at how to scale a monitoring system, these are great tools to allow for that. Another popular data collection tool is statsd. The main idea is that you want to use collection tools that don’t have to have metrics pre-defined for them. If you give them a value for a metric they track it, that is all. The more often you give it to them, the more numbers they store.
Graph presentation
Ganglia is great for allowing you to programmatically define graphs and manage those via your CM system of choice, like puppet or chef. Another approach is something like Graphite which provides a rich and generic UI for taking whatever metrics you collect & combining them into a graph. Building custom dashboards and such in Graphite is where it’s strength is at.
Alerting
Nagios. We all dislike it, but it works pretty well. The main advantage Nagios has over more “intelligent” systems is that it can be configured through your CM system of choice. Additionally, Nagios has a massive community behind it. When building out Nagios or whatever you use, do your best to drive your configuration through CM and try to get things to the point where you don’t have to do any incremental monitoring work for each new system you add. New systems that are the same type as a system that’s already defined should just get monitored for “free”.
Summary
Think of monitoring like a service, like any other application in your architecture. You want it to discover what’s out there and configure itself as much as possible. Doing this isn’t completely simple yet – but it’s possible and if you set your mind to it you might even find a way to do it better that you can contribute back to the community. In doing this, try to get out of the way of your Developers and strive to have metrics they expose in their application automatically show up in your monitoring system of choice. Try to make it very low cost for them to add new metrics to see new information & you will probably be surprised at the amount of monitoring you get when all a Developer has to do is write the code to track the metric & it shows up in prod.
]]>So, I wanted to draw some parallels to my observations about climbing and working through obstacles in tech. We deal with new challenges every day and some can be pretty intimidating. We also learn and teach a lot, and I think there are some lessons about that as well.
Don’t allow obstacles to defeat you before you start.
When you approach a new route it can be intimidating, you aren’t really sure what to expect, what part of a route is going to be most difficult. This is true of approaching many problems, but just because a problem looks intimidating does not mean it cannot be overcome. There is a lot we don’t understand until we are working our way through a problem and telling yourself you can’t do this isn’t going to help. Make one move at a time & do your best – rarely is the situation so critical that you cannot afford to adjust as you learn.
When you miss, inspect and adapt.
The bouldering gym is full of big pads. Those aren’t there because nobody ever falls. Everybody falls. This is part of the process of challenging yourself, part of the process of trying new things. You go to the gym to fall because it’s safe to challenge yourself there & learn how to improve.
Too often I hear folks who are afraid to fall, afraid they might choose the wrong path when working through a problem in their life, their career, or some technical issue. It just isn’t possible to know the right path 100% of the time so don’t bother, do your best. When the inevitable fall happens, take another look at your moves, try to understand what went wrong, and try again a different way.
Inspecting and adapting to what you learn is one of the greatest skills you can learn. Freeing yourself to make mistakes removes a lot of barriers that you thought were there when they actually weren’t.
Watch others
In the Gym this means literally watching other climbers. Some climb with such grace that it makes things look so easy. This is true of a lot of things – so look at what others are doing. We all experience problems in different ways and we all solve them in different ways, learning from each other is key to progress. But keep in mind too that what works for one may not work for another. A tall person will climb a route much differently than a short person, they have longer reach, they also have a different center of gravity. Use ideas you see, but don’t get too upset if those same ideas don’t work for you.
Be patient
When you first start to climb, as when you first get into most things, there is a period of fast improvement. You feel great, you are learning fast, you must be awesome. As you learn more and as you start to approach more difficult challenges, your ascent will seem to slow. You are getting better, you are learning stuff, but it’s not as easy as it used to be. Once you’ve been doing this for 15 years, the problems that are hard to overcome aren’t about learning how to use some new programming language or learning to deal with some new technology, they are the finer points that actually make you better day to day. Those things take time to overcome, they are hard problems that require discipline and persistence like you have never needed before.
Climbers who have been climbing for many years will tell you that it becomes very hard to progress to the next level. Each progressive level requires significant improvement & a lot of work. You have to be patient & keep at it, you have to love climbing to climb, and you will improve.
Be Helpful
The only reason I am where I am at is because there were people who were willing to help me along the way. When I first stepped foot in the climbing gym there were people who showed me the basics. When I was clearly struggling with a route, there were people who climbed it & offered advice. When I have had problems finding that missing semicolon in a sea of code, there have been others willing to lend a 2nd set of eyes to find it.
We need each other to overcome obstacles and we each bring a different set of skills to the table. Your being helpful contributes to that, just as you have leveraged others helpfulness to get where you are at. Give back and help out.
Be Nice
It’s easy to be arrogant. It’s easy to tell someone that your challenges are more difficult than theirs. It also serves no one but yourself. Be kind as you work your way through your challenges because relationships matter more than any ability you could ever learn.
]]>